[PATCH] support "requeueing" futexes

This addresses a futex related SMP scalability problem of glibc. A number of regressions have been reported to the NTPL mailing list when going to many CPUs, for applications that use condition variables and the pthread_cond_broadcast() API call. Using this functionality, testcode shows a slowdown from 0.12 seconds runtime to over 237 seconds (!) runtime, on 4-CPU systems. pthread condition variables use two futex-backed mutex-alike locks: an internal one for the glibc CV state itself, and a user-supplied mutex which the API guarantees to take in certain codepaths. (Unfortunately the user-supplied mutex cannot be used to protect the CV state, so we've got to deal with two locks.) The cause of the slowdown is a 'swarm effect': if lots of threads are blocked on a condition variable, and pthread_cond_broadcast() is done, then glibc first does a FUTEX_WAKE on the cv-internal mutex, then down a mutex_down() on the user-supplied mutex. Ie. a swarm of threads is created which all race to serialize on the user-supplied mutex. The more threads are used, the more likely it becomes that the scheduler will balance them over to other CPUs - where they just schedule, try to lock the mutex, and go to sleep. This 'swarm effect' is purely technical, a side-effect of glibc's use of futexes, and the imperfect coupling of the two locks. the solution to this problem is to not wake up the swarm of threads, but 'requeue' them from the CV-internal mutex to the user-supplied mutex. The attached patch adds the FUTEX_REQUEUE feature FUTEX_REQUEUE requeues N threads from futex address A to futex address B. This way glibc can wake up a single thread (which will take the user-mutex), and can requeue the rest, with a single system-call. Ulrich Drepper has implemented FUTEX_REQUEUE support in glibc, and a number of people have tested it over the past couple of weeks. Here are the measurements done by Saurabh Desai: System: 4xPIII 700MHz ./cond-perf -r 100 -n 200: 1p 2p 4p Default NPTL: 0.120s 0.211s 237.407s requeue NPTL: 0.124s 0.156s 0.040s ./cond-perf -r 1000 -n 100: Default NPTL: 0.276s 0.412s 0.530s requeue NPTL: 0.349s 0.503s 0.550s ./pp -v -n 128 -i 1000 -S 32768: Default NPTL: 128 games in 1.111s 1.270s 16.894s requeue NPTL: 128 games in 1.111s 1.959s 2.426s ./pp -v -n 1024 -i 10 -S 32768: Default NPTL: 1024 games in 0.181s 0.394s incompleted 2m+ requeue NPTL: 1024 games in 0.166s 0.254s 0.341s the speedup with increasing number of threads is quite significant, in the 128 threads, case it's more than 8 times. In the cond-perf test, on 4 CPUs it's almost infinitely faster than the 'swarm of threads' catastrophy triggered by the old code.
author: Ingo Molnar <mingo@elte.hu> 2003-05-24 21:50:32 -0700
committer: Linus Torvalds <torvalds@home.transmeta.com> 2003-05-24 21:50:32 -0700
commit: 7149345c76a810e3fb8cc9b58706027c310497b3 (patch)
tree: 633eaf761ddfb11284ecfd1fcaefaa8ca8955e5e /include/linux
parent: 9bda5f681fd216b6a856b72940195e5335317bf1 (diff)
1 files changed, 7 insertions, 1 deletions
diff --git a/include/linux/futex.h b/include/linux/futex.h
index b91878c07352..c76dd1ee3076 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -2,10 +2,16 @@
 #define _LINUX_FUTEX_H
 
 /* Second argument to futex syscall */
+
+
 #define FUTEX_WAIT (0)
 #define FUTEX_WAKE (1)
 #define FUTEX_FD (2)
+#define FUTEX_REQUEUE (3)
+
+
+asmlinkage long sys_futex(u32 __user *uaddr, int op, int val,
+			  struct timespec __user *utime, u32 __user *uaddr2);
 
-extern asmlinkage long sys_futex(u32 __user *uaddr, int op, int val, struct timespec __user *utime);
 
 #endif
author	Ingo Molnar <mingo@elte.hu>	2003-05-24 21:50:32 -0700
committer	Linus Torvalds <torvalds@home.transmeta.com>	2003-05-24 21:50:32 -0700
commit	7149345c76a810e3fb8cc9b58706027c310497b3 (patch)
tree	633eaf761ddfb11284ecfd1fcaefaa8ca8955e5e /include/linux
parent	9bda5f681fd216b6a856b72940195e5335317bf1 (diff)