user/sven/linux.git - Linux Kernel

Age	Commit message (Collapse)	Author
2006-05-21	[PATCH] Fix a NO_IDLE_HZ timer bug	Zachary Amsden
	Under certain timing conditions, a race during boot occurs where timer ticks are being processed on remote CPUs. The remote timer ticks can increment jiffies, and if this happens during a window when a timeout is very close to expiring but a local tick has not yet been delivered, you can end up with 1) No softirq pending 2) A local timer wheel which is not synced to jiffies 3) No high resolution timer active 4) A local timer which is supposed to fire before the current jiffies value. In this circumstance, the comparison in next_timer_interrupt overflows, because the base of the comparison for high resolution timers is jiffies, but for the softirq timer wheel, it is relative the the current base of the wheel (jiffies_base). Signed-off-by: Zachary Amsden <zach@vmware.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-26	[PATCH] Remove __devinit and __cpuinit from notifier_call definitions	Chandra Seetharaman
	Few of the notifier_chain_register() callers use __init in the definition of notifier_call. It is incorrect as the function definition should be available after the initializations (they do not unregister them during initializations). This patch fixes all such usages to _not_ have the notifier_call __init section. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-26	[PATCH] Remove __devinitdata from notifier block definitions	Chandra Seetharaman
	Few of the notifier_chain_register() callers use __devinitdata in the definition of notifier_block data structure. It is incorrect as the data structure should be available after the initializations (they do not unregister them during initializations). This was leading to an oops when notifier_chain_register() call is invoked for those callback chains after initialization. This patch fixes all such usages to _not_ have the notifier_block data structure in the init data section. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11	[PATCH] timer initialisation fix	Andrew Morton
	We need the boot CPU's tvec_bases[] entry to be initialised super-early in boot, for early_serial_setup(). That runs within setup_arch(), before even per-cpu areas are initialised. The patch changes tvec_bases to use compile-time initialisation, and adds a separate array `tvec_base_done' to keep track of which CPU has had its tvec_bases[] entry initialised (because we can no longer use the zeroness of that tvec_bases[] entry to determine whether it has been initialised). Thanks to Eugene Surovegin <ebs@ebshome.net> for diagnosing this. Cc: Eugene Surovegin <ebs@ebshome.net> Cc: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-09	[PATCH] x86_64: Fix drift with HPET timer enabled	Jordan Hargrave
	If the HPET timer is enabled, the clock can drift by ~3 seconds a day. This is due to the HPET timer not being initialized with the correct setting (still using PIT count). If HZ changes, this drift can become even more pronounced. HPET patch initializes tick_nsec with correct tick_nsec settings for HPET timer. Vojtech comments: "It's not entirely correct (it assumes the HPET ticks totally exactly), but it's significantly better than assuming the PIT error there." Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-02	BUG_ON() Conversion in kernel/signal.c	Eric Sesterhenn
	this changes if() BUG(); constructs to BUG_ON() which is cleaner, contains unlikely() and can better optimized away. Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-03-31	[PATCH] sched: reduce overhead of calc_load	Jack Steiner
	Currently, count_active_tasks() calls both nr_running() & nr_interruptible(). Each of these functions does a "for_each_cpu" & reads values from the runqueue of each cpu. Although this is not a lot of instructions, each runqueue may be located on different node. Depending on the architecture, a unique TLB entry may be required to access each runqueue. Since there may be more runqueues than cpu TLB entries, a scan of all runqueues can trash the TLB. Each memory reference incurs a TLB miss & refill. In addition, the runqueue cacheline that contains nr_running & nr_uninterruptible may be evicted from the cache between the two passes. This causes unnecessary cache misses. Combining nr_running() & nr_interruptible() into a single function substantially reduces the TLB & cache misses on large systems. This should have no measureable effect on smaller systems. On a 128p IA64 system running a memory stress workload, the new function reduced the overhead of calc_load() from 605 usec/call to 324 usec/call. Signed-off-by: Jack Steiner <steiner@sgi.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31	[PATCH] __mod_timer: simplify ->base changing	Oleg Nesterov
	Since base and new_base are of the same type now, we can save one 'if' branch and simplify the code a bit. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31	[PATCH] kill __init_timer_base in favor of boot_tvec_bases	Oleg Nesterov
	Commit a4a6198b80cf82eb8160603c98da218d1bd5e104: [PATCH] tvec_bases too large for per-cpu data introduced "struct tvec_t_base_s boot_tvec_bases" which is visible at compile time. This means we can kill __init_timer_base and move timer_base_s's content into tvec_t_base_s. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25	[PATCH] remove pps support	Roman Zippel
	This removes the support for pps. It's completely unused within the kernel and is basically in the way for further cleanups. It should be easier to readd proper support for it after the rest has been converted to NTP4 (where the pps mechanisms are quite different from NTP3 anyway). Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: Adrian Bunk <bunk@stusta.de> Cc: john stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25	[PATCH] sys_alarm() unsigned signed conversion fixup	Thomas Gleixner
	alarm() calls the kernel with an unsigend int timeout in seconds. The value is stored in the tv_sec field of a struct timeval to setup the itimer. The tv_sec field of struct timeval is of type long, which causes the tv_sec value to be negative on 32 bit machines if seconds > INT_MAX. Before the hrtimer merge (pre 2.6.16) such a negative value was converted to the maximum jiffies timeout by the timeval_to_jiffies conversion. It's not clear whether this was intended or just happened to be done by the timeval_to_jiffies code. hrtimers expect a timeval in canonical form and treat a negative timeout as already expired. This breaks the legitimate usage of alarm() with a timeout value > INT_MAX seconds. For 32 bit machines it is therefor necessary to limit the internal seconds value to avoid API breakage. Instead of doing this in all implementations of sys_alarm the duplicated sys_alarm code is moved into a common function in itimer.c Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24	[PATCH] timer-irq-driven soft-watchdog, cleanups	Ingo Molnar
	Make the softlockup detector purely timer-interrupt driven, removing softirq-context (timer) dependencies. This means that if the softlockup watchdog triggers, it has truly observed a longer than 10 seconds scheduling delay of a SCHED_FIFO prio 99 task. (the patch also turns off the softlockup detector during the initial bootup phase and does small style fixes) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24	[PATCH] tvec_bases too large for per-cpu data	Jan Beulich
	With internal Xen-enabled kernels we see the kernel's static per-cpu data area exceed the limit of 32k on x86-64, and even native x86-64 kernels get fairly close to that limit. I generally question whether it is reasonable to have data structures several kb in size allocated as per-cpu data when the space there is rather limited. The biggest arch-independent consumer is tvec_bases (over 4k on 32-bit archs, over 8k on 64-bit ones), which now gets converted to use dynamically allocated memory instead. Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-17	[PATCH] time_interpolator: add __read_mostly	Christoph Lameter
	The pointer to the current time interpolator and the current list of time interpolators are typically only changed during bootup. Adding __read_mostly takes them away from possibly hot cachelines. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-06	[PATCH] time: add barrier after updating jiffies_64	Atsushi Nemoto
	Add a compiler barrier so that we don't read jiffies before updating jiffies_64. Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-06	[PATCH] fix next_timer_interrupt() for hrtimer	Tony Lindgren
	Also from Thomas Gleixner <tglx@linutronix.de> Function next_timer_interrupt() got broken with a recent patch 6ba1b91213e81aa92b5cf7539f7d2a94ff54947c as sys_nanosleep() was moved to hrtimer. This broke things as next_timer_interrupt() did not check hrtimer tree for next event. Function next_timer_interrupt() is needed with dyntick (CONFIG_NO_IDLE_HZ, VST) implementations, as the system can be in idle when next hrtimer event was supposed to happen. At least ARM and S390 currently use next_timer_interrupt(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-02	[PATCH] time_interpolator: Use readq_relaxed() instead of readq().	Christoph Lameter
	On some platforms readq performs additional work to make sure I/O is done in a coherent way. This is not needed for time retrieval as done by the time interpolator. So we can use readq_relaxed instead which will improve performance. It affects sparc64 and ia64 only. Apparently it makes a significant difference on ia64. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: john stultz <johnstul@us.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-17	[PATCH] Provide an interface for getting the current tick length	Paul Mackerras
	This provides an interface for arch code to find out how many nanoseconds are going to be added on to xtime by the next call to do_timer. The value returned is a fixed-point number in 52.12 format in nanoseconds. The reason for this format is that it gives the full precision that the timekeeping code is using internally. The motivation for this is to fix a problem that has arisen on 32-bit powerpc in that the value returned by do_gettimeofday drifts apart from xtime if NTP is being used. PowerPC is now using a lockless do_gettimeofday based on reading the timebase register and performing some simple arithmetic. (This method of getting the time is also exported to userspace via the VDSO.) However, the factor and offset it uses were calculated based on the nominal tick length and weren't being adjusted when NTP varied the tick length. Note that 64-bit powerpc has had the lockless do_gettimeofday for a long time now. It also had an extremely hairy routine that got called from the 32-bit compat routine for adjtimex, which adjusted the factor and offset according to what it thought the timekeeping code was going to do. Not only was this only called if a 32-bit task did adjtimex (i.e. not if a 64-bit task did adjtimex), it was also duplicating computations from kernel/timer.c and it wasn't clear that it was (still) correct. The simple solution is to ask the timekeeping code how long the current jiffy will be on each timer interrupt, after calling do_timer. If this jiffy will be a different length from the last one, we then need to compute new values for the factor and offset used in the lockless do_gettimeofday. In this way we can keep xtime and do_gettimeofday in sync, even when NTP is varying the tick length. Note that when adjtimex varies the tick length, it almost always introduces the variation from the next tick on. The only case I could see where adjtimex would vary the length of the current tick is when an old-style adjtime adjustment is being cancelled. (It's not clear to me why the adjustment has to be cancelled immediately rather than from the next tick on.) Thus I don't see any real need for a hook in adjtimex; the rare case of an old-style adjustment being cancelled can be fixed up at the next tick. Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: john stultz <johnstul@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-07	[PATCH] timer.c NULL noise removal	Al Viro
	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-01-10	[PATCH] hrtimer: switch sys_nanosleep to hrtimer	Thomas Gleixner
	convert sys_nanosleep() to use hrtimer_nanosleep() Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10	[PATCH] hrtimer: hrtimer core code	Thomas Gleixner
	hrtimer subsystem core. It is initialized at bootup and expired by the timer interrupt, but is otherwise not utilized by any other subsystem yet. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08	[PATCH] kernel/: small cleanups	Adrian Bunk
	This patch contains the following cleanups: - make needlessly global functions static - every file should include the headers containing the prototypes for it's global functions Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30	[PATCH] jiffies_64 cleanup	Thomas Gleixner
	Define jiffies_64 in kernel/timer.c rather than having 24 duplicated defines in each architecture. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30	[PATCH] remove timer debug field	Andrew Morton
	Remove timer_list.magic and associated debugging code. I originally added this when a spinlock was added to timer_list - this meant that an all-zeroes timer became illegal and init_timer() was required. That spinlock isn't even there any more, although timer.base must now be initialised. I'll keep this debugging code in -mm. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30	[PATCH] ntp whitespace cleanup	Andrew Morton
	Fix bizarre 4-space coding style in the NTP code. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30	[PATCH] NTP shift_right cleanup	john stultz
	Create a macro shift_right() that avoids the numerous ugly conditionals in the NTP code that look like: if(a < 0) b = -(-a >> shift); else b = a >> shift; Replacing it with: b = shift_right(a, shift); This should have zero effect on the logic, however it should probably have a bit of testing just to be sure. Also replace open-coded min/max with the macros. Signed-off-by : John Stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30	[PATCH] introduce setup_timer() helper	Oleg Nesterov
	Every user of init_timer() also needs to initialize ->function and ->data fields. This patch adds a simple setup_timer() helper for that. The schedule_timeout() is patched as an example of usage. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29	[PATCH] TIMERS: add missing compensation for HZ == 250	YOSHIFUJI Hideaki
	Add missing compensation for (HZ == 250) != (1 << SHIFT_HZ) in second_overflow(). Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-13	[PATCH] schedule_timeout_[un]interruptible() speedup	Andrew Morton
	These functions don't need schedule_timeout()'s barrier. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10	[PATCH] kernel: fix-up schedule_timeout() usage	Nishanth Aravamudan
	Use schedule_timeout_{,un}interruptible() instead of set_current_state()/schedule_timeout() to reduce kernel size. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10	[PATCH] add schedule_timeout_{,un}interruptible() interfaces	Nishanth Aravamudan
	Add schedule_timeout_{,un}interruptible() interfaces so that schedule_timeout() callers don't have to worry about forgetting to add the set_current_state() call beforehand. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07	[PATCH] optimize writer path in time_interpolator_get_counter()	Alex Williamson
	Christoph Lameter <clameter@engr.sgi.com> When using a time interpolator that is susceptible to jitter there's potentially contention over a cmpxchg used to prevent time from going backwards. This is unnecessary when the caller holds the xtime write seqlock as all readers will be blocked from returning until the write is complete. We can therefore allow writers to insert a new value and exit rather than fight with CPUs who only hold a reader lock. Signed-off-by: Alex Williamson <alex.williamson@hp.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07	[PATCH] detect soft lockups	Ingo Molnar
	This patch adds a new kernel debug feature: CONFIG_DETECT_SOFTLOCKUP. When enabled then per-CPU watchdog threads are started, which try to run once per second. If they get delayed for more than 10 seconds then a callback from the timer interrupt detects this condition and prints out a warning message and a stack dump (once per lockup incident). The feature is otherwise non-intrusive, it doesnt try to unlock the box in any way, it only gets the debug info out, automatically, and on all CPUs affected by the lockup. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-Off-By: Matthias Urlichs <smurf@smurf.noris.de> Signed-off-by: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-23	[PATCH] preempt race in getppid	David Meybohm
	With CONFIG_PREEMPT && !CONFIG_SMP, it's possible for sys_getppid to return a bogus value if the parent's task_struct gets reallocated after current->group_leader->real_parent is read: asmlinkage long sys_getppid(void) { int pid; struct task_struct me = current; struct task_struct parent; parent = me->group_leader->real_parent; RACE HERE => for (;;) { pid = parent->tgid; #ifdef CONFIG_SMP { struct task_struct old = parent; / * Make sure we read the pid before re-reading the * parent pointer: */ smp_rmb(); parent = me->group_leader->real_parent; if (old != parent) continue; } #endif break; } return pid; } If the process gets preempted at the indicated point, the parent process can go ahead and call exit() and then get wait()'d on to reap its task_struct. When the preempted process gets resumed, it will not do any further checks of the parent pointer on !CONFIG_SMP: it will read the bad pid and return. So, the same algorithm used when SMP is enabled should be used when preempt is enabled, which will recheck ->real_parent in this case. Signed-off-by: David Meybohm <dmeybohmlkml@bellsouth.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25	[PATCH] kernel/timer: fix msleep_interruptible() comment	Domen Puncer
	The comment for msleep_interruptible() is wrong, as it will ignore wait-queue events, but will wake up early for signals. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Domen Puncer <domen@coderock.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23	[PATCH] preempt_count is int - remove cast and don't assign to unsigned type	Jesper Juhl
	In kernel/sched.c the return value from preempt_count() is cast to an int. That made sense when preempt_count was defined as different types on is not needed and should go away. The patch removes the cast. In kernel/timer.c the return value from preempt_count() is assigned to a variable of type u32 and then that unsigned value is later compared to preempt_count(). Since preempt_count() returns an int, an int is what should be used to store its return value. Storing the result in an unsigned 32bit integer made a tiny bit of sense back when preempt_count was different types on different archs, but no more - let's not play signed vs unsigned comparison games when we don't have to. The patch modifies the code to use an int to hold the value. While I was around that bit of code I also made two changes to a nearby (related) printk() - I modified it to specify the loglevel explicitly and also broke the line into a few pieces to avoid it being longer than 80 chars and clarified the text a bit. Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23	[PATCH] timers: introduce try_to_del_timer_sync()	Oleg Nesterov
	This patch splits del_timer_sync() into 2 functions. The new one, try_to_del_timer_sync(), returns -1 when it hits executing timer. It can be used in interrupt context, or when the caller hold locks which can prevent completion of the timer's handler. NOTE. Currently it can't be used in interrupt context in UP case, because ->running_timer is used only with CONFIG_SMP. Should the need arise, it is possible to kill #ifdef CONFIG_SMP in set_running_timer(), it is cheap. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23	[PATCH] timers fixes/improvements	Oleg Nesterov
	This patch tries to solve following problems: 1. del_timer_sync() is racy. The timer can be fired again after del_timer_sync have checked all cpus and before it will recheck timer_pending(). 2. It has scalability problems. All cpus are scanned to determine if the timer is running on that cpu. With this patch del_timer_sync is O(1) and no slower than plain del_timer(pending_timer), unless it has to actually wait for completion of the currently running timer. The only restriction is that the recurring timer should not use add_timer_on(). 3. The timers are not serialized wrt to itself. If CPU_0 does mod_timer(jiffies+1) while the timer is currently running on CPU 1, it is quite possible that local interrupt on CPU_0 will start that timer before it finished on CPU_1. 4. The timers locking is suboptimal. __mod_timer() takes 3 locks at once and still requires wmb() in del_timer/run_timers. The new implementation takes 2 locks sequentially and does not need memory barriers. Currently ->base != NULL means that the timer is pending. In that case ->base.lock is used to lock the timer. __mod_timer also takes timer->lock because ->base can be == NULL. This patch uses timer->entry.next != NULL as indication that the timer is pending. So it does __list_del(), entry->next = NULL instead of list_del() when the timer is deleted. The ->base field is used for hashed locking only, it is initialized in init_timer() which sets ->base = per_cpu(tvec_bases). When the tvec_bases.lock is locked, it means that all timers which are tied to this base via timer->base are locked, and the base itself is locked too. So __run_timers/migrate_timers can safely modify all timers which could be found on ->tvX lists (pending timers). When the timer's base is locked, and the timer removed from ->entry list (which means that _run_timers/migrate_timers can't see this timer), it is possible to set timer->base = NULL and drop the lock: the timer remains locked. This patch adds lock_timer_base() helper, which waits for ->base != NULL, locks the ->base, and checks it is still the same. __mod_timer() schedules the timer on the local CPU and changes it's base. However, it does not lock both old and new bases at once. It locks the timer via lock_timer_base(), deletes the timer, sets ->base = NULL, and unlocks old base. Then __mod_timer() locks new_base, sets ->base = new_base, and adds this timer. This simplifies the code, because AB-BA deadlock is not possible. __mod_timer() also ensures that the timer's base is not changed while the timer's handler is running on the old base. __run_timers(), del_timer() do not change ->base anymore, they only clear pending flag. So del_timer_sync() can test timer->base->running_timer == timer to detect whether it is running or not. We don't need timer_list->lock anymore, this patch kills it. We also don't need barriers. del_timer() and __run_timers() used smp_wmb() before clearing timer's pending flag. It was needed because __mod_timer() did not lock old_base if the timer is not pending, so __mod_timer()->list_add() could race with del_timer()->list_del(). With this patch these functions are serialized through base->lock. One problem. TIMER_INITIALIZER can't use per_cpu(tvec_bases). So this patch adds global struct timer_base_s { spinlock_t lock; struct timer_list *running_timer; } __init_timer_base; which is used by TIMER_INITIALIZER. The corresponding fields in tvec_t_base_s struct are replaced by struct timer_base_s t_base. It is indeed ugly. But this can't have scalability problems. The global __init_timer_base.lock is used only when __mod_timer() is called for the first time AND the timer was compile time initialized. After that the timer migrates to the local CPU. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Renaud Lienhart <renaud.lienhart@free.fr> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01	[PATCH] use smp_mb/wmb/rmb where possible	akpm@osdl.org
	Replace a number of memory barriers with smp_ variants. This means we won't take the unnecessary hit on UP machines. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-11	[PATCH] Make lots of things static	Adrian Bunk
	This is a megarollup of ~60 patches which give various things static scope. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-07	[PATCH] posix-timers: CPU clock support for POSIX timers	Roland McGrath
	POSIX requires that when you claim _POSIX_CPUTIME and _POSIX_THREAD_CPUTIME, not only the clock_* calls but also timer_* calls must support the thread and process CPU time clocks. This patch provides that support, building on my recent additions to support these clocks in the POSIX clock_* interfaces. This patch will not work without those changes, as well as the patch fixing the timer lock-siglock deadlock problem. The apparent pervasive changes to posix-timers.c are simply that some fields of struct k_itimer have changed name and moved into a union. This was appropriate since the data structures required for the existing real-time timer support and for the new thread/process CPU-time timers are quite different. The glibc patches to support CPU time clocks using the new kernel support is in http://people.redhat.com/roland/glibc/kernel-cpuclocks.patch, and that includes tests for the timer support (if you build glibc with NPTL). From: Christoph Lameter <clameter@sgi.com> Your patch breaks the mmtimer driver because it used k_itimer values for its own purposes. Here is a fix by defining an additional structure in k_itimer (same approach for mmtimer as the cpu timers): From: Roland McGrath <roland@redhat.com> Fix bug identified by Alexander Nyberg <alexn@dsv.su.se> > The problem arises from code touching the union in alloc_posix_timer() > which makes firing go non-zero. When firing is checked in > posix_cpu_timer_set() it will be positive causing an infinite loop. > > So either the below fix or preferably move the INIT_LIST_HEAD(x) from > alloc_posix_timer() to somewhere later where it doesn't disturb the other > union members. Thanks for finding this problem. The latter is what I think is the right solution. This patch does that, and also removes some superfluous rezeroing. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-07	[PATCH] base-small: shrink timer hashes	Matt Mackall
	CONFIG_BASE_SMALL reduce timer list hashes Signed-off-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-02-14	[TIMER]: Export avenrun for packet scheduler meta ematch.	David S. Miller
	Signed-off-by: David S. Miller <davem@davemloft.net>
2005-01-20	[PATCH] avoid sparse warning due to time-interpolator	David Mosberger
	The "addr" member in the time-interpolator is sometimes used as a function-pointer and sometimes as an I/O-memory pointer. The attached patch tells sparse that this is OK. Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-11	[PATCH] cputime: introduce cputime	Martin Schwidefsky
	This patch introduces the concept of (virtual) cputime. Each architecture can define its method to measure cputime. The main idea is to define a cputime_t type and a set of operations on it (see asm-generic/cputime.h). Then use the type for utime, stime, cutime, cstime, it_virt_value, it_virt_incr, it_prof_value and it_prof_incr and use the cputime operations for each access to these variables. The default implementation is jiffies based and the effect of this patch for architectures which use the default implementation should be neglectible. There is a second type cputime64_t which is necessary for the kernel_stat cpu statistics. The default cputime_t is 32 bit and based on HZ, this will overflow after 49.7 days. This is not enough for kernel_stat (ihmo not enough for a processes too), so it is necessary to have a 64 bit type. The third thing that gets introduced by this patch is an additional field for the /proc/stat interface: cpu steal time. An architecture can account cpu steal time by calls to the account_stealtime function. The cpu which backs a virtual processor doesn't spent all of its time for the virtual cpu. To get meaningful cpu usage numbers this involuntary wait time needs to be accounted and exported to user space. From: Hugh Dickins <hugh@veritas.com> The p->signal check in account_system_time is insufficient. If the timer interrupt hits near the end of exit_notify, after EXIT_ZOMBIE has been set, another cpu may release_task (NULLifying p->signal) in between account_system_time's check and check_rlimit's dereference. Nor should account_it_prof risk send_sig. But surely account_user_time is safe? Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07	[PATCH] Fix kernel/timer.c comment typo	Vasia Pupkin
	Signed-off-by: Vasia Pupkin <ptushnik@gmail.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07	[PATCH] Lock initializer cleanup (Core)	Thomas Gleixner
	Kernel core files converted to use the new lock initializers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07	[PATCH] remove the BKL by turning it into a semaphore	Ingo Molnar
	This is the current remove-BKL patch. I test-booted it on x86 and x64, trying every conceivable combination of SMP, PREEMPT and PREEMPT_BKL. All other architectures should compile as well. (most of the testing was done with the zaphod patch undone but it applies cleanly on vanilla -mm3 as well and should work fine.) this is the debugging-enabled variant of the patch which has two main debugging features: - debug potentially illegal smp_processor_id() use. Has caught a number of real bugs - e.g. look at the printk.c fix in the patch. - make it possible to enable/disable the BKL via a .config. If this goes upstream we dont want this of course, but for now it gives people a chance to find out whether any particular problem was caused by this patch. This patch has one important fix over the previous BKL patch: on PREEMPT kernels if we preempted BKL-using code then the code still auto-dropped the BKL by mistake. This caused a number of breakages for testers, which breakages went away once this bug was fixed. Also the debugging mechanism has been improved alot relative to the previous BKL patch. Would be nice to test-drive this in -mm. There will likely be some more smp_processor_id() false positives but they are 1) harmless 2) easy to fix up. We could as well find more real smp_processor_id() related breakages as well. The most noteworthy fact is that no BKL-using code was found yet that relied on smp_processor_id(), which is promising from a compatibility POV. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] task_struct.exit_state usage	Roland McGrath
	I just did a quick audit of the use of exit_state and the EXIT_* bit macros. I guess I didn't really review these changes very closely when you did them originally. :-( I found several places that seem like lossy cases of query-replace without enough thought about the code. Linus has previously said the >= tests ought to be & tests instead. But for exit_state, it can only ever be 0, EXIT_DEAD, or EXIT_ZOMBIE--so a nonzero test is actually the same as testing & (EXIT_DEAD\|EXIT_ZOMBIE), and maybe its code is a tiny bit better. The case like in choose_new_parent is just confusing, to have the always-false test for EXIT_* bits in ->state there too. The two cases in wants_signal and do_process_times are actual regressions that will give us back old bugs in race conditions. These places had s/TASK/EXIT/ but not s/state/exit_state/, and now there tests for exiting tasks are now wrong and never catching them. I take it back: there is no regression in wants_signal in practice I think, because of the PF_EXITING test that makes the EXIT_* state checks superfluous anyway. So that is just another cosmetic case of confusing code. But in do_process_times, there is that SIGXCPU-while-exiting race condition back again. Signed-off-by: Roland McGrath <roland@redhat.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-11-21	[PATCH] del_timer() vs. mod_timer() SMP race	Benjamin Herrenschmidt
	We just spent some days fighting a rare race in one of the distro's who backported some of timer.c from 2.6 to 2.4 (though they missed a bit). The actual race we found didn't happen in 2.6 _but_ code inspection showed that a similar race is still present in 2.6, explanation below: Code removing a timer from a list (run_timers or del_timer) takes that CPU list lock, does list_del, then timer->base = NULL. It is mandatory that this timer->base = NULL is visible to other CPUs only after the list_del() is complete. If not, then mod timer could see it NULL, thus take it's own CPU list lock and not the one for the CPU the timer was beeing removed from the list, and thus the list_add in mod_timer() could race with the list_del() from run_timers() or del_timer(). Our race happened with run_timers(), which _DOES_ contain a proper smp_wmb() in the right spot in 2.6, but didn't in the "backport" we were fighting with. However, del_timer() doesn't have such a barrier, and thus is subject to this race in 2.6 as well. This patch fixes it. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>