summaryrefslogtreecommitdiff
path: root/kernel/timer.c
AgeCommit message (Collapse)Author
2006-11-24fix sys_getppid oopses on debug kernelKirill Korotaev
sys_getppid() optimization can access a freed memory. On kernels with DEBUG_SLAB turned ON, this results in Oops. As Dave Hansen noted, this optimization is also unsafe for memory hotplug. So this patch always takes the lock to be safe. Signed-off-by: Kirill Korotaev <dev@openvz.org> Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-03-17[PATCH] time_interpolator: add __read_mostlyChristoph Lameter
The pointer to the current time interpolator and the current list of time interpolators are typically only changed during bootup. Adding __read_mostly takes them away from possibly hot cachelines. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-06[PATCH] time: add barrier after updating jiffies_64Atsushi Nemoto
Add a compiler barrier so that we don't read jiffies before updating jiffies_64. Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-06[PATCH] fix next_timer_interrupt() for hrtimerTony Lindgren
Also from Thomas Gleixner <tglx@linutronix.de> Function next_timer_interrupt() got broken with a recent patch 6ba1b91213e81aa92b5cf7539f7d2a94ff54947c as sys_nanosleep() was moved to hrtimer. This broke things as next_timer_interrupt() did not check hrtimer tree for next event. Function next_timer_interrupt() is needed with dyntick (CONFIG_NO_IDLE_HZ, VST) implementations, as the system can be in idle when next hrtimer event was supposed to happen. At least ARM and S390 currently use next_timer_interrupt(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-02[PATCH] time_interpolator: Use readq_relaxed() instead of readq().Christoph Lameter
On some platforms readq performs additional work to make sure I/O is done in a coherent way. This is not needed for time retrieval as done by the time interpolator. So we can use readq_relaxed instead which will improve performance. It affects sparc64 and ia64 only. Apparently it makes a significant difference on ia64. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: john stultz <johnstul@us.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-17[PATCH] Provide an interface for getting the current tick lengthPaul Mackerras
This provides an interface for arch code to find out how many nanoseconds are going to be added on to xtime by the next call to do_timer. The value returned is a fixed-point number in 52.12 format in nanoseconds. The reason for this format is that it gives the full precision that the timekeeping code is using internally. The motivation for this is to fix a problem that has arisen on 32-bit powerpc in that the value returned by do_gettimeofday drifts apart from xtime if NTP is being used. PowerPC is now using a lockless do_gettimeofday based on reading the timebase register and performing some simple arithmetic. (This method of getting the time is also exported to userspace via the VDSO.) However, the factor and offset it uses were calculated based on the nominal tick length and weren't being adjusted when NTP varied the tick length. Note that 64-bit powerpc has had the lockless do_gettimeofday for a long time now. It also had an extremely hairy routine that got called from the 32-bit compat routine for adjtimex, which adjusted the factor and offset according to what it thought the timekeeping code was going to do. Not only was this only called if a 32-bit task did adjtimex (i.e. not if a 64-bit task did adjtimex), it was also duplicating computations from kernel/timer.c and it wasn't clear that it was (still) correct. The simple solution is to ask the timekeeping code how long the current jiffy will be on each timer interrupt, after calling do_timer. If this jiffy will be a different length from the last one, we then need to compute new values for the factor and offset used in the lockless do_gettimeofday. In this way we can keep xtime and do_gettimeofday in sync, even when NTP is varying the tick length. Note that when adjtimex varies the tick length, it almost always introduces the variation from the next tick on. The only case I could see where adjtimex would vary the length of the current tick is when an old-style adjtime adjustment is being cancelled. (It's not clear to me why the adjustment has to be cancelled immediately rather than from the next tick on.) Thus I don't see any real need for a hook in adjtimex; the rare case of an old-style adjustment being cancelled can be fixed up at the next tick. Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: john stultz <johnstul@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-07[PATCH] timer.c NULL noise removalAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-01-10[PATCH] hrtimer: switch sys_nanosleep to hrtimerThomas Gleixner
convert sys_nanosleep() to use hrtimer_nanosleep() Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] hrtimer: hrtimer core codeThomas Gleixner
hrtimer subsystem core. It is initialized at bootup and expired by the timer interrupt, but is otherwise not utilized by any other subsystem yet. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] kernel/: small cleanupsAdrian Bunk
This patch contains the following cleanups: - make needlessly global functions static - every file should include the headers containing the prototypes for it's global functions Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] jiffies_64 cleanupThomas Gleixner
Define jiffies_64 in kernel/timer.c rather than having 24 duplicated defines in each architecture. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] remove timer debug fieldAndrew Morton
Remove timer_list.magic and associated debugging code. I originally added this when a spinlock was added to timer_list - this meant that an all-zeroes timer became illegal and init_timer() was required. That spinlock isn't even there any more, although timer.base must now be initialised. I'll keep this debugging code in -mm. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] ntp whitespace cleanupAndrew Morton
Fix bizarre 4-space coding style in the NTP code. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] NTP shift_right cleanupjohn stultz
Create a macro shift_right() that avoids the numerous ugly conditionals in the NTP code that look like: if(a < 0) b = -(-a >> shift); else b = a >> shift; Replacing it with: b = shift_right(a, shift); This should have zero effect on the logic, however it should probably have a bit of testing just to be sure. Also replace open-coded min/max with the macros. Signed-off-by : John Stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] introduce setup_timer() helperOleg Nesterov
Every user of init_timer() also needs to initialize ->function and ->data fields. This patch adds a simple setup_timer() helper for that. The schedule_timeout() is patched as an example of usage. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] TIMERS: add missing compensation for HZ == 250YOSHIFUJI Hideaki
Add missing compensation for (HZ == 250) != (1 << SHIFT_HZ) in second_overflow(). Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-13[PATCH] schedule_timeout_[un]interruptible() speedupAndrew Morton
These functions don't need schedule_timeout()'s barrier. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] kernel: fix-up schedule_timeout() usageNishanth Aravamudan
Use schedule_timeout_{,un}interruptible() instead of set_current_state()/schedule_timeout() to reduce kernel size. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] add schedule_timeout_{,un}interruptible() interfacesNishanth Aravamudan
Add schedule_timeout_{,un}interruptible() interfaces so that schedule_timeout() callers don't have to worry about forgetting to add the set_current_state() call beforehand. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] optimize writer path in time_interpolator_get_counter()Alex Williamson
Christoph Lameter <clameter@engr.sgi.com> When using a time interpolator that is susceptible to jitter there's potentially contention over a cmpxchg used to prevent time from going backwards. This is unnecessary when the caller holds the xtime write seqlock as all readers will be blocked from returning until the write is complete. We can therefore allow writers to insert a new value and exit rather than fight with CPUs who only hold a reader lock. Signed-off-by: Alex Williamson <alex.williamson@hp.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] detect soft lockupsIngo Molnar
This patch adds a new kernel debug feature: CONFIG_DETECT_SOFTLOCKUP. When enabled then per-CPU watchdog threads are started, which try to run once per second. If they get delayed for more than 10 seconds then a callback from the timer interrupt detects this condition and prints out a warning message and a stack dump (once per lockup incident). The feature is otherwise non-intrusive, it doesnt try to unlock the box in any way, it only gets the debug info out, automatically, and on all CPUs affected by the lockup. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-Off-By: Matthias Urlichs <smurf@smurf.noris.de> Signed-off-by: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-23[PATCH] preempt race in getppidDavid Meybohm
With CONFIG_PREEMPT && !CONFIG_SMP, it's possible for sys_getppid to return a bogus value if the parent's task_struct gets reallocated after current->group_leader->real_parent is read: asmlinkage long sys_getppid(void) { int pid; struct task_struct *me = current; struct task_struct *parent; parent = me->group_leader->real_parent; RACE HERE => for (;;) { pid = parent->tgid; #ifdef CONFIG_SMP { struct task_struct *old = parent; /* * Make sure we read the pid before re-reading the * parent pointer: */ smp_rmb(); parent = me->group_leader->real_parent; if (old != parent) continue; } #endif break; } return pid; } If the process gets preempted at the indicated point, the parent process can go ahead and call exit() and then get wait()'d on to reap its task_struct. When the preempted process gets resumed, it will not do any further checks of the parent pointer on !CONFIG_SMP: it will read the bad pid and return. So, the same algorithm used when SMP is enabled should be used when preempt is enabled, which will recheck ->real_parent in this case. Signed-off-by: David Meybohm <dmeybohmlkml@bellsouth.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25[PATCH] kernel/timer: fix msleep_interruptible() commentDomen Puncer
The comment for msleep_interruptible() is wrong, as it will ignore wait-queue events, but will wake up early for signals. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Domen Puncer <domen@coderock.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] preempt_count is int - remove cast and don't assign to unsigned typeJesper Juhl
In kernel/sched.c the return value from preempt_count() is cast to an int. That made sense when preempt_count was defined as different types on is not needed and should go away. The patch removes the cast. In kernel/timer.c the return value from preempt_count() is assigned to a variable of type u32 and then that unsigned value is later compared to preempt_count(). Since preempt_count() returns an int, an int is what should be used to store its return value. Storing the result in an unsigned 32bit integer made a tiny bit of sense back when preempt_count was different types on different archs, but no more - let's not play signed vs unsigned comparison games when we don't have to. The patch modifies the code to use an int to hold the value. While I was around that bit of code I also made two changes to a nearby (related) printk() - I modified it to specify the loglevel explicitly and also broke the line into a few pieces to avoid it being longer than 80 chars and clarified the text a bit. Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] timers: introduce try_to_del_timer_sync()Oleg Nesterov
This patch splits del_timer_sync() into 2 functions. The new one, try_to_del_timer_sync(), returns -1 when it hits executing timer. It can be used in interrupt context, or when the caller hold locks which can prevent completion of the timer's handler. NOTE. Currently it can't be used in interrupt context in UP case, because ->running_timer is used only with CONFIG_SMP. Should the need arise, it is possible to kill #ifdef CONFIG_SMP in set_running_timer(), it is cheap. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] timers fixes/improvementsOleg Nesterov
This patch tries to solve following problems: 1. del_timer_sync() is racy. The timer can be fired again after del_timer_sync have checked all cpus and before it will recheck timer_pending(). 2. It has scalability problems. All cpus are scanned to determine if the timer is running on that cpu. With this patch del_timer_sync is O(1) and no slower than plain del_timer(pending_timer), unless it has to actually wait for completion of the currently running timer. The only restriction is that the recurring timer should not use add_timer_on(). 3. The timers are not serialized wrt to itself. If CPU_0 does mod_timer(jiffies+1) while the timer is currently running on CPU 1, it is quite possible that local interrupt on CPU_0 will start that timer before it finished on CPU_1. 4. The timers locking is suboptimal. __mod_timer() takes 3 locks at once and still requires wmb() in del_timer/run_timers. The new implementation takes 2 locks sequentially and does not need memory barriers. Currently ->base != NULL means that the timer is pending. In that case ->base.lock is used to lock the timer. __mod_timer also takes timer->lock because ->base can be == NULL. This patch uses timer->entry.next != NULL as indication that the timer is pending. So it does __list_del(), entry->next = NULL instead of list_del() when the timer is deleted. The ->base field is used for hashed locking only, it is initialized in init_timer() which sets ->base = per_cpu(tvec_bases). When the tvec_bases.lock is locked, it means that all timers which are tied to this base via timer->base are locked, and the base itself is locked too. So __run_timers/migrate_timers can safely modify all timers which could be found on ->tvX lists (pending timers). When the timer's base is locked, and the timer removed from ->entry list (which means that _run_timers/migrate_timers can't see this timer), it is possible to set timer->base = NULL and drop the lock: the timer remains locked. This patch adds lock_timer_base() helper, which waits for ->base != NULL, locks the ->base, and checks it is still the same. __mod_timer() schedules the timer on the local CPU and changes it's base. However, it does not lock both old and new bases at once. It locks the timer via lock_timer_base(), deletes the timer, sets ->base = NULL, and unlocks old base. Then __mod_timer() locks new_base, sets ->base = new_base, and adds this timer. This simplifies the code, because AB-BA deadlock is not possible. __mod_timer() also ensures that the timer's base is not changed while the timer's handler is running on the old base. __run_timers(), del_timer() do not change ->base anymore, they only clear pending flag. So del_timer_sync() can test timer->base->running_timer == timer to detect whether it is running or not. We don't need timer_list->lock anymore, this patch kills it. We also don't need barriers. del_timer() and __run_timers() used smp_wmb() before clearing timer's pending flag. It was needed because __mod_timer() did not lock old_base if the timer is not pending, so __mod_timer()->list_add() could race with del_timer()->list_del(). With this patch these functions are serialized through base->lock. One problem. TIMER_INITIALIZER can't use per_cpu(tvec_bases). So this patch adds global struct timer_base_s { spinlock_t lock; struct timer_list *running_timer; } __init_timer_base; which is used by TIMER_INITIALIZER. The corresponding fields in tvec_t_base_s struct are replaced by struct timer_base_s t_base. It is indeed ugly. But this can't have scalability problems. The global __init_timer_base.lock is used only when __mod_timer() is called for the first time AND the timer was compile time initialized. After that the timer migrates to the local CPU. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Renaud Lienhart <renaud.lienhart@free.fr> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01[PATCH] use smp_mb/wmb/rmb where possibleakpm@osdl.org
Replace a number of memory barriers with smp_ variants. This means we won't take the unnecessary hit on UP machines. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-11[PATCH] Make lots of things staticAdrian Bunk
This is a megarollup of ~60 patches which give various things static scope. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-07[PATCH] posix-timers: CPU clock support for POSIX timersRoland McGrath
POSIX requires that when you claim _POSIX_CPUTIME and _POSIX_THREAD_CPUTIME, not only the clock_* calls but also timer_* calls must support the thread and process CPU time clocks. This patch provides that support, building on my recent additions to support these clocks in the POSIX clock_* interfaces. This patch will not work without those changes, as well as the patch fixing the timer lock-siglock deadlock problem. The apparent pervasive changes to posix-timers.c are simply that some fields of struct k_itimer have changed name and moved into a union. This was appropriate since the data structures required for the existing real-time timer support and for the new thread/process CPU-time timers are quite different. The glibc patches to support CPU time clocks using the new kernel support is in http://people.redhat.com/roland/glibc/kernel-cpuclocks.patch, and that includes tests for the timer support (if you build glibc with NPTL). From: Christoph Lameter <clameter@sgi.com> Your patch breaks the mmtimer driver because it used k_itimer values for its own purposes. Here is a fix by defining an additional structure in k_itimer (same approach for mmtimer as the cpu timers): From: Roland McGrath <roland@redhat.com> Fix bug identified by Alexander Nyberg <alexn@dsv.su.se> > The problem arises from code touching the union in alloc_posix_timer() > which makes firing go non-zero. When firing is checked in > posix_cpu_timer_set() it will be positive causing an infinite loop. > > So either the below fix or preferably move the INIT_LIST_HEAD(x) from > alloc_posix_timer() to somewhere later where it doesn't disturb the other > union members. Thanks for finding this problem. The latter is what I think is the right solution. This patch does that, and also removes some superfluous rezeroing. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-07[PATCH] base-small: shrink timer hashesMatt Mackall
CONFIG_BASE_SMALL reduce timer list hashes Signed-off-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-02-14[TIMER]: Export avenrun for packet scheduler meta ematch.David S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-01-20[PATCH] avoid sparse warning due to time-interpolatorDavid Mosberger
The "addr" member in the time-interpolator is sometimes used as a function-pointer and sometimes as an I/O-memory pointer. The attached patch tells sparse that this is OK. Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-11[PATCH] cputime: introduce cputimeMartin Schwidefsky
This patch introduces the concept of (virtual) cputime. Each architecture can define its method to measure cputime. The main idea is to define a cputime_t type and a set of operations on it (see asm-generic/cputime.h). Then use the type for utime, stime, cutime, cstime, it_virt_value, it_virt_incr, it_prof_value and it_prof_incr and use the cputime operations for each access to these variables. The default implementation is jiffies based and the effect of this patch for architectures which use the default implementation should be neglectible. There is a second type cputime64_t which is necessary for the kernel_stat cpu statistics. The default cputime_t is 32 bit and based on HZ, this will overflow after 49.7 days. This is not enough for kernel_stat (ihmo not enough for a processes too), so it is necessary to have a 64 bit type. The third thing that gets introduced by this patch is an additional field for the /proc/stat interface: cpu steal time. An architecture can account cpu steal time by calls to the account_stealtime function. The cpu which backs a virtual processor doesn't spent all of its time for the virtual cpu. To get meaningful cpu usage numbers this involuntary wait time needs to be accounted and exported to user space. From: Hugh Dickins <hugh@veritas.com> The p->signal check in account_system_time is insufficient. If the timer interrupt hits near the end of exit_notify, after EXIT_ZOMBIE has been set, another cpu may release_task (NULLifying p->signal) in between account_system_time's check and check_rlimit's dereference. Nor should account_it_prof risk send_sig. But surely account_user_time is safe? Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07[PATCH] Fix kernel/timer.c comment typoVasia Pupkin
Signed-off-by: Vasia Pupkin <ptushnik@gmail.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07[PATCH] Lock initializer cleanup (Core)Thomas Gleixner
Kernel core files converted to use the new lock initializers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07[PATCH] remove the BKL by turning it into a semaphoreIngo Molnar
This is the current remove-BKL patch. I test-booted it on x86 and x64, trying every conceivable combination of SMP, PREEMPT and PREEMPT_BKL. All other architectures should compile as well. (most of the testing was done with the zaphod patch undone but it applies cleanly on vanilla -mm3 as well and should work fine.) this is the debugging-enabled variant of the patch which has two main debugging features: - debug potentially illegal smp_processor_id() use. Has caught a number of real bugs - e.g. look at the printk.c fix in the patch. - make it possible to enable/disable the BKL via a .config. If this goes upstream we dont want this of course, but for now it gives people a chance to find out whether any particular problem was caused by this patch. This patch has one important fix over the previous BKL patch: on PREEMPT kernels if we preempted BKL-using code then the code still auto-dropped the BKL by mistake. This caused a number of breakages for testers, which breakages went away once this bug was fixed. Also the debugging mechanism has been improved alot relative to the previous BKL patch. Would be nice to test-drive this in -mm. There will likely be some more smp_processor_id() false positives but they are 1) harmless 2) easy to fix up. We could as well find more real smp_processor_id() related breakages as well. The most noteworthy fact is that no BKL-using code was found yet that relied on smp_processor_id(), which is promising from a compatibility POV. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04[PATCH] task_struct.exit_state usageRoland McGrath
I just did a quick audit of the use of exit_state and the EXIT_* bit macros. I guess I didn't really review these changes very closely when you did them originally. :-( I found several places that seem like lossy cases of query-replace without enough thought about the code. Linus has previously said the >= tests ought to be & tests instead. But for exit_state, it can only ever be 0, EXIT_DEAD, or EXIT_ZOMBIE--so a nonzero test is actually the same as testing & (EXIT_DEAD|EXIT_ZOMBIE), and maybe its code is a tiny bit better. The case like in choose_new_parent is just confusing, to have the always-false test for EXIT_* bits in ->state there too. The two cases in wants_signal and do_process_times are actual regressions that will give us back old bugs in race conditions. These places had s/TASK/EXIT/ but not s/state/exit_state/, and now there tests for exiting tasks are now wrong and never catching them. I take it back: there is no regression in wants_signal in practice I think, because of the PF_EXITING test that makes the EXIT_* state checks superfluous anyway. So that is just another cosmetic case of confusing code. But in do_process_times, there is that SIGXCPU-while-exiting race condition back again. Signed-off-by: Roland McGrath <roland@redhat.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-11-21[PATCH] del_timer() vs. mod_timer() SMP raceBenjamin Herrenschmidt
We just spent some days fighting a rare race in one of the distro's who backported some of timer.c from 2.6 to 2.4 (though they missed a bit). The actual race we found didn't happen in 2.6 _but_ code inspection showed that a similar race is still present in 2.6, explanation below: Code removing a timer from a list (run_timers or del_timer) takes that CPU list lock, does list_del, then timer->base = NULL. It is mandatory that this timer->base = NULL is visible to other CPUs only after the list_del() is complete. If not, then mod timer could see it NULL, thus take it's own CPU list lock and not the one for the CPU the timer was beeing removed from the list, and thus the list_add in mod_timer() could race with the list_del() from run_timers() or del_timer(). Our race happened with run_timers(), which _DOES_ contain a proper smp_wmb() in the right spot in 2.6, but didn't in the "backport" we were fighting with. However, del_timer() doesn't have such a barrier, and thus is subject to this race in 2.6 as well. This patch fixes it. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-31[PATCH] fix IBM cyclone clock and some cleanupChristoph Lameter
- fix broken IBM cyclone time interpolator support - add support for cyclic timers through an addition of a mask in the timer interpolator structure - Allow time_interpolator_update() and time_interpolator_get_offset() to be invoked without an active time interpolator (necessary since the cyclone clock is initialized late in ACPI processing) - remove obsolete function time_interpolator_resolution() - add a mask to all struct time_interpolator setups in the kernel - Make time interpolators work on 32bit platforms Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-27[PATCH] unexport add_timer_on()Arjan van de Ven
add_timer_on() isn't used by modules (in fact it's only used ONCE, in workqueue.c) and it's not even a good api for drivers, in fact, the comment for it says * This is not very scalable on SMP. Double adds are not possible. Signed-off-by: Arjan van de Ven <arjan@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-24[PATCH] Fix msleep to sleep _at_least_ the requested amountBenjamin Herrenschmidt
Makes sure msleep() sleeps at least the amount provided, since schedule_timeout() doesn't guarantee a full jiffy. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-18[PATCH] cleanup: move call to update_process_times.Martin Schwidefsky
For non-smp kernels the call to update_process_times is done in the do_timer function. It is more consistent with smp kernels to move this call to the architecture file which calls do_timer. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-18[PATCH] fix & clean up zombie/dead task handling & preemptionIngo Molnar
This patch fixes all the preempt-after-task->state-is-TASK_DEAD problems we had. Right now, the moment procfs does a down() that sleeps in proc_pid_flush() [it could] our TASK_DEAD state is zapped and we might be back to TASK_RUNNING to and we trigger this assert: schedule(); BUG(); /* Avoid "noreturn function does return". */ for (;;) ; I have split out TASK_ZOMBIE and TASK_DEAD into a separate p->exit_state field, to allow the detaching of exit-signal/parent/wait-handling from descheduling a dead task. Dead-task freeing is done via PF_DEAD. Tested the patch on x86 SMP and UP, but all architectures should work fine. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-18[PATCH] xtime value may become incorrectVladimir Grouzdev
The xtime value may become incorrect when the update_wall_time(ticks) function is called with "ticks" > 1. In such a case, the xtime variable is updated multiple times inside the loop but it is normalized only once outside of the loop. This bug was reported at: http://bugme.osdl.org/show_bug.cgi?id=3403 Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-18[PATCH] add missing linux/syscalls.h includesArnd Bergmann
I found that the prototypes for sys_waitid and sys_fcntl in <linux/syscalls.h> don't match the implementation. In order to keep all prototypes in sync in the future, now include the header from each file implementing any syscall. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-18[PATCH] make rlimit settings per-process instead of per-threadRoland McGrath
POSIX specifies that the limit settings provided by getrlimit/setrlimit are shared by the whole process, not specific to individual threads. This patch changes the behavior of those calls to comply with POSIX. I've moved the struct rlimit array from task_struct to signal_struct, as it has the correct sharing properties. (This reduces kernel memory usage per thread in multithreaded processes by around 100/200 bytes for 32/64 machines respectively.) I took a fairly minimal approach to the locking issues with the newly shared struct rlimit array. It turns out that all the code that is checking limits really just needs to look at one word at a time (one rlim_cur field, usually). It's only the few places like getrlimit itself (and fork), that require atomicity in accessing a whole struct rlimit, so I just used a spin lock for them and no locking for most of the checks. If it turns out that readers of struct rlimit need more atomicity where they are now cheap, or less overhead where they are now atomic (e.g. fork), then seqcount is certainly the right thing to use for them instead of readers using the spin lock. Though it's in signal_struct, I didn't use siglock since the access to rlimits never needs to disable irqs and doesn't overlap with other siglock uses. Instead of adding something new, I overloaded task_lock(task->group_leader) for this; it is used for other things that are not likely to happen simultaneously with limit tweaking. To me that seems preferable to adding a word, but it would be trivial (and arguably cleaner) to add a separate lock for these users (or e.g. just use seqlock, which adds two words but is optimal for readers). Most of the changes here are just the trivial s/->rlim/->signal->rlim/. I stumbled across what must be a long-standing bug, in reparent_to_init. It does: memcpy(current->rlim, init_task.rlim, sizeof(*(current->rlim))); when surely it was intended to be: memcpy(current->rlim, init_task.rlim, sizeof(current->rlim)); As rlim is an array, the * in the sizeof expression gets the size of the first element, so this just changes the first limit (RLIMIT_CPU). This is for kernel threads, where it's clear that resetting all the rlimits is what you want. With that fixed, the setting of RLIMIT_FSIZE in nfsd is superfluous since it will now already have been reset to RLIM_INFINITY. The other subtlety is removing: tsk->rlim[RLIMIT_CPU].rlim_cur = RLIM_INFINITY; in exit_notify, which was to avoid a race signalling during self-reaping exit. As the limit is now shared, a dying thread should not change it for others. Instead, I avoid that race by checking current->state before the RLIMIT_CPU check. (Adding one new conditional in that path is now required one way or another, since if not for this check there would also be a new race with self-reaping exit later on clearing current->signal that would have to be checked for.) The one loose end left by this patch is with process accounting. do_acct_process temporarily resets the RLIMIT_FSIZE limit while writing the accounting record. I left this as it was, but it is now changing a limit that might be shared by other threads still running. I left this in a dubious state because it seems to me that processing accounting may already be more generally a dubious state when it comes to NPTL threads. I would think you would want one record per process, with aggregate data about all threads that ever lived in it, not a separate record for each thread. I don't use process accounting myself, but if anyone is interested in testing it out I could provide a patch to change it this way. One final note, this is not 100% to POSIX compliance in regards to rlimits. POSIX specifies that RLIMIT_CPU refers to a whole process in aggregate, not to each individual thread. I will provide patches later on to achieve that change, assuming this patch goes in first. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-02[PATCH] msleep_interruptible(): fix whitespaceMaximilian Attems
thanks Xu for noticing, some whitespace found it's way there. clean that up. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-02[PATCH] sparc64: time interpolator build fixAndrew Morton
We need io.h for readq(). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-09-16[PATCH] time interpolators logic fixChristoph Lameter
Report the resolution of the time source correctly for time interpolators with a frequency over 1 Ghz. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-09-07[PATCH] Time interpolator: Scalability enhancements and high resolution time ↵Christoph Lameter
for IA64 This has been in the ia64 (and hence -mm) trees for a couple of months. Changelog: * Affects only architectures which define CONFIG_TIME_INTERPOLATION (currently only IA64) * Genericize time interpolation, make time interpolators easily usable and provide instructions on how to use the interpolator for other architectures. * Provide nanosecond resolution for clock_gettime and an accuracy up to the time interpolator time base. * clock_getres() reports resolution of underlying time basis which is typically <50ns and may be 1ns on some systems. * Make time interpolator self-tuning to limit time jumps and to make the interpolators work correctly on systems with broken time base specifications. * SMP scalability: Make clock_gettime and gettimeofday scale O(1) by removing the cmpxchg for most clocks (tested for up to 512 CPUs) * IA64: provide asm fastcall that doubles the performance of gettimeofday and clock_gettime on SGI and other IA64 systems (asm fastcalls scale O(1) together with the scalability fixes). * IA64: provide nojitter kernel option so that IA64 systems with correctly synchronized ITC counters may also enjoy the scalability enhancements. Performance measurements for single calls (ITC cycles): A. 4 way Intel IA64 SMP system (kmart) ITC offsets: kmart:/usr/src/noship-tests # dmesg|grep synchr CPU 1: synchronized ITC with CPU 0 (last diff 1 cycles, maxerr 417 cycles) CPU 2: synchronized ITC with CPU 0 (last diff 2 cycles, maxerr 417 cycles) CPU 3: synchronized ITC with CPU 0 (last diff 1 cycles, maxerr 417 cycles) A.1. Current kernel code kmart:/usr/src/noship-tests # ./dmt gettimeofday cycles: 3737 220 215 215 215 215 215 215 215 215 clock_gettime(REAL) cycles: 4058 575 564 576 565 566 558 558 558 558 clock_gettime(MONO) cycles: 1583 621 609 609 609 609 609 609 609 609 clock_gettime(PROCESS) cycles: 71428 298 259 259 259 259 259 259 259 259 clock_gettime(THREAD) cycles: 3982 336 290 298 298 298 298 286 286 286 A.2 New code using cmpxchg kmart:/usr/src/noship-tests # ./dmt gettimeofday cycles: 3145 213 216 213 213 213 213 213 213 213 clock_gettime(REAL) cycles: 3185 230 210 210 210 210 210 210 210 210 clock_gettime(MONO) cycles: 284 217 217 216 216 216 216 216 216 216 clock_gettime(PROCESS) cycles: 68857 289 270 259 259 259 259 259 259 259 clock_gettime(THREAD) cycles: 3862 339 298 298 298 298 290 286 286 286 A.3 New code with cmpxchg switched off (nojitter kernel option) kmart:/usr/src/noship-tests # ./dmt gettimeofday cycles: 3195 219 219 212 212 212 212 212 212 212 clock_gettime(REAL) cycles: 3003 228 205 205 205 205 205 205 205 205 clock_gettime(MONO) cycles: 279 209 209 209 208 208 208 208 208 208 clock_gettime(PROCESS) cycles: 65849 292 259 259 268 270 270 259 259 259 B. SGI SN2 system running 512 IA64 CPUs. B.1. Current kernel code [root@ascender noship-tests]# ./dmt gettimeofday cycles: 17221 1028 1007 1004 1004 1004 1010 25928 1002 1003 clock_gettime(REAL) cycles: 10388 1099 1055 1044 1064 1063 1051 1056 1061 1056 clock_gettime(MONO) cycles: 2363 96 96 96 96 96 96 96 96 96 clock_gettime(PROCESS) cycles: 46537 804 660 666 666 666 666 666 666 666 clock_gettime(THREAD) cycles: 10945 727 710 684 685 686 685 686 685 686 B.2 New code ascender:~/noship-tests # ./dmt gettimeofday cycles: 3874 610 588 588 588 588 588 588 588 588 clock_gettime(REAL) cycles: 3893 612 588 582 588 588 588 588 588 588 clock_gettime(MONO) cycles: 686 595 595 588 588 588 588 588 588 588 clock_gettime(PROCESS) cycles: 290759 322 269 269 259 265 265 265 259 259 clock_gettime(THREAD) cycles: 5153 358 306 298 296 304 290 298 298 298 Scalability of time functions (in time it takes to do a million calls): ======================================================================= A. 4 way Intel IA SMP system (kmart) A.1 Current code kmart:/usr/src/noship-tests # ./todscale -n1000000 CPUS WALL WALL/CPUS 1 0.192 0.192 2 1.125 0.563 4 9.229 2.307 A.2 New code using cmpxchg kmart:/usr/src/noship-tests # ./todscale CPUS WALL WALL/CPUS 1 0.188 0.188 2 0.457 0.229 4 0.413 0.103 (the measurement with 4 cpus may fluctuate up to 15.x somehow) A.3 New code without cmpxchg (nojitter kernel option) kmart:/usr/src/noship-tests # ./todscale -n10000000 CPUS WALL WALL/CPUS 1 0.180 0.180 2 0.180 0.090 4 0.252 0.063 B. SGI SN2 system running 512 IA64 CPUs. The system has a global monotonic clock and therefore has no need for compensation. Current code uses a cmpxchg. New code has no cmpxchg. B.1 current code ascender:~/noship-tests # ./todscale CPUS WALL WALL/CPUS 1 0.850 0.850 2 1.767 0.884 4 6.124 1.531 8 20.777 2.597 16 57.693 3.606 32 164.688 5.146 64 456.647 7.135 128 1093.371 8.542 256 2778.257 10.853 (System crash at 512 CPUs) B.2 New code ascender:~/noship-tests # ./todscale -n1000000 CPUS WALL WALL/CPUS 1 0.426 0.426 2 0.429 0.215 4 0.436 0.109 8 0.452 0.057 16 0.454 0.028 32 0.457 0.014 64 0.459 0.007 128 0.466 0.004 256 0.474 0.002 512 0.518 0.001 Clock Accuracy ============== A. 4 CPU SMP system A.1 Old code kmart:/usr/src/noship-tests # ./cdisp Gettimeofday() = 1092124757.270305000 CLOCK_REALTIME= 1092124757.270382000 resolution= 0.000976563 CLOCK_MONOTONIC= 89.696726590 resolution= 0.000976563 CLOCK_PROCESS_CPUTIME_ID= 0.001242507 resolution= 0.000000001 CLOCK_THREAD_CPUTIME_ID= 0.001255310 resolution= 0.000000001 A.2 New code kmart:/usr/src/noship-tests # ./cdisp Gettimeofday() = 1092124478.194530000 CLOCK_REALTIME= 1092124478.194603399 resolution= 0.000000001 CLOCK_MONOTONIC= 88.198315204 resolution= 0.000000001 CLOCK_PROCESS_CPUTIME_ID= 0.001241235 resolution= 0.000000001 CLOCK_THREAD_CPUTIME_ID= 0.001254747 resolution= 0.000000001 Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>