user/sven/linux.git - Linux Kernel

Age	Commit message (Collapse)	Author
2005-05-01	[PATCH] use smp_mb/wmb/rmb where possible	akpm@osdl.org
	Replace a number of memory barriers with smp_ variants. This means we won't take the unnecessary hit on UP machines. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-11	[PATCH] Make lots of things static	Adrian Bunk
	This is a megarollup of ~60 patches which give various things static scope. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-07	[PATCH] posix-timers: CPU clock support for POSIX timers	Roland McGrath
	POSIX requires that when you claim _POSIX_CPUTIME and _POSIX_THREAD_CPUTIME, not only the clock_* calls but also timer_* calls must support the thread and process CPU time clocks. This patch provides that support, building on my recent additions to support these clocks in the POSIX clock_* interfaces. This patch will not work without those changes, as well as the patch fixing the timer lock-siglock deadlock problem. The apparent pervasive changes to posix-timers.c are simply that some fields of struct k_itimer have changed name and moved into a union. This was appropriate since the data structures required for the existing real-time timer support and for the new thread/process CPU-time timers are quite different. The glibc patches to support CPU time clocks using the new kernel support is in http://people.redhat.com/roland/glibc/kernel-cpuclocks.patch, and that includes tests for the timer support (if you build glibc with NPTL). From: Christoph Lameter <clameter@sgi.com> Your patch breaks the mmtimer driver because it used k_itimer values for its own purposes. Here is a fix by defining an additional structure in k_itimer (same approach for mmtimer as the cpu timers): From: Roland McGrath <roland@redhat.com> Fix bug identified by Alexander Nyberg <alexn@dsv.su.se> > The problem arises from code touching the union in alloc_posix_timer() > which makes firing go non-zero. When firing is checked in > posix_cpu_timer_set() it will be positive causing an infinite loop. > > So either the below fix or preferably move the INIT_LIST_HEAD(x) from > alloc_posix_timer() to somewhere later where it doesn't disturb the other > union members. Thanks for finding this problem. The latter is what I think is the right solution. This patch does that, and also removes some superfluous rezeroing. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-07	[PATCH] base-small: shrink timer hashes	Matt Mackall
	CONFIG_BASE_SMALL reduce timer list hashes Signed-off-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-02-14	[TIMER]: Export avenrun for packet scheduler meta ematch.	David S. Miller
	Signed-off-by: David S. Miller <davem@davemloft.net>
2005-01-20	[PATCH] avoid sparse warning due to time-interpolator	David Mosberger
	The "addr" member in the time-interpolator is sometimes used as a function-pointer and sometimes as an I/O-memory pointer. The attached patch tells sparse that this is OK. Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-11	[PATCH] cputime: introduce cputime	Martin Schwidefsky
	This patch introduces the concept of (virtual) cputime. Each architecture can define its method to measure cputime. The main idea is to define a cputime_t type and a set of operations on it (see asm-generic/cputime.h). Then use the type for utime, stime, cutime, cstime, it_virt_value, it_virt_incr, it_prof_value and it_prof_incr and use the cputime operations for each access to these variables. The default implementation is jiffies based and the effect of this patch for architectures which use the default implementation should be neglectible. There is a second type cputime64_t which is necessary for the kernel_stat cpu statistics. The default cputime_t is 32 bit and based on HZ, this will overflow after 49.7 days. This is not enough for kernel_stat (ihmo not enough for a processes too), so it is necessary to have a 64 bit type. The third thing that gets introduced by this patch is an additional field for the /proc/stat interface: cpu steal time. An architecture can account cpu steal time by calls to the account_stealtime function. The cpu which backs a virtual processor doesn't spent all of its time for the virtual cpu. To get meaningful cpu usage numbers this involuntary wait time needs to be accounted and exported to user space. From: Hugh Dickins <hugh@veritas.com> The p->signal check in account_system_time is insufficient. If the timer interrupt hits near the end of exit_notify, after EXIT_ZOMBIE has been set, another cpu may release_task (NULLifying p->signal) in between account_system_time's check and check_rlimit's dereference. Nor should account_it_prof risk send_sig. But surely account_user_time is safe? Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07	[PATCH] Fix kernel/timer.c comment typo	Vasia Pupkin
	Signed-off-by: Vasia Pupkin <ptushnik@gmail.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07	[PATCH] Lock initializer cleanup (Core)	Thomas Gleixner
	Kernel core files converted to use the new lock initializers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07	[PATCH] remove the BKL by turning it into a semaphore	Ingo Molnar
	This is the current remove-BKL patch. I test-booted it on x86 and x64, trying every conceivable combination of SMP, PREEMPT and PREEMPT_BKL. All other architectures should compile as well. (most of the testing was done with the zaphod patch undone but it applies cleanly on vanilla -mm3 as well and should work fine.) this is the debugging-enabled variant of the patch which has two main debugging features: - debug potentially illegal smp_processor_id() use. Has caught a number of real bugs - e.g. look at the printk.c fix in the patch. - make it possible to enable/disable the BKL via a .config. If this goes upstream we dont want this of course, but for now it gives people a chance to find out whether any particular problem was caused by this patch. This patch has one important fix over the previous BKL patch: on PREEMPT kernels if we preempted BKL-using code then the code still auto-dropped the BKL by mistake. This caused a number of breakages for testers, which breakages went away once this bug was fixed. Also the debugging mechanism has been improved alot relative to the previous BKL patch. Would be nice to test-drive this in -mm. There will likely be some more smp_processor_id() false positives but they are 1) harmless 2) easy to fix up. We could as well find more real smp_processor_id() related breakages as well. The most noteworthy fact is that no BKL-using code was found yet that relied on smp_processor_id(), which is promising from a compatibility POV. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] task_struct.exit_state usage	Roland McGrath
	I just did a quick audit of the use of exit_state and the EXIT_* bit macros. I guess I didn't really review these changes very closely when you did them originally. :-( I found several places that seem like lossy cases of query-replace without enough thought about the code. Linus has previously said the >= tests ought to be & tests instead. But for exit_state, it can only ever be 0, EXIT_DEAD, or EXIT_ZOMBIE--so a nonzero test is actually the same as testing & (EXIT_DEAD\|EXIT_ZOMBIE), and maybe its code is a tiny bit better. The case like in choose_new_parent is just confusing, to have the always-false test for EXIT_* bits in ->state there too. The two cases in wants_signal and do_process_times are actual regressions that will give us back old bugs in race conditions. These places had s/TASK/EXIT/ but not s/state/exit_state/, and now there tests for exiting tasks are now wrong and never catching them. I take it back: there is no regression in wants_signal in practice I think, because of the PF_EXITING test that makes the EXIT_* state checks superfluous anyway. So that is just another cosmetic case of confusing code. But in do_process_times, there is that SIGXCPU-while-exiting race condition back again. Signed-off-by: Roland McGrath <roland@redhat.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-11-21	[PATCH] del_timer() vs. mod_timer() SMP race	Benjamin Herrenschmidt
	We just spent some days fighting a rare race in one of the distro's who backported some of timer.c from 2.6 to 2.4 (though they missed a bit). The actual race we found didn't happen in 2.6 _but_ code inspection showed that a similar race is still present in 2.6, explanation below: Code removing a timer from a list (run_timers or del_timer) takes that CPU list lock, does list_del, then timer->base = NULL. It is mandatory that this timer->base = NULL is visible to other CPUs only after the list_del() is complete. If not, then mod timer could see it NULL, thus take it's own CPU list lock and not the one for the CPU the timer was beeing removed from the list, and thus the list_add in mod_timer() could race with the list_del() from run_timers() or del_timer(). Our race happened with run_timers(), which _DOES_ contain a proper smp_wmb() in the right spot in 2.6, but didn't in the "backport" we were fighting with. However, del_timer() doesn't have such a barrier, and thus is subject to this race in 2.6 as well. This patch fixes it. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-31	[PATCH] fix IBM cyclone clock and some cleanup	Christoph Lameter
	- fix broken IBM cyclone time interpolator support - add support for cyclic timers through an addition of a mask in the timer interpolator structure - Allow time_interpolator_update() and time_interpolator_get_offset() to be invoked without an active time interpolator (necessary since the cyclone clock is initialized late in ACPI processing) - remove obsolete function time_interpolator_resolution() - add a mask to all struct time_interpolator setups in the kernel - Make time interpolators work on 32bit platforms Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-27	[PATCH] unexport add_timer_on()	Arjan van de Ven
	add_timer_on() isn't used by modules (in fact it's only used ONCE, in workqueue.c) and it's not even a good api for drivers, in fact, the comment for it says * This is not very scalable on SMP. Double adds are not possible. Signed-off-by: Arjan van de Ven <arjan@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-24	[PATCH] Fix msleep to sleep _at_least_ the requested amount	Benjamin Herrenschmidt
	Makes sure msleep() sleeps at least the amount provided, since schedule_timeout() doesn't guarantee a full jiffy. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-18	[PATCH] cleanup: move call to update_process_times.	Martin Schwidefsky
	For non-smp kernels the call to update_process_times is done in the do_timer function. It is more consistent with smp kernels to move this call to the architecture file which calls do_timer. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-18	[PATCH] fix & clean up zombie/dead task handling & preemption	Ingo Molnar
	This patch fixes all the preempt-after-task->state-is-TASK_DEAD problems we had. Right now, the moment procfs does a down() that sleeps in proc_pid_flush() [it could] our TASK_DEAD state is zapped and we might be back to TASK_RUNNING to and we trigger this assert: schedule(); BUG(); /* Avoid "noreturn function does return". */ for (;;) ; I have split out TASK_ZOMBIE and TASK_DEAD into a separate p->exit_state field, to allow the detaching of exit-signal/parent/wait-handling from descheduling a dead task. Dead-task freeing is done via PF_DEAD. Tested the patch on x86 SMP and UP, but all architectures should work fine. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-18	[PATCH] xtime value may become incorrect	Vladimir Grouzdev
	The xtime value may become incorrect when the update_wall_time(ticks) function is called with "ticks" > 1. In such a case, the xtime variable is updated multiple times inside the loop but it is normalized only once outside of the loop. This bug was reported at: http://bugme.osdl.org/show_bug.cgi?id=3403 Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-18	[PATCH] add missing linux/syscalls.h includes	Arnd Bergmann
	I found that the prototypes for sys_waitid and sys_fcntl in <linux/syscalls.h> don't match the implementation. In order to keep all prototypes in sync in the future, now include the header from each file implementing any syscall. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-18	[PATCH] make rlimit settings per-process instead of per-thread	Roland McGrath
	POSIX specifies that the limit settings provided by getrlimit/setrlimit are shared by the whole process, not specific to individual threads. This patch changes the behavior of those calls to comply with POSIX. I've moved the struct rlimit array from task_struct to signal_struct, as it has the correct sharing properties. (This reduces kernel memory usage per thread in multithreaded processes by around 100/200 bytes for 32/64 machines respectively.) I took a fairly minimal approach to the locking issues with the newly shared struct rlimit array. It turns out that all the code that is checking limits really just needs to look at one word at a time (one rlim_cur field, usually). It's only the few places like getrlimit itself (and fork), that require atomicity in accessing a whole struct rlimit, so I just used a spin lock for them and no locking for most of the checks. If it turns out that readers of struct rlimit need more atomicity where they are now cheap, or less overhead where they are now atomic (e.g. fork), then seqcount is certainly the right thing to use for them instead of readers using the spin lock. Though it's in signal_struct, I didn't use siglock since the access to rlimits never needs to disable irqs and doesn't overlap with other siglock uses. Instead of adding something new, I overloaded task_lock(task->group_leader) for this; it is used for other things that are not likely to happen simultaneously with limit tweaking. To me that seems preferable to adding a word, but it would be trivial (and arguably cleaner) to add a separate lock for these users (or e.g. just use seqlock, which adds two words but is optimal for readers). Most of the changes here are just the trivial s/->rlim/->signal->rlim/. I stumbled across what must be a long-standing bug, in reparent_to_init. It does: memcpy(current->rlim, init_task.rlim, sizeof((current->rlim))); when surely it was intended to be: memcpy(current->rlim, init_task.rlim, sizeof(current->rlim)); As rlim is an array, the in the sizeof expression gets the size of the first element, so this just changes the first limit (RLIMIT_CPU). This is for kernel threads, where it's clear that resetting all the rlimits is what you want. With that fixed, the setting of RLIMIT_FSIZE in nfsd is superfluous since it will now already have been reset to RLIM_INFINITY. The other subtlety is removing: tsk->rlim[RLIMIT_CPU].rlim_cur = RLIM_INFINITY; in exit_notify, which was to avoid a race signalling during self-reaping exit. As the limit is now shared, a dying thread should not change it for others. Instead, I avoid that race by checking current->state before the RLIMIT_CPU check. (Adding one new conditional in that path is now required one way or another, since if not for this check there would also be a new race with self-reaping exit later on clearing current->signal that would have to be checked for.) The one loose end left by this patch is with process accounting. do_acct_process temporarily resets the RLIMIT_FSIZE limit while writing the accounting record. I left this as it was, but it is now changing a limit that might be shared by other threads still running. I left this in a dubious state because it seems to me that processing accounting may already be more generally a dubious state when it comes to NPTL threads. I would think you would want one record per process, with aggregate data about all threads that ever lived in it, not a separate record for each thread. I don't use process accounting myself, but if anyone is interested in testing it out I could provide a patch to change it this way. One final note, this is not 100% to POSIX compliance in regards to rlimits. POSIX specifies that RLIMIT_CPU refers to a whole process in aggregate, not to each individual thread. I will provide patches later on to achieve that change, assuming this patch goes in first. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-02	[PATCH] msleep_interruptible(): fix whitespace	Maximilian Attems
	thanks Xu for noticing, some whitespace found it's way there. clean that up. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-02	[PATCH] sparc64: time interpolator build fix	Andrew Morton
	We need io.h for readq(). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-09-16	[PATCH] time interpolators logic fix	Christoph Lameter
	Report the resolution of the time source correctly for time interpolators with a frequency over 1 Ghz. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-09-07	[PATCH] Time interpolator: Scalability enhancements and high resolution time ↵	Christoph Lameter
	for IA64 This has been in the ia64 (and hence -mm) trees for a couple of months. Changelog: * Affects only architectures which define CONFIG_TIME_INTERPOLATION (currently only IA64) * Genericize time interpolation, make time interpolators easily usable and provide instructions on how to use the interpolator for other architectures. * Provide nanosecond resolution for clock_gettime and an accuracy up to the time interpolator time base. * clock_getres() reports resolution of underlying time basis which is typically <50ns and may be 1ns on some systems. * Make time interpolator self-tuning to limit time jumps and to make the interpolators work correctly on systems with broken time base specifications. * SMP scalability: Make clock_gettime and gettimeofday scale O(1) by removing the cmpxchg for most clocks (tested for up to 512 CPUs) * IA64: provide asm fastcall that doubles the performance of gettimeofday and clock_gettime on SGI and other IA64 systems (asm fastcalls scale O(1) together with the scalability fixes). * IA64: provide nojitter kernel option so that IA64 systems with correctly synchronized ITC counters may also enjoy the scalability enhancements. Performance measurements for single calls (ITC cycles): A. 4 way Intel IA64 SMP system (kmart) ITC offsets: kmart:/usr/src/noship-tests # dmesg\|grep synchr CPU 1: synchronized ITC with CPU 0 (last diff 1 cycles, maxerr 417 cycles) CPU 2: synchronized ITC with CPU 0 (last diff 2 cycles, maxerr 417 cycles) CPU 3: synchronized ITC with CPU 0 (last diff 1 cycles, maxerr 417 cycles) A.1. Current kernel code kmart:/usr/src/noship-tests # ./dmt gettimeofday cycles: 3737 220 215 215 215 215 215 215 215 215 clock_gettime(REAL) cycles: 4058 575 564 576 565 566 558 558 558 558 clock_gettime(MONO) cycles: 1583 621 609 609 609 609 609 609 609 609 clock_gettime(PROCESS) cycles: 71428 298 259 259 259 259 259 259 259 259 clock_gettime(THREAD) cycles: 3982 336 290 298 298 298 298 286 286 286 A.2 New code using cmpxchg kmart:/usr/src/noship-tests # ./dmt gettimeofday cycles: 3145 213 216 213 213 213 213 213 213 213 clock_gettime(REAL) cycles: 3185 230 210 210 210 210 210 210 210 210 clock_gettime(MONO) cycles: 284 217 217 216 216 216 216 216 216 216 clock_gettime(PROCESS) cycles: 68857 289 270 259 259 259 259 259 259 259 clock_gettime(THREAD) cycles: 3862 339 298 298 298 298 290 286 286 286 A.3 New code with cmpxchg switched off (nojitter kernel option) kmart:/usr/src/noship-tests # ./dmt gettimeofday cycles: 3195 219 219 212 212 212 212 212 212 212 clock_gettime(REAL) cycles: 3003 228 205 205 205 205 205 205 205 205 clock_gettime(MONO) cycles: 279 209 209 209 208 208 208 208 208 208 clock_gettime(PROCESS) cycles: 65849 292 259 259 268 270 270 259 259 259 B. SGI SN2 system running 512 IA64 CPUs. B.1. Current kernel code [root@ascender noship-tests]# ./dmt gettimeofday cycles: 17221 1028 1007 1004 1004 1004 1010 25928 1002 1003 clock_gettime(REAL) cycles: 10388 1099 1055 1044 1064 1063 1051 1056 1061 1056 clock_gettime(MONO) cycles: 2363 96 96 96 96 96 96 96 96 96 clock_gettime(PROCESS) cycles: 46537 804 660 666 666 666 666 666 666 666 clock_gettime(THREAD) cycles: 10945 727 710 684 685 686 685 686 685 686 B.2 New code ascender:~/noship-tests # ./dmt gettimeofday cycles: 3874 610 588 588 588 588 588 588 588 588 clock_gettime(REAL) cycles: 3893 612 588 582 588 588 588 588 588 588 clock_gettime(MONO) cycles: 686 595 595 588 588 588 588 588 588 588 clock_gettime(PROCESS) cycles: 290759 322 269 269 259 265 265 265 259 259 clock_gettime(THREAD) cycles: 5153 358 306 298 296 304 290 298 298 298 Scalability of time functions (in time it takes to do a million calls): ======================================================================= A. 4 way Intel IA SMP system (kmart) A.1 Current code kmart:/usr/src/noship-tests # ./todscale -n1000000 CPUS WALL WALL/CPUS 1 0.192 0.192 2 1.125 0.563 4 9.229 2.307 A.2 New code using cmpxchg kmart:/usr/src/noship-tests # ./todscale CPUS WALL WALL/CPUS 1 0.188 0.188 2 0.457 0.229 4 0.413 0.103 (the measurement with 4 cpus may fluctuate up to 15.x somehow) A.3 New code without cmpxchg (nojitter kernel option) kmart:/usr/src/noship-tests # ./todscale -n10000000 CPUS WALL WALL/CPUS 1 0.180 0.180 2 0.180 0.090 4 0.252 0.063 B. SGI SN2 system running 512 IA64 CPUs. The system has a global monotonic clock and therefore has no need for compensation. Current code uses a cmpxchg. New code has no cmpxchg. B.1 current code ascender:~/noship-tests # ./todscale CPUS WALL WALL/CPUS 1 0.850 0.850 2 1.767 0.884 4 6.124 1.531 8 20.777 2.597 16 57.693 3.606 32 164.688 5.146 64 456.647 7.135 128 1093.371 8.542 256 2778.257 10.853 (System crash at 512 CPUs) B.2 New code ascender:~/noship-tests # ./todscale -n1000000 CPUS WALL WALL/CPUS 1 0.426 0.426 2 0.429 0.215 4 0.436 0.109 8 0.452 0.057 16 0.454 0.028 32 0.457 0.014 64 0.459 0.007 128 0.466 0.004 256 0.474 0.002 512 0.518 0.001 Clock Accuracy ============== A. 4 CPU SMP system A.1 Old code kmart:/usr/src/noship-tests # ./cdisp Gettimeofday() = 1092124757.270305000 CLOCK_REALTIME= 1092124757.270382000 resolution= 0.000976563 CLOCK_MONOTONIC= 89.696726590 resolution= 0.000976563 CLOCK_PROCESS_CPUTIME_ID= 0.001242507 resolution= 0.000000001 CLOCK_THREAD_CPUTIME_ID= 0.001255310 resolution= 0.000000001 A.2 New code kmart:/usr/src/noship-tests # ./cdisp Gettimeofday() = 1092124478.194530000 CLOCK_REALTIME= 1092124478.194603399 resolution= 0.000000001 CLOCK_MONOTONIC= 88.198315204 resolution= 0.000000001 CLOCK_PROCESS_CPUTIME_ID= 0.001241235 resolution= 0.000000001 CLOCK_THREAD_CPUTIME_ID= 0.001254747 resolution= 0.000000001 Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-09-02	[PATCH] Add msleep_interruptible() function to kernel/timer.c	Maximilian Attems
	Signed-off-by: Maximilian Attems <janitor@sternwelten.at> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-26	[PATCH] interrupt is enabled before it should be when kernel is booted	Yanmin Zhang
	There is a minor problem in function start_kernel. start_kernel will enable interrupt after calling profile_init. However, before that, function time_init on IA64 platform could enable interrupt. See this call sequence: start_kernel ->time_init ->ia64_init_itm ->register_time_interpolator ->write_seqlock_irq. Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com> Signed-off-by: Yao Jun <junx.yao@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-26	[PATCH] improve OProfile on many-way systems	John Levon
	Anton prompted me to get this patch merged. It changes the core buffer sync algorithm of OProfile to avoid global locks wherever possible. Anton tested an earlier version of this patch with some success. I've lightly tested this applied against 2.6.8.1-mm3 on my two-way machine. The changes also have the happy side-effect of losing less samples after munmap operations, and removing the blind spot of tasks exiting inside the kernel. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22	[PATCH] gettimeofday nanoseconds patch	Christoph Lameter
	This issue was discussed on lkml and linux-ia64. The patch introduces "getnstimeofday" and removes all the code scaling gettimeofday to nanoseoncs. It makes it possible for the posix-timer functions to return higher accuracy. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-01	[PATCH] Off-by-one error for SIGXCPU / RLIMIT_CPU checking	Michael Kerrisk
	There is a lonstanding off-by-one error that results from an incorrect comparison when checking whether a process has consumed CPU time in excess of its RLIMIT_CPU limits. This means, for example, that if we use setrlimit() to set the soft CPU limit (rlim_cur) to 5 seconds and the hard limit (rlim_max) to 10 seconds, then the process only receives a SIGXCPU signal after consuming 6 seconds of CPU time, and, if it continues consuming CPU after handling that signal, only receives SIGKILL after consuming 11 seconds of CPU time. The fix is trivial. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-17	[PATCH] Make update_one_process() static	Andrew Morton
	Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-05-21	[PATCH] Sanitise handling of unneeded syscall stubs	Andrew Morton
	From: David Mosberger <davidm@napali.hpl.hp.com> Below is a patch that tries to sanitize the dropping of unneeded system-call stubs in generic code. In some instances, it would be possible to move the optional system-call stubs into a library routine which would avoid the need for #ifdefs, but in many cases, doing so would require making several functions global (and possibly exporting additional data-structures in header-files). Furthermore, it would inhibit (automatic) inlining in the cases in the cases where the stubs are needed. For these reasons, the patch keeps the #ifdef-approach. This has been tested on ia64 and there were no objections from the arch-maintainers (and one positive response). The patch should be safe but arch-maintainers may want to take a second look to see if some __ARCH_WANT_foo macros should be removed for their architecture (I'm quite sure that's the case, but I wanted to play it safe and only preserved the status-quo in that regard).
2004-05-18	Add msleep function to the kernel core to prevent duplication.	Greg Kroah-Hartman

2004-05-14	[PATCH] Add del_single_shot_timer()	Andrew Morton
	From: Geoff Gustafson <geoff@linux.jf.intel.com>, "Chen, Kenneth W" <kenneth.w.chen@intel.com>, Ingo Molnar <mingo@elte.hu>, me. The big-SMP guys are seeing high CPU load due to del_timer_sync()'s inefficiencies. The callers are fs/aio.c and schedule_timeout(). We note that neither of these callers' timer handlers actually re-add the timer - they are single-shot. So we don't need all that complexity in del_timer_sync() - we can just run del_timer() and if that worked we know the timer is dead. Add del_single_shot_timer(), export it to modules and use it in AIO and schedule_timeout(). (these numbers are for an earlier patch, but they'll be close) Before: 32p 4p Warm cache 29,000 505 Cold cache 37,800 1220 After: 32p 4p Warm cache 95 88 Cold cache 1,800 140 [Measurements are CPU cycles spent in a call to del_timer_sync, the average of 1000 calls. 32p is 16-node NUMA, 4p is SMP.] (I cleaned up a few things and added some commentary)
2004-05-14	[PATCH] Revisited: ia64-cpu-hotplug-cpu_present.patch	Andrew Morton
	From: Paul Jackson <pj@sgi.com> With a hotplug capable kernel, there is a requirement to distinguish a possible CPU from one actually present. The set of possible CPU numbers doesn't change during a single system boot, but the set of present CPUs changes as CPUs are physically inserted into or removed from a system. The cpu_possible_map does not change once initialized at boot, but the cpu_present_map changes dynamically as CPUs are inserted or removed. Paul Jackson <pj@sgi.com> provided an expanded explanation: Ashok's cpu hot plug patch adds a cpu_present_map, resulting in the following cpu maps being available. All the following maps are fixed size bitmaps of size NR_CPUS. #ifdef CONFIG_HOTPLUG_CPU cpu_possible_map - map with all NR_CPUS bits set cpu_present_map - map with bit 'cpu' set iff cpu is populated cpu_online_map - map with bit 'cpu' set iff cpu available to scheduler #else cpu_possible_map - map with bit 'cpu' set iff cpu is populated cpu_present_map - copy of cpu_possible_map cpu_online_map - map with bit 'cpu' set iff cpu available to scheduler #endif In either case, NR_CPUS is fixed at compile time, as the static size of these bitmaps. The cpu_possible_map is fixed at boot time, as the set of CPU id's that it is possible might ever be plugged in at anytime during the life of that system boot. The cpu_present_map is dynamic(), representing which CPUs are currently plugged in. And cpu_online_map is the dynamic subset of cpu_present_map, indicating those CPUs available for scheduling. If HOTPLUG is enabled, then cpu_possible_map is forced to have all NR_CPUS bits set, otherwise it is just the set of CPUs that ACPI reports present at boot. If HOTPLUG is enabled, then cpu_present_map varies dynamically, depending on what ACPI reports as currently plugged in, otherwise cpu_present_map is just a copy of cpu_possible_map. () Well, cpu_present_map is dynamic in the hotplug case. If not hotplug, it's the same as cpu_possible_map, hence fixed at boot.
2004-04-26	[PATCH] s390: no timer interrupts in idle.	Andrew Morton
	From: Martin Schwidefsky <schwidefsky@de.ibm.com> This patch add a system control that allows to switch off the jiffies timer interrupts while a cpu sleeps in idle. This is useful for a system running with virtual cpus under z/VM.
2004-04-11	[PATCH] Fix get_wchan() FIXME wrt. order of functions	Andrew Morton
	From: William Lee Irwin III <wli@holomorphy.com> This addresses the issue with get_wchan() that the various functions acting as scheduling-related primitives are not, in fact, contiguous in the text segment. It creates an ELF section for scheduling primitives to be placed in, and places currently-detected (i.e. skipped during stack decoding) scheduling primitives and others like io_schedule() and down(), which are currently missed by get_wchan() code, into this section also. The net effects are more reliability of get_wchan()'s results and the new ability, made use of by this code, to arbitrarily place scheduling primitives in the source code without disturbing get_wchan()'s accuracy. Suggestions by Arnd Bergmann and Matthew Wilcox regarding reducing the invasiveness of the patch were incorporated during prior rounds of review. I've at least tried to sweep all arches in this patch.
2004-03-30	Add __user pointer annotations	Linus Torvalds
	Every pointer in <syscalls.h> had better be a user pointer. Also add some others that a quick sanity check picked up on.
2004-03-18	[PATCH] Hotplug CPUs: Other CPU_DEAD Notifiers	Rusty Russell
	Various files keep per-cpu caches which need to be freed/moved when a CPU goes down. All under CONFIG_HOTPLUG_CPU ifdefs. scsi.c: drain dead cpu's scsi_done_q onto this cpu. buffer.c: brelse the bh_lrus queue for dead cpu. timer.c: migrate timers from dead cpu, being careful of lock order vs __mod_timer. radix_tree.c: free dead cpu's radix_tree_preloads page_alloc.c: empty dead cpu's nr_pagecache_local into nr_pagecache, and free pages on cpu's local cache. slab.c: stop reap_timer for dead cpu, adjust each cache's free limit, and free each slab cache's per-cpu block. swap.c: drain dead cpu's lru_add_pvecs into ours, and empty its committed_space counter into global counter. dev.c: drain device queues from dead cpu into this one. flow.c: drain dead cpu's flow cache.
2004-03-11	[PATCH] time interpolator fix	Andrew Morton
	From: john stultz <johnstul@us.ibm.com> In developing the ia64-cyclone patch, which implements a cyclone based time interpolator, I found the following bug which could cause time inconsistencies. In update_wall_time_one_tick(), which is called each timer interrupt, we call time_interpolator_update(delta_nsec) where delta_nsec is approximately NSEC_PER_SEC/HZ. This directly correlates with the changes to xtime which occurs in update_wall_time_one_tick(). However in update_wall_time(), on a second overflow, we again call time_interpolator_update(NSEC_PER_SEC). However while the components of xtime are being changed, the overall value of xtime does not (nsec is decremented NSEC_PER_SEC and sec is incremented). Thus this call to time_interpolator_update is incorrect. This patch removes the incorrect call to time_interpolator_update and was found to resolve the time inconsistencies I had seen while developing the ia64-cyclone patch.
2004-03-06	[PATCH] fastcall / regparm fixes	Andrew Morton
	From: Gerd Knorr <kraxel@suse.de> Current gcc's error out if a function's declaration and definition disagree about the register passing convention. The patch adds a new `fastcall' declatation primitive, and uses that in all the FASTCALL functions which we could find. A number of inconsistencies were fixed up along the way.
2004-03-01	[PATCH] sys_alarm() return value fix	Andrew Morton
	From: Kurt Garloff <garloff@suse.de> when calling alarm(1); alarm(0); the second alarm can wrongly return 2. This makes an LSB test fail. It happens due to rounding errors in the timeval to jiffie conversion and back. On i386 with HZ =3D=3D 1000, there would not need to be rounding error IMVHO, but they even occur there. On HZ=3D1024 platforms, they may even be unavoidable. Attached patch fixes the return value of alarm().
2004-01-18	[PATCH] Use for_each_cpu() Where It's Meant To Be	Andrew Morton
	From: Rusty Russell <rusty@rustcorp.com.au> Some places use cpu_online() where they should be using cpu_possible, most commonly for tallying statistics. This makes no difference without hotplug CPU. Use the for_each_cpu() macro in those places, providing good examples (and making the external hotplug CPU patch smaller). Some places use cpu_online() where they should be using cpu_possible, most commonly for tallying statistics. This makes no difference without hotplug CPU. Use the for_each_cpu() macro in those places, providing good examples (and making the external hotplug CPU patch smaller).
2003-11-07	Don't fold nanosleep() into clock_nanosleep().	Linus Torvalds
	The latter has buggy restart functionality and is a lot more complicated anyway.
2003-10-21	[PATCH] Time precision, adjtime(x) vs. gettimeofday	Andrew Morton
	From: Stephen Hemminger <shemminger@osdl.org> The following will prevent adjtime from causing time regression. It delays starting the adjtime mechanism for one tick, and keeps gettimeofday inside the window. Only fixes i386, but changes to other arch would be similar. Running a simple clock test program and playing with adjtime demonstrates that this fixes the problem (and 2.6.0-test6 is broken). But given the fragile nature of the timer code, it should go through some more testing before inclusion.
2003-10-13	Mark timer as running in the timer base _before_ we clear the pending	Linus Torvalds
	flag (and order it on SMP), so that del_timer_sync() always sees the timer either pending or running if it is active.
2003-10-13	Revert recent SMP timer changes - they cause deadlocks	Linus Torvalds
	Cset exclude: mingo@elte.hu[torvalds]\|ChangeSet\|20031012025453\|05000
2003-10-11	[PATCH] SMP races in the timer code	Ingo Molnar
	This fixes two del_timer_sync() races that are still in the timer code. The first race was actually triggered in a 2.4 backport of the 2.6 timer code. The second race was never triggered - it is mostly theoretical on a standalone kernel. (It's more likely in any virtualized or otherwise preemptable environment.) Both races happen when self-rearming timers are used. One mainstream example is kernel/itimer.c. The effect of the races is that del_timer_sync() lets a timer running instead of synchronizing with it, causing logic bugs (and crashes) in the affected kernel code. One typical incarnation of the race is a double add_timer(). race #1: this code in __run_timers() is running on CPU0: list_del(&timer->entry); timer->base = NULL; [] set_running_timer(base, timer); spin_unlock_irq(&base->lock); [] fn(data); spin_lock_irq(&base->lock); CPU0 gets stuck at the [] code-point briefly - after the timer->base has been set to NULL, but before the base->running_timer pointer has been set up. This is a fundamentally volatile scenario, as there's _zero_ knowledge in the data structures that this timer is about to be executed! Now CPU1 comes along and calls del_timer_sync(). It will find nothing - neither timer->base nor base->running_timer will cause it to synchronize. It will return and report that the timer has been deleted - shortly afterwards CPU1 continues to execute the timer fn, which will cause crashes. This particular race is easy to fix by reordering the timer->base clearing with set_running_timer(), and putting a wmb() between them, but there's more races: race #2 The checking of del_timer_sync() for 'pending or running timer' is fundamentally unrobust. Eg. if CPU0 gets stuck at the [*] point below: base = &per_cpu(tvec_bases, i); if (base->running_timer == timer) { while (base->running_timer == timer) { cpu_relax(); preempt_check_resched(); } [] break; } } smp_rmb(); if (timer_pending(timer)) goto del_again; then del_timer_sync() has already decided that this timer is not running (we just finished loop-waiting for it), but we have not done the timer_pending() check yet. If the timer has re-armed itself, and if the timer expires on CPU1 (this needs a long delay on CPU0 but that's not hard to achieve eg. in UML or with kernel preemption enabled), then CPU1 could start to expire the timer and gets to the [] point in __run_timers (see above), then CPU1 gets stalled and CPU0 is unstalled, then the timer_pending() check in del_timer_sync() will not notice the running timer, and del_timer_sync() returns - while CPU1 is just about to run the timer! Fixing this second race is hard - it involves a heavy race-check operation that has to lock all bases, and has to re-check the base->running_timer value, and timer_pending condition atomically. This fix also fixes the first race, due to forcing del_timer_sync() to always observe the timer state atomically, so the [] code point will always synchronize with del_timer_sync(). The patch is ugly but safe, and it has fixed the crashes in the 2.4 backport. I tested the patch on 2.6.0-test7 with some heavy itimer use and it works fine. Removing self-arming timers safely is the sole purpose of del_timer_sync(), so there's no way around this overhead i think. I believe we should ultimately fix all major del_timer_sync() users to not use self-arming timers - having del_timer_sync() in the thread-exit path is now a considerable source of SMP overhead. But this is out of the scope of current 2.6 fixes of course, and we have to support self-arming timers as well.
2003-10-07	o kernel/ksyms.c: move remaining EXPORT_SYMBOLs, remove this file from the tree	Arnaldo Carvalho de Melo

2003-09-01	Fix del_timer_sync() SMP memory ordering (from Tejun Huh <tejun@aratech.co.kr>)	Linus Torvalds
	From Tejun's posting: > > This patch fixes a race between del_timer_sync and recursive timers. > Current implementation allows the value of timer->base that is used > for timer_pending test to be fetched before finishing running_timer > test, so it's possible for a recursive time to be pending after > del_timer_sync. Adding smp_rmb before timer_pending removes the race.
2003-08-14	[PATCH] More timer race fixes	Ingo Molnar
	Patch from Julie DeWandel. This patch has solved the crashes observed during TPC-C runs on the 16-way box. (I'm confident it will fix the other reported cases as well.) The race is the setting of timer->base to NULL, by del_timer() or __run_timers(). If new_base == old_base in __mod_timer() then we do not re-check timer->base after getting the lock. (the only case where we do not have to re-check the base is in the !old_base case, but the else branch also includes the old_base==new_base case.) The __run_timers() case made the lock_timer() patch not work fully - we cannot use lock_timer() in __run_timers() due to lock ordering.