| Age | Commit message (Collapse) | Author |
|
Replace a number of memory barriers with smp_ variants. This means we won't
take the unnecessary hit on UP machines.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This is a megarollup of ~60 patches which give various things static scope.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
POSIX requires that when you claim _POSIX_CPUTIME and _POSIX_THREAD_CPUTIME,
not only the clock_* calls but also timer_* calls must support the thread and
process CPU time clocks. This patch provides that support, building on my
recent additions to support these clocks in the POSIX clock_* interfaces.
This patch will not work without those changes, as well as the patch fixing
the timer lock-siglock deadlock problem.
The apparent pervasive changes to posix-timers.c are simply that some fields
of struct k_itimer have changed name and moved into a union. This was
appropriate since the data structures required for the existing real-time
timer support and for the new thread/process CPU-time timers are quite
different.
The glibc patches to support CPU time clocks using the new kernel support is
in http://people.redhat.com/roland/glibc/kernel-cpuclocks.patch, and that
includes tests for the timer support (if you build glibc with NPTL).
From: Christoph Lameter <clameter@sgi.com>
Your patch breaks the mmtimer driver because it used k_itimer values for
its own purposes. Here is a fix by defining an additional structure in
k_itimer (same approach for mmtimer as the cpu timers):
From: Roland McGrath <roland@redhat.com>
Fix bug identified by Alexander Nyberg <alexn@dsv.su.se>
> The problem arises from code touching the union in alloc_posix_timer()
> which makes firing go non-zero. When firing is checked in
> posix_cpu_timer_set() it will be positive causing an infinite loop.
>
> So either the below fix or preferably move the INIT_LIST_HEAD(x) from
> alloc_posix_timer() to somewhere later where it doesn't disturb the other
> union members.
Thanks for finding this problem. The latter is what I think is the right
solution. This patch does that, and also removes some superfluous rezeroing.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
CONFIG_BASE_SMALL reduce timer list hashes
Signed-off-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The "addr" member in the time-interpolator is sometimes used as a
function-pointer and sometimes as an I/O-memory pointer. The attached
patch tells sparse that this is OK.
Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch introduces the concept of (virtual) cputime. Each architecture
can define its method to measure cputime. The main idea is to define a
cputime_t type and a set of operations on it (see asm-generic/cputime.h).
Then use the type for utime, stime, cutime, cstime, it_virt_value,
it_virt_incr, it_prof_value and it_prof_incr and use the cputime operations
for each access to these variables. The default implementation is jiffies
based and the effect of this patch for architectures which use the default
implementation should be neglectible.
There is a second type cputime64_t which is necessary for the kernel_stat
cpu statistics. The default cputime_t is 32 bit and based on HZ, this will
overflow after 49.7 days. This is not enough for kernel_stat (ihmo not
enough for a processes too), so it is necessary to have a 64 bit type.
The third thing that gets introduced by this patch is an additional field
for the /proc/stat interface: cpu steal time. An architecture can account
cpu steal time by calls to the account_stealtime function. The cpu which
backs a virtual processor doesn't spent all of its time for the virtual
cpu. To get meaningful cpu usage numbers this involuntary wait time needs
to be accounted and exported to user space.
From: Hugh Dickins <hugh@veritas.com>
The p->signal check in account_system_time is insufficient. If the timer
interrupt hits near the end of exit_notify, after EXIT_ZOMBIE has been set,
another cpu may release_task (NULLifying p->signal) in between
account_system_time's check and check_rlimit's dereference. Nor should
account_it_prof risk send_sig. But surely account_user_time is safe?
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Signed-off-by: Vasia Pupkin <ptushnik@gmail.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Kernel core files converted to use the new lock initializers.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This is the current remove-BKL patch. I test-booted it on x86 and x64, trying
every conceivable combination of SMP, PREEMPT and PREEMPT_BKL. All other
architectures should compile as well. (most of the testing was done with the
zaphod patch undone but it applies cleanly on vanilla -mm3 as well and should
work fine.)
this is the debugging-enabled variant of the patch which has two main
debugging features:
- debug potentially illegal smp_processor_id() use. Has caught a number
of real bugs - e.g. look at the printk.c fix in the patch.
- make it possible to enable/disable the BKL via a .config. If this
goes upstream we dont want this of course, but for now it gives
people a chance to find out whether any particular problem was caused
by this patch.
This patch has one important fix over the previous BKL patch: on PREEMPT
kernels if we preempted BKL-using code then the code still auto-dropped the
BKL by mistake. This caused a number of breakages for testers, which
breakages went away once this bug was fixed.
Also the debugging mechanism has been improved alot relative to the previous
BKL patch.
Would be nice to test-drive this in -mm. There will likely be some more
smp_processor_id() false positives but they are 1) harmless 2) easy to fix up.
We could as well find more real smp_processor_id() related breakages as well.
The most noteworthy fact is that no BKL-using code was found yet that relied
on smp_processor_id(), which is promising from a compatibility POV.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
I just did a quick audit of the use of exit_state and the EXIT_* bit
macros. I guess I didn't really review these changes very closely when you
did them originally. :-(
I found several places that seem like lossy cases of query-replace without
enough thought about the code. Linus has previously said the >= tests
ought to be & tests instead. But for exit_state, it can only ever be 0,
EXIT_DEAD, or EXIT_ZOMBIE--so a nonzero test is actually the same as
testing & (EXIT_DEAD|EXIT_ZOMBIE), and maybe its code is a tiny bit better.
The case like in choose_new_parent is just confusing, to have the
always-false test for EXIT_* bits in ->state there too.
The two cases in wants_signal and do_process_times are actual regressions
that will give us back old bugs in race conditions. These places had
s/TASK/EXIT/ but not s/state/exit_state/, and now there tests for exiting
tasks are now wrong and never catching them. I take it back: there is no
regression in wants_signal in practice I think, because of the PF_EXITING
test that makes the EXIT_* state checks superfluous anyway. So that is
just another cosmetic case of confusing code. But in do_process_times,
there is that SIGXCPU-while-exiting race condition back again.
Signed-off-by: Roland McGrath <roland@redhat.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
We just spent some days fighting a rare race in one of the distro's who backported
some of timer.c from 2.6 to 2.4 (though they missed a bit).
The actual race we found didn't happen in 2.6 _but_ code inspection showed that a
similar race is still present in 2.6, explanation below:
Code removing a timer from a list (run_timers or del_timer) takes that CPU list
lock, does list_del, then timer->base = NULL.
It is mandatory that this timer->base = NULL is visible to other CPUs only after
the list_del() is complete. If not, then mod timer could see it NULL, thus take it's
own CPU list lock and not the one for the CPU the timer was beeing removed from the
list, and thus the list_add in mod_timer() could race with the list_del() from
run_timers() or del_timer().
Our race happened with run_timers(), which _DOES_ contain a proper smp_wmb() in the
right spot in 2.6, but didn't in the "backport" we were fighting with.
However, del_timer() doesn't have such a barrier, and thus is subject to this race in
2.6 as well. This patch fixes it.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
- fix broken IBM cyclone time interpolator support
- add support for cyclic timers through an addition of a mask
in the timer interpolator structure
- Allow time_interpolator_update() and time_interpolator_get_offset()
to be invoked without an active time interpolator
(necessary since the cyclone clock is initialized late in ACPI
processing)
- remove obsolete function time_interpolator_resolution()
- add a mask to all struct time_interpolator setups in the
kernel
- Make time interpolators work on 32bit platforms
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
add_timer_on() isn't used by modules (in fact it's only used ONCE, in
workqueue.c) and it's not even a good api for drivers, in fact, the comment
for it says
* This is not very scalable on SMP. Double adds are not possible.
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Makes sure msleep() sleeps at least the amount provided, since
schedule_timeout() doesn't guarantee a full jiffy.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
For non-smp kernels the call to update_process_times is done in the
do_timer function. It is more consistent with smp kernels to move this
call to the architecture file which calls do_timer.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch fixes all the preempt-after-task->state-is-TASK_DEAD problems we
had. Right now, the moment procfs does a down() that sleeps in
proc_pid_flush() [it could] our TASK_DEAD state is zapped and we might be
back to TASK_RUNNING to and we trigger this assert:
schedule();
BUG();
/* Avoid "noreturn function does return". */
for (;;) ;
I have split out TASK_ZOMBIE and TASK_DEAD into a separate p->exit_state
field, to allow the detaching of exit-signal/parent/wait-handling from
descheduling a dead task. Dead-task freeing is done via PF_DEAD.
Tested the patch on x86 SMP and UP, but all architectures should work
fine.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The xtime value may become incorrect when the update_wall_time(ticks)
function is called with "ticks" > 1. In such a case, the xtime variable is
updated multiple times inside the loop but it is normalized only once
outside of the loop.
This bug was reported at:
http://bugme.osdl.org/show_bug.cgi?id=3403
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
I found that the prototypes for sys_waitid and sys_fcntl in
<linux/syscalls.h> don't match the implementation. In order to keep all
prototypes in sync in the future, now include the header from each file
implementing any syscall.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
POSIX specifies that the limit settings provided by getrlimit/setrlimit are
shared by the whole process, not specific to individual threads. This
patch changes the behavior of those calls to comply with POSIX.
I've moved the struct rlimit array from task_struct to signal_struct, as it
has the correct sharing properties. (This reduces kernel memory usage per
thread in multithreaded processes by around 100/200 bytes for 32/64
machines respectively.) I took a fairly minimal approach to the locking
issues with the newly shared struct rlimit array. It turns out that all
the code that is checking limits really just needs to look at one word at a
time (one rlim_cur field, usually). It's only the few places like
getrlimit itself (and fork), that require atomicity in accessing a whole
struct rlimit, so I just used a spin lock for them and no locking for most
of the checks. If it turns out that readers of struct rlimit need more
atomicity where they are now cheap, or less overhead where they are now
atomic (e.g. fork), then seqcount is certainly the right thing to use for
them instead of readers using the spin lock. Though it's in signal_struct,
I didn't use siglock since the access to rlimits never needs to disable
irqs and doesn't overlap with other siglock uses. Instead of adding
something new, I overloaded task_lock(task->group_leader) for this; it is
used for other things that are not likely to happen simultaneously with
limit tweaking. To me that seems preferable to adding a word, but it would
be trivial (and arguably cleaner) to add a separate lock for these users
(or e.g. just use seqlock, which adds two words but is optimal for readers).
Most of the changes here are just the trivial s/->rlim/->signal->rlim/.
I stumbled across what must be a long-standing bug, in reparent_to_init.
It does:
memcpy(current->rlim, init_task.rlim, sizeof(*(current->rlim)));
when surely it was intended to be:
memcpy(current->rlim, init_task.rlim, sizeof(current->rlim));
As rlim is an array, the * in the sizeof expression gets the size of the
first element, so this just changes the first limit (RLIMIT_CPU). This is
for kernel threads, where it's clear that resetting all the rlimits is what
you want. With that fixed, the setting of RLIMIT_FSIZE in nfsd is
superfluous since it will now already have been reset to RLIM_INFINITY.
The other subtlety is removing:
tsk->rlim[RLIMIT_CPU].rlim_cur = RLIM_INFINITY;
in exit_notify, which was to avoid a race signalling during self-reaping
exit. As the limit is now shared, a dying thread should not change it for
others. Instead, I avoid that race by checking current->state before the
RLIMIT_CPU check. (Adding one new conditional in that path is now required
one way or another, since if not for this check there would also be a new
race with self-reaping exit later on clearing current->signal that would
have to be checked for.)
The one loose end left by this patch is with process accounting.
do_acct_process temporarily resets the RLIMIT_FSIZE limit while writing the
accounting record. I left this as it was, but it is now changing a limit
that might be shared by other threads still running. I left this in a
dubious state because it seems to me that processing accounting may already
be more generally a dubious state when it comes to NPTL threads. I would
think you would want one record per process, with aggregate data about all
threads that ever lived in it, not a separate record for each thread.
I don't use process accounting myself, but if anyone is interested in
testing it out I could provide a patch to change it this way.
One final note, this is not 100% to POSIX compliance in regards to rlimits.
POSIX specifies that RLIMIT_CPU refers to a whole process in aggregate, not
to each individual thread. I will provide patches later on to achieve that
change, assuming this patch goes in first.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
thanks Xu for noticing, some whitespace found it's way there.
clean that up.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
We need io.h for readq().
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Report the resolution of the time source correctly for time interpolators
with a frequency over 1 Ghz.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
for IA64
This has been in the ia64 (and hence -mm) trees for a couple of months.
Changelog:
* Affects only architectures which define CONFIG_TIME_INTERPOLATION
(currently only IA64)
* Genericize time interpolation, make time interpolators easily usable
and provide instructions on how to use the interpolator for other
architectures.
* Provide nanosecond resolution for clock_gettime and an accuracy
up to the time interpolator time base.
* clock_getres() reports resolution of underlying time basis which
is typically <50ns and may be 1ns on some systems.
* Make time interpolator self-tuning to limit time jumps
and to make the interpolators work correctly on systems with
broken time base specifications.
* SMP scalability: Make clock_gettime and gettimeofday scale O(1)
by removing the cmpxchg for most clocks (tested for up to 512 CPUs)
* IA64: provide asm fastcall that doubles the performance
of gettimeofday and clock_gettime on SGI and other IA64 systems
(asm fastcalls scale O(1) together with the scalability fixes).
* IA64: provide nojitter kernel option so that IA64 systems with
correctly synchronized ITC counters may also enjoy the
scalability enhancements.
Performance measurements for single calls (ITC cycles):
A. 4 way Intel IA64 SMP system (kmart)
ITC offsets:
kmart:/usr/src/noship-tests # dmesg|grep synchr
CPU 1: synchronized ITC with CPU 0 (last diff 1 cycles, maxerr 417 cycles)
CPU 2: synchronized ITC with CPU 0 (last diff 2 cycles, maxerr 417 cycles)
CPU 3: synchronized ITC with CPU 0 (last diff 1 cycles, maxerr 417 cycles)
A.1. Current kernel code
kmart:/usr/src/noship-tests # ./dmt
gettimeofday cycles: 3737 220 215 215 215 215 215 215 215 215
clock_gettime(REAL) cycles: 4058 575 564 576 565 566 558 558 558 558
clock_gettime(MONO) cycles: 1583 621 609 609 609 609 609 609 609 609
clock_gettime(PROCESS) cycles: 71428 298 259 259 259 259 259 259 259 259
clock_gettime(THREAD) cycles: 3982 336 290 298 298 298 298 286 286 286
A.2 New code using cmpxchg
kmart:/usr/src/noship-tests # ./dmt
gettimeofday cycles: 3145 213 216 213 213 213 213 213 213 213
clock_gettime(REAL) cycles: 3185 230 210 210 210 210 210 210 210 210
clock_gettime(MONO) cycles: 284 217 217 216 216 216 216 216 216 216
clock_gettime(PROCESS) cycles: 68857 289 270 259 259 259 259 259 259 259
clock_gettime(THREAD) cycles: 3862 339 298 298 298 298 290 286 286 286
A.3 New code with cmpxchg switched off (nojitter kernel option)
kmart:/usr/src/noship-tests # ./dmt
gettimeofday cycles: 3195 219 219 212 212 212 212 212 212 212
clock_gettime(REAL) cycles: 3003 228 205 205 205 205 205 205 205 205
clock_gettime(MONO) cycles: 279 209 209 209 208 208 208 208 208 208
clock_gettime(PROCESS) cycles: 65849 292 259 259 268 270 270 259 259 259
B. SGI SN2 system running 512 IA64 CPUs.
B.1. Current kernel code
[root@ascender noship-tests]# ./dmt
gettimeofday cycles: 17221 1028 1007 1004 1004 1004 1010 25928 1002 1003
clock_gettime(REAL) cycles: 10388 1099 1055 1044 1064 1063 1051 1056 1061 1056
clock_gettime(MONO) cycles: 2363 96 96 96 96 96 96 96 96 96
clock_gettime(PROCESS) cycles: 46537 804 660 666 666 666 666 666 666 666
clock_gettime(THREAD) cycles: 10945 727 710 684 685 686 685 686 685 686
B.2 New code
ascender:~/noship-tests # ./dmt
gettimeofday cycles: 3874 610 588 588 588 588 588 588 588 588
clock_gettime(REAL) cycles: 3893 612 588 582 588 588 588 588 588 588
clock_gettime(MONO) cycles: 686 595 595 588 588 588 588 588 588 588
clock_gettime(PROCESS) cycles: 290759 322 269 269 259 265 265 265 259 259
clock_gettime(THREAD) cycles: 5153 358 306 298 296 304 290 298 298 298
Scalability of time functions (in time it takes to do a million calls):
=======================================================================
A. 4 way Intel IA SMP system (kmart)
A.1 Current code
kmart:/usr/src/noship-tests # ./todscale -n1000000
CPUS WALL WALL/CPUS
1 0.192 0.192
2 1.125 0.563
4 9.229 2.307
A.2 New code using cmpxchg
kmart:/usr/src/noship-tests # ./todscale
CPUS WALL WALL/CPUS
1 0.188 0.188
2 0.457 0.229
4 0.413 0.103
(the measurement with 4 cpus may fluctuate up to 15.x somehow)
A.3 New code without cmpxchg (nojitter kernel option)
kmart:/usr/src/noship-tests # ./todscale -n10000000
CPUS WALL WALL/CPUS
1 0.180 0.180
2 0.180 0.090
4 0.252 0.063
B. SGI SN2 system running 512 IA64 CPUs.
The system has a global monotonic clock and therefore has
no need for compensation. Current code uses a cmpxchg. New
code has no cmpxchg.
B.1 current code
ascender:~/noship-tests # ./todscale
CPUS WALL WALL/CPUS
1 0.850 0.850
2 1.767 0.884
4 6.124 1.531
8 20.777 2.597
16 57.693 3.606
32 164.688 5.146
64 456.647 7.135
128 1093.371 8.542
256 2778.257 10.853
(System crash at 512 CPUs)
B.2 New code
ascender:~/noship-tests # ./todscale -n1000000
CPUS WALL WALL/CPUS
1 0.426 0.426
2 0.429 0.215
4 0.436 0.109
8 0.452 0.057
16 0.454 0.028
32 0.457 0.014
64 0.459 0.007
128 0.466 0.004
256 0.474 0.002
512 0.518 0.001
Clock Accuracy
==============
A. 4 CPU SMP system
A.1 Old code
kmart:/usr/src/noship-tests # ./cdisp
Gettimeofday() = 1092124757.270305000
CLOCK_REALTIME= 1092124757.270382000 resolution= 0.000976563
CLOCK_MONOTONIC= 89.696726590 resolution= 0.000976563
CLOCK_PROCESS_CPUTIME_ID= 0.001242507 resolution= 0.000000001
CLOCK_THREAD_CPUTIME_ID= 0.001255310 resolution= 0.000000001
A.2 New code
kmart:/usr/src/noship-tests # ./cdisp
Gettimeofday() = 1092124478.194530000
CLOCK_REALTIME= 1092124478.194603399 resolution= 0.000000001
CLOCK_MONOTONIC= 88.198315204 resolution= 0.000000001
CLOCK_PROCESS_CPUTIME_ID= 0.001241235 resolution= 0.000000001
CLOCK_THREAD_CPUTIME_ID= 0.001254747 resolution= 0.000000001
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Signed-off-by: Maximilian Attems <janitor@sternwelten.at>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
There is a minor problem in function start_kernel. start_kernel will
enable interrupt after calling profile_init. However, before that,
function time_init on IA64 platform could enable interrupt. See this call
sequence:
start_kernel
->time_init
->ia64_init_itm
->register_time_interpolator
->write_seqlock_irq.
Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com>
Signed-off-by: Yao Jun <junx.yao@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Anton prompted me to get this patch merged. It changes the core buffer
sync algorithm of OProfile to avoid global locks wherever possible. Anton
tested an earlier version of this patch with some success. I've lightly
tested this applied against 2.6.8.1-mm3 on my two-way machine.
The changes also have the happy side-effect of losing less samples after
munmap operations, and removing the blind spot of tasks exiting inside the
kernel.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This issue was discussed on lkml and linux-ia64. The patch introduces
"getnstimeofday" and removes all the code scaling gettimeofday to
nanoseoncs. It makes it possible for the posix-timer functions to return
higher accuracy.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
There is a lonstanding off-by-one error that results from an incorrect
comparison when checking whether a process has consumed CPU time in
excess of its RLIMIT_CPU limits.
This means, for example, that if we use setrlimit() to set the soft CPU
limit (rlim_cur) to 5 seconds and the hard limit (rlim_max) to 10 seconds,
then the process only receives a SIGXCPU signal after consuming 6 seconds
of CPU time, and, if it continues consuming CPU after handling that
signal, only receives SIGKILL after consuming 11 seconds of CPU time.
The fix is trivial.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
From: David Mosberger <davidm@napali.hpl.hp.com>
Below is a patch that tries to sanitize the dropping of unneeded system-call
stubs in generic code. In some instances, it would be possible to move the
optional system-call stubs into a library routine which would avoid the need
for #ifdefs, but in many cases, doing so would require making several
functions global (and possibly exporting additional data-structures in
header-files). Furthermore, it would inhibit (automatic) inlining in the
cases in the cases where the stubs are needed. For these reasons, the patch
keeps the #ifdef-approach.
This has been tested on ia64 and there were no objections from the
arch-maintainers (and one positive response). The patch should be safe but
arch-maintainers may want to take a second look to see if some __ARCH_WANT_foo
macros should be removed for their architecture (I'm quite sure that's the
case, but I wanted to play it safe and only preserved the status-quo in that
regard).
|
|
|
|
From: Geoff Gustafson <geoff@linux.jf.intel.com>,
"Chen, Kenneth W" <kenneth.w.chen@intel.com>,
Ingo Molnar <mingo@elte.hu>,
me.
The big-SMP guys are seeing high CPU load due to del_timer_sync()'s
inefficiencies. The callers are fs/aio.c and schedule_timeout().
We note that neither of these callers' timer handlers actually re-add the
timer - they are single-shot.
So we don't need all that complexity in del_timer_sync() - we can just run
del_timer() and if that worked we know the timer is dead.
Add del_single_shot_timer(), export it to modules and use it in AIO and
schedule_timeout().
(these numbers are for an earlier patch, but they'll be close)
Before: 32p 4p
Warm cache 29,000 505
Cold cache 37,800 1220
After: 32p 4p
Warm cache 95 88
Cold cache 1,800 140
[Measurements are CPU cycles spent in a call to del_timer_sync, the average
of 1000 calls. 32p is 16-node NUMA, 4p is SMP.]
(I cleaned up a few things and added some commentary)
|
|
From: Paul Jackson <pj@sgi.com>
With a hotplug capable kernel, there is a requirement to distinguish a
possible CPU from one actually present. The set of possible CPU numbers
doesn't change during a single system boot, but the set of present CPUs
changes as CPUs are physically inserted into or removed from a system. The
cpu_possible_map does not change once initialized at boot, but the
cpu_present_map changes dynamically as CPUs are inserted or removed.
Paul Jackson <pj@sgi.com> provided an expanded explanation:
Ashok's cpu hot plug patch adds a cpu_present_map, resulting in the following
cpu maps being available. All the following maps are fixed size bitmaps of
size NR_CPUS.
#ifdef CONFIG_HOTPLUG_CPU
cpu_possible_map - map with all NR_CPUS bits set
cpu_present_map - map with bit 'cpu' set iff cpu is populated
cpu_online_map - map with bit 'cpu' set iff cpu available to scheduler
#else
cpu_possible_map - map with bit 'cpu' set iff cpu is populated
cpu_present_map - copy of cpu_possible_map
cpu_online_map - map with bit 'cpu' set iff cpu available to scheduler
#endif
In either case, NR_CPUS is fixed at compile time, as the static size of these
bitmaps. The cpu_possible_map is fixed at boot time, as the set of CPU id's
that it is possible might ever be plugged in at anytime during the life of
that system boot. The cpu_present_map is dynamic(*), representing which CPUs
are currently plugged in. And cpu_online_map is the dynamic subset of
cpu_present_map, indicating those CPUs available for scheduling.
If HOTPLUG is enabled, then cpu_possible_map is forced to have all NR_CPUS
bits set, otherwise it is just the set of CPUs that ACPI reports present at
boot.
If HOTPLUG is enabled, then cpu_present_map varies dynamically, depending on
what ACPI reports as currently plugged in, otherwise cpu_present_map is just a
copy of cpu_possible_map.
(*) Well, cpu_present_map is dynamic in the hotplug case. If not hotplug,
it's the same as cpu_possible_map, hence fixed at boot.
|
|
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
This patch add a system control that allows to switch off the jiffies timer
interrupts while a cpu sleeps in idle. This is useful for a system running
with virtual cpus under z/VM.
|
|
From: William Lee Irwin III <wli@holomorphy.com>
This addresses the issue with get_wchan() that the various functions acting
as scheduling-related primitives are not, in fact, contiguous in the text
segment. It creates an ELF section for scheduling primitives to be placed
in, and places currently-detected (i.e. skipped during stack decoding)
scheduling primitives and others like io_schedule() and down(), which are
currently missed by get_wchan() code, into this section also.
The net effects are more reliability of get_wchan()'s results and the new
ability, made use of by this code, to arbitrarily place scheduling
primitives in the source code without disturbing get_wchan()'s accuracy.
Suggestions by Arnd Bergmann and Matthew Wilcox regarding reducing the
invasiveness of the patch were incorporated during prior rounds of review.
I've at least tried to sweep all arches in this patch.
|
|
Every pointer in <syscalls.h> had better be a user
pointer. Also add some others that a quick sanity check
picked up on.
|
|
Various files keep per-cpu caches which need to be freed/moved when a
CPU goes down. All under CONFIG_HOTPLUG_CPU ifdefs.
scsi.c: drain dead cpu's scsi_done_q onto this cpu.
buffer.c: brelse the bh_lrus queue for dead cpu.
timer.c: migrate timers from dead cpu, being careful of lock order vs
__mod_timer.
radix_tree.c: free dead cpu's radix_tree_preloads
page_alloc.c: empty dead cpu's nr_pagecache_local into nr_pagecache, and
free pages on cpu's local cache.
slab.c: stop reap_timer for dead cpu, adjust each cache's free limit, and
free each slab cache's per-cpu block.
swap.c: drain dead cpu's lru_add_pvecs into ours, and empty its committed_space
counter into global counter.
dev.c: drain device queues from dead cpu into this one.
flow.c: drain dead cpu's flow cache.
|
|
From: john stultz <johnstul@us.ibm.com>
In developing the ia64-cyclone patch, which implements a cyclone based time
interpolator, I found the following bug which could cause time
inconsistencies.
In update_wall_time_one_tick(), which is called each timer interrupt, we
call time_interpolator_update(delta_nsec) where delta_nsec is approximately
NSEC_PER_SEC/HZ. This directly correlates with the changes to xtime which
occurs in update_wall_time_one_tick().
However in update_wall_time(), on a second overflow, we again call
time_interpolator_update(NSEC_PER_SEC). However while the components of
xtime are being changed, the overall value of xtime does not (nsec is
decremented NSEC_PER_SEC and sec is incremented). Thus this call to
time_interpolator_update is incorrect.
This patch removes the incorrect call to time_interpolator_update and was
found to resolve the time inconsistencies I had seen while developing the
ia64-cyclone patch.
|
|
From: Gerd Knorr <kraxel@suse.de>
Current gcc's error out if a function's declaration and definition disagree
about the register passing convention.
The patch adds a new `fastcall' declatation primitive, and uses that in all
the FASTCALL functions which we could find. A number of inconsistencies were
fixed up along the way.
|
|
From: Kurt Garloff <garloff@suse.de>
when calling
alarm(1); alarm(0);
the second alarm can wrongly return 2. This makes an LSB test fail.
It happens due to rounding errors in the timeval to jiffie conversion and
back. On i386 with HZ =3D=3D 1000, there would not need to be rounding
error IMVHO, but they even occur there. On HZ=3D1024 platforms, they may
even be unavoidable.
Attached patch fixes the return value of alarm().
|
|
From: Rusty Russell <rusty@rustcorp.com.au>
Some places use cpu_online() where they should be using cpu_possible, most
commonly for tallying statistics. This makes no difference without hotplug
CPU.
Use the for_each_cpu() macro in those places, providing good examples (and
making the external hotplug CPU patch smaller).
Some places use cpu_online() where they should be using cpu_possible, most
commonly for tallying statistics. This makes no difference without hotplug
CPU.
Use the for_each_cpu() macro in those places, providing good examples (and
making the external hotplug CPU patch smaller).
|
|
The latter has buggy restart functionality and is a lot
more complicated anyway.
|
|
From: Stephen Hemminger <shemminger@osdl.org>
The following will prevent adjtime from causing time regression. It delays
starting the adjtime mechanism for one tick, and keeps gettimeofday inside
the window.
Only fixes i386, but changes to other arch would be similar.
Running a simple clock test program and playing with adjtime demonstrates
that this fixes the problem (and 2.6.0-test6 is broken). But given the
fragile nature of the timer code, it should go through some more testing
before inclusion.
|
|
flag (and order it on SMP), so that del_timer_sync() always sees the
timer either pending or running if it is active.
|
|
Cset exclude: mingo@elte.hu[torvalds]|ChangeSet|20031012025453|05000
|
|
This fixes two del_timer_sync() races that are still in the timer code.
The first race was actually triggered in a 2.4 backport of the 2.6 timer
code. The second race was never triggered - it is mostly theoretical on
a standalone kernel. (It's more likely in any virtualized or otherwise
preemptable environment.)
Both races happen when self-rearming timers are used. One mainstream
example is kernel/itimer.c. The effect of the races is that
del_timer_sync() lets a timer running instead of synchronizing with it,
causing logic bugs (and crashes) in the affected kernel code. One
typical incarnation of the race is a double add_timer().
race #1:
this code in __run_timers() is running on CPU0:
list_del(&timer->entry);
timer->base = NULL;
[*]
set_running_timer(base, timer);
spin_unlock_irq(&base->lock);
[**]
fn(data);
spin_lock_irq(&base->lock);
CPU0 gets stuck at the [*] code-point briefly - after the timer->base has
been set to NULL, but before the base->running_timer pointer has been set
up. This is a fundamentally volatile scenario, as there's _zero_ knowledge
in the data structures that this timer is about to be executed!
Now CPU1 comes along and calls del_timer_sync(). It will find nothing -
neither timer->base nor base->running_timer will cause it to synchronize.
It will return and report that the timer has been deleted - shortly
afterwards CPU1 continues to execute the timer fn, which will cause
crashes.
This particular race is easy to fix by reordering the timer->base
clearing with set_running_timer(), and putting a wmb() between them, but
there's more races:
race #2
The checking of del_timer_sync() for 'pending or running timer' is
fundamentally unrobust. Eg. if CPU0 gets stuck at the [***] point below:
base = &per_cpu(tvec_bases, i);
if (base->running_timer == timer) {
while (base->running_timer == timer) {
cpu_relax();
preempt_check_resched();
}
[***]
break;
}
}
smp_rmb();
if (timer_pending(timer))
goto del_again;
then del_timer_sync() has already decided that this timer is not running
(we just finished loop-waiting for it), but we have not done the
timer_pending() check yet.
If the timer has re-armed itself, and if the timer expires on CPU1 (this
needs a long delay on CPU0 but that's not hard to achieve eg. in UML or
with kernel preemption enabled), then CPU1 could start to expire the
timer and gets to the [**] point in __run_timers (see above), then CPU1
gets stalled and CPU0 is unstalled, then the timer_pending() check in
del_timer_sync() will not notice the running timer, and del_timer_sync()
returns - while CPU1 is just about to run the timer!
Fixing this second race is hard - it involves a heavy race-check
operation that has to lock all bases, and has to re-check the
base->running_timer value, and timer_pending condition atomically.
This fix also fixes the first race, due to forcing del_timer_sync() to
always observe the timer state atomically, so the [*] code point will
always synchronize with del_timer_sync().
The patch is ugly but safe, and it has fixed the crashes in the 2.4
backport. I tested the patch on 2.6.0-test7 with some heavy itimer use
and it works fine. Removing self-arming timers safely is the sole
purpose of del_timer_sync(), so there's no way around this overhead i
think. I believe we should ultimately fix all major del_timer_sync()
users to not use self-arming timers - having del_timer_sync() in the
thread-exit path is now a considerable source of SMP overhead. But this
is out of the scope of current 2.6 fixes of course, and we have to
support self-arming timers as well.
|
|
|
|
From Tejun's posting:
>
> This patch fixes a race between del_timer_sync and recursive timers.
> Current implementation allows the value of timer->base that is used
> for timer_pending test to be fetched before finishing running_timer
> test, so it's possible for a recursive time to be pending after
> del_timer_sync. Adding smp_rmb before timer_pending removes the race.
|
|
Patch from Julie DeWandel.
This patch has solved the crashes observed during TPC-C runs on the
16-way box. (I'm confident it will fix the other reported cases as
well.)
The race is the setting of timer->base to NULL, by del_timer() or
__run_timers(). If new_base == old_base in __mod_timer() then we do not
re-check timer->base after getting the lock. (the only case where we do
not have to re-check the base is in the !old_base case, but the else
branch also includes the old_base==new_base case.)
The __run_timers() case made the lock_timer() patch not work fully - we
cannot use lock_timer() in __run_timers() due to lock ordering.
|