| Age | Commit message (Collapse) | Author |
|
sys_getppid() optimization can access a freed memory. On kernels with
DEBUG_SLAB turned ON, this results in Oops. As Dave Hansen noted, this
optimization is also unsafe for memory hotplug.
So this patch always takes the lock to be safe.
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
The pointer to the current time interpolator and the current list of time
interpolators are typically only changed during bootup. Adding
__read_mostly takes them away from possibly hot cachelines.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Add a compiler barrier so that we don't read jiffies before updating
jiffies_64.
Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Also from Thomas Gleixner <tglx@linutronix.de>
Function next_timer_interrupt() got broken with a recent patch
6ba1b91213e81aa92b5cf7539f7d2a94ff54947c as sys_nanosleep() was moved to
hrtimer. This broke things as next_timer_interrupt() did not check hrtimer
tree for next event.
Function next_timer_interrupt() is needed with dyntick (CONFIG_NO_IDLE_HZ,
VST) implementations, as the system can be in idle when next hrtimer event
was supposed to happen. At least ARM and S390 currently use
next_timer_interrupt().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
On some platforms readq performs additional work to make sure I/O is done
in a coherent way. This is not needed for time retrieval as done by the
time interpolator. So we can use readq_relaxed instead which will improve
performance.
It affects sparc64 and ia64 only. Apparently it makes a significant
difference on ia64.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: john stultz <johnstul@us.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This provides an interface for arch code to find out how many
nanoseconds are going to be added on to xtime by the next call to
do_timer. The value returned is a fixed-point number in 52.12 format
in nanoseconds. The reason for this format is that it gives the
full precision that the timekeeping code is using internally.
The motivation for this is to fix a problem that has arisen on 32-bit
powerpc in that the value returned by do_gettimeofday drifts apart
from xtime if NTP is being used. PowerPC is now using a lockless
do_gettimeofday based on reading the timebase register and performing
some simple arithmetic. (This method of getting the time is also
exported to userspace via the VDSO.) However, the factor and offset
it uses were calculated based on the nominal tick length and weren't
being adjusted when NTP varied the tick length.
Note that 64-bit powerpc has had the lockless do_gettimeofday for a
long time now. It also had an extremely hairy routine that got called
from the 32-bit compat routine for adjtimex, which adjusted the
factor and offset according to what it thought the timekeeping code
was going to do. Not only was this only called if a 32-bit task did
adjtimex (i.e. not if a 64-bit task did adjtimex), it was also
duplicating computations from kernel/timer.c and it wasn't clear that
it was (still) correct.
The simple solution is to ask the timekeeping code how long the
current jiffy will be on each timer interrupt, after calling
do_timer. If this jiffy will be a different length from the last one,
we then need to compute new values for the factor and offset used in
the lockless do_gettimeofday. In this way we can keep xtime and
do_gettimeofday in sync, even when NTP is varying the tick length.
Note that when adjtimex varies the tick length, it almost always
introduces the variation from the next tick on. The only case I could
see where adjtimex would vary the length of the current tick is when
an old-style adjtime adjustment is being cancelled. (It's not clear
to me why the adjustment has to be cancelled immediately rather than
from the next tick on.) Thus I don't see any real need for a hook in
adjtimex; the rare case of an old-style adjustment being cancelled can
be fixed up at the next tick.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: john stultz <johnstul@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
convert sys_nanosleep() to use hrtimer_nanosleep()
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
hrtimer subsystem core. It is initialized at bootup and expired by the timer
interrupt, but is otherwise not utilized by any other subsystem yet.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch contains the following cleanups:
- make needlessly global functions static
- every file should include the headers containing the prototypes for
it's global functions
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Define jiffies_64 in kernel/timer.c rather than having 24 duplicated
defines in each architecture.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Remove timer_list.magic and associated debugging code.
I originally added this when a spinlock was added to timer_list - this meant
that an all-zeroes timer became illegal and init_timer() was required.
That spinlock isn't even there any more, although timer.base must now be
initialised.
I'll keep this debugging code in -mm.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Fix bizarre 4-space coding style in the NTP code.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Create a macro shift_right() that avoids the numerous ugly conditionals in the
NTP code that look like:
if(a < 0)
b = -(-a >> shift);
else
b = a >> shift;
Replacing it with:
b = shift_right(a, shift);
This should have zero effect on the logic, however it should probably have
a bit of testing just to be sure.
Also replace open-coded min/max with the macros.
Signed-off-by : John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Every user of init_timer() also needs to initialize ->function and ->data
fields. This patch adds a simple setup_timer() helper for that.
The schedule_timeout() is patched as an example of usage.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Add missing compensation for (HZ == 250) != (1 << SHIFT_HZ) in
second_overflow().
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
These functions don't need schedule_timeout()'s barrier.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Use schedule_timeout_{,un}interruptible() instead of
set_current_state()/schedule_timeout() to reduce kernel size.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Add schedule_timeout_{,un}interruptible() interfaces so that
schedule_timeout() callers don't have to worry about forgetting to add the
set_current_state() call beforehand.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Christoph Lameter <clameter@engr.sgi.com>
When using a time interpolator that is susceptible to jitter there's
potentially contention over a cmpxchg used to prevent time from going
backwards. This is unnecessary when the caller holds the xtime write
seqlock as all readers will be blocked from returning until the write is
complete. We can therefore allow writers to insert a new value and exit
rather than fight with CPUs who only hold a reader lock.
Signed-off-by: Alex Williamson <alex.williamson@hp.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch adds a new kernel debug feature: CONFIG_DETECT_SOFTLOCKUP.
When enabled then per-CPU watchdog threads are started, which try to run
once per second. If they get delayed for more than 10 seconds then a
callback from the timer interrupt detects this condition and prints out a
warning message and a stack dump (once per lockup incident). The feature
is otherwise non-intrusive, it doesnt try to unlock the box in any way, it
only gets the debug info out, automatically, and on all CPUs affected by
the lockup.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-Off-By: Matthias Urlichs <smurf@smurf.noris.de>
Signed-off-by: Richard Purdie <rpurdie@rpsys.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
With CONFIG_PREEMPT && !CONFIG_SMP, it's possible for sys_getppid to
return a bogus value if the parent's task_struct gets reallocated after
current->group_leader->real_parent is read:
asmlinkage long sys_getppid(void)
{
int pid;
struct task_struct *me = current;
struct task_struct *parent;
parent = me->group_leader->real_parent;
RACE HERE => for (;;) {
pid = parent->tgid;
#ifdef CONFIG_SMP
{
struct task_struct *old = parent;
/*
* Make sure we read the pid before re-reading the
* parent pointer:
*/
smp_rmb();
parent = me->group_leader->real_parent;
if (old != parent)
continue;
}
#endif
break;
}
return pid;
}
If the process gets preempted at the indicated point, the parent process
can go ahead and call exit() and then get wait()'d on to reap its
task_struct. When the preempted process gets resumed, it will not do any
further checks of the parent pointer on !CONFIG_SMP: it will read the
bad pid and return.
So, the same algorithm used when SMP is enabled should be used when
preempt is enabled, which will recheck ->real_parent in this case.
Signed-off-by: David Meybohm <dmeybohmlkml@bellsouth.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The comment for msleep_interruptible() is wrong, as it will ignore
wait-queue events, but will wake up early for signals.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Domen Puncer <domen@coderock.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
In kernel/sched.c the return value from preempt_count() is cast to an int.
That made sense when preempt_count was defined as different types on is not
needed and should go away. The patch removes the cast.
In kernel/timer.c the return value from preempt_count() is assigned to a
variable of type u32 and then that unsigned value is later compared to
preempt_count(). Since preempt_count() returns an int, an int is what
should be used to store its return value. Storing the result in an
unsigned 32bit integer made a tiny bit of sense back when preempt_count was
different types on different archs, but no more - let's not play signed vs
unsigned comparison games when we don't have to. The patch modifies the
code to use an int to hold the value. While I was around that bit of code
I also made two changes to a nearby (related) printk() - I modified it to
specify the loglevel explicitly and also broke the line into a few pieces
to avoid it being longer than 80 chars and clarified the text a bit.
Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch splits del_timer_sync() into 2 functions. The new one,
try_to_del_timer_sync(), returns -1 when it hits executing timer.
It can be used in interrupt context, or when the caller hold locks which
can prevent completion of the timer's handler.
NOTE. Currently it can't be used in interrupt context in UP case, because
->running_timer is used only with CONFIG_SMP.
Should the need arise, it is possible to kill #ifdef CONFIG_SMP in
set_running_timer(), it is cheap.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch tries to solve following problems:
1. del_timer_sync() is racy. The timer can be fired again after
del_timer_sync have checked all cpus and before it will recheck
timer_pending().
2. It has scalability problems. All cpus are scanned to determine
if the timer is running on that cpu.
With this patch del_timer_sync is O(1) and no slower than plain
del_timer(pending_timer), unless it has to actually wait for
completion of the currently running timer.
The only restriction is that the recurring timer should not use
add_timer_on().
3. The timers are not serialized wrt to itself.
If CPU_0 does mod_timer(jiffies+1) while the timer is currently
running on CPU 1, it is quite possible that local interrupt on
CPU_0 will start that timer before it finished on CPU_1.
4. The timers locking is suboptimal. __mod_timer() takes 3 locks
at once and still requires wmb() in del_timer/run_timers.
The new implementation takes 2 locks sequentially and does not
need memory barriers.
Currently ->base != NULL means that the timer is pending. In that case
->base.lock is used to lock the timer. __mod_timer also takes timer->lock
because ->base can be == NULL.
This patch uses timer->entry.next != NULL as indication that the timer is
pending. So it does __list_del(), entry->next = NULL instead of list_del()
when the timer is deleted.
The ->base field is used for hashed locking only, it is initialized
in init_timer() which sets ->base = per_cpu(tvec_bases). When the
tvec_bases.lock is locked, it means that all timers which are tied
to this base via timer->base are locked, and the base itself is locked
too.
So __run_timers/migrate_timers can safely modify all timers which could
be found on ->tvX lists (pending timers).
When the timer's base is locked, and the timer removed from ->entry list
(which means that _run_timers/migrate_timers can't see this timer), it is
possible to set timer->base = NULL and drop the lock: the timer remains
locked.
This patch adds lock_timer_base() helper, which waits for ->base != NULL,
locks the ->base, and checks it is still the same.
__mod_timer() schedules the timer on the local CPU and changes it's base.
However, it does not lock both old and new bases at once. It locks the
timer via lock_timer_base(), deletes the timer, sets ->base = NULL, and
unlocks old base. Then __mod_timer() locks new_base, sets ->base = new_base,
and adds this timer. This simplifies the code, because AB-BA deadlock is not
possible. __mod_timer() also ensures that the timer's base is not changed
while the timer's handler is running on the old base.
__run_timers(), del_timer() do not change ->base anymore, they only clear
pending flag.
So del_timer_sync() can test timer->base->running_timer == timer to detect
whether it is running or not.
We don't need timer_list->lock anymore, this patch kills it.
We also don't need barriers. del_timer() and __run_timers() used smp_wmb()
before clearing timer's pending flag. It was needed because __mod_timer()
did not lock old_base if the timer is not pending, so __mod_timer()->list_add()
could race with del_timer()->list_del(). With this patch these functions are
serialized through base->lock.
One problem. TIMER_INITIALIZER can't use per_cpu(tvec_bases). So this patch
adds global
struct timer_base_s {
spinlock_t lock;
struct timer_list *running_timer;
} __init_timer_base;
which is used by TIMER_INITIALIZER. The corresponding fields in tvec_t_base_s
struct are replaced by struct timer_base_s t_base.
It is indeed ugly. But this can't have scalability problems. The global
__init_timer_base.lock is used only when __mod_timer() is called for the first
time AND the timer was compile time initialized. After that the timer migrates
to the local CPU.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Renaud Lienhart <renaud.lienhart@free.fr>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Replace a number of memory barriers with smp_ variants. This means we won't
take the unnecessary hit on UP machines.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This is a megarollup of ~60 patches which give various things static scope.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
POSIX requires that when you claim _POSIX_CPUTIME and _POSIX_THREAD_CPUTIME,
not only the clock_* calls but also timer_* calls must support the thread and
process CPU time clocks. This patch provides that support, building on my
recent additions to support these clocks in the POSIX clock_* interfaces.
This patch will not work without those changes, as well as the patch fixing
the timer lock-siglock deadlock problem.
The apparent pervasive changes to posix-timers.c are simply that some fields
of struct k_itimer have changed name and moved into a union. This was
appropriate since the data structures required for the existing real-time
timer support and for the new thread/process CPU-time timers are quite
different.
The glibc patches to support CPU time clocks using the new kernel support is
in http://people.redhat.com/roland/glibc/kernel-cpuclocks.patch, and that
includes tests for the timer support (if you build glibc with NPTL).
From: Christoph Lameter <clameter@sgi.com>
Your patch breaks the mmtimer driver because it used k_itimer values for
its own purposes. Here is a fix by defining an additional structure in
k_itimer (same approach for mmtimer as the cpu timers):
From: Roland McGrath <roland@redhat.com>
Fix bug identified by Alexander Nyberg <alexn@dsv.su.se>
> The problem arises from code touching the union in alloc_posix_timer()
> which makes firing go non-zero. When firing is checked in
> posix_cpu_timer_set() it will be positive causing an infinite loop.
>
> So either the below fix or preferably move the INIT_LIST_HEAD(x) from
> alloc_posix_timer() to somewhere later where it doesn't disturb the other
> union members.
Thanks for finding this problem. The latter is what I think is the right
solution. This patch does that, and also removes some superfluous rezeroing.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
CONFIG_BASE_SMALL reduce timer list hashes
Signed-off-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The "addr" member in the time-interpolator is sometimes used as a
function-pointer and sometimes as an I/O-memory pointer. The attached
patch tells sparse that this is OK.
Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch introduces the concept of (virtual) cputime. Each architecture
can define its method to measure cputime. The main idea is to define a
cputime_t type and a set of operations on it (see asm-generic/cputime.h).
Then use the type for utime, stime, cutime, cstime, it_virt_value,
it_virt_incr, it_prof_value and it_prof_incr and use the cputime operations
for each access to these variables. The default implementation is jiffies
based and the effect of this patch for architectures which use the default
implementation should be neglectible.
There is a second type cputime64_t which is necessary for the kernel_stat
cpu statistics. The default cputime_t is 32 bit and based on HZ, this will
overflow after 49.7 days. This is not enough for kernel_stat (ihmo not
enough for a processes too), so it is necessary to have a 64 bit type.
The third thing that gets introduced by this patch is an additional field
for the /proc/stat interface: cpu steal time. An architecture can account
cpu steal time by calls to the account_stealtime function. The cpu which
backs a virtual processor doesn't spent all of its time for the virtual
cpu. To get meaningful cpu usage numbers this involuntary wait time needs
to be accounted and exported to user space.
From: Hugh Dickins <hugh@veritas.com>
The p->signal check in account_system_time is insufficient. If the timer
interrupt hits near the end of exit_notify, after EXIT_ZOMBIE has been set,
another cpu may release_task (NULLifying p->signal) in between
account_system_time's check and check_rlimit's dereference. Nor should
account_it_prof risk send_sig. But surely account_user_time is safe?
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Signed-off-by: Vasia Pupkin <ptushnik@gmail.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Kernel core files converted to use the new lock initializers.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This is the current remove-BKL patch. I test-booted it on x86 and x64, trying
every conceivable combination of SMP, PREEMPT and PREEMPT_BKL. All other
architectures should compile as well. (most of the testing was done with the
zaphod patch undone but it applies cleanly on vanilla -mm3 as well and should
work fine.)
this is the debugging-enabled variant of the patch which has two main
debugging features:
- debug potentially illegal smp_processor_id() use. Has caught a number
of real bugs - e.g. look at the printk.c fix in the patch.
- make it possible to enable/disable the BKL via a .config. If this
goes upstream we dont want this of course, but for now it gives
people a chance to find out whether any particular problem was caused
by this patch.
This patch has one important fix over the previous BKL patch: on PREEMPT
kernels if we preempted BKL-using code then the code still auto-dropped the
BKL by mistake. This caused a number of breakages for testers, which
breakages went away once this bug was fixed.
Also the debugging mechanism has been improved alot relative to the previous
BKL patch.
Would be nice to test-drive this in -mm. There will likely be some more
smp_processor_id() false positives but they are 1) harmless 2) easy to fix up.
We could as well find more real smp_processor_id() related breakages as well.
The most noteworthy fact is that no BKL-using code was found yet that relied
on smp_processor_id(), which is promising from a compatibility POV.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
I just did a quick audit of the use of exit_state and the EXIT_* bit
macros. I guess I didn't really review these changes very closely when you
did them originally. :-(
I found several places that seem like lossy cases of query-replace without
enough thought about the code. Linus has previously said the >= tests
ought to be & tests instead. But for exit_state, it can only ever be 0,
EXIT_DEAD, or EXIT_ZOMBIE--so a nonzero test is actually the same as
testing & (EXIT_DEAD|EXIT_ZOMBIE), and maybe its code is a tiny bit better.
The case like in choose_new_parent is just confusing, to have the
always-false test for EXIT_* bits in ->state there too.
The two cases in wants_signal and do_process_times are actual regressions
that will give us back old bugs in race conditions. These places had
s/TASK/EXIT/ but not s/state/exit_state/, and now there tests for exiting
tasks are now wrong and never catching them. I take it back: there is no
regression in wants_signal in practice I think, because of the PF_EXITING
test that makes the EXIT_* state checks superfluous anyway. So that is
just another cosmetic case of confusing code. But in do_process_times,
there is that SIGXCPU-while-exiting race condition back again.
Signed-off-by: Roland McGrath <roland@redhat.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
We just spent some days fighting a rare race in one of the distro's who backported
some of timer.c from 2.6 to 2.4 (though they missed a bit).
The actual race we found didn't happen in 2.6 _but_ code inspection showed that a
similar race is still present in 2.6, explanation below:
Code removing a timer from a list (run_timers or del_timer) takes that CPU list
lock, does list_del, then timer->base = NULL.
It is mandatory that this timer->base = NULL is visible to other CPUs only after
the list_del() is complete. If not, then mod timer could see it NULL, thus take it's
own CPU list lock and not the one for the CPU the timer was beeing removed from the
list, and thus the list_add in mod_timer() could race with the list_del() from
run_timers() or del_timer().
Our race happened with run_timers(), which _DOES_ contain a proper smp_wmb() in the
right spot in 2.6, but didn't in the "backport" we were fighting with.
However, del_timer() doesn't have such a barrier, and thus is subject to this race in
2.6 as well. This patch fixes it.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
- fix broken IBM cyclone time interpolator support
- add support for cyclic timers through an addition of a mask
in the timer interpolator structure
- Allow time_interpolator_update() and time_interpolator_get_offset()
to be invoked without an active time interpolator
(necessary since the cyclone clock is initialized late in ACPI
processing)
- remove obsolete function time_interpolator_resolution()
- add a mask to all struct time_interpolator setups in the
kernel
- Make time interpolators work on 32bit platforms
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
add_timer_on() isn't used by modules (in fact it's only used ONCE, in
workqueue.c) and it's not even a good api for drivers, in fact, the comment
for it says
* This is not very scalable on SMP. Double adds are not possible.
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Makes sure msleep() sleeps at least the amount provided, since
schedule_timeout() doesn't guarantee a full jiffy.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
For non-smp kernels the call to update_process_times is done in the
do_timer function. It is more consistent with smp kernels to move this
call to the architecture file which calls do_timer.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch fixes all the preempt-after-task->state-is-TASK_DEAD problems we
had. Right now, the moment procfs does a down() that sleeps in
proc_pid_flush() [it could] our TASK_DEAD state is zapped and we might be
back to TASK_RUNNING to and we trigger this assert:
schedule();
BUG();
/* Avoid "noreturn function does return". */
for (;;) ;
I have split out TASK_ZOMBIE and TASK_DEAD into a separate p->exit_state
field, to allow the detaching of exit-signal/parent/wait-handling from
descheduling a dead task. Dead-task freeing is done via PF_DEAD.
Tested the patch on x86 SMP and UP, but all architectures should work
fine.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The xtime value may become incorrect when the update_wall_time(ticks)
function is called with "ticks" > 1. In such a case, the xtime variable is
updated multiple times inside the loop but it is normalized only once
outside of the loop.
This bug was reported at:
http://bugme.osdl.org/show_bug.cgi?id=3403
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
I found that the prototypes for sys_waitid and sys_fcntl in
<linux/syscalls.h> don't match the implementation. In order to keep all
prototypes in sync in the future, now include the header from each file
implementing any syscall.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
POSIX specifies that the limit settings provided by getrlimit/setrlimit are
shared by the whole process, not specific to individual threads. This
patch changes the behavior of those calls to comply with POSIX.
I've moved the struct rlimit array from task_struct to signal_struct, as it
has the correct sharing properties. (This reduces kernel memory usage per
thread in multithreaded processes by around 100/200 bytes for 32/64
machines respectively.) I took a fairly minimal approach to the locking
issues with the newly shared struct rlimit array. It turns out that all
the code that is checking limits really just needs to look at one word at a
time (one rlim_cur field, usually). It's only the few places like
getrlimit itself (and fork), that require atomicity in accessing a whole
struct rlimit, so I just used a spin lock for them and no locking for most
of the checks. If it turns out that readers of struct rlimit need more
atomicity where they are now cheap, or less overhead where they are now
atomic (e.g. fork), then seqcount is certainly the right thing to use for
them instead of readers using the spin lock. Though it's in signal_struct,
I didn't use siglock since the access to rlimits never needs to disable
irqs and doesn't overlap with other siglock uses. Instead of adding
something new, I overloaded task_lock(task->group_leader) for this; it is
used for other things that are not likely to happen simultaneously with
limit tweaking. To me that seems preferable to adding a word, but it would
be trivial (and arguably cleaner) to add a separate lock for these users
(or e.g. just use seqlock, which adds two words but is optimal for readers).
Most of the changes here are just the trivial s/->rlim/->signal->rlim/.
I stumbled across what must be a long-standing bug, in reparent_to_init.
It does:
memcpy(current->rlim, init_task.rlim, sizeof(*(current->rlim)));
when surely it was intended to be:
memcpy(current->rlim, init_task.rlim, sizeof(current->rlim));
As rlim is an array, the * in the sizeof expression gets the size of the
first element, so this just changes the first limit (RLIMIT_CPU). This is
for kernel threads, where it's clear that resetting all the rlimits is what
you want. With that fixed, the setting of RLIMIT_FSIZE in nfsd is
superfluous since it will now already have been reset to RLIM_INFINITY.
The other subtlety is removing:
tsk->rlim[RLIMIT_CPU].rlim_cur = RLIM_INFINITY;
in exit_notify, which was to avoid a race signalling during self-reaping
exit. As the limit is now shared, a dying thread should not change it for
others. Instead, I avoid that race by checking current->state before the
RLIMIT_CPU check. (Adding one new conditional in that path is now required
one way or another, since if not for this check there would also be a new
race with self-reaping exit later on clearing current->signal that would
have to be checked for.)
The one loose end left by this patch is with process accounting.
do_acct_process temporarily resets the RLIMIT_FSIZE limit while writing the
accounting record. I left this as it was, but it is now changing a limit
that might be shared by other threads still running. I left this in a
dubious state because it seems to me that processing accounting may already
be more generally a dubious state when it comes to NPTL threads. I would
think you would want one record per process, with aggregate data about all
threads that ever lived in it, not a separate record for each thread.
I don't use process accounting myself, but if anyone is interested in
testing it out I could provide a patch to change it this way.
One final note, this is not 100% to POSIX compliance in regards to rlimits.
POSIX specifies that RLIMIT_CPU refers to a whole process in aggregate, not
to each individual thread. I will provide patches later on to achieve that
change, assuming this patch goes in first.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
thanks Xu for noticing, some whitespace found it's way there.
clean that up.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
We need io.h for readq().
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Report the resolution of the time source correctly for time interpolators
with a frequency over 1 Ghz.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
for IA64
This has been in the ia64 (and hence -mm) trees for a couple of months.
Changelog:
* Affects only architectures which define CONFIG_TIME_INTERPOLATION
(currently only IA64)
* Genericize time interpolation, make time interpolators easily usable
and provide instructions on how to use the interpolator for other
architectures.
* Provide nanosecond resolution for clock_gettime and an accuracy
up to the time interpolator time base.
* clock_getres() reports resolution of underlying time basis which
is typically <50ns and may be 1ns on some systems.
* Make time interpolator self-tuning to limit time jumps
and to make the interpolators work correctly on systems with
broken time base specifications.
* SMP scalability: Make clock_gettime and gettimeofday scale O(1)
by removing the cmpxchg for most clocks (tested for up to 512 CPUs)
* IA64: provide asm fastcall that doubles the performance
of gettimeofday and clock_gettime on SGI and other IA64 systems
(asm fastcalls scale O(1) together with the scalability fixes).
* IA64: provide nojitter kernel option so that IA64 systems with
correctly synchronized ITC counters may also enjoy the
scalability enhancements.
Performance measurements for single calls (ITC cycles):
A. 4 way Intel IA64 SMP system (kmart)
ITC offsets:
kmart:/usr/src/noship-tests # dmesg|grep synchr
CPU 1: synchronized ITC with CPU 0 (last diff 1 cycles, maxerr 417 cycles)
CPU 2: synchronized ITC with CPU 0 (last diff 2 cycles, maxerr 417 cycles)
CPU 3: synchronized ITC with CPU 0 (last diff 1 cycles, maxerr 417 cycles)
A.1. Current kernel code
kmart:/usr/src/noship-tests # ./dmt
gettimeofday cycles: 3737 220 215 215 215 215 215 215 215 215
clock_gettime(REAL) cycles: 4058 575 564 576 565 566 558 558 558 558
clock_gettime(MONO) cycles: 1583 621 609 609 609 609 609 609 609 609
clock_gettime(PROCESS) cycles: 71428 298 259 259 259 259 259 259 259 259
clock_gettime(THREAD) cycles: 3982 336 290 298 298 298 298 286 286 286
A.2 New code using cmpxchg
kmart:/usr/src/noship-tests # ./dmt
gettimeofday cycles: 3145 213 216 213 213 213 213 213 213 213
clock_gettime(REAL) cycles: 3185 230 210 210 210 210 210 210 210 210
clock_gettime(MONO) cycles: 284 217 217 216 216 216 216 216 216 216
clock_gettime(PROCESS) cycles: 68857 289 270 259 259 259 259 259 259 259
clock_gettime(THREAD) cycles: 3862 339 298 298 298 298 290 286 286 286
A.3 New code with cmpxchg switched off (nojitter kernel option)
kmart:/usr/src/noship-tests # ./dmt
gettimeofday cycles: 3195 219 219 212 212 212 212 212 212 212
clock_gettime(REAL) cycles: 3003 228 205 205 205 205 205 205 205 205
clock_gettime(MONO) cycles: 279 209 209 209 208 208 208 208 208 208
clock_gettime(PROCESS) cycles: 65849 292 259 259 268 270 270 259 259 259
B. SGI SN2 system running 512 IA64 CPUs.
B.1. Current kernel code
[root@ascender noship-tests]# ./dmt
gettimeofday cycles: 17221 1028 1007 1004 1004 1004 1010 25928 1002 1003
clock_gettime(REAL) cycles: 10388 1099 1055 1044 1064 1063 1051 1056 1061 1056
clock_gettime(MONO) cycles: 2363 96 96 96 96 96 96 96 96 96
clock_gettime(PROCESS) cycles: 46537 804 660 666 666 666 666 666 666 666
clock_gettime(THREAD) cycles: 10945 727 710 684 685 686 685 686 685 686
B.2 New code
ascender:~/noship-tests # ./dmt
gettimeofday cycles: 3874 610 588 588 588 588 588 588 588 588
clock_gettime(REAL) cycles: 3893 612 588 582 588 588 588 588 588 588
clock_gettime(MONO) cycles: 686 595 595 588 588 588 588 588 588 588
clock_gettime(PROCESS) cycles: 290759 322 269 269 259 265 265 265 259 259
clock_gettime(THREAD) cycles: 5153 358 306 298 296 304 290 298 298 298
Scalability of time functions (in time it takes to do a million calls):
=======================================================================
A. 4 way Intel IA SMP system (kmart)
A.1 Current code
kmart:/usr/src/noship-tests # ./todscale -n1000000
CPUS WALL WALL/CPUS
1 0.192 0.192
2 1.125 0.563
4 9.229 2.307
A.2 New code using cmpxchg
kmart:/usr/src/noship-tests # ./todscale
CPUS WALL WALL/CPUS
1 0.188 0.188
2 0.457 0.229
4 0.413 0.103
(the measurement with 4 cpus may fluctuate up to 15.x somehow)
A.3 New code without cmpxchg (nojitter kernel option)
kmart:/usr/src/noship-tests # ./todscale -n10000000
CPUS WALL WALL/CPUS
1 0.180 0.180
2 0.180 0.090
4 0.252 0.063
B. SGI SN2 system running 512 IA64 CPUs.
The system has a global monotonic clock and therefore has
no need for compensation. Current code uses a cmpxchg. New
code has no cmpxchg.
B.1 current code
ascender:~/noship-tests # ./todscale
CPUS WALL WALL/CPUS
1 0.850 0.850
2 1.767 0.884
4 6.124 1.531
8 20.777 2.597
16 57.693 3.606
32 164.688 5.146
64 456.647 7.135
128 1093.371 8.542
256 2778.257 10.853
(System crash at 512 CPUs)
B.2 New code
ascender:~/noship-tests # ./todscale -n1000000
CPUS WALL WALL/CPUS
1 0.426 0.426
2 0.429 0.215
4 0.436 0.109
8 0.452 0.057
16 0.454 0.028
32 0.457 0.014
64 0.459 0.007
128 0.466 0.004
256 0.474 0.002
512 0.518 0.001
Clock Accuracy
==============
A. 4 CPU SMP system
A.1 Old code
kmart:/usr/src/noship-tests # ./cdisp
Gettimeofday() = 1092124757.270305000
CLOCK_REALTIME= 1092124757.270382000 resolution= 0.000976563
CLOCK_MONOTONIC= 89.696726590 resolution= 0.000976563
CLOCK_PROCESS_CPUTIME_ID= 0.001242507 resolution= 0.000000001
CLOCK_THREAD_CPUTIME_ID= 0.001255310 resolution= 0.000000001
A.2 New code
kmart:/usr/src/noship-tests # ./cdisp
Gettimeofday() = 1092124478.194530000
CLOCK_REALTIME= 1092124478.194603399 resolution= 0.000000001
CLOCK_MONOTONIC= 88.198315204 resolution= 0.000000001
CLOCK_PROCESS_CPUTIME_ID= 0.001241235 resolution= 0.000000001
CLOCK_THREAD_CPUTIME_ID= 0.001254747 resolution= 0.000000001
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|