| Age | Commit message (Collapse) | Author |
|
Patch from Julie DeWandel.
This patch has solved the crashes observed during TPC-C runs on the
16-way box. (I'm confident it will fix the other reported cases as
well.)
The race is the setting of timer->base to NULL, by del_timer() or
__run_timers(). If new_base == old_base in __mod_timer() then we do not
re-check timer->base after getting the lock. (the only case where we do
not have to re-check the base is in the !old_base case, but the else
branch also includes the old_base==new_base case.)
The __run_timers() case made the lock_timer() patch not work fully - we
cannot use lock_timer() in __run_timers() due to lock ordering.
|
|
From: Ingo Molnar <mingo@elte.hu>
It unifies the functionality of add_timer() and mod_timer(), and makes any
combination of the timer API calls completely SMP-safe. del_timer() is still
not using the timer lock.
this patch fixes the only timer bug in 2.6 i'm aware of: the del_timer_sync()
+ add_timer() combination in kernel/itimer.c is buggy. This was correct code
in 2.4, because there it was safe to do an add_timer() from the timer handler
itself, parallel to a del_timer_sync().
If we want to make this safe in 2.6 too (which i think we want to) then we
have to make add_timer() almost equivalent to mod_timer(), locking-wise. And
once we are at this point i think it's much cleaner to actually make
add_timer() a variant of mod_timer(). (There's no locking cost for
add_timer(), only the cost of an extra branch. And we've removed another
commonly used function from the icache.)
|
|
From: Albert Cahalan <albert@users.sourceforge.net>
This should improve timekeeping a bit @ 1000 HZ.
|
|
From: Peter Chubb <peterc@gelato.unsw.edu.au>
Currently, do_setitimer() is used in several files, but doesn't appear
in any header. Thus its declaration is repeated in some files, and
its use causes a warning in others (because there is no declaration
present).
This patch:
-- adds a couple of declarations to linux/times.h
-- removes the (now duplicate) declarations from other files.
|
|
in add_timer_internal() we simply leave the timer pending forever if the
expiry is in more than 0xffffffff jiffies. This means more than 48 days on
eg. ia64 - which is not an unrealistic timeout. IIRC crond is happy to use
extremely large timeouts.
It's better to time out early (if you can call 48 days "early") than to
not time out at all.
|
|
In general, it is more better to use get_cpu_var() and __get_cpu_var()
to access per-cpu variables on this CPU than to use smp_processor_id()
and per_cpu(). In the current default implemention they are equivalent,
but on IA64 the former is already faster, and other archs will follow.
|
|
From: John Stultz, George Anzinger, Eric Piel
There was confusion over the definition of TICK_USEC. TICK_USEC is
supposed to be based on USER_HZ, however a recent change caused TICK_USEC
to be based on HZ. This broke the adjtimex() interface on systems where
USER_HZ != HZ. This patch reverts the change to TICK_USEC, removes an
added mis-use of the value and fixes some incorrect comments that could
lead to this sort of confusion.
Also this patch resolves the related LTP adjtimex failures.
|
|
From: george anzinger <george@mvista.com>
This patch addresses issues of roundoff error in the time keeping and NTP
code as follows:
The conversion of "actual jiffies" to TICK_USEC and then to TICK_NSEC
introduced large errors if jiffies was not a power of 10 (e.g. 1024 for
the ia64). Most of this is avoided by converting directly to TICK_NSEC.
The calculation of MAX_SEC_IN_JIFFIES (the largest timespec or timeval the
kernel will attempt) had overflow problems in the 64-bit machines. We
introduce a different equation for those machines.
The NTP frequency update code was allowing a micro second of error to
accumulate before applying the correction. We change FINEUSEC to FINENSEC
to do the correction as soon as a full nanosecond has accumulated.
The initial calculation of time_freq for NTP had severe roundoff errors for
HZ not a power of 10 (i.e. 1024). A new equation fixes this.
clock_nanosleep is changed to round up to the next jiffie to cover starting
between jiffies.
|
|
From: george anzinger <george@mvista.com>
This patch does the following:
Pushs down the change from timeval to timespec in the settime routines.
Fixes two places where time was set without updating the monotonic clock
offset. (Changes sys_stime() to call do_settimeofday() and changes
clock_warp to do the update directly.) These were bugs!
Changes the uptime code to use the posix_clock_monotonic notion of uptime
instead of the jiffies. This time will track NTP changes and so should be
better than your standard wristwatch (if your using ntp).
Changes posix_clock_monotonic to start at 0 on boot (was set to start at
initial jiffies).
Fixes a bug (never experienced) in timer_create() in posix-timers.c where
we "could" have released timer_id 0 if "id resources" were low.
Adds a test in do_settimeofday() to error out (EINVAL) attempts to use
unnormalized times. This is passed back up to both settimeofday and
posix_setclock().
Warning: Requires changes in .../arch/???/kernel/time.c to change
do_settimeofday() to return an error if time is not normalized and to use a
timespec instead of timeval for its input.
|
|
Trivial patch: when these were introduced cpu.h didn't exist.
|
|
- Add comment about slab ctor behaviour (Ingo Oeser)
- mm/slab.c:fprob() shows up in profiles a lot. Rename it to something more
meaningful.
- fatfs printk warning fix (Randy Dunlap)
- give the the time interpolator list and lock file-static scope (hch)
|
|
From: Christoph Hellwig <hch@lst.de>
- don't add one level of indentation when taking a lock
- remove useless ti_global struct
|
|
From: David Mosberger <davidm@napali.hpl.hp.com>
Basically, what the patch does is provide two hooks such that platforms
(and subplatforms) can provide time-interpolation in a way that guarantees
that two causally related gettimeofday() calls will never see time going
backwards (unless there is a settimeofday() call, of course).
There is some evidence that the current scheme does work: we use it on ia64
both for cycle-counter-based interpolation and the SGI folks use it with a
chipset-based high-performance counter.
It seems like enough platforms do this sort of thing to provide _some_
support in the core, especially because it's rather tricky to guarantee
that time never goes backwards (short of a settimeofday, of course).
This patch is based on something Jes Sorensen wrote for the SGI Itanium 2
platform (which has a chipset-internal high-res clock). I adapted it so it
can be used for cycle-counter interpolation also. The net effect is that
"last_time_offset" can be removed completely from the kernel.
The basic idea behind the patch is simply: every time you advance xtime by
N nanoseconds, you call update_wall_time_hook(NSEC). Every time the time
gets set (i.e., discontinuity is OK), reset_wall_time_hook() is called.
|
|
Don't depend on undefined preprocessor symbols evaluating to zero.
|
|
The POSIX CLOCK_MONOTONIC currently has only 1/HZ resolution. Further, it is
tied to jiffies (i.e. is a restatment of jiffies) rather than "xtime" or the
gettimeofday() clock.
This patch changes CLOCK_MONOTONIC to be a restatment of gettimeofday() plus
an offset to remove any clock setting activity from CLOCK_MONOTONIC. An
offset is kept that represents the difference between CLOCK_MONOTONIC and
gettimeofday(). This offset is updated when ever the gettimeofday() clock is
set to back the clock setting change out of CLOCK_MONOTONIC (which by the
standard, can not be set).
With this change CLOCK_REALTIME (a direct restatement of gettimeofday()),
CLOCK_MONOTONIC and gettimeofday() will all tick at the same time and with
the same rate. And all will be affected by NTP adjustments (save those which
actually set the time).
|
|
Noted by David Mosberger:
"If someone happens to arm a periodic timer at exactly 256 jiffies (as
ohci happens to do on platforms with HZ=1024), then you end up getting
an endless loop of timer activations, causing a machine hang.
The problem is that __run_timers updates base->timer_jiffies _before_
running the callback routines. If a callback re-arms the timer at
exactly 256 jiffies, add_timers() will reinsert the timer into the list
that we're currently processing, which of course will cause the timer to
expire immediately again, etc., etc., ad naseum... "
The answer here is to move the whole expired list to a local header and
to not look back.
|
|
|
|
swap.h is basically the header for MM internals instead of the
public API (mm_internal.h would have been a better name..). Stop
including it in mm.h - this only needs moving one function that
should be in swap.h anyway to the right place and fixing up a bunch
of places using it.
|
|
From: george anzinger <george@mvista.com>
The recently-added code which avoids a lockup when a timer handler re-adds
the timer right now can be simplified.
If we change __run_timers() to increment base->timer_jiffies _before_ running
the timers, then any re-additions will not be inserted in the list which
__run_timers is presently walking.
|
|
From: george anzinger <george@mvista.com>
Remove the `index' field from the timer structures. It contains the same
info as the timer_jiffies field.
So just use the base->timer_jiffies field directly.
|
|
From: Tim Schmielau <tim@physik3.uni-rostock.de>
Fixes the problem wherein nanosleep() is sleeping for the wrong duration.
When starting out with timer_jiffies=0, the timer cascade is (unneccessarily)
triggered on the first timer interrupt, incrementing all the higher indices.
When starting with any other initial jiffies value, we miss that and end up
with all higher indices being off by one.
|
|
This is a forward-port of Andrea's fix in 2.4.
If a timer handler re-adds a timer to go off right now, __run_timers() will
never terminate. (I wrote a test. It happens.)
Fix that up by teaching internal_add_timer() to detect when it is being
called from within the context of __run_timers() and to park newly-added
timers onto a temp list instead. These timers are then added for real by
__run_timers(), after it has finished processing all pending timers.
|
|
- Use list_head functions rather than open-coding them
- Use time comparison macros rather than open-coding them
- Hide some ifdefs
- uninline internal_add_timer(). Saves half a kilobyte of text.
|
|
From Tim Schmielau <tim@physik3.uni-rostock.de>
Force jiffies to start out at five-minutes-before-wrap. To find
jiffy-wrapping bugs.
|
|
x86-64 vsyscalls require mapping the sequence number used by
gettimeofday in a magic way, so that userland can access it via
vsyscalls for user space time-of-day access.
Instead of putting the magic into generic code I just allowed to move it
into architecture specific files.
|
|
This is version 23 or so of the POSIX timer code.
Internal changelog:
- Changed the signals code to match the new order of things. Also the
new xtime_lock code needed to be picked up. It made some things a lot
simpler.
- Fixed a spin lock hand off problem in locking timers (thanks
to Randy).
- Fixed nanosleep to test for out of bound nanoseconds
(thanks to Julie).
- Fixed a couple of id deallocation bugs that left old ids
laying around (hey I get this one).
- This version has a new timer id manager. Andrew Morton
suggested elimination of recursion (done) and I added code
to allow it to release unused nodes. The prior version only
released the leaf nodes. (The id manager uses radix tree
type nodes.) Also added is a reuse count so ids will not
repeat for at least 256 alloc/ free cycles.
- The changes for the new sys_call restart now allow one
restart function to handle both nanosleep and clock_nanosleep.
Saves a bit of code, nice.
- All the requested changes and Lindent too :).
- I also broke clock_nanosleep() apart much the same way
nanosleep() was with the 2.5.50-bk5 changes.
TIMER STORMS
The POSIX clocks and timers code prevents "timer storms" by
not putting repeating timers back in the timer list until
the signal is delivered for the prior expiry. Timer events
missed by this delay are accounted for in the timer overrun
count. The net result is MUCH lower system overhead while
presenting the same info to the user as would be the case if
an interrupt and timer processing were required for each
increment in the overrun count.
|
|
Add "seqlock" infrastructure for doing low-overhead optimistic reader
locks (writer increments a sequence number, reader verifies that no
writers came in during the critical region, and lots of careful memory
barriers to take care of business).
Make xtime/get_jiffies_64() use this new locking.
|
|
Use 64 bit jiffies for reporting uptime.
|
|
|
|
using the new system call restart infrastructure.
This breaks the compat layer - it really needs to do its own version
of restarting, since the restarting depends on the types.
|
|
This is the generic part of the start of the compatibility syscall
layer. I think I have made it generic enough that each architecture can
define what compatibility means.
To use this, an architecture must create asm/compat.h and provide
typedefs for (currently) 'compat_time_t', 'struct compat_timeval' and
'struct compat_timespec'.
|
|
This changes sys_getppid() to be more POSIX-threading conformant.
sys_getppid() needs to return the PID of the "process' parent" (ie. the
tgid of the parent thread), not the thread parent's PID. The patch has
no effect on non-CLONE_THREAD users, for them current->group_leader ==
current. The effect on CLONE_THREAD threads is that getppid() does not
return any PID within the thread group anymore. Plus if a threaded
application starts up a (non-thread) child then the child sees the
process PID of the parent process, not the thread PID of the parent
thread.
in theory we could introduce the getttid() variant to get to the TID of
the parent thread, but i doubt it would be of any use. (and we can add
it if the need arises.)
The lockless algorithm is still safe because the ->group_leader pointer
never changes asynchronously. (the ->real_parent pointer might still
change asynchronously so the SMP checks are still needed.)
I've also updated the comments (they referenced the nonexistent p_ooptr
field.), plus i've changed the mb() to rmb() - we need to order the
reads, we dont do any global writes that need some predictable ordering.
|
|
Patch from Bill Irwin. It has the potential to break userspace
monitoring tools a little bit, and I'm a rater uncertain about
how useful the per-process per-cpu accounting is.
Bill sent this out as an RFC on July 29:
"These statistics severely bloat the task_struct and nothing in
userspace can rely on them as they're conditional on CONFIG_SMP. If
anyone is using them (or just wants them around), please speak up."
And nobody spoke up.
If we apply this, the contents of /proc/783/cpu will go from
cpu 1 1
cpu0 0 0
cpu1 0 0
cpu2 1 1
cpu3 0 0
to
cpu 1 1
And we shall save 256 bytes from the ia32 task_struct.
On my SMP build with NR_CPUS=32:
Without this patch, sizeof(task_struct) is 1824, slab uses a 1-order
allocation and we are getting 2 task_structs per page.
With this patch, sizeof(task_struct) is 1568, slab uses a 2-order
allocation and we are getting 2.5 task_structs per page.
So it seems worthwhile.
(Maybe this highlights a shortcoming in slab. For the 1824-byte case
it could have used a 0-order allocation)
|
|
The timer code is attempting to replicate the softirq characteristics at
the tasklet level, which is a little pointless. This patch converts
timers to be a first-class softirq citizen.
|
|
If two CPUs run mod_timer against the same not-pending timer then they
have no locking relationship. They can both see the timer as
not-pending and they both add the timer to their cpu-local list. The
CPU which gets there second corrupts the first CPU's lists.
This was causing Dave Hansen's 8-way to oops after a couple of minutes
of specweb testing.
I believe that to fix this we need locking which is associated with the
timer itself. The easy fix is hashed spinlocking based on the timer's
address. The hard fix is a lock inside the timer itself.
It is hard because init_timer() becomes compulsory, to initialise that
spinlock. An unknown number of code paths in the kernel just wipe the
timer to all-zeroes and start using it.
I chose the hard way - it is cleaner and more idiomatic. The patch
also adds a "magic number" to the timer so we can detect when a timer
was not correctly initialised. A warning and stack backtrace is
generated and the timer is fixed up. After 16 such warnings the
warning mechanism shuts itself up until a reboot.
It took six patches to my kernel to stop the warnings from coming out.
The uninitialised timers are extremely easy to find and fix. But it
will take some time to weed them all out. Maybe we should go for
the hashed locking...
Note that the new timer->lock means that we can clean up some awkward
"oh we raced, let's try again" code in timer.c. But to do that we'd
also need to take timer->lock in the commonly-called del_timer(), so I
left it as-is.
The lock is not needed in add_timer() because concurrent
add_timer()/add_timer() and concurrent add_timer()/mod_timer() are
illegal.
|
|
Patch from Ravikiran G Thirumalai <kiran@in.ibm.com>
1. Break out disk stats from kernel_stat and move disk stat to blkdev.h
2. Group cpu stat in kernel_stat and make them "per_cpu" instead of
the NR_CPUS array
3. Remove EXPORT_SYMBOL(kstat) from ksyms.c (as I noticed that no module is
using kstat)
|
|
Patch from Dipankar Sarma <dipankar@in.ibm.com>
This patch changes the per-CPU data in timer management (tvec_bases)
to use per_cpu data area and makes it safe for cpu_possible allocation
by using CPU notifiers. End result - saving space.
Depends on cpu_possible patch.
|
|
add_timer_on is like add_timer, except it takes a target CPU on which
to add the timer.
The slab code needs per-cpu timers for shrinking the per-cpu caches.
|
|
This implements a simple hook into the profiling timer for x86 so that
non-perfctr machines can still use oprofile. This has proven useful for
laptops and the like.
It also reduces header dependencies a bit by centralising readprofile
code
|
|
This is my latest timer patchset, it makes del_timer_sync() a bit more
robust wrt. code that re-adds timers from the timer handler.
Other changes in the patch:
- clean up cascading a bit.
- do not save flags in __run_timer_list - we enter from an irqs-enabled
tasklet.
|
|
Comment above getpid() is wrong.
This patch fixes it, and expands the comment to explain why on earth
we have getpid() returning ->tgid and not ->pid.
|
|
I think I have found it and it only hits on a 64 bit machine.
If the timeout is big enough we still need to initialise timer->entry.
Otherwise bad things happen we we hit del_timer.
|
|
This does a number of timer subsystem enhancements:
- simplified timer initialization, now it's the cheapest possible thing:
static inline void init_timer(struct timer_list * timer)
{
timer->base = NULL;
}
since the timer functions already did a !timer->base check this did not
have any effect on their fastpath.
- the rule from now on is that timer->base is set upon activation of the
timer, and cleared upon deactivation. This also made it possible to:
- reorganize all the timer handling code to not assume anything about
timer->entry.next and timer->entry.prev - this also removed lots of
unnecessery cleaning of these fields. Removed lots of unnecessary list
operations from the fastpath.
- simplified del_timer_sync(): it now uses del_timer() plus some simple
synchronization code. Note that this also fixes a bug: if mod_timer (or
add_timer) moves a currently executing timer to another CPU's timer
vector, then del_timer_sync() does not synchronize with the handler
properly.
- bugfix: moved run_local_timers() from scheduler_tick() into
update_process_times() .. scheduler_tick() might be called from the fork
code which will not quite have the intended effect ...
- removed the APIC-timer-IRQ shifting done on SMP, Dipankar Sarma's
testing shows no negative effects.
- cleaned up include/linux/timer.h:
- removed the timer_t typedef, and fixes up kernel/workqueue.c to use
the 'struct timer_list' name instead.
- removed unnecessery includes
- renamed the 'list' field to 'entry' (it's an entry not a list head)
- exchanged the 'function' and 'data' fields. This, besides being
more logical, also unearthed the last few remaining places that
initialized timers by assuming some given field ordering, the patch
also fixes these places. (fs/xfs/pagebuf/page_buf.c,
net/core/profile.c and net/ipv4/inetpeer.c)
- removed the defunct sync_timers(), timer_enter() and timer_exit()
prototypes.
- added docbook-style comments.
- other kernel/timer.c changes:
- base->running_timer does not have to be volatile ...
- added consistent comments to all the important functions.
- made the sync-waiting in del_timer_sync preempt- and lowpower-
friendly.
i've compiled, booted & tested the patched kernel on x86 UP and SMP. I
have tried moderately high networking load as well, to make sure the timer
changes are correct - they appear to be.
|
|
This is the smptimers patch plus the removal of old BHs and a rewrite of
task-queue handling.
Basically with the removal of TIMER_BH i think the time is right to get
rid of old BHs forever, and to do a massive cleanup of all related
fields. The following five basic 'execution context' abstractions are
supported by the kernel:
- hardirq
- softirq
- tasklet
- keventd-driven task-queues
- process contexts
I've done the following cleanups/simplifications to task-queues:
- removed the ability to define your own task-queue, what can be done is
to schedule_task() a given task to keventd, and to flush all pending
tasks.
This is actually a quite easy transition, since 90% of all task-queue
users in the kernel used BH_IMMEDIATE - which is very similar in
functionality to keventd.
I believe task-queues should not be removed from the kernel altogether.
It's true that they were written as a candidate replacement for BHs
originally, but they do make sense in a different way: it's perhaps the
easiest interface to do deferred processing from IRQ context, in
performance-uncritical code areas. They are easier to use than
tasklets.
code that cares about performance should convert to tasklets - as the
timer code and the serial subsystem has done already. For extreme
performance softirqs should be used - the net subsystem does this.
and we can do this for 2.6 - there are only a couple of areas left after
fixing all the BH_IMMEDIATE places.
i have moved all the taskqueue handling code into kernel/context.c, and
only kept the basic 'queue a task' definitions in include/linux/tqueue.h.
I've converted three of the most commonly used BH_IMMEDIATE users:
tty_io.c, floppy.c and random.c. [random.c might need more thought
though.]
i've also cleaned up kernel/timer.c over that of the stock smptimers
patch: privatized the timer-vec definitions (nothing needs it,
init_timer() used it mistakenly) and cleaned up the code. Plus i've moved
some code around that does not belong into timer.c, and within timer.c
i've organized data and functions along functionality and further
separated the base timer code from the NTP bits.
net_bh_lock: i have removed it, since it would synchronize to nothing. The
old protocol handlers should still run on UP, and on SMP the kernel prints
a warning upon use. Alexey, is this approach fine with you?
scalable timers: i've further improved the patch ported to 2.5 by wli and
Dipankar. There is only one pending issue i can see, the question of
whether to migrate timers in mod_timer() or not. I'm quite convinced that
they should be migrated, but i might be wrong. It's a 10 lines change to
switch between migrating and non-migrating timers, we can do performance
tests later on. The current, more complex migration code is pretty fast
and has been stable under extremely high networking loads in the past 2
years, so we can immediately switch to the simpler variant if someone
proves it improves performance. (I'd say if non-migrating timers improve
Apache performance on one of the bigger NUMA boxes then the point is
proven, no further though will be needed.)
|
|
and does the wrong thing for higher HZ values anyway.
|
|
I've been playing with different HZ values in the 2.4 kernel for a while
now, and apparantly Linus also has decided to introduce a USER_HZ
constant (I used CLOCKS_PER_SEC) while raising the HZ value on x86 to
1000.
On x86 timekeeping has shown to be relative fragile when raising HZ (OK,
I tried HZ=2048 which is quite high) because of the way the interrupt
timer is configured to fire HZ times each second. This is done by
configuring a divisor in the timer chip (LATCH) which divides a certain
clock (1193180) and makes the chip fire interrupts at the resulting
frequency.
Now comes the catch: NTP requires a clock accuracy of 500 ppm. For some
HZ values the clock is not accurate enough to meet this requirement,
hence NTP won't work well.
An example HZ value is 1020 which exceeds the 500 ppm requirement. In
this case the best approximation is 1019.8 Hz. the xtime.tv_usec value
is raised with a value of 980 each tick which means that after one
second the tv_usec value has increased with 999404 (should be 1000000)
which is an accuracy of 596 ppm.
Some more examples:
HZ Accuracy (ppm)
---- --------------
100 17
1000 151
1024 632
2000 687
2008 343
2011 18
2048 1249
What I've been doing is replace tv_usec by tv_nsec, meaning xtime is now
a timespec instead of a timeval. This allows the accuracy to be
improved by a factor of 1000 for any (well ... any?) HZ value.
Of course all kinds of calculations had te be improved as well. The
ACTHZ constantant is introduced to approximate the actual HZ value, it's
used to do some approximations of other related values.
|
|
|
|
I've noticed that xtime_lock and timerlist_lock ends up on the same
cacheline all the time (atleaset on x86). Not a good thing for
loads with high xxx_timer and do_gettimeofday counts I guess (networking etc).
Here's a trivial fix.
|
|
- introduce new type of context-switch locking, this is a must-have for
ia64 and sparc64.
- load_balance() bug noticed by Scott Rhine and myself: scan the
whole list to find imbalance number of tasks, not just the tail
of the list.
- sched_yield() fix: use current->array not rq->active.
|
|
Stop using "struct tms" internally - always use timer ticks (or one of
the sane timeval/timespec types) instead.
Explicitly convert to clock_t when copying to user space for the old
broken interfaces that still use "clock_t".
Clean up and unify jiffies<->timeval conversion.
|