| Age | Commit message (Collapse) | Author |
|
|
|
using the new system call restart infrastructure.
This breaks the compat layer - it really needs to do its own version
of restarting, since the restarting depends on the types.
|
|
This is the generic part of the start of the compatibility syscall
layer. I think I have made it generic enough that each architecture can
define what compatibility means.
To use this, an architecture must create asm/compat.h and provide
typedefs for (currently) 'compat_time_t', 'struct compat_timeval' and
'struct compat_timespec'.
|
|
This changes sys_getppid() to be more POSIX-threading conformant.
sys_getppid() needs to return the PID of the "process' parent" (ie. the
tgid of the parent thread), not the thread parent's PID. The patch has
no effect on non-CLONE_THREAD users, for them current->group_leader ==
current. The effect on CLONE_THREAD threads is that getppid() does not
return any PID within the thread group anymore. Plus if a threaded
application starts up a (non-thread) child then the child sees the
process PID of the parent process, not the thread PID of the parent
thread.
in theory we could introduce the getttid() variant to get to the TID of
the parent thread, but i doubt it would be of any use. (and we can add
it if the need arises.)
The lockless algorithm is still safe because the ->group_leader pointer
never changes asynchronously. (the ->real_parent pointer might still
change asynchronously so the SMP checks are still needed.)
I've also updated the comments (they referenced the nonexistent p_ooptr
field.), plus i've changed the mb() to rmb() - we need to order the
reads, we dont do any global writes that need some predictable ordering.
|
|
Patch from Bill Irwin. It has the potential to break userspace
monitoring tools a little bit, and I'm a rater uncertain about
how useful the per-process per-cpu accounting is.
Bill sent this out as an RFC on July 29:
"These statistics severely bloat the task_struct and nothing in
userspace can rely on them as they're conditional on CONFIG_SMP. If
anyone is using them (or just wants them around), please speak up."
And nobody spoke up.
If we apply this, the contents of /proc/783/cpu will go from
cpu 1 1
cpu0 0 0
cpu1 0 0
cpu2 1 1
cpu3 0 0
to
cpu 1 1
And we shall save 256 bytes from the ia32 task_struct.
On my SMP build with NR_CPUS=32:
Without this patch, sizeof(task_struct) is 1824, slab uses a 1-order
allocation and we are getting 2 task_structs per page.
With this patch, sizeof(task_struct) is 1568, slab uses a 2-order
allocation and we are getting 2.5 task_structs per page.
So it seems worthwhile.
(Maybe this highlights a shortcoming in slab. For the 1824-byte case
it could have used a 0-order allocation)
|
|
The timer code is attempting to replicate the softirq characteristics at
the tasklet level, which is a little pointless. This patch converts
timers to be a first-class softirq citizen.
|
|
If two CPUs run mod_timer against the same not-pending timer then they
have no locking relationship. They can both see the timer as
not-pending and they both add the timer to their cpu-local list. The
CPU which gets there second corrupts the first CPU's lists.
This was causing Dave Hansen's 8-way to oops after a couple of minutes
of specweb testing.
I believe that to fix this we need locking which is associated with the
timer itself. The easy fix is hashed spinlocking based on the timer's
address. The hard fix is a lock inside the timer itself.
It is hard because init_timer() becomes compulsory, to initialise that
spinlock. An unknown number of code paths in the kernel just wipe the
timer to all-zeroes and start using it.
I chose the hard way - it is cleaner and more idiomatic. The patch
also adds a "magic number" to the timer so we can detect when a timer
was not correctly initialised. A warning and stack backtrace is
generated and the timer is fixed up. After 16 such warnings the
warning mechanism shuts itself up until a reboot.
It took six patches to my kernel to stop the warnings from coming out.
The uninitialised timers are extremely easy to find and fix. But it
will take some time to weed them all out. Maybe we should go for
the hashed locking...
Note that the new timer->lock means that we can clean up some awkward
"oh we raced, let's try again" code in timer.c. But to do that we'd
also need to take timer->lock in the commonly-called del_timer(), so I
left it as-is.
The lock is not needed in add_timer() because concurrent
add_timer()/add_timer() and concurrent add_timer()/mod_timer() are
illegal.
|
|
Patch from Ravikiran G Thirumalai <kiran@in.ibm.com>
1. Break out disk stats from kernel_stat and move disk stat to blkdev.h
2. Group cpu stat in kernel_stat and make them "per_cpu" instead of
the NR_CPUS array
3. Remove EXPORT_SYMBOL(kstat) from ksyms.c (as I noticed that no module is
using kstat)
|
|
Patch from Dipankar Sarma <dipankar@in.ibm.com>
This patch changes the per-CPU data in timer management (tvec_bases)
to use per_cpu data area and makes it safe for cpu_possible allocation
by using CPU notifiers. End result - saving space.
Depends on cpu_possible patch.
|
|
add_timer_on is like add_timer, except it takes a target CPU on which
to add the timer.
The slab code needs per-cpu timers for shrinking the per-cpu caches.
|
|
This implements a simple hook into the profiling timer for x86 so that
non-perfctr machines can still use oprofile. This has proven useful for
laptops and the like.
It also reduces header dependencies a bit by centralising readprofile
code
|
|
This is my latest timer patchset, it makes del_timer_sync() a bit more
robust wrt. code that re-adds timers from the timer handler.
Other changes in the patch:
- clean up cascading a bit.
- do not save flags in __run_timer_list - we enter from an irqs-enabled
tasklet.
|
|
Comment above getpid() is wrong.
This patch fixes it, and expands the comment to explain why on earth
we have getpid() returning ->tgid and not ->pid.
|
|
I think I have found it and it only hits on a 64 bit machine.
If the timeout is big enough we still need to initialise timer->entry.
Otherwise bad things happen we we hit del_timer.
|
|
This does a number of timer subsystem enhancements:
- simplified timer initialization, now it's the cheapest possible thing:
static inline void init_timer(struct timer_list * timer)
{
timer->base = NULL;
}
since the timer functions already did a !timer->base check this did not
have any effect on their fastpath.
- the rule from now on is that timer->base is set upon activation of the
timer, and cleared upon deactivation. This also made it possible to:
- reorganize all the timer handling code to not assume anything about
timer->entry.next and timer->entry.prev - this also removed lots of
unnecessery cleaning of these fields. Removed lots of unnecessary list
operations from the fastpath.
- simplified del_timer_sync(): it now uses del_timer() plus some simple
synchronization code. Note that this also fixes a bug: if mod_timer (or
add_timer) moves a currently executing timer to another CPU's timer
vector, then del_timer_sync() does not synchronize with the handler
properly.
- bugfix: moved run_local_timers() from scheduler_tick() into
update_process_times() .. scheduler_tick() might be called from the fork
code which will not quite have the intended effect ...
- removed the APIC-timer-IRQ shifting done on SMP, Dipankar Sarma's
testing shows no negative effects.
- cleaned up include/linux/timer.h:
- removed the timer_t typedef, and fixes up kernel/workqueue.c to use
the 'struct timer_list' name instead.
- removed unnecessery includes
- renamed the 'list' field to 'entry' (it's an entry not a list head)
- exchanged the 'function' and 'data' fields. This, besides being
more logical, also unearthed the last few remaining places that
initialized timers by assuming some given field ordering, the patch
also fixes these places. (fs/xfs/pagebuf/page_buf.c,
net/core/profile.c and net/ipv4/inetpeer.c)
- removed the defunct sync_timers(), timer_enter() and timer_exit()
prototypes.
- added docbook-style comments.
- other kernel/timer.c changes:
- base->running_timer does not have to be volatile ...
- added consistent comments to all the important functions.
- made the sync-waiting in del_timer_sync preempt- and lowpower-
friendly.
i've compiled, booted & tested the patched kernel on x86 UP and SMP. I
have tried moderately high networking load as well, to make sure the timer
changes are correct - they appear to be.
|
|
This is the smptimers patch plus the removal of old BHs and a rewrite of
task-queue handling.
Basically with the removal of TIMER_BH i think the time is right to get
rid of old BHs forever, and to do a massive cleanup of all related
fields. The following five basic 'execution context' abstractions are
supported by the kernel:
- hardirq
- softirq
- tasklet
- keventd-driven task-queues
- process contexts
I've done the following cleanups/simplifications to task-queues:
- removed the ability to define your own task-queue, what can be done is
to schedule_task() a given task to keventd, and to flush all pending
tasks.
This is actually a quite easy transition, since 90% of all task-queue
users in the kernel used BH_IMMEDIATE - which is very similar in
functionality to keventd.
I believe task-queues should not be removed from the kernel altogether.
It's true that they were written as a candidate replacement for BHs
originally, but they do make sense in a different way: it's perhaps the
easiest interface to do deferred processing from IRQ context, in
performance-uncritical code areas. They are easier to use than
tasklets.
code that cares about performance should convert to tasklets - as the
timer code and the serial subsystem has done already. For extreme
performance softirqs should be used - the net subsystem does this.
and we can do this for 2.6 - there are only a couple of areas left after
fixing all the BH_IMMEDIATE places.
i have moved all the taskqueue handling code into kernel/context.c, and
only kept the basic 'queue a task' definitions in include/linux/tqueue.h.
I've converted three of the most commonly used BH_IMMEDIATE users:
tty_io.c, floppy.c and random.c. [random.c might need more thought
though.]
i've also cleaned up kernel/timer.c over that of the stock smptimers
patch: privatized the timer-vec definitions (nothing needs it,
init_timer() used it mistakenly) and cleaned up the code. Plus i've moved
some code around that does not belong into timer.c, and within timer.c
i've organized data and functions along functionality and further
separated the base timer code from the NTP bits.
net_bh_lock: i have removed it, since it would synchronize to nothing. The
old protocol handlers should still run on UP, and on SMP the kernel prints
a warning upon use. Alexey, is this approach fine with you?
scalable timers: i've further improved the patch ported to 2.5 by wli and
Dipankar. There is only one pending issue i can see, the question of
whether to migrate timers in mod_timer() or not. I'm quite convinced that
they should be migrated, but i might be wrong. It's a 10 lines change to
switch between migrating and non-migrating timers, we can do performance
tests later on. The current, more complex migration code is pretty fast
and has been stable under extremely high networking loads in the past 2
years, so we can immediately switch to the simpler variant if someone
proves it improves performance. (I'd say if non-migrating timers improve
Apache performance on one of the bigger NUMA boxes then the point is
proven, no further though will be needed.)
|
|
and does the wrong thing for higher HZ values anyway.
|
|
I've been playing with different HZ values in the 2.4 kernel for a while
now, and apparantly Linus also has decided to introduce a USER_HZ
constant (I used CLOCKS_PER_SEC) while raising the HZ value on x86 to
1000.
On x86 timekeeping has shown to be relative fragile when raising HZ (OK,
I tried HZ=2048 which is quite high) because of the way the interrupt
timer is configured to fire HZ times each second. This is done by
configuring a divisor in the timer chip (LATCH) which divides a certain
clock (1193180) and makes the chip fire interrupts at the resulting
frequency.
Now comes the catch: NTP requires a clock accuracy of 500 ppm. For some
HZ values the clock is not accurate enough to meet this requirement,
hence NTP won't work well.
An example HZ value is 1020 which exceeds the 500 ppm requirement. In
this case the best approximation is 1019.8 Hz. the xtime.tv_usec value
is raised with a value of 980 each tick which means that after one
second the tv_usec value has increased with 999404 (should be 1000000)
which is an accuracy of 596 ppm.
Some more examples:
HZ Accuracy (ppm)
---- --------------
100 17
1000 151
1024 632
2000 687
2008 343
2011 18
2048 1249
What I've been doing is replace tv_usec by tv_nsec, meaning xtime is now
a timespec instead of a timeval. This allows the accuracy to be
improved by a factor of 1000 for any (well ... any?) HZ value.
Of course all kinds of calculations had te be improved as well. The
ACTHZ constantant is introduced to approximate the actual HZ value, it's
used to do some approximations of other related values.
|
|
|
|
I've noticed that xtime_lock and timerlist_lock ends up on the same
cacheline all the time (atleaset on x86). Not a good thing for
loads with high xxx_timer and do_gettimeofday counts I guess (networking etc).
Here's a trivial fix.
|
|
- introduce new type of context-switch locking, this is a must-have for
ia64 and sparc64.
- load_balance() bug noticed by Scott Rhine and myself: scan the
whole list to find imbalance number of tasks, not just the tail
of the list.
- sched_yield() fix: use current->array not rq->active.
|
|
Stop using "struct tms" internally - always use timer ticks (or one of
the sane timeval/timespec types) instead.
Explicitly convert to clock_t when copying to user space for the old
broken interfaces that still use "clock_t".
Clean up and unify jiffies<->timeval conversion.
|
|
This micropatch adds unlikely() macro into add_timer() bug check code.
Without this path gcc 3.1 makes bad thing reordering printk() into
the middle of function body.
|
|
Nobody's using it any more, kill:
|
|
This is actually part of the work I've been doing to remove BHs, but it
stands by itself.
|
|
x86-64 needs an own special declaration of jiffies_64.
prepare for this by moving the jiffies_64 declaration from
kernel/timer.c down into each architecture.
|
|
Looks like sys_sysinfo has not been touched in years. Among other
things, it uses a global cli() for protection; I switched it to an
existing rwlock. I also pulled it out of info.c and stuck it in timer.c
(I choose timer.c because it shares dependencies there already).
The details:
- move sys_sysinfo to kernel/timer.c from kernel/info.c:
why one small syscall got its own file is beyond me.
- delete kernel/info.c
- stop the global cli! now grab a read_lock on xtime_lock.
this is safe as we moved the write_unlock on xtime_lock
down one line to cover the calculating of avenrun.
- trivial code cleanup
|
|
On ia64 MP machines, we use the cycle counter register of each CPU to
obtain fine-grained time-stamps. At boot-time, we synchronize the
counters as close as possible (similar to x86, though with a different
algorithm). But even with this synchronization, there is still a
small (really: tiny) chance that a process bouncing from one CPU to
another could observe time going backwards. To guard against this, I
maintain a global variable called "last_time_offset" which keeps track
of the largest time-interpolation value returned so far. Most of this
is in platform-specific code (arch/ia64/kernel/time.c), but there are
a handful of places in platform-independent code where this variable
needs to be cleared to zero. This is what the patch below does. I
didn't put it inside CONFIG_IA64 because I think this can be useful
for other platforms, too. I suppose I could put it inside CONFIG_SMP
though this would make the code uglier. If you think it's OK, please
apply, otherwise, I'd appreciate your feedback.
|
|
This is William Irwin's algorithmically O(1) version of
count_active_tasks (which is currently O(n) for n total tasks on the
system).
I like it a lot: we become O(1) because now we count uninterruptible
tasks, so we can return (nr_uninterruptible + nr_running). It does not
introduce any overhead or hurt the case for small n, so I have no
complaints.
This copy has a small optimization over the original posting, but is
otherwise the same thing wli posted earlier. I have tested to make sure
this returns accurate results and that the kernel profile improves.
|
|
Ok, here it is. The following arch are not covered:
Mips, Mips64 in 32-bit mode, parisc in __LP64__ mode.
In addition, x86_64 mentions jiffies in the existing script.
This may be a problem.
|
|
This patch (#1) just converts the task_struct to use struct list_head rather
than direct pointers for maintaining the children list.
|
|
|
|
- David Howells: abtract out "current->need_resched" as "need_resched()"
- Frank Davis: ide-tape update for bio
- various: header file fixups
- Jens Axboe: fix up bio/ide/highmem issues
- Kai Germaschewski: ISDN update
- Tim Waugh: parport update
- Patrik Mochel: initcall update
- Greg KH: USB and Compaq PCI hotplug updates
|
|
- Kai Germaschewski: ISDN updates
- Al Viro: start moving buffer cache indexing to "struct block_device *"
- Greg KH: USB update
- Russell King: fix up some ARM merge issues
- Ingo Molnar: scalable scheduler
|
|
- Christoph Hellwig: scsi_register_module cleanup
- Mikael Pettersson: apic.c LVTERR fixes
- Russell King: ARM update (including bio update for icside)
- Jens Axboe: more bio updates
- Al Viro: make ready to switch bread away from kdev_t..
- Davide Libenzi: scheduler cleanups
- Anders Gustafsson: LVM fixes for bio
- Richard Gooch: devfs update
|
|
- various: fix some module exports uncovered by stricter error checking
- Urban Widmark: make smbfs use same error define names as samba and win32
- Greg KH: USB update
- Tom Rini: MPC8xx ppc update
- Matthew Wilcox: rd.c page cache flushing fix
- Richard Gooch: devfs race fix: rwsem for symlinks
- Björn Wesen: Cris arch update
- Nikita Danilov: reiserfs cleanup
- Tim Waugh: parport update
- Peter Rival: update alpha SMP bootup to match wait_init_idle fixes
- Trond Myklebust: lockd/grace period fix
|
|
- remember to increment the version number
- Chris Mason: reiserfs mark_journal_new and bh leak fix
- Richard Gooch: devfs update
- Alexander Viro: further FS cleanup (superblock list)
- David Woodhouse: MTD update
- Kai Germaschewski: ISDN update (stanford checker fixes etc)
- Rich Baum: gcc-3.0 warning fixes
- Jeff Garzik: network driver updates
- Geert Uytterhoeven: m68k fbdev logo merge glitch fix
- Andrea Arcangeli: fix signal return path
- David Miller: Sparc updates
- Johannes Erdfelt: USB update
- Carsten Otte, Andries Brouwer: don't clear blk_size unconditionally
on partition check
- Martin Frey: alpha Sable irq fix
- Paul Mackerras: PPC softirq update
- Patrick Mochel: PCI power management infrastructure
- Robert Siemer: miroSOUND driver update
- Neil Brown: knfsd updates, including ability to export ReiserFS filesystems
- Trond Myklebust: NFS readdir fixup, don't update atime on client
- Andrew Morton: truncate_inode_pages speedup
- Paul Menage: make inode quota count all inodes..
|
|
|