| Age | Commit message (Collapse) | Author |
|
FASTCALL() is always expanded to empty, remove it.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
On Sat, 2008-01-05 at 13:35 -0800, Davide Libenzi wrote:
> I remember I talked with Arjan about this time ago. Basically, since 1)
> you can drop an epoll fd inside another epoll fd 2) callback-based wakeups
> are used, you can see a wake_up() from inside another wake_up(), but they
> will never refer to the same lock instance.
> Think about:
>
> dfd = socket(...);
> efd1 = epoll_create();
> efd2 = epoll_create();
> epoll_ctl(efd1, EPOLL_CTL_ADD, dfd, ...);
> epoll_ctl(efd2, EPOLL_CTL_ADD, efd1, ...);
>
> When a packet arrives to the device underneath "dfd", the net code will
> issue a wake_up() on its poll wake list. Epoll (efd1) has installed a
> callback wakeup entry on that queue, and the wake_up() performed by the
> "dfd" net code will end up in ep_poll_callback(). At this point epoll
> (efd1) notices that it may have some event ready, so it needs to wake up
> the waiters on its poll wait list (efd2). So it calls ep_poll_safewake()
> that ends up in another wake_up(), after having checked about the
> recursion constraints. That are, no more than EP_MAX_POLLWAKE_NESTS, to
> avoid stack blasting. Never hit the same queue, to avoid loops like:
>
> epoll_ctl(efd2, EPOLL_CTL_ADD, efd1, ...);
> epoll_ctl(efd3, EPOLL_CTL_ADD, efd2, ...);
> epoll_ctl(efd4, EPOLL_CTL_ADD, efd3, ...);
> epoll_ctl(efd1, EPOLL_CTL_ADD, efd4, ...);
>
> The code "if (tncur->wq == wq || ..." prevents re-entering the same
> queue/lock.
Since the epoll code is very careful to not nest same instance locks
allow the recursion.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
|
|
Also move wake_up_locked() to be with the related functions
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
|
|
clean up the sleep_on() APIs:
- do not use fastcall
- replace fragile macro magic with proper inline functions
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
kernel: INFO: trying to register non-static key.
kernel: the code is fine but needs lockdep annotation.
kernel: turning off the locking correctness validator.
kernel: [<c04051ed>] show_trace_log_lvl+0x58/0x16a
kernel: [<c04057fa>] show_trace+0xd/0x10
kernel: [<c0405913>] dump_stack+0x19/0x1b
kernel: [<c043b1e2>] __lock_acquire+0xf0/0x90d
kernel: [<c043bf70>] lock_acquire+0x4b/0x6b
kernel: [<c061472f>] _spin_lock_irqsave+0x22/0x32
kernel: [<c04363d3>] prepare_to_wait+0x17/0x4b
kernel: [<f89a24b6>] lpfc_do_work+0xdd/0xcc2 [lpfc]
kernel: [<c04361b9>] kthread+0xc3/0xf2
kernel: [<c0402005>] kernel_thread_helper+0x5/0xb
Another case of non-static lockdep keys; duplicate the paradigm set by
DECLARE_COMPLETION_ONSTACK and introduce DECLARE_WAIT_QUEUE_HEAD_ONSTACK.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Greg KH <gregkh@suse.de>
Cc: Markus Lidel <markus.lidel@shadowconnect.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
allyesconfig vmlinux size delta:
text data bss dec filename
20736884 6073834 3075176 29885894 vmlinux.before
20721009 6073966 3075176 29870151 vmlinux.after
~18 bytes per callsite, 15K of text size (~0.1%) saved.
(as an added bonus this also removes a lockdep annotation.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Create one lock class for all waitqueue locks in the kernel. Has no effect on
non-lockdep kernels.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Locking init improvement:
- introduce and use __SPIN_LOCK_UNLOCKED for array initializations,
to pass in the name string of locks, used by debugging
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
|
|
Fix more include file problems that surfaced since I submitted the previous
fix-missing-includes.patch. This should now allow not to include sched.h
from module.h, which is done by a followup patch.
Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
In the upcoming aio_down patch, it is useful to store a private data
pointer in the kiocb's wait_queue. Since we provide our own wake up
function and do not require the task_struct pointer, it makes sense to
convert the task pointer into a generic private pointer.
Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Use LIST_HEAD_INIT rather than doing it by hand in DEFINE_WAIT.
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Unify the spinlock initialization as far as possible.
Signed-off-by: Amit Gud <gud@eth.net>
Signed-off-by: Domen Puncer <domen@coderock.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
New kernel-doc comments for might_sleep & wait_event_*
Signed-off-by: Martin Waitz <tali@admingilde.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Addresses bug #3863, from <daveh@dmh2000.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
no user in sight
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Some of the parameters to __wait_on_bit() and __wait_on_bit_lock() are
redundant, as the wait_bit_queue parameter holds the flags word and the bit
number. This patch updates __wait_on_bit() and __wait_on_bit_lock() to
fetch that information from the wait_bit_queue passed to them and so reduce
the number of parameters so that -mregparm may be more effective.
Incremental atop the complete out-of-lining of the contention cases and the
fastcall and wait_on_bit_lock()/test_and_set_bit() fixes.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Move the slow paths of wait_on_bit() and wait_on_bit_lock() out of line.
Also uninline wake_up_bit() to reduce the number of callsites generated,
and adjust loop startup in __wait_on_bit_lock() to properly reflect its
usage in the contention case.
Incremental atop the fastcall and wait_on_bit_lock()/test_and_set_bit()
fixes. Successfully tested on x86-64.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Eliminate the bh waitqueue hashtable using bit_waitqueue() via
wait_on_bit() and wake_up_bit() to locate the waitqueue head associated
with a bit.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Consolidate bit waiting code patterns for page waitqueues using
__wait_on_bit() and __wait_on_bit_lock().
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Eliminate specialized page and bh waitqueue hashing structures in favor of
a standardized structure, using wake_up_bit() to wake waiters using the
standardized wait_bit_key structure.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
There appears to be one case missing from the wait_event() family - the
uninterruptible timeout wait. The following patch adds this.
This wait is particularly useful when (eg) you wish to pass work off to an
interrupt handler to perform, but also want to know if the hardware has
decided to go gaga under you. You don't want to sit around waiting for
something that'll never happen - you want to go and wack the gremlin which
caused the failure and retry.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch adds a new system call `waitid'. This is a new POSIX call that
subsumes the rest of the wait* family and can do some things the older
calls cannot. A minor addition is the ability to select what kinds of
status to check for with a mask of independent bits, so you can wait for
just stops and not terminations, for example. A more significant
improvement is the WNOWAIT flag, which allows for polling child status
without reaping. This interface fills in a siginfo_t with the same details
that a SIGCHLD for the status change has; some of that info (e.g. si_uid)
is not available via wait4 or other calls.
I've added a new system call that has the parameter conventions of the
POSIX function because that seems like the cleanest thing. This patch
includes the actual system call table additions for i386 and x86-64; other
architectures will need to assign the system call number, and 64-bit ones
may need to implement 32-bit compat support for it as I did for x86-64.
The new features could instead be provided by some new kludge inventions in
the wait4 system call interface (that's what BSD did). If kludges are
preferable to adding a system call, I can work up something different.
I added a struct rusage field si_rusage to siginfo_t in the SIGCHLD case
(this does not affect the size or layout of the struct). This is not part
of the POSIX interface, but it makes it so that `waitid' subsumes all the
functionality of `wait4'. Future kernel ABIs (new arch's or whatnot) can
have only the `waitid' system call and the rest of the wait* family
including wait3 and wait4 can be implemented in user space using waitid.
There is nothing in user space as yet that would make use of the new field.
Most of the new functionality is implemented purely in the waitid system
call itself. POSIX also provides for the WCONTINUED flag to report when a
child process had been stopped by job control and then resumed with
SIGCONT. Corresponding to this, a SIGCHLD is now generated when a child
resumes (unless SA_NOCLDSTOP is set), with the value CLD_CONTINUED in
siginfo_t.si_code. To implement this, some additional bookkeeping is
required in the signal code handling job control stops.
The motivation for this work is to make it possible to implement the POSIX
semantics of the `waitid' function in glibc completely and correctly. If
changing either the system call interface used to accomplish that, or any
details of the kernel implementation work, would improve the chances of
getting this incorporated, I am more than happy to work through any issues.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
From: Daniel McNeil <daniel@osdl.org>
From: Chris Mason <mason@suse.com>
AIO: retry infrastructure fixes and enhancements
Reorganises, comments and fixes the AIO retry logic. Fixes
and enhancements include:
- Split iocb setup and execution in io_submit
(also fixes io_submit error reporting)
- Use aio workqueue instead of keventd for retries
- Default high level retry methods
- Subtle use_mm/unuse_mm fix
- Code commenting
- Fix aio process hang on EINVAL (Daniel McNeil)
- Hold the context lock across unuse_mm
- Acquire task_lock in use_mm()
- Allow fops to override the retry method with their own
- Elevated ref count for AIO retries (Daniel McNeil)
- set_fs needed when calling use_mm
- Flush workqueue on __put_ioctx (Chris Mason)
- Fix io_cancel to work with retries (Chris Mason)
- Read-immediate option for socket/pipe retry support
Note on default high-level retry methods support
================================================
High-level retry methods allows an AIO request to be executed as a series of
non-blocking iterations, where each iteration retries the remaining part of
the request from where the last iteration left off, by reissuing the
corresponding AIO fop routine with modified arguments representing the
remaining I/O. The retries are "kicked" via the AIO waitqueue callback
aio_wake_function() which replaces the default wait queue entry used for
blocking waits.
The high level retry infrastructure is responsible for running the
iterations in the mm context (address space) of the caller, and ensures that
only one retry instance is active at a given time, thus relieving the fops
themselves from having to deal with potential races of that sort.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Use the more SMP-friendly prepare_to_wait()/finish_wait() in wait_event() and
friends.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch defines a macro that does exactly what
wait_event_interruptible() does except that it adds the current task to the
wait queue as an exclusive task (i.e., sets the WQ_FLAG_EXCLUSIVE flag)
rather than as a non-exclusive task as wait_event_interruptible() does.
This allows one to do a wake_up_nr() to wake up a specific number of tasks.
I'm in the process of submitting a patch to linux-ia64 that requires this
capability. (Its subject line is "[PATCH 3/4] SGI Altix cross partition
functionality".)
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
From: William Lee Irwin III <wli@holomorphy.com>
This patch provides an additional argument to __wake_up_common() so that the
information wakefunc.patch made waiters ready to receive may be passed to them
by wakers. This is provided as a separate patch so that the overhead of the
additional argument to __wake_up_common() can be measured in isolation. No
change in performance was observable here.
|
|
From: William Lee Irwin III <wli@holomorphy.com>
This patch series is solving the "thundering herd" problem that occurs in the
mainline implementation of hashed waitqueues. There are two sources of
spurious wakeups in such arrangements:
(a) Hash collisions that place waiters on different objects on the same
waitqueue, which wakes threads falsely when any of the objects hashed to
the same queue receives a wakeup. i.e. loss of information about which
object a wakeup event is related to.
(b) Loss of information about which object a given waiter is waiting on.
This precludes wake-one semantics for mutual exclusion scenarios. For
instance, a lock bit may be slept on. If there are any waiters on the
object, a lock bit release event must wake at least one of them so as to
prevent deadlock. But without information as to which waiter is waiting
on which object, we must resort to waking all waiters who could possibly
be waiting on it. Now, as the lock bit provides mutual exclusion, only
one of the waiters woken can proceed, and the remainder will go back to
sleep and wait for another event, creating unnecessary system load. Once
wake-one semantics are established, only one of the waiters waiting to
acquire a lock bit need to be woken, which measurably reduces system load
and improves efficiency (i.e. it's the subject of the benchmarking I've
been sending to you).
Even beyond the measurable efficiency gains, there are reasons of robustness
and responsiveness to motivate addressing the issue of thundering herds. In a
real-life scenario I've been personally involved in resolving, the thundering
herd issue caused powerful modern SMP machines with fast IO systems to be
unresponsive to user input for a minute at a time or more. Analogues of these
patches for the distro kernels involved fully resolved the issue to the
customer's satisfaction and obviated workarounds to limit the pagecache's
size.
The latest spin of these patches basically shoves more pieces of the logic
into the wakeup functions, with some efficiency gains from sharing the hot
codepath with the rest of the kernel, and a slightly larger diff than the
patches with the newly-introduced entrypoint. Writing these was motivated by
the push to insulate sched.c from more of the details of wakeup semantics by
putting more of the logic into the wakeup functions. In order to accomplish
this while still solving (b), the wakeup functions grew a new argument for
communication about what object a wakeup event is related to to be passed by
the waker.
=========
This patch provides an additional argument to wakeup functions so that
information may be passed from the waker to the waiter. This is provided as a
separate patch so that the overhead of the additional argument can be measured
in isolation. No change in performance was observable here.
|
|
It has no callers, is using the non-existent spin_lock_irqrestore(), and is
obviously very untested. Kill.
|
|
This fixes the scheduler's sync-wakeup code to be consistent on UP as
well.
Right now there's a behavioral difference between an UP kernel and an
SMP kernel running on a UP box: sync wakeups (which are only activated
on SMP) can cause a wakeup of a higher prio task, without preemption.
On UP kernels this does not happen. This difference in wakeup behavior
is bad.
This patch activates sync wakeups on UP as well - in the cases sync
wakeups are done the waker knows that it will schedule away soon, so
this 'delay preemption' decision is correct on UP as well.
|
|
This patch removes all the wait_queue handling code from sched.h and puts
it in wait.h with the rest of the wait_queue handling code. Note that
sched.h must continue to include wait.h for the wait_queue_head_t embedded
in struct task. However there may be files which only need wait.h now.
|
|
This is worth a whopping 2% on spwecweb on an 8-way. Which is faintly
surprising because __wake_up and other wait/wakeup functions are not
apparent in the specweb profiles which I've seen.
The main objective of this is to reduce the CPU cost of the wait/wakeup
operation. When a task is woken up, its waitqueue is removed from the
waitqueue_head by the waker (ie: immediately), rather than by the woken
process.
This means that a subsequent wakeup does not need to revisit the
just-woken task. It also means that the just-woken task does not need
to take the waitqueue_head's lock, which may well reside in another
CPU's cache.
I have no decent measurements on the effect of this change - possibly a
20-30% drop in __wake_up cost in Badari's 40-dds-to-40-disks test (it
was the most expensive function), but it's inconclusive. And no
quantitative testing of which I am aware has been performed by
networking people.
The API is very simple to use (Linus thought it up):
my_func(waitqueue_head_t *wqh)
{
DEFINE_WAIT(wait);
prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
if (!some_test)
schedule();
finish_wait(wqh, &wait);
}
or:
DEFINE_WAIT(wait);
while (!some_test_1) {
prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
if (!some_test_2)
schedule();
...
}
finish_wait(wqh, &wait);
You need to bear in mind that once prepare_to_wait has been performed,
your task could be removed from the waitqueue_head and placed into
TASK_RUNNING at any time. You don't know whether or not you're still
on the waitqueue_head.
Running prepare_to_wait() when you're already on the waitqueue_head is
fine - it will do the right thing.
Running finish_wait() when you're actually not on the waitqueue_head is
fine.
Running finish_wait() when you've _never_ been on the waitqueue_head is
fine, as ling as the DEFINE_WAIT() macro was used to initialise the
waitqueue.
You don't need to fiddle with current->state. prepare_to_wait() and
finish_wait() will do that. finish_wait() will always return in state
TASK_RUNNING.
There are plenty of usage examples in vm-wakeups.patch and
tcp-wakeups.patch.
|
|
These are the completely generic bits (linux/init_task.h and linux/wait.h).
From: Art Haas <ahaas@neosoft.com>
Here's the latest diffs for the files in include/linux.
Patches are against 2.5.31.
|
|
This adds support for wait queue function callbacks, which are used by
aio to build async read / write operations on top of existing wait
queues at points that would normally block a process.
|
|
This patch removes the whole wq_lock_t abstraction, forcing the behavior
to be that of a standard spinlock and changes all the wq_lock code in
the tree appropriately.
Removes lots of code - always a Good Thing to me. New behavior is same
as previous behavior (USE_RW_WAIT_QUEUE_SPINLOCK unset).
|
|
- Jens Axboe: more bio updates, fix some request list bogosity under load
- Al Viro: export seq_xxx functions
- Manfred Spraul: include file cleanups, pc110pad compile fix
- David Woodhouse: fix JFFS2 write error handling
- Dave Jones: start merging up with 2.4.x patches
- Manfred Spraul: coredump fixes, FS event counter cleanups
- me: fix SCSI CD-ROM sectorsize BIO breakage
|
|
- Russell King: /proc/cpuinfo for ARM
- Paul Mackerras: PPC update (cpuinfo etc)
- Nicolas Aspert: fix Intel 8xx agptlb flush
- Marko Myllynen: "Lindent" doesn't really need bash ;)
- Alexander Viro: /proc/cpuinfo for s390/s390x/sh, /proc/pci cleanup
- Alexander Viro: make lseek work on seqfiles
|
|
- Greg KH: start migration to new "min()/max()"
- Roman Zippel: move affs over to "min()/max()".
- Vojtech Pavlik: VIA update (make sure not to IRQ-unmask a vt82c576)
- Jan Kara: quota bug-fix (don't decrement quota for non-counted inode)
- Anton Altaparmakov: more NTFS updates
- Al Viro: make nosuid/noexec/nodev be per-mount flags, not per-filesystem
- Alan Cox: merge input/joystick layer differences, driver and alpha merge
- Keith Owens: scsi Makefile cleanup
- Trond Myklebust: fix oopsable race in locking code
- Jean Tourrilhes: IrDA update
|
|
- Chris Mason: daemonize reiserfs commit thread
- Alan Cox: syncup (AFFS might even work, and official VIA workarounds)
- Jeff Garzik: network driver updates
- Paul Mackerras: PPP update
- David Howells: more rw-sem cleanups, updates. Slowly getting somewhere.
|
|
- Ingo Molnar/Al Viro: don't use bforget() on ext2 (and minix) metadata
where we may not be the only owner of the buffer! FS corruption.
- Andi Kleen: IPv6 packet re-assembly fix.
- David Howells: fix up rwsem implementation
- Alan Cox: more merging (S/390 down, ARM to go).
- Jens Axboe: LVM and loop fixes
|
|
- driver sync up with Alan
- Andrew Morton: wakeup cleanup and race fix
- Paul Mackerras: macintosh driver updates.
- don't trust "page_count()" on reserved pages!
- Russell King: fix serious IDE multimode write bug!
- me, Jens, others: fix elevator problem
- ARM, MIPS and cris architecture updates
- alpha updates: better page clear/copy, avoid kernel lock in execve
- USB and firewire updates
- ISDN updates
- Irda updates
|
|
|