| Age | Commit message (Collapse) | Author |
|
Do all timer zapping in exit_itimers.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The patch "MCA recovery improvements" added do_exit to mca_drv.c.
That's fine when the mca recovery code is built in the kernel
(CONFIG_IA64_MCA_RECOVERY=y) but breaks building the mca recovery
code as a module (CONFIG_IA64_MCA_RECOVERY=m).
Most users are currently building this as a module, as loading
and unloading the module provides a very convenient way to turn
on/off error recovery.
This patch exports do_exit, so mca_drv.c can build as a module.
Signed-off-by: Russ Anderson (rja@sgi.com)
Signed-off-by: Tony Luck <tony.luck@intel.com>
|
|
Another large rollup of various patches from Adrian which make things static
where they were needlessly exported.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
I have recompiled Linux kernel 2.6.11.5 documentation for me and our
university students again. The documentation could be extended for more
sources which are equipped by structured comments for recent 2.6 kernels. I
have tried to proceed with that task. I have done that more times from 2.6.0
time and it gets boring to do same changes again and again. Linux kernel
compiles after changes for i386 and ARM targets. I have added references to
some more files into kernel-api book, I have added some section names as well.
So please, check that changes do not break something and that categories are
not too much skewed.
I have changed kernel-doc to accept "fastcall" and "asmlinkage" words reserved
by kernel convention. Most of the other changes are modifications in the
comments to make kernel-doc happy, accept some parameters description and do
not bail out on errors. Changed <pid> to @pid in the description, moved some
#ifdef before comments to correct function to comments bindings, etc.
You can see result of the modified documentation build at
http://cmp.felk.cvut.cz/~pisa/linux/lkdb-2.6.11.tar.gz
Some more sources are ready to be included into kernel-doc generated
documentation. Sources has been added into kernel-api for now. Some more
section names added and probably some more chaos introduced as result of quick
cleanup work.
Signed-off-by: Pavel Pisa <pisa@cmp.felk.cvut.cz>
Signed-off-by: Martin Waitz <tali@admingilde.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Convert most of the current code that uses _NSIG directly to instead use
valid_signal(). This avoids gcc -W warnings and off-by-one errors.
Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
It's old sanity checking that may have been useful for debugging, but
is just bogus these days.
Noticed by Mattia Belletti.
|
|
This patch hides reparent_to_init(). reparent_to_init() should only be
called by daemonize().
Signed-off-by: Coywolf Qi Hunt <coywolf@lovecn.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This my cpuset patch, with the following changes in the last two weeks:
1) Updated to 2.6.8.1-mm1
2) [Simon Derr <Simon.Derr@bull.net>] Fix new cpuset to begin empty,
not copied from parent. Needed to avoid breaking exclusive property.
3) [Dinakar Guniguntala <dino@in.ibm.com>] Finish initializing top
cpuset from cpu_possible_map after smp_init() called.
4) [Paul Jackson <pj@sgi.com>] Check on each call to __alloc_pages()
if the current tasks cpuset mems_allowed has changed. Use a cpuset
generation number, bumped on any cpuset memory placement change,
to make this check efficient. Update the tasks mems_allowed from
its cpuset, if the cpuset has changed.
5) [Paul Jackson <pj@sgi.com>] If a task is moved to another cpuset,
then update its cpus_allowed, using set_cpus_allowed().
6) [Paul Jackson <pj@sgi.com>] Update Documentation/cpusets.txt to
reflect above changes (4) and (5).
I continue to recommend the following patch for inclusion in your 2.6.9-*mm
series, when that opens. It provides an important facility for high
performance computing on large systems. Simon Derr of Bull (France) and
myself are the primary authors. Erich Focht has indicated that NEC is also
a potential user of this patch on the TX-7 NUMA machines, and that he
"would very much welcome the inclusion of cpusets."
I offer this update to lkml, in order to invite continued feedback.
The one prerequiste patch for this cpuset patch was just posted before this
one. That was a patch to provide a new bitmap list format, of which
cpusets is the first user.
This patch has been built on top of 2.6.8.1-mm1, for the arch's:
i386 x86_64 sparc ia64 powerpc-405 powerpc-750 sparc64
with and without CONFIG_CPUSET. It has been booted and tested on ia64
(sn2_defconfig, SN2 hardware). The 'alpha' arch also built, except for
what seems to be an unrelated toolchain problem (crosstool ld sigsegv) in
the final link step.
===
Cpusets provide a mechanism for assigning a set of CPUs and Memory Nodes to
a set of tasks.
Cpusets constrain the CPU and Memory placement of tasks to only the
processor and memory resources within a tasks current cpuset. They form a
nested hierarchy visible in a virtual file system. These are the essential
hooks, beyond what is already present, required to manage dynamic job
placement on large systems.
Cpusets require small kernel hooks in init, exit, fork, mempolicy,
sched_setaffinity, page_alloc and vmscan. And they require a "struct
cpuset" pointer, a cpuset_mems_generation, and a "mems_allowed" nodemask_t
(to go along with the "cpus_allowed" cpumask_t that's already there) in
each task struct.
These hooks:
1) establish and propagate cpusets,
2) enforce CPU placement in sched_setaffinity,
3) enforce Memory placement in mbind and sys_set_mempolicy,
4) restrict page allocation and scanning to mems_allowed, and
5) restrict migration and set_cpus_allowed to cpus_allowed.
The other required hook, restricting task scheduling to CPUs in a tasks
cpus_allowed mask, is already present.
Cpusets extend the usefulness of, the existing placement support that was
added to Linux 2.6 kernels: sched_setaffinity() for CPU placement, and
mbind() and set_mempolicy() for memory placement. On smaller or dedicated
use systems, the existing calls are often sufficient.
On larger NUMA systems, running more than one, performance critical, job,
it is necessary to be able to manage jobs in their entirety. This includes
providing a job with exclusive CPU and memory that no other job can use,
and being able to list all tasks currently in a cpuset.
A given job running within a cpuset, would likely use the existing
placement calls to manage its CPU and memory placement in more detail.
Cpusets are named, nested sets of CPUs and Memory Nodes. Each cpuset is
represented by a directory in the cpuset virtual file system, normally
mounted at /dev/cpuset.
Each cpuset directory provides the following files, which can be
read and written:
cpus:
List of CPUs allowed to tasks in that cpuset.
mems:
List of Memory Nodes allowed to tasks in that cpuset.
tasks:
List of pid's of tasks in that cpuset.
cpu_exclusive:
Flag (0 or 1) - if set, cpuset has exclusive use of
its CPUs (no sibling or cousin cpuset may overlap CPUs).
mem_exclusive:
Flag (0 or 1) - if set, cpuset has exclusive use of
its Memory Nodes (no sibling or cousin may overlap).
notify_on_release:
Flag (0 or 1) - if set, then /sbin/cpuset_release_agent
will be invoked, with the name (/dev/cpuset relative path)
of that cpuset in argv[1], when the last user of it (task
or child cpuset) goes away. This supports automatic
cleanup of abandoned cpusets.
In addition one new filetype is added to the /proc file system:
/proc/<pid>/cpuset:
For each task (pid), list its cpuset path, relative to the
root of the cpuset file system. This file is read-only.
New cpusets are created using 'mkdir' (at the shell or in C). Old ones are
removed using 'rmdir'. The above files are accessed using read(2) and
write(2) system calls, or shell commands such as 'cat' and 'echo'.
The CPUs and Memory Nodes in a given cpuset are always a subset of its
parent. The root cpuset has all possible CPUs and Memory Nodes in the
system. A cpuset may be exclusive (cpu or memory) only if its parent is
similarly exclusive.
See further Documentation/cpusets.txt, at the top of the following
patch.
/proc interface:
It is useful, when learning and making new uses of cpusets and placement to be
able to see what are the current value of a tasks cpus_allowed and
mems_allowed, which are the actual placement used by the kernel scheduler and
memory allocator.
The cpus_allowed and mems_allowed values are needed by user space apps that
are micromanaging placement, such as when moving an app to a obtained by
that app within its cpuset using sched_setaffinity, mbind and
set_mempolicy.
The cpus_allowed value is also available via the sched_getaffinity system
call. But since the entire rest of the cpuset API, including the display
of mems_allowed added here, is via an ascii style presentation in /proc and
/dev/cpuset, it is worth the extra couple lines of code to display
cpus_allowed in the same way.
This patch adds the display of these two fields to the 'status' file in the
/proc/<pid> directory of each task. The fields are only added if
CONFIG_CPUSETS is enabled (which is also needed to define the mems_allowed
field of each task). The new output lines look like:
$ tail -2 /proc/1/status
Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff
Mems_allowed: ffffffff,ffffffff
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Simon Derr <simon.derr@bull.net>
Signed-off-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
POSIX requires that setitimer, getitimer, and alarm work on a per-process
basis. Currently, Linux implements these for individual threads. This patch
fixes these semantics for the ITIMER_PROF timer (which generates SIGPROF) and
the ITIMER_VIRTUAL timer (which generates SIGVTALRM), making them shared by
all threads in a process (thread group). This patch should be applied after
the one that fixes ITIMER_REAL.
The essential machinery for these timers is tied into the new posix-timers
code for process CPU clocks and timers. This patch requires the cputimers
patch and its dependencies.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
POSIX requires that setitimer, getitimer, and alarm work on a per-process
basis. Currently, Linux implements these for individual threads. This patch
fixes these semantics for the ITIMER_REAL timer (which generates SIGALRM),
making it shared by all threads in a process (thread group).
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
It was intended that such things would not be possible because getting into
that code in the first place should be ruled out while exiting. That
removes the requirement for any special case check in the common path.
But, it was done too late since it hadn't occurred to me that ->live going
zero itself created a problem.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
POSIX requires that when you claim _POSIX_CPUTIME and _POSIX_THREAD_CPUTIME,
not only the clock_* calls but also timer_* calls must support the thread and
process CPU time clocks. This patch provides that support, building on my
recent additions to support these clocks in the POSIX clock_* interfaces.
This patch will not work without those changes, as well as the patch fixing
the timer lock-siglock deadlock problem.
The apparent pervasive changes to posix-timers.c are simply that some fields
of struct k_itimer have changed name and moved into a union. This was
appropriate since the data structures required for the existing real-time
timer support and for the new thread/process CPU-time timers are quite
different.
The glibc patches to support CPU time clocks using the new kernel support is
in http://people.redhat.com/roland/glibc/kernel-cpuclocks.patch, and that
includes tests for the timer support (if you build glibc with NPTL).
From: Christoph Lameter <clameter@sgi.com>
Your patch breaks the mmtimer driver because it used k_itimer values for
its own purposes. Here is a fix by defining an additional structure in
k_itimer (same approach for mmtimer as the cpu timers):
From: Roland McGrath <roland@redhat.com>
Fix bug identified by Alexander Nyberg <alexn@dsv.su.se>
> The problem arises from code touching the union in alloc_posix_timer()
> which makes firing go non-zero. When firing is checked in
> posix_cpu_timer_set() it will be positive causing an infinite loop.
>
> So either the below fix or preferably move the INIT_LIST_HEAD(x) from
> alloc_posix_timer() to somewhere later where it doesn't disturb the other
> union members.
Thanks for finding this problem. The latter is what I think is the right
solution. This patch does that, and also removes some superfluous rezeroing.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
In the 2.6.11 development cycle function calls have been added to lots
of hot vm paths to do accounting. I think these should not go into the
final 2.6.1 release because these statistics can be collected in a different
way that does not require the updating of counters from frequently used
vm code paths and is consistent with the methods use elsewhere in the kernel
to obtain statistics.
These function calls are
acct_update_integrals -> Account for processes based on stime changes
update_mem_hiwater -> takes rss and total_vm hiwater marks.
acct_update_integrals is only useful to call if stime changes otherwise
it will simply return. It is therefore best to relocate the function call
to acct_update_integral into the function that updates stime which is
account_system_time and remove it from the vm code paths.
update_mem_hiwater finds the rss hiwater mark. We call that from timer
context as well. This means that processes' high-water marks are now
sampled statistically, at timer-interrupt time rather than
deterministically. This may or may not be a problem..
This means that the rss limit is not always updated if rss is increased
and thus not as accurate. But the benefit is that the rss checks do no
pollute the vm paths and that it is consistent with the rss limit check.
The following patch removes acct_update_integrals and update_mem_hiwater
from the hot vm paths.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
From: Jay Lan <jlan@sgi.com>
The new "move-accounting-function-calls-out-of-critical-vm-code-paths"
patch in 2.6.11-rc3-mm2 was different from the code i tested.
In particular, it mistakenly dropped the accounting routine calls
in fs/exec.c. The calls in do_execve() are needed to properly
initialize accounting fields. Specifically, the tsk->acct_stimexpd
needs to be initialized to tsk->stime.
I have discussed this with Christoph Lameter and he gave me full
blessings to bring the calls back.
Signed-off-by: Jay Lan <jlan@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
They had their time and place, but right now they are using
infrastructure that is getting re-done, and we're better off
just dropping them.
|
|
If one thread uses ptrace on another thread in the same thread group, there
can be a deadlock when calling exec. The ptrace_stop change ensures that
no tracing stop can be entered for a queued signal, or exit tracing, if the
tracer is part of the same dying group. The exit_notify change prevents a
ptrace zombie from sticking around if its tracer is in the midst of a group
exit (which an exec fakes), so these zombies don't hold up de_thread's
synchronization. The de_thread change ensures the new thread group leader
doesn't wind up ptracing itself, which would produce its own deadlocks.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch introduces the concept of (virtual) cputime. Each architecture
can define its method to measure cputime. The main idea is to define a
cputime_t type and a set of operations on it (see asm-generic/cputime.h).
Then use the type for utime, stime, cutime, cstime, it_virt_value,
it_virt_incr, it_prof_value and it_prof_incr and use the cputime operations
for each access to these variables. The default implementation is jiffies
based and the effect of this patch for architectures which use the default
implementation should be neglectible.
There is a second type cputime64_t which is necessary for the kernel_stat
cpu statistics. The default cputime_t is 32 bit and based on HZ, this will
overflow after 49.7 days. This is not enough for kernel_stat (ihmo not
enough for a processes too), so it is necessary to have a 64 bit type.
The third thing that gets introduced by this patch is an additional field
for the /proc/stat interface: cpu steal time. An architecture can account
cpu steal time by calls to the account_stealtime function. The cpu which
backs a virtual processor doesn't spent all of its time for the virtual
cpu. To get meaningful cpu usage numbers this involuntary wait time needs
to be accounted and exported to user space.
From: Hugh Dickins <hugh@veritas.com>
The p->signal check in account_system_time is insufficient. If the timer
interrupt hits near the end of exit_notify, after EXIT_ZOMBIE has been set,
another cpu may release_task (NULLifying p->signal) in between
account_system_time's check and check_rlimit's dereference. Nor should
account_it_prof risk send_sig. But surely account_user_time is safe?
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
When a thread stops for ptrace exit tracing, it cannot be resumed by
SIGKILL. Once PF_EXITING is set, SIGKILL will not cause a wakeup from stop
(see wants_signal in kernel/signal.c). This patch moves the ptrace stop
for exit tracing before the setting of PF_EXITING.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The __ptrace_unlink code that checks for TASK_TRACED fixed the problem of a
thread being left in TASK_TRACED when no longer being ptraced.
However, an oversight in the original fix made it fail to handle the
case where the child is ptraced by its real parent.
Fixed thus.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Use the existing "tty_sem" to protect against the process tty changes
too.
|
|
__exit_mm() is an inlined version of exit_mm(). This patch unifies them.
Saves 356 byte in exit.o.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
I just did a quick audit of the use of exit_state and the EXIT_* bit
macros. I guess I didn't really review these changes very closely when you
did them originally. :-(
I found several places that seem like lossy cases of query-replace without
enough thought about the code. Linus has previously said the >= tests
ought to be & tests instead. But for exit_state, it can only ever be 0,
EXIT_DEAD, or EXIT_ZOMBIE--so a nonzero test is actually the same as
testing & (EXIT_DEAD|EXIT_ZOMBIE), and maybe its code is a tiny bit better.
The case like in choose_new_parent is just confusing, to have the
always-false test for EXIT_* bits in ->state there too.
The two cases in wants_signal and do_process_times are actual regressions
that will give us back old bugs in race conditions. These places had
s/TASK/EXIT/ but not s/state/exit_state/, and now there tests for exiting
tasks are now wrong and never catching them. I take it back: there is no
regression in wants_signal in practice I think, because of the PF_EXITING
test that makes the EXIT_* state checks superfluous anyway. So that is
just another cosmetic case of confusing code. But in do_process_times,
there is that SIGXCPU-while-exiting race condition back again.
Signed-off-by: Roland McGrath <roland@redhat.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
There is really no point in each task_struct having its own waitchld_exit.
In the only use of it, the waitchld_exit of each thread in a group gets
woken up at the same time. So, there might as well just be one wait queue
for the whole thread group. This patch does that by moving the field from
task_struct to signal_struct. It should have no effect on the behavior,
but saves a little work and a little storage in the multithreaded case.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
After my last change, there are plenty of unused bits available in the new
flags word in signal_struct. This patch moves the `group_exit' flag into
one of those bits, saving a word in signal_struct.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The `sig_avoid_stop_race' checks fail to catch a related race scenario that
can happen. I don't think this has been seen in nature, but it could
happen in the same sorts of situations where the observed problems come up
that those checks work around. This patch takes a different approach to
catching this race condition. The new approach plugs the hole, and I think
is also cleaner.
The issue is a race between one CPU processing a stop signal while another
CPU processes a SIGCONT or SIGKILL. There is a window in stop-signal
processing where the siglock must be released. If a SIGCONT or SIGKILL
comes along here on another CPU, then the stop signal in the midst of being
processed needs to be discarded rather than having the stop take place
after the SIGCONT or SIGKILL has been generated. The existing workaround
checks for this case explicitly by looking for a pending SIGCONT or SIGKILL
after reacquiring the lock.
However, there is another problem related to the same race issue. In the
window where the processing of the stop signal has released the siglock,
the stop signal is not represented in the pending set any more, but it is
still "pending" and not "delivered" in POSIX terms. The SIGCONT coming in
this window is required to clear all pending stop signals. But, if a stop
signal has been dequeued but not yet processed, the SIGCONT generation will
fail to clear it (in handle_stop_signal). Likewise, a SIGKILL coming here
should prevent the stop processing and make the thread die immediately
instead. The `sig_avoid_stop_race' code checks for this by examining the
pending set to see if SIGCONT or SIGKILL is in it. But this fails to
handle the case where another CPU running another thread in the same
process has already dequeued the signal (so it no longer can be found in
the pending set). We must catch this as well, so that the same problems do
not arise when another thread on another CPU acted real fast.
I've fixed this dumping the `sig_avoid_stop_race' kludge in favor of a
little explicit bookkeeping. Now, dequeuing any stop signal sets a flag
saying that a pending stop signal has been taken on by some CPU since the
last time all pending stop signals were cleared due to SIGCONT/SIGKILL.
The processing of stop signals checks the flag after the window where it
released the lock, and abandons the signal the flag has been cleared. The
code that clears pending stop signals on SIGCONT generation also clears
this flag. The various places that are trying to ensure the process dies
quickly (SIGKILL or other unhandled signals) also clear the flag. I've
made this a general flags word in signal_struct, and replaced the
stop_state field with flag bits in this word.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch is to offer common accounting data collection method at memory
usage for various accounting packages including BSD accounting, ELSA, CSA
and any other acct packages that use a common layer of data collection.
New struct fields are added to mm_struct to save high watermarks of rss
usage as well as virtual memory usage.
New struct fields are added to task_struct to collect accumulated rss usage
and vm usages.
These data are collected on per process basis.
Signed-off-by: Jay Lan <jlan@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
switch_uid() doesn't care about tasklist_lock, so do it outside
the lock and avoid a subtle (and very very unlikely to trigger)
AB-BA deadlock.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Klaus Dittrich observed this bug and posted a test case for it.
This patch fixes both that failure mode and some others possible. What
Klaus saw was a false negative (i.e. ECHILD when there was a child)
when the group leader was a zombie but delayed because other children
live; in the test program this happens in a race between the two threads
dying on a signal.
The change to the TASK_TRACED case avoids a potential false positive
(blocking, or WNOHANG returning 0, when there are really no children
left), in the race condition where my_ptrace_child returns zero.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
We should not touch "self_exec_id" here. The parent changed,
not we.
|
|
Only set the flag in the cases when the exit state is not either
TASK_DEAD or TASK_ZOMBIE.
(TASK_DEAD or TASK_ZOMBIE will either race or we'll return the
information, so no need to note them).
I confirmed that this fixes the problem and I also ran some LTP tests
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
released the tasklist_lock.
Since it released the lock, the process lists may not
be valid any more, and we must repeat the loop rather than
continue with the next parent.
Use -EAGAIN to show this condition (separate from the
normal -EFAULT that may happen if rusage information could
not be copied to user space).
|
|
This clarifies more of the x86 caller/callee stack ownership
issues by making the exception and interrupt handler assembler
interfaces use register calling conventions.
System calls still use the stack.
Tested with "crashme" on UP/SMP.
|
|
The session leader should disassociate from its controlling terminal and
send SIGHUP signals only when the whole session leader process dies.
Currently, this gets done when any thread in that process dies, which is
wrong. This patch fixes it.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch changes process accounting to write just one record for a
process with many NPTL threads, rather than one record for each thread. No
record is written until the last thread exits. The process's record shows
the cumulative time of all the threads that ever lived in that process
(thread group). This seems like the clearly right thing and I assume it is
what anyone using process accounting really would like to see.
There is a race condition between multiple threads exiting at the same time
to decide which one should write the accounting record. I couldn't think
of anything clever using existing bookkeeping that would get this right, so
I added another counter for this. (There may be some potential to clean up
existing places that figure out how many non-zombie threads are in the
group, now that this count is available.)
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Not exactly a thing we want done from modules, and no module uses it
anyway.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The feature set the patch includes:
- Key attributes:
- Key type
- Description (by which a key of a particular type can be selected)
- Payload
- UID, GID and permissions mask
- Expiry time
- Keyrings (just a type of key that holds links to other keys)
- User-defined keys
- Key revokation
- Access controls
- Per user key-count and key-memory consumption quota
- Three std keyrings per task: per-thread, per-process, session
- Two std keyrings per user: per-user and default-user-session
- prctl() functions for key and keyring creation and management
- Kernel interfaces for filesystem, blockdev, net stack access
- JIT key creation by usermode helper
There are also two utility programs available:
(*) http://people.redhat.com/~dhowells/keys/keyctl.c
A comprehensive key management tool, permitting all the interfaces
available to userspace to be exercised.
(*) http://people.redhat.com/~dhowells/keys/request-key
An example shell script (to be installed in /sbin) for instantiating a
key.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch fixes all the preempt-after-task->state-is-TASK_DEAD problems we
had. Right now, the moment procfs does a down() that sleeps in
proc_pid_flush() [it could] our TASK_DEAD state is zapped and we might be
back to TASK_RUNNING to and we trigger this assert:
schedule();
BUG();
/* Avoid "noreturn function does return". */
for (;;) ;
I have split out TASK_ZOMBIE and TASK_DEAD into a separate p->exit_state
field, to allow the detaching of exit-signal/parent/wait-handling from
descheduling a dead task. Dead-task freeing is done via PF_DEAD.
Tested the patch on x86 SMP and UP, but all architectures should work
fine.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
I found that the prototypes for sys_waitid and sys_fcntl in
<linux/syscalls.h> don't match the implementation. In order to keep all
prototypes in sync in the future, now include the header from each file
implementing any syscall.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
There is a race between PTRACE_ATTACH and the real parent calling wait.
For a moment, the task is put in PT_PTRACED but with its parent still
pointing to its real_parent. In this circumstance, if the real parent
calls wait without the WUNTRACED flag, he can see a stopped child status,
which wait should never return without WUNTRACED when the caller is not
using ptrace. Here it is not the caller that is using ptrace, but some
third party.
This patch avoids this race condition by adding the PT_ATTACHED flag to
distinguish a real parent from a ptrace_attach parent when PT_PTRACED is
set, and then having wait use this flag to confirm that things are in order
and not consider the child ptraced when its ->ptrace flags are set but its
parent links have not yet been switched. (ptrace_check_attach also uses it
similarly to rule out a possible race with a bogus ptrace call by the real
parent during ptrace_attach.)
While looking into this, I noticed that every arch's sys_execve has:
current->ptrace &= ~PT_DTRACE;
with no locking at all. So, if an exec happens in a race with
PTRACE_ATTACH, you could wind up with ->ptrace not having PT_PTRACED set
because this store clobbered it. That will cause later BUG hits because
the parent links indicate ptracedness but the flag is not set. The patch
corrects all the places I found to use task_lock around diddling ->ptrace
when it's possible to be racing with ptrace_attach. (The ptrace operation
code itself doesn't have this issue because it already excludes anyone else
being in ptrace_attach.)
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
POSIX specifies the new WCONTINUED flag for waitpid, not just for waitid.
I overlooked this addition when I implemented waitid. The real work was
already done to support waitid, but waitpid needs to report the results
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
POSIX specifies that the limit settings provided by getrlimit/setrlimit are
shared by the whole process, not specific to individual threads. This
patch changes the behavior of those calls to comply with POSIX.
I've moved the struct rlimit array from task_struct to signal_struct, as it
has the correct sharing properties. (This reduces kernel memory usage per
thread in multithreaded processes by around 100/200 bytes for 32/64
machines respectively.) I took a fairly minimal approach to the locking
issues with the newly shared struct rlimit array. It turns out that all
the code that is checking limits really just needs to look at one word at a
time (one rlim_cur field, usually). It's only the few places like
getrlimit itself (and fork), that require atomicity in accessing a whole
struct rlimit, so I just used a spin lock for them and no locking for most
of the checks. If it turns out that readers of struct rlimit need more
atomicity where they are now cheap, or less overhead where they are now
atomic (e.g. fork), then seqcount is certainly the right thing to use for
them instead of readers using the spin lock. Though it's in signal_struct,
I didn't use siglock since the access to rlimits never needs to disable
irqs and doesn't overlap with other siglock uses. Instead of adding
something new, I overloaded task_lock(task->group_leader) for this; it is
used for other things that are not likely to happen simultaneously with
limit tweaking. To me that seems preferable to adding a word, but it would
be trivial (and arguably cleaner) to add a separate lock for these users
(or e.g. just use seqlock, which adds two words but is optimal for readers).
Most of the changes here are just the trivial s/->rlim/->signal->rlim/.
I stumbled across what must be a long-standing bug, in reparent_to_init.
It does:
memcpy(current->rlim, init_task.rlim, sizeof(*(current->rlim)));
when surely it was intended to be:
memcpy(current->rlim, init_task.rlim, sizeof(current->rlim));
As rlim is an array, the * in the sizeof expression gets the size of the
first element, so this just changes the first limit (RLIMIT_CPU). This is
for kernel threads, where it's clear that resetting all the rlimits is what
you want. With that fixed, the setting of RLIMIT_FSIZE in nfsd is
superfluous since it will now already have been reset to RLIM_INFINITY.
The other subtlety is removing:
tsk->rlim[RLIMIT_CPU].rlim_cur = RLIM_INFINITY;
in exit_notify, which was to avoid a race signalling during self-reaping
exit. As the limit is now shared, a dying thread should not change it for
others. Instead, I avoid that race by checking current->state before the
RLIMIT_CPU check. (Adding one new conditional in that path is now required
one way or another, since if not for this check there would also be a new
race with self-reaping exit later on clearing current->signal that would
have to be checked for.)
The one loose end left by this patch is with process accounting.
do_acct_process temporarily resets the RLIMIT_FSIZE limit while writing the
accounting record. I left this as it was, but it is now changing a limit
that might be shared by other threads still running. I left this in a
dubious state because it seems to me that processing accounting may already
be more generally a dubious state when it comes to NPTL threads. I would
think you would want one record per process, with aggregate data about all
threads that ever lived in it, not a separate record for each thread.
I don't use process accounting myself, but if anyone is interested in
testing it out I could provide a patch to change it this way.
One final note, this is not 100% to POSIX compliance in regards to rlimits.
POSIX specifies that RLIMIT_CPU refers to a whole process in aggregate, not
to each individual thread. I will provide patches later on to achieve that
change, assuming this patch goes in first.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
A hack to prevent the compiler from generatin tailcalls in these two
functions.
With CONFIG_REGPARM=y, the tailcalled code ends up stomping on the
syscall's argument frame which corrupts userspace's registers.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
|
|
As I explained in the waitid patches, I added the si_rusage field to
siginfo_t with the idea of having the siginfo_t waitid fills in contain all
the information that wait4 or any such call could ever tell you. Nowhere
in POSIX nor anywhere else specifies this field in siginfo_t.
When Ulrich and I hashed out the system call interface we wanted, we looked
at siginfo_t and decided there was plenty of space to throw in si_rusage.
Well, it turns out we didn't check the 64-bit platforms. There struct
rusage is ridiculously large (lots of longs for things that are never in a
million years going to hit 2^32), and my changes bumped up the size of
siginfo_t. Changing that size is more trouble than it's worth.
This patch reverts the changes to the siginfo_t structure types,
and no longer provides the rusage details in SIGCHLD signal data.
Instead, I added a fifth argument to the waitid system call to fill in rusage.
waitid is the name of the POSIX function with four arguments. It might
make sense to rename the system call `waitsys' to follow SGI's system call
with the same arguments, or `wait5' in the mindless tradition. But, feh.
I just added the argument to sys_waitid, rather than worrying about
changing the name in all the tables (and choosing a new stupid name).
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This adds a new state TASK_TRACED that is used in place of TASK_STOPPED
when a thread stops because it is ptraced. Now ptrace operations are only
permitted when the target is in TASK_TRACED state, not in TASK_STOPPED.
This means that if a process is stopped normally by a job control signal
and then you PTRACE_ATTACH to it, you will have to send it a SIGCONT before
you can do any ptrace operations on it. (The SIGCONT will be reported to
ptrace and then you can discard it instead of passing it through when you
call PTRACE_CONT et al.)
If a traced child gets orphaned while in TASK_TRACED state, it morphs into
TASK_STOPPED state. This makes it again possible to resume or destroy the
process with SIGCONT or SIGKILL.
All non-signal tracing stops should now be done via ptrace_notify. I've
updated the syscall tracing code in several architectures to do this
instead of replicating the work by hand. I also fixed several that were
unnecessarily repeating some of the checks in ptrace_check_attach. Calling
ptrace_check_attach alone is sufficient, and the old checks repeated before
are now incorrect, not just superfluous.
I've closed a race in ptrace_check_attach. With this, we should have a
robust guarantee that when ptrace starts operating, the task will be in
TASK_TRACED state and won't come out of it. This is because the only way
to resume from TASK_TRACED is via ptrace operations, and only the one
parent thread attached as the tracer can do those.
This patch also cleans up the do_notify_parent and do_notify_parent_cldstop
code so that the dead and stopped cases are completely disjoint. The
notify_parent function is gone.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch fixes strange and obscure pid implementation in current kernels:
- it removes calling of put_task_struct() from detach_pid()
under tasklist_lock. This allows to use blocking calls
in security_task_free() hooks (in __put_task_struct()).
- it saves some space = 5*5 ints = 100 bytes in task_struct
- it's smaller and tidy, more straigthforward and doesn't use
any knowledge about pids using and assignment.
- it removes pid_links and pid_struct doesn't hold reference counters
on task_struct. instead, new pid_structs and linked altogether and
only one of them is inserted in hash_list.
Signed-off-by: Kirill Korotaev (kksx@mail.ru)
Signed-off-by: William Irwin <wli@holomorphy.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
|
|
This patch changes the rusage bookkeeping and the semantics of the
getrusage and times calls in a couple of ways.
The first change is in the c* fields counting dead child processes. POSIX
requires that children that have died be counted in these fields when they
are reaped by a wait* call, and that if they are never reaped (e.g.
because of ignoring SIGCHLD, or exitting yourself first) then they are
never counted. These were counted in release_task for all threads. I've
changed it so they are counted in wait_task_zombie, i.e. exactly when
being reaped.
POSIX also specifies for RUSAGE_CHILDREN that the report include the reaped
child processes of the calling process, i.e. whole thread group in Linux,
not just ones forked by the calling thread. POSIX specifies tms_c[us]time
fields in the times call the same way. I've moved the c* fields that
contain this information into signal_struct, where the single set of
counters accumulates data from any thread in the group that calls wait*.
Finally, POSIX specifies getrusage and times as returning cumulative totals
for the whole process (aka thread group), not just the calling thread.
I've added fields in signal_struct to accumulate the stats of detached
threads as they die. The process stats are the sums of these records plus
the stats of remaining each live/zombie thread. The times and getrusage
calls, and the internal uses for filling in wait4 results and siginfo_t,
now iterate over the threads in the thread group and sum up their stats
along with the stats recorded for threads already dead and gone.
I added a new value RUSAGE_GROUP (-3) for the getrusage system call rather
than changing the behavior of the old RUSAGE_SELF (0). POSIX specifies
RUSAGE_SELF to mean all threads, so the glibc getrusage call will just
translate it to RUSAGE_GROUP for new kernels. I did this thinking that
someone somewhere might want the old behavior with an old glibc and a new
kernel (it is only different if they are using CLONE_THREAD anyway).
However, I've changed the times system call to conform to POSIX as well and
did not provide any backward compatibility there. In that case there is
nothing easy like a parameter value to use, it would have to be a new
system call number. That seems pretty pointless. Given that, I wonder if
it is worth bothering to preserve the compatible RUSAGE_SELF behavior by
introducing RUSAGE_GROUP instead of just changing RUSAGE_SELF's meaning.
Comments?
I've done some basic testing on x86 and x86-64, and all the numbers come
out right after these fixes. (I have a test program that shows a few
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch adds a new system call `waitid'. This is a new POSIX call that
subsumes the rest of the wait* family and can do some things the older
calls cannot. A minor addition is the ability to select what kinds of
status to check for with a mask of independent bits, so you can wait for
just stops and not terminations, for example. A more significant
improvement is the WNOWAIT flag, which allows for polling child status
without reaping. This interface fills in a siginfo_t with the same details
that a SIGCHLD for the status change has; some of that info (e.g. si_uid)
is not available via wait4 or other calls.
I've added a new system call that has the parameter conventions of the
POSIX function because that seems like the cleanest thing. This patch
includes the actual system call table additions for i386 and x86-64; other
architectures will need to assign the system call number, and 64-bit ones
may need to implement 32-bit compat support for it as I did for x86-64.
The new features could instead be provided by some new kludge inventions in
the wait4 system call interface (that's what BSD did). If kludges are
preferable to adding a system call, I can work up something different.
I added a struct rusage field si_rusage to siginfo_t in the SIGCHLD case
(this does not affect the size or layout of the struct). This is not part
of the POSIX interface, but it makes it so that `waitid' subsumes all the
functionality of `wait4'. Future kernel ABIs (new arch's or whatnot) can
have only the `waitid' system call and the rest of the wait* family
including wait3 and wait4 can be implemented in user space using waitid.
There is nothing in user space as yet that would make use of the new field.
Most of the new functionality is implemented purely in the waitid system
call itself. POSIX also provides for the WCONTINUED flag to report when a
child process had been stopped by job control and then resumed with
SIGCONT. Corresponding to this, a SIGCHLD is now generated when a child
resumes (unless SA_NOCLDSTOP is set), with the value CLD_CONTINUED in
siginfo_t.si_code. To implement this, some additional bookkeeping is
required in the signal code handling job control stops.
The motivation for this work is to make it possible to implement the POSIX
semantics of the `waitid' function in glibc completely and correctly. If
changing either the system call interface used to accomplish that, or any
details of the kernel implementation work, would improve the chances of
getting this incorporated, I am more than happy to work through any issues.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Anton prompted me to get this patch merged. It changes the core buffer
sync algorithm of OProfile to avoid global locks wherever possible. Anton
tested an earlier version of this patch with some success. I've lightly
tested this applied against 2.6.8.1-mm3 on my two-way machine.
The changes also have the happy side-effect of losing less samples after
munmap operations, and removing the blind spot of tasks exiting inside the
kernel.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
When the initial thread in a multi-threaded program dies (the thread group
leader), its child processes are wrongly orphaned, and thereafter when
other threads die their child processes are also orphaned even though live
threads remain in the parent process that can call wait. I have a small
(under 100 lines), POSIX-compliant test program that demonstrates this
using -lpthread (NPTL) if anyone is interested in seeing it.
The bug is that forget_original_parent moves children to the dead parent's
group leader if it's alive, but if not it orphans them. I've changed it so
it instead reparents children to any other live thread in the dead parent's
group (not even preferring the group leader). Children go to init only if
there are no live threads in the parent's group at all. These are the
correct semantics for fork children of POSIX threads.
The second part of the change is to do the CLONE_PARENT behavior always for
CLONE_THREAD, i.e. make sure that each new thread's parent link points to
the real parent of the process and never another thread in its own group.
Without this, when the group leader dies leaving a sole live thread in the
group, forget_original_parent will try to reparent that thread to itself
because it's a child of the dying group leader. Rather handling this case
specially to reparent to the group leader's parent instead, it's more
efficient just to make sure that noone ever has a parent link to inside his
own thread group. Now the reparenting work never needs to be done for
threads created in the same group when their creator thread dies. The only
change from losing the who-created-whom information is when you look at
"PPid:" in /proc/PID/task/TID/status. For purposes of all direct system
calls, it was already as if CLONE_THREAD threads had the parent of the
group leader. (POSIX provides no way to keep track of which thread created
which other thread with pthread_create.)
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|