user/sven/linux.git - Linux Kernel

Age	Commit message (Collapse)	Author
2005-06-17	[PATCH] timer exit cleanup	Ingo Molnar
	Do all timer zapping in exit_itimers. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-03	[patch] MCA recovery module undefined symbol fix	Russ Anderson
	The patch "MCA recovery improvements" added do_exit to mca_drv.c. That's fine when the mca recovery code is built in the kernel (CONFIG_IA64_MCA_RECOVERY=y) but breaks building the mca recovery code as a module (CONFIG_IA64_MCA_RECOVERY=m). Most users are currently building this as a module, as loading and unloading the module provides a very convenient way to turn on/off error recovery. This patch exports do_exit, so mca_drv.c can build as a module. Signed-off-by: Russ Anderson (rja@sgi.com) Signed-off-by: Tony Luck <tony.luck@intel.com>
2005-05-01	[PATCH] make lots of things static	Adrian Bunk
	Another large rollup of various patches from Adrian which make things static where they were needlessly exported. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01	[PATCH] DocBook: changes and extensions to the kernel documentation	Pavel Pisa
	I have recompiled Linux kernel 2.6.11.5 documentation for me and our university students again. The documentation could be extended for more sources which are equipped by structured comments for recent 2.6 kernels. I have tried to proceed with that task. I have done that more times from 2.6.0 time and it gets boring to do same changes again and again. Linux kernel compiles after changes for i386 and ARM targets. I have added references to some more files into kernel-api book, I have added some section names as well. So please, check that changes do not break something and that categories are not too much skewed. I have changed kernel-doc to accept "fastcall" and "asmlinkage" words reserved by kernel convention. Most of the other changes are modifications in the comments to make kernel-doc happy, accept some parameters description and do not bail out on errors. Changed <pid> to @pid in the description, moved some #ifdef before comments to correct function to comments bindings, etc. You can see result of the modified documentation build at http://cmp.felk.cvut.cz/~pisa/linux/lkdb-2.6.11.tar.gz Some more sources are ready to be included into kernel-doc generated documentation. Sources has been added into kernel-api for now. Some more section names added and probably some more chaos introduced as result of quick cleanup work. Signed-off-by: Pavel Pisa <pisa@cmp.felk.cvut.cz> Signed-off-by: Martin Waitz <tali@admingilde.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01	[PATCH] convert that currently tests _NSIG directly to use valid_signal()	Jesper Juhl
	Convert most of the current code that uses _NSIG directly to instead use valid_signal(). This avoids gcc -W warnings and off-by-one errors. Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-04-29	Remove bogus BUG() in kernel/exit.c	Linus Torvalds
	It's old sanity checking that may have been useful for debugging, but is just bogus these days. Noticed by Mattia Belletti.
2005-04-16	[PATCH] reparent_to_init cleanup	Coywolf Qi Hunt
	This patch hides reparent_to_init(). reparent_to_init() should only be called by daemonize(). Signed-off-by: Coywolf Qi Hunt <coywolf@lovecn.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-09	[PATCH] cpusets - big numa cpu and memory placement	Paul Jackson
	This my cpuset patch, with the following changes in the last two weeks: 1) Updated to 2.6.8.1-mm1 2) [Simon Derr <Simon.Derr@bull.net>] Fix new cpuset to begin empty, not copied from parent. Needed to avoid breaking exclusive property. 3) [Dinakar Guniguntala <dino@in.ibm.com>] Finish initializing top cpuset from cpu_possible_map after smp_init() called. 4) [Paul Jackson <pj@sgi.com>] Check on each call to __alloc_pages() if the current tasks cpuset mems_allowed has changed. Use a cpuset generation number, bumped on any cpuset memory placement change, to make this check efficient. Update the tasks mems_allowed from its cpuset, if the cpuset has changed. 5) [Paul Jackson <pj@sgi.com>] If a task is moved to another cpuset, then update its cpus_allowed, using set_cpus_allowed(). 6) [Paul Jackson <pj@sgi.com>] Update Documentation/cpusets.txt to reflect above changes (4) and (5). I continue to recommend the following patch for inclusion in your 2.6.9-*mm series, when that opens. It provides an important facility for high performance computing on large systems. Simon Derr of Bull (France) and myself are the primary authors. Erich Focht has indicated that NEC is also a potential user of this patch on the TX-7 NUMA machines, and that he "would very much welcome the inclusion of cpusets." I offer this update to lkml, in order to invite continued feedback. The one prerequiste patch for this cpuset patch was just posted before this one. That was a patch to provide a new bitmap list format, of which cpusets is the first user. This patch has been built on top of 2.6.8.1-mm1, for the arch's: i386 x86_64 sparc ia64 powerpc-405 powerpc-750 sparc64 with and without CONFIG_CPUSET. It has been booted and tested on ia64 (sn2_defconfig, SN2 hardware). The 'alpha' arch also built, except for what seems to be an unrelated toolchain problem (crosstool ld sigsegv) in the final link step. === Cpusets provide a mechanism for assigning a set of CPUs and Memory Nodes to a set of tasks. Cpusets constrain the CPU and Memory placement of tasks to only the processor and memory resources within a tasks current cpuset. They form a nested hierarchy visible in a virtual file system. These are the essential hooks, beyond what is already present, required to manage dynamic job placement on large systems. Cpusets require small kernel hooks in init, exit, fork, mempolicy, sched_setaffinity, page_alloc and vmscan. And they require a "struct cpuset" pointer, a cpuset_mems_generation, and a "mems_allowed" nodemask_t (to go along with the "cpus_allowed" cpumask_t that's already there) in each task struct. These hooks: 1) establish and propagate cpusets, 2) enforce CPU placement in sched_setaffinity, 3) enforce Memory placement in mbind and sys_set_mempolicy, 4) restrict page allocation and scanning to mems_allowed, and 5) restrict migration and set_cpus_allowed to cpus_allowed. The other required hook, restricting task scheduling to CPUs in a tasks cpus_allowed mask, is already present. Cpusets extend the usefulness of, the existing placement support that was added to Linux 2.6 kernels: sched_setaffinity() for CPU placement, and mbind() and set_mempolicy() for memory placement. On smaller or dedicated use systems, the existing calls are often sufficient. On larger NUMA systems, running more than one, performance critical, job, it is necessary to be able to manage jobs in their entirety. This includes providing a job with exclusive CPU and memory that no other job can use, and being able to list all tasks currently in a cpuset. A given job running within a cpuset, would likely use the existing placement calls to manage its CPU and memory placement in more detail. Cpusets are named, nested sets of CPUs and Memory Nodes. Each cpuset is represented by a directory in the cpuset virtual file system, normally mounted at /dev/cpuset. Each cpuset directory provides the following files, which can be read and written: cpus: List of CPUs allowed to tasks in that cpuset. mems: List of Memory Nodes allowed to tasks in that cpuset. tasks: List of pid's of tasks in that cpuset. cpu_exclusive: Flag (0 or 1) - if set, cpuset has exclusive use of its CPUs (no sibling or cousin cpuset may overlap CPUs). mem_exclusive: Flag (0 or 1) - if set, cpuset has exclusive use of its Memory Nodes (no sibling or cousin may overlap). notify_on_release: Flag (0 or 1) - if set, then /sbin/cpuset_release_agent will be invoked, with the name (/dev/cpuset relative path) of that cpuset in argv[1], when the last user of it (task or child cpuset) goes away. This supports automatic cleanup of abandoned cpusets. In addition one new filetype is added to the /proc file system: /proc/<pid>/cpuset: For each task (pid), list its cpuset path, relative to the root of the cpuset file system. This file is read-only. New cpusets are created using 'mkdir' (at the shell or in C). Old ones are removed using 'rmdir'. The above files are accessed using read(2) and write(2) system calls, or shell commands such as 'cat' and 'echo'. The CPUs and Memory Nodes in a given cpuset are always a subset of its parent. The root cpuset has all possible CPUs and Memory Nodes in the system. A cpuset may be exclusive (cpu or memory) only if its parent is similarly exclusive. See further Documentation/cpusets.txt, at the top of the following patch. /proc interface: It is useful, when learning and making new uses of cpusets and placement to be able to see what are the current value of a tasks cpus_allowed and mems_allowed, which are the actual placement used by the kernel scheduler and memory allocator. The cpus_allowed and mems_allowed values are needed by user space apps that are micromanaging placement, such as when moving an app to a obtained by that app within its cpuset using sched_setaffinity, mbind and set_mempolicy. The cpus_allowed value is also available via the sched_getaffinity system call. But since the entire rest of the cpuset API, including the display of mems_allowed added here, is via an ascii style presentation in /proc and /dev/cpuset, it is worth the extra couple lines of code to display cpus_allowed in the same way. This patch adds the display of these two fields to the 'status' file in the /proc/<pid> directory of each task. The fields are only added if CONFIG_CPUSETS is enabled (which is also needed to define the mems_allowed field of each task). The new output lines look like: $ tail -2 /proc/1/status Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff Mems_allowed: ffffffff,ffffffff Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com> Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Simon Derr <simon.derr@bull.net> Signed-off-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-07	[PATCH] make ITIMER_PROF, ITIMER_VIRTUAL per-process	Roland McGrath
	POSIX requires that setitimer, getitimer, and alarm work on a per-process basis. Currently, Linux implements these for individual threads. This patch fixes these semantics for the ITIMER_PROF timer (which generates SIGPROF) and the ITIMER_VIRTUAL timer (which generates SIGVTALRM), making them shared by all threads in a process (thread group). This patch should be applied after the one that fixes ITIMER_REAL. The essential machinery for these timers is tied into the new posix-timers code for process CPU clocks and timers. This patch requires the cputimers patch and its dependencies. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-07	[PATCH] make ITIMER_REAL per-process	Roland McGrath
	POSIX requires that setitimer, getitimer, and alarm work on a per-process basis. Currently, Linux implements these for individual threads. This patch fixes these semantics for the ITIMER_REAL timer (which generates SIGALRM), making it shared by all threads in a process (thread group). Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-07	[PATCH] PANIC in check_process_timers()	Roland McGrath
	It was intended that such things would not be possible because getting into that code in the first place should be ruled out while exiting. That removes the requirement for any special case check in the common path. But, it was done too late since it hadn't occurred to me that ->live going zero itself created a problem. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-07	[PATCH] posix-timers: CPU clock support for POSIX timers	Roland McGrath
	POSIX requires that when you claim _POSIX_CPUTIME and _POSIX_THREAD_CPUTIME, not only the clock_* calls but also timer_* calls must support the thread and process CPU time clocks. This patch provides that support, building on my recent additions to support these clocks in the POSIX clock_* interfaces. This patch will not work without those changes, as well as the patch fixing the timer lock-siglock deadlock problem. The apparent pervasive changes to posix-timers.c are simply that some fields of struct k_itimer have changed name and moved into a union. This was appropriate since the data structures required for the existing real-time timer support and for the new thread/process CPU-time timers are quite different. The glibc patches to support CPU time clocks using the new kernel support is in http://people.redhat.com/roland/glibc/kernel-cpuclocks.patch, and that includes tests for the timer support (if you build glibc with NPTL). From: Christoph Lameter <clameter@sgi.com> Your patch breaks the mmtimer driver because it used k_itimer values for its own purposes. Here is a fix by defining an additional structure in k_itimer (same approach for mmtimer as the cpu timers): From: Roland McGrath <roland@redhat.com> Fix bug identified by Alexander Nyberg <alexn@dsv.su.se> > The problem arises from code touching the union in alloc_posix_timer() > which makes firing go non-zero. When firing is checked in > posix_cpu_timer_set() it will be positive causing an infinite loop. > > So either the below fix or preferably move the INIT_LIST_HEAD(x) from > alloc_posix_timer() to somewhere later where it doesn't disturb the other > union members. Thanks for finding this problem. The latter is what I think is the right solution. This patch does that, and also removes some superfluous rezeroing. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-04	[PATCH] Move accounting function calls out of critical vm code paths	Christoph Lameter
	In the 2.6.11 development cycle function calls have been added to lots of hot vm paths to do accounting. I think these should not go into the final 2.6.1 release because these statistics can be collected in a different way that does not require the updating of counters from frequently used vm code paths and is consistent with the methods use elsewhere in the kernel to obtain statistics. These function calls are acct_update_integrals -> Account for processes based on stime changes update_mem_hiwater -> takes rss and total_vm hiwater marks. acct_update_integrals is only useful to call if stime changes otherwise it will simply return. It is therefore best to relocate the function call to acct_update_integral into the function that updates stime which is account_system_time and remove it from the vm code paths. update_mem_hiwater finds the rss hiwater mark. We call that from timer context as well. This means that processes' high-water marks are now sampled statistically, at timer-interrupt time rather than deterministically. This may or may not be a problem.. This means that the rss limit is not always updated if rss is increased and thus not as accurate. But the benefit is that the rss checks do no pollute the vm paths and that it is consistent with the rss limit check. The following patch removes acct_update_integrals and update_mem_hiwater from the hot vm paths. Signed-off-by: Christoph Lameter <clameter@sgi.com> From: Jay Lan <jlan@sgi.com> The new "move-accounting-function-calls-out-of-critical-vm-code-paths" patch in 2.6.11-rc3-mm2 was different from the code i tested. In particular, it mistakenly dropped the accounting routine calls in fs/exec.c. The calls in do_execve() are needed to properly initialize accounting fields. Specifically, the tsk->acct_stimexpd needs to be initialized to tsk->stime. I have discussed this with Christoph Lameter and he gave me full blessings to bring the calls back. Signed-off-by: Jay Lan <jlan@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-19	Remove old debugging tests.	Linus Torvalds
	They had their time and place, but right now they are using infrastructure that is getting re-done, and we're better off just dropping them.
2005-01-14	[PATCH] fix exec deadlock when ptrace used inside the thread group	Roland McGrath
	If one thread uses ptrace on another thread in the same thread group, there can be a deadlock when calling exec. The ptrace_stop change ensures that no tracing stop can be entered for a queued signal, or exit tracing, if the tracer is part of the same dying group. The exit_notify change prevents a ptrace zombie from sticking around if its tracer is in the midst of a group exit (which an exec fakes), so these zombies don't hold up de_thread's synchronization. The de_thread change ensures the new thread group leader doesn't wind up ptracing itself, which would produce its own deadlocks. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-11	[PATCH] cputime: introduce cputime	Martin Schwidefsky
	This patch introduces the concept of (virtual) cputime. Each architecture can define its method to measure cputime. The main idea is to define a cputime_t type and a set of operations on it (see asm-generic/cputime.h). Then use the type for utime, stime, cutime, cstime, it_virt_value, it_virt_incr, it_prof_value and it_prof_incr and use the cputime operations for each access to these variables. The default implementation is jiffies based and the effect of this patch for architectures which use the default implementation should be neglectible. There is a second type cputime64_t which is necessary for the kernel_stat cpu statistics. The default cputime_t is 32 bit and based on HZ, this will overflow after 49.7 days. This is not enough for kernel_stat (ihmo not enough for a processes too), so it is necessary to have a 64 bit type. The third thing that gets introduced by this patch is an additional field for the /proc/stat interface: cpu steal time. An architecture can account cpu steal time by calls to the account_stealtime function. The cpu which backs a virtual processor doesn't spent all of its time for the virtual cpu. To get meaningful cpu usage numbers this involuntary wait time needs to be accounted and exported to user space. From: Hugh Dickins <hugh@veritas.com> The p->signal check in account_system_time is insufficient. If the timer interrupt hits near the end of exit_notify, after EXIT_ZOMBIE has been set, another cpu may release_task (NULLifying p->signal) in between account_system_time's check and check_rlimit's dereference. Nor should account_it_prof risk send_sig. But surely account_user_time is safe? Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-10	[PATCH] don't let PTRACE_EVENT_EXIT stop hold up SIGKILL	Roland McGrath
	When a thread stops for ptrace exit tracing, it cannot be resumed by SIGKILL. Once PF_EXITING is set, SIGKILL will not cause a wakeup from stop (see wants_signal in kernel/signal.c). This patch moves the ptrace stop for exit tracing before the setting of PF_EXITING. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-09	[PATCH] fix __ptrace_unlink TASK_TRACED recovery for real parent	Roland McGrath
	The __ptrace_unlink code that checks for TASK_TRACED fixed the problem of a thread being left in TASK_TRACED when no longer being ptraced. However, an oversight in the original fix made it fail to handle the case where the child is ptraced by its real parent. Fixed thus. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-06	[PATCH] First cut at setsid/tty locking	Alan Cox
	Use the existing "tty_sem" to protect against the process tty changes too.
2005-01-04	[PATCH] uninline/kill __exit_mm()	Oleg Nesterov
	__exit_mm() is an inlined version of exit_mm(). This patch unifies them. Saves 356 byte in exit.o. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] task_struct.exit_state usage	Roland McGrath
	I just did a quick audit of the use of exit_state and the EXIT_* bit macros. I guess I didn't really review these changes very closely when you did them originally. :-( I found several places that seem like lossy cases of query-replace without enough thought about the code. Linus has previously said the >= tests ought to be & tests instead. But for exit_state, it can only ever be 0, EXIT_DEAD, or EXIT_ZOMBIE--so a nonzero test is actually the same as testing & (EXIT_DEAD\|EXIT_ZOMBIE), and maybe its code is a tiny bit better. The case like in choose_new_parent is just confusing, to have the always-false test for EXIT_* bits in ->state there too. The two cases in wants_signal and do_process_times are actual regressions that will give us back old bugs in race conditions. These places had s/TASK/EXIT/ but not s/state/exit_state/, and now there tests for exiting tasks are now wrong and never catching them. I take it back: there is no regression in wants_signal in practice I think, because of the PF_EXITING test that makes the EXIT_* state checks superfluous anyway. So that is just another cosmetic case of confusing code. But in do_process_times, there is that SIGXCPU-while-exiting race condition back again. Signed-off-by: Roland McGrath <roland@redhat.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] move waitchld_exit from task_struct to signal_struct	Roland McGrath
	There is really no point in each task_struct having its own waitchld_exit. In the only use of it, the waitchld_exit of each thread in a group gets woken up at the same time. So, there might as well just be one wait queue for the whole thread group. This patch does that by moving the field from task_struct to signal_struct. It should have no effect on the behavior, but saves a little work and a little storage in the multithreaded case. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] move group_exit flag into signal_struct.flags word	Roland McGrath
	After my last change, there are plenty of unused bits available in the new flags word in signal_struct. This patch moves the `group_exit' flag into one of those bits, saving a word in signal_struct. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] fix stop signal race	Roland McGrath
	The `sig_avoid_stop_race' checks fail to catch a related race scenario that can happen. I don't think this has been seen in nature, but it could happen in the same sorts of situations where the observed problems come up that those checks work around. This patch takes a different approach to catching this race condition. The new approach plugs the hole, and I think is also cleaner. The issue is a race between one CPU processing a stop signal while another CPU processes a SIGCONT or SIGKILL. There is a window in stop-signal processing where the siglock must be released. If a SIGCONT or SIGKILL comes along here on another CPU, then the stop signal in the midst of being processed needs to be discarded rather than having the stop take place after the SIGCONT or SIGKILL has been generated. The existing workaround checks for this case explicitly by looking for a pending SIGCONT or SIGKILL after reacquiring the lock. However, there is another problem related to the same race issue. In the window where the processing of the stop signal has released the siglock, the stop signal is not represented in the pending set any more, but it is still "pending" and not "delivered" in POSIX terms. The SIGCONT coming in this window is required to clear all pending stop signals. But, if a stop signal has been dequeued but not yet processed, the SIGCONT generation will fail to clear it (in handle_stop_signal). Likewise, a SIGKILL coming here should prevent the stop processing and make the thread die immediately instead. The `sig_avoid_stop_race' code checks for this by examining the pending set to see if SIGCONT or SIGKILL is in it. But this fails to handle the case where another CPU running another thread in the same process has already dequeued the signal (so it no longer can be found in the pending set). We must catch this as well, so that the same problems do not arise when another thread on another CPU acted real fast. I've fixed this dumping the `sig_avoid_stop_race' kludge in favor of a little explicit bookkeeping. Now, dequeuing any stop signal sets a flag saying that a pending stop signal has been taken on by some CPU since the last time all pending stop signals were cleared due to SIGCONT/SIGKILL. The processing of stop signals checks the flag after the window where it released the lock, and abandons the signal the flag has been cleared. The code that clears pending stop signals on SIGCONT generation also clears this flag. The various places that are trying to ensure the process dies quickly (SIGKILL or other unhandled signals) also clear the flag. I've made this a general flags word in signal_struct, and replaced the stop_state field with flag bits in this word. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] enhanced Memory accounting data collection	Jay Lan
	This patch is to offer common accounting data collection method at memory usage for various accounting packages including BSD accounting, ELSA, CSA and any other acct packages that use a common layer of data collection. New struct fields are added to mm_struct to save high watermarks of rss usage as well as virtual memory usage. New struct fields are added to task_struct to collect accumulated rss usage and vm usages. These data are collected on per process basis. Signed-off-by: Jay Lan <jlan@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-12-23	[PATCH] AB-BA deadlock between uidhash_lock and tasklist_lock.	Andrew Morton
	switch_uid() doesn't care about tasklist_lock, so do it outside the lock and avoid a subtle (and very very unlikely to trigger) AB-BA deadlock. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-12-16	[PATCH] fix bogus ECHILD return from wait* with zombie group leader	Roland McGrath
	Klaus Dittrich observed this bug and posted a test case for it. This patch fixes both that failure mode and some others possible. What Klaus saw was a false negative (i.e. ECHILD when there was a child) when the group leader was a zombie but delayed because other children live; in the test program this happens in a race between the two threads dying on a signal. The change to the TASK_TRACED case avoids a potential false positive (blocking, or WNOHANG returning 0, when there are really no children left), in the race condition where my_ptrace_child returns zero. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-11-17	Fix reading /proc/<pid>/mem when parent dies.	Linus Torvalds
	We should not touch "self_exec_id" here. The parent changed, not we.
2004-11-09	[PATCH] Fix do_wait race	Dinakar Guniguntala
	Only set the flag in the cases when the exit state is not either TASK_DEAD or TASK_ZOMBIE. (TASK_DEAD or TASK_ZOMBIE will either race or we'll return the information, so no need to note them). I confirmed that this fixes the problem and I also ran some LTP tests Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com> Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-11-08	wait_task_stopped() must not just return 0 when it has	Linus Torvalds
	released the tasklist_lock. Since it released the lock, the process lists may not be valid any more, and we must repeat the loop rather than continue with the next parent. Use -EAGAIN to show this condition (separate from the normal -EFAULT that may happen if rusage information could not be copied to user space).
2004-11-03	x86: regparm calling convention for exceptions and interrupts.	Linus Torvalds
	This clarifies more of the x86 caller/callee stack ownership issues by making the exception and interrupt handler assembler interfaces use register calling conventions. System calls still use the stack. Tested with "crashme" on UP/SMP.
2004-10-25	[PATCH] session leader tty disassociation fix	Roland McGrath
	The session leader should disassociate from its controlling terminal and send SIGHUP signals only when the whole session leader process dies. Currently, this gets done when any thread in that process dies, which is wrong. This patch fixes it. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-25	[PATCH] acct: report single record for multithreaded process	Roland McGrath
	This patch changes process accounting to write just one record for a process with many NPTL threads, rather than one record for each thread. No record is written until the last thread exits. The process's record shows the cumulative time of all the threads that ever lived in that process (thread group). This seems like the clearly right thing and I assume it is what anyone using process accounting really would like to see. There is a race condition between multiple threads exiting at the same time to decide which one should write the accounting record. I couldn't think of anything clever using existing bookkeeping that would get this right, so I added another counter for this. (There may be some potential to clean up existing places that figure out how many non-zombie threads are in the group, now that this count is available.) Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-18	[PATCH] unexport exit_mm	Christoph Hellwig
	Not exactly a thing we want done from modules, and no module uses it anyway. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-18	[PATCH] implement in-kernel keys & keyring management	David Howells
	The feature set the patch includes: - Key attributes: - Key type - Description (by which a key of a particular type can be selected) - Payload - UID, GID and permissions mask - Expiry time - Keyrings (just a type of key that holds links to other keys) - User-defined keys - Key revokation - Access controls - Per user key-count and key-memory consumption quota - Three std keyrings per task: per-thread, per-process, session - Two std keyrings per user: per-user and default-user-session - prctl() functions for key and keyring creation and management - Kernel interfaces for filesystem, blockdev, net stack access - JIT key creation by usermode helper There are also two utility programs available: () http://people.redhat.com/~dhowells/keys/keyctl.c A comprehensive key management tool, permitting all the interfaces available to userspace to be exercised. () http://people.redhat.com/~dhowells/keys/request-key An example shell script (to be installed in /sbin) for instantiating a key. Signed-Off-By: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-18	[PATCH] fix & clean up zombie/dead task handling & preemption	Ingo Molnar
	This patch fixes all the preempt-after-task->state-is-TASK_DEAD problems we had. Right now, the moment procfs does a down() that sleeps in proc_pid_flush() [it could] our TASK_DEAD state is zapped and we might be back to TASK_RUNNING to and we trigger this assert: schedule(); BUG(); /* Avoid "noreturn function does return". */ for (;;) ; I have split out TASK_ZOMBIE and TASK_DEAD into a separate p->exit_state field, to allow the detaching of exit-signal/parent/wait-handling from descheduling a dead task. Dead-task freeing is done via PF_DEAD. Tested the patch on x86 SMP and UP, but all architectures should work fine. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-18	[PATCH] add missing linux/syscalls.h includes	Arnd Bergmann
	I found that the prototypes for sys_waitid and sys_fcntl in <linux/syscalls.h> don't match the implementation. In order to keep all prototypes in sync in the future, now include the header from each file implementing any syscall. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-18	[PATCH] fix PTRACE_ATTACH race with real parent's wait calls	Roland McGrath
	There is a race between PTRACE_ATTACH and the real parent calling wait. For a moment, the task is put in PT_PTRACED but with its parent still pointing to its real_parent. In this circumstance, if the real parent calls wait without the WUNTRACED flag, he can see a stopped child status, which wait should never return without WUNTRACED when the caller is not using ptrace. Here it is not the caller that is using ptrace, but some third party. This patch avoids this race condition by adding the PT_ATTACHED flag to distinguish a real parent from a ptrace_attach parent when PT_PTRACED is set, and then having wait use this flag to confirm that things are in order and not consider the child ptraced when its ->ptrace flags are set but its parent links have not yet been switched. (ptrace_check_attach also uses it similarly to rule out a possible race with a bogus ptrace call by the real parent during ptrace_attach.) While looking into this, I noticed that every arch's sys_execve has: current->ptrace &= ~PT_DTRACE; with no locking at all. So, if an exec happens in a race with PTRACE_ATTACH, you could wind up with ->ptrace not having PT_PTRACED set because this store clobbered it. That will cause later BUG hits because the parent links indicate ptracedness but the flag is not set. The patch corrects all the places I found to use task_lock around diddling ->ptrace when it's possible to be racing with ptrace_attach. (The ptrace operation code itself doesn't have this issue because it already excludes anyone else being in ptrace_attach.) Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-18	[PATCH] add WCONTINUED support to wait4 syscall	Roland McGrath
	POSIX specifies the new WCONTINUED flag for waitpid, not just for waitid. I overlooked this addition when I implemented waitid. The real work was already done to support waitid, but waitpid needs to report the results Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-18	[PATCH] make rlimit settings per-process instead of per-thread	Roland McGrath
	POSIX specifies that the limit settings provided by getrlimit/setrlimit are shared by the whole process, not specific to individual threads. This patch changes the behavior of those calls to comply with POSIX. I've moved the struct rlimit array from task_struct to signal_struct, as it has the correct sharing properties. (This reduces kernel memory usage per thread in multithreaded processes by around 100/200 bytes for 32/64 machines respectively.) I took a fairly minimal approach to the locking issues with the newly shared struct rlimit array. It turns out that all the code that is checking limits really just needs to look at one word at a time (one rlim_cur field, usually). It's only the few places like getrlimit itself (and fork), that require atomicity in accessing a whole struct rlimit, so I just used a spin lock for them and no locking for most of the checks. If it turns out that readers of struct rlimit need more atomicity where they are now cheap, or less overhead where they are now atomic (e.g. fork), then seqcount is certainly the right thing to use for them instead of readers using the spin lock. Though it's in signal_struct, I didn't use siglock since the access to rlimits never needs to disable irqs and doesn't overlap with other siglock uses. Instead of adding something new, I overloaded task_lock(task->group_leader) for this; it is used for other things that are not likely to happen simultaneously with limit tweaking. To me that seems preferable to adding a word, but it would be trivial (and arguably cleaner) to add a separate lock for these users (or e.g. just use seqlock, which adds two words but is optimal for readers). Most of the changes here are just the trivial s/->rlim/->signal->rlim/. I stumbled across what must be a long-standing bug, in reparent_to_init. It does: memcpy(current->rlim, init_task.rlim, sizeof((current->rlim))); when surely it was intended to be: memcpy(current->rlim, init_task.rlim, sizeof(current->rlim)); As rlim is an array, the in the sizeof expression gets the size of the first element, so this just changes the first limit (RLIMIT_CPU). This is for kernel threads, where it's clear that resetting all the rlimits is what you want. With that fixed, the setting of RLIMIT_FSIZE in nfsd is superfluous since it will now already have been reset to RLIM_INFINITY. The other subtlety is removing: tsk->rlim[RLIMIT_CPU].rlim_cur = RLIM_INFINITY; in exit_notify, which was to avoid a race signalling during self-reaping exit. As the limit is now shared, a dying thread should not change it for others. Instead, I avoid that race by checking current->state before the RLIMIT_CPU check. (Adding one new conditional in that path is now required one way or another, since if not for this check there would also be a new race with self-reaping exit later on clearing current->signal that would have to be checked for.) The one loose end left by this patch is with process accounting. do_acct_process temporarily resets the RLIMIT_FSIZE limit while writing the accounting record. I left this as it was, but it is now changing a limit that might be shared by other threads still running. I left this in a dubious state because it seems to me that processing accounting may already be more generally a dubious state when it comes to NPTL threads. I would think you would want one record per process, with aggregate data about all threads that ever lived in it, not a separate record for each thread. I don't use process accounting myself, but if anyone is interested in testing it out I could provide a patch to change it this way. One final note, this is not 100% to POSIX compliance in regards to rlimits. POSIX specifies that RLIMIT_CPU refers to a whole process in aggregate, not to each individual thread. I will provide patches later on to achieve that change, assuming this patch goes in first. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-16	[PATCH] tailcall prevention in sys_wait4() and sys_waitid()	Ingo Molnar
	A hack to prevent the compiler from generatin tailcalls in these two functions. With CONFIG_REGPARM=y, the tailcalled code ends up stomping on the syscall's argument frame which corrupts userspace's registers. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-09-20	Correct prototypes for sys_wait4 and sys_waitpid.	Richard Henderson

2004-09-15	[PATCH] back out siginfo_t.si_rusage from waitid changes	Roland McGrath
	As I explained in the waitid patches, I added the si_rusage field to siginfo_t with the idea of having the siginfo_t waitid fills in contain all the information that wait4 or any such call could ever tell you. Nowhere in POSIX nor anywhere else specifies this field in siginfo_t. When Ulrich and I hashed out the system call interface we wanted, we looked at siginfo_t and decided there was plenty of space to throw in si_rusage. Well, it turns out we didn't check the 64-bit platforms. There struct rusage is ridiculously large (lots of longs for things that are never in a million years going to hit 2^32), and my changes bumped up the size of siginfo_t. Changing that size is more trouble than it's worth. This patch reverts the changes to the siginfo_t structure types, and no longer provides the rusage details in SIGCHLD signal data. Instead, I added a fifth argument to the waitid system call to fill in rusage. waitid is the name of the POSIX function with four arguments. It might make sense to rename the system call `waitsys' to follow SGI's system call with the same arguments, or `wait5' in the mindless tradition. But, feh. I just added the argument to sys_waitid, rather than worrying about changing the name in all the tables (and choosing a new stupid name). Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-09-07	[PATCH] cleanup ptrace stops and remove notify_parent	Roland McGrath
	This adds a new state TASK_TRACED that is used in place of TASK_STOPPED when a thread stops because it is ptraced. Now ptrace operations are only permitted when the target is in TASK_TRACED state, not in TASK_STOPPED. This means that if a process is stopped normally by a job control signal and then you PTRACE_ATTACH to it, you will have to send it a SIGCONT before you can do any ptrace operations on it. (The SIGCONT will be reported to ptrace and then you can discard it instead of passing it through when you call PTRACE_CONT et al.) If a traced child gets orphaned while in TASK_TRACED state, it morphs into TASK_STOPPED state. This makes it again possible to resume or destroy the process with SIGCONT or SIGKILL. All non-signal tracing stops should now be done via ptrace_notify. I've updated the syscall tracing code in several architectures to do this instead of replicating the work by hand. I also fixed several that were unnecessarily repeating some of the checks in ptrace_check_attach. Calling ptrace_check_attach alone is sufficient, and the old checks repeated before are now incorrect, not just superfluous. I've closed a race in ptrace_check_attach. With this, we should have a robust guarantee that when ptrace starts operating, the task will be in TASK_TRACED state and won't come out of it. This is because the only way to resume from TASK_TRACED is via ptrace operations, and only the one parent thread attached as the tracer can do those. This patch also cleans up the do_notify_parent and do_notify_parent_cldstop code so that the dead and stopped cases are completely disjoint. The notify_parent function is gone. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-09-02	[PATCH] fixed pidhashing patch	Kirill Korotaev
	This patch fixes strange and obscure pid implementation in current kernels: - it removes calling of put_task_struct() from detach_pid() under tasklist_lock. This allows to use blocking calls in security_task_free() hooks (in __put_task_struct()). - it saves some space = 5*5 ints = 100 bytes in task_struct - it's smaller and tidy, more straigthforward and doesn't use any knowledge about pids using and assignment. - it removes pid_links and pid_struct doesn't hold reference counters on task_struct. instead, new pid_structs and linked altogether and only one of them is inserted in hash_list. Signed-off-by: Kirill Korotaev (kksx@mail.ru) Signed-off-by: William Irwin <wli@holomorphy.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-31	Annotate sys_wait4() user pointers	Linus Torvalds

2004-08-30	[PATCH] fix rusage semantics	Roland McGrath
	This patch changes the rusage bookkeeping and the semantics of the getrusage and times calls in a couple of ways. The first change is in the c* fields counting dead child processes. POSIX requires that children that have died be counted in these fields when they are reaped by a wait* call, and that if they are never reaped (e.g. because of ignoring SIGCHLD, or exitting yourself first) then they are never counted. These were counted in release_task for all threads. I've changed it so they are counted in wait_task_zombie, i.e. exactly when being reaped. POSIX also specifies for RUSAGE_CHILDREN that the report include the reaped child processes of the calling process, i.e. whole thread group in Linux, not just ones forked by the calling thread. POSIX specifies tms_c[us]time fields in the times call the same way. I've moved the c* fields that contain this information into signal_struct, where the single set of counters accumulates data from any thread in the group that calls wait*. Finally, POSIX specifies getrusage and times as returning cumulative totals for the whole process (aka thread group), not just the calling thread. I've added fields in signal_struct to accumulate the stats of detached threads as they die. The process stats are the sums of these records plus the stats of remaining each live/zombie thread. The times and getrusage calls, and the internal uses for filling in wait4 results and siginfo_t, now iterate over the threads in the thread group and sum up their stats along with the stats recorded for threads already dead and gone. I added a new value RUSAGE_GROUP (-3) for the getrusage system call rather than changing the behavior of the old RUSAGE_SELF (0). POSIX specifies RUSAGE_SELF to mean all threads, so the glibc getrusage call will just translate it to RUSAGE_GROUP for new kernels. I did this thinking that someone somewhere might want the old behavior with an old glibc and a new kernel (it is only different if they are using CLONE_THREAD anyway). However, I've changed the times system call to conform to POSIX as well and did not provide any backward compatibility there. In that case there is nothing easy like a parameter value to use, it would have to be a new system call number. That seems pretty pointless. Given that, I wonder if it is worth bothering to preserve the compatible RUSAGE_SELF behavior by introducing RUSAGE_GROUP instead of just changing RUSAGE_SELF's meaning. Comments? I've done some basic testing on x86 and x86-64, and all the numbers come out right after these fixes. (I have a test program that shows a few Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-30	[PATCH] waitid system call	Roland McGrath
	This patch adds a new system call `waitid'. This is a new POSIX call that subsumes the rest of the wait* family and can do some things the older calls cannot. A minor addition is the ability to select what kinds of status to check for with a mask of independent bits, so you can wait for just stops and not terminations, for example. A more significant improvement is the WNOWAIT flag, which allows for polling child status without reaping. This interface fills in a siginfo_t with the same details that a SIGCHLD for the status change has; some of that info (e.g. si_uid) is not available via wait4 or other calls. I've added a new system call that has the parameter conventions of the POSIX function because that seems like the cleanest thing. This patch includes the actual system call table additions for i386 and x86-64; other architectures will need to assign the system call number, and 64-bit ones may need to implement 32-bit compat support for it as I did for x86-64. The new features could instead be provided by some new kludge inventions in the wait4 system call interface (that's what BSD did). If kludges are preferable to adding a system call, I can work up something different. I added a struct rusage field si_rusage to siginfo_t in the SIGCHLD case (this does not affect the size or layout of the struct). This is not part of the POSIX interface, but it makes it so that `waitid' subsumes all the functionality of `wait4'. Future kernel ABIs (new arch's or whatnot) can have only the `waitid' system call and the rest of the wait* family including wait3 and wait4 can be implemented in user space using waitid. There is nothing in user space as yet that would make use of the new field. Most of the new functionality is implemented purely in the waitid system call itself. POSIX also provides for the WCONTINUED flag to report when a child process had been stopped by job control and then resumed with SIGCONT. Corresponding to this, a SIGCHLD is now generated when a child resumes (unless SA_NOCLDSTOP is set), with the value CLD_CONTINUED in siginfo_t.si_code. To implement this, some additional bookkeeping is required in the signal code handling job control stops. The motivation for this work is to make it possible to implement the POSIX semantics of the `waitid' function in glibc completely and correctly. If changing either the system call interface used to accomplish that, or any details of the kernel implementation work, would improve the chances of getting this incorporated, I am more than happy to work through any issues. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-26	[PATCH] improve OProfile on many-way systems	John Levon
	Anton prompted me to get this patch merged. It changes the core buffer sync algorithm of OProfile to avoid global locks wherever possible. Anton tested an earlier version of this patch with some success. I've lightly tested this applied against 2.6.8.1-mm3 on my two-way machine. The changes also have the happy side-effect of losing less samples after munmap operations, and removing the blind spot of tasks exiting inside the kernel. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-26	[PATCH] fix MT reparenting when thread group leader dies	Roland McGrath
	When the initial thread in a multi-threaded program dies (the thread group leader), its child processes are wrongly orphaned, and thereafter when other threads die their child processes are also orphaned even though live threads remain in the parent process that can call wait. I have a small (under 100 lines), POSIX-compliant test program that demonstrates this using -lpthread (NPTL) if anyone is interested in seeing it. The bug is that forget_original_parent moves children to the dead parent's group leader if it's alive, but if not it orphans them. I've changed it so it instead reparents children to any other live thread in the dead parent's group (not even preferring the group leader). Children go to init only if there are no live threads in the parent's group at all. These are the correct semantics for fork children of POSIX threads. The second part of the change is to do the CLONE_PARENT behavior always for CLONE_THREAD, i.e. make sure that each new thread's parent link points to the real parent of the process and never another thread in its own group. Without this, when the group leader dies leaving a sole live thread in the group, forget_original_parent will try to reparent that thread to itself because it's a child of the dying group leader. Rather handling this case specially to reparent to the group leader's parent instead, it's more efficient just to make sure that noone ever has a parent link to inside his own thread group. Now the reparenting work never needs to be done for threads created in the same group when their creator thread dies. The only change from losing the who-created-whom information is when you look at "PPid:" in /proc/PID/task/TID/status. For purposes of all direct system calls, it was already as if CLONE_THREAD threads had the parent of the group leader. (POSIX provides no way to keep track of which thread created which other thread with pthread_create.) Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>