user/sven/linux.git - Linux Kernel

Age	Commit message (Collapse)	Author
2006-07-14	[PATCH] per-task-delay-accounting: /proc export of aggregated block I/O delays	Shailabh Nagar
	Export I/O delays seen by a task through /proc/<tgid>/stats for use in top etc. Note that delays for I/O done for swapping in pages (swapin I/O) is clubbed together with all other I/O here (this is not the case in the netlink interface where the swapin I/O is kept distinct) [akpm@osdl.org: printk warning fix] Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com> Signed-off-by: Balbir Singh <balbir@in.ibm.com> Cc: Jes Sorensen <jes@sgi.com> Cc: Peter Chubb <peterc@gelato.unsw.edu.au> Cc: Erich Focht <efocht@ess.nec.de> Cc: Levent Serinol <lserinol@gmail.com> Cc: Jay Lan <jlan@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30	Remove obsolete #include <linux/config.h>	Jörn Engel
	Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-03-26	[PATCH] hrtimers: remove it_real_value calculation from proc/*/stat	Roman Zippel
	Remove the it_real_value from /proc/*/stat, during 1.2.x was the last time it returned useful data (as it was directly maintained by the scheduler), now it's only a waste of time to calculate it. Return 0 instead. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10	[PATCH] hrtimer: switch itimers to hrtimer	Thomas Gleixner
	switch itimers to a hrtimers-based implementation Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06	[PATCH] s390: cleanup Kconfig	Martin Schwidefsky
	Sanitize some s390 Kconfig options. We have ARCH_S390, ARCH_S390X, ARCH_S390_31, 64BIT, S390_SUPPORT and COMPAT. Replace these 6 options by S390, 64BIT and COMPAT. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29	[PATCH] mm: rss = file_rss + anon_rss	Hugh Dickins
	I was lazy when we added anon_rss, and chose to change as few places as possible. So currently each anonymous page has to be counted twice, in rss and in anon_rss. Which won't be so good if those are atomic counts in some configurations. Change that around: keep file_rss and anon_rss separately, and add them together (with get_mm_rss macro) when the total is needed - reading two atomics is much cheaper than updating two atomics. And update anon_rss upfront, typically in memory.c, not tucked away in page_add_anon_rmap. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-17	[PATCH] files: fix preemption issues	Dipankar Sarma
	With the new fdtable locking rules, you have to protect fdtable with either ->file_lock or rcu_read_lock/unlock(). There are some places where we aren't doing either. This patch fixes those places. Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09	[PATCH] files: break up files struct	Dipankar Sarma
	In order for the RCU to work, the file table array, sets and their sizes must be updated atomically. Instead of ensuring this through too many memory barriers, we put the arrays and their sizes in a separate structure. This patch takes the first step of putting the file table elements in a separate structure fdtable that is embedded withing files_struct. It also changes all the users to refer to the file table using files_fdtable() macro. Subsequent applciation of RCU becomes easier after this. Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> Signed-Off-By: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-28	[PATCH] mm counter operations through macros	Christoph Lameter
	This patch extracts all the operations on counters protected by the page table lock (currently rss and anon_rss) into definitions in include/linux/sched.h. All rss operations are performed through the following macros: get_mm_counter(mm, member) -> Obtain the value of a counter set_mm_counter(mm, member, value) -> Set the value of a counter update_mm_counter(mm, member, value) -> Add to a counter inc_mm_counter(mm, member) -> Increment a counter dec_mm_counter(mm, member) -> Decrement a counter With this patch it becomes easier to add new counters and it is possible to redefine the method of counter handling. The counters are an issue for scalability since they are used in frequently used code paths and may cause cache line bouncing. F.e. One may not use counters at all and count the pages when needed, switch to atomic operations if the mm_struct locking changes or split the rss into counters that can be locally incremented. The relevant fields of the task_struct are renamed with a leading underscore to catch out people who are not using the acceessor macros. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-09	[PATCH] cpusets - big numa cpu and memory placement	Paul Jackson
	This my cpuset patch, with the following changes in the last two weeks: 1) Updated to 2.6.8.1-mm1 2) [Simon Derr <Simon.Derr@bull.net>] Fix new cpuset to begin empty, not copied from parent. Needed to avoid breaking exclusive property. 3) [Dinakar Guniguntala <dino@in.ibm.com>] Finish initializing top cpuset from cpu_possible_map after smp_init() called. 4) [Paul Jackson <pj@sgi.com>] Check on each call to __alloc_pages() if the current tasks cpuset mems_allowed has changed. Use a cpuset generation number, bumped on any cpuset memory placement change, to make this check efficient. Update the tasks mems_allowed from its cpuset, if the cpuset has changed. 5) [Paul Jackson <pj@sgi.com>] If a task is moved to another cpuset, then update its cpus_allowed, using set_cpus_allowed(). 6) [Paul Jackson <pj@sgi.com>] Update Documentation/cpusets.txt to reflect above changes (4) and (5). I continue to recommend the following patch for inclusion in your 2.6.9-*mm series, when that opens. It provides an important facility for high performance computing on large systems. Simon Derr of Bull (France) and myself are the primary authors. Erich Focht has indicated that NEC is also a potential user of this patch on the TX-7 NUMA machines, and that he "would very much welcome the inclusion of cpusets." I offer this update to lkml, in order to invite continued feedback. The one prerequiste patch for this cpuset patch was just posted before this one. That was a patch to provide a new bitmap list format, of which cpusets is the first user. This patch has been built on top of 2.6.8.1-mm1, for the arch's: i386 x86_64 sparc ia64 powerpc-405 powerpc-750 sparc64 with and without CONFIG_CPUSET. It has been booted and tested on ia64 (sn2_defconfig, SN2 hardware). The 'alpha' arch also built, except for what seems to be an unrelated toolchain problem (crosstool ld sigsegv) in the final link step. === Cpusets provide a mechanism for assigning a set of CPUs and Memory Nodes to a set of tasks. Cpusets constrain the CPU and Memory placement of tasks to only the processor and memory resources within a tasks current cpuset. They form a nested hierarchy visible in a virtual file system. These are the essential hooks, beyond what is already present, required to manage dynamic job placement on large systems. Cpusets require small kernel hooks in init, exit, fork, mempolicy, sched_setaffinity, page_alloc and vmscan. And they require a "struct cpuset" pointer, a cpuset_mems_generation, and a "mems_allowed" nodemask_t (to go along with the "cpus_allowed" cpumask_t that's already there) in each task struct. These hooks: 1) establish and propagate cpusets, 2) enforce CPU placement in sched_setaffinity, 3) enforce Memory placement in mbind and sys_set_mempolicy, 4) restrict page allocation and scanning to mems_allowed, and 5) restrict migration and set_cpus_allowed to cpus_allowed. The other required hook, restricting task scheduling to CPUs in a tasks cpus_allowed mask, is already present. Cpusets extend the usefulness of, the existing placement support that was added to Linux 2.6 kernels: sched_setaffinity() for CPU placement, and mbind() and set_mempolicy() for memory placement. On smaller or dedicated use systems, the existing calls are often sufficient. On larger NUMA systems, running more than one, performance critical, job, it is necessary to be able to manage jobs in their entirety. This includes providing a job with exclusive CPU and memory that no other job can use, and being able to list all tasks currently in a cpuset. A given job running within a cpuset, would likely use the existing placement calls to manage its CPU and memory placement in more detail. Cpusets are named, nested sets of CPUs and Memory Nodes. Each cpuset is represented by a directory in the cpuset virtual file system, normally mounted at /dev/cpuset. Each cpuset directory provides the following files, which can be read and written: cpus: List of CPUs allowed to tasks in that cpuset. mems: List of Memory Nodes allowed to tasks in that cpuset. tasks: List of pid's of tasks in that cpuset. cpu_exclusive: Flag (0 or 1) - if set, cpuset has exclusive use of its CPUs (no sibling or cousin cpuset may overlap CPUs). mem_exclusive: Flag (0 or 1) - if set, cpuset has exclusive use of its Memory Nodes (no sibling or cousin may overlap). notify_on_release: Flag (0 or 1) - if set, then /sbin/cpuset_release_agent will be invoked, with the name (/dev/cpuset relative path) of that cpuset in argv[1], when the last user of it (task or child cpuset) goes away. This supports automatic cleanup of abandoned cpusets. In addition one new filetype is added to the /proc file system: /proc/<pid>/cpuset: For each task (pid), list its cpuset path, relative to the root of the cpuset file system. This file is read-only. New cpusets are created using 'mkdir' (at the shell or in C). Old ones are removed using 'rmdir'. The above files are accessed using read(2) and write(2) system calls, or shell commands such as 'cat' and 'echo'. The CPUs and Memory Nodes in a given cpuset are always a subset of its parent. The root cpuset has all possible CPUs and Memory Nodes in the system. A cpuset may be exclusive (cpu or memory) only if its parent is similarly exclusive. See further Documentation/cpusets.txt, at the top of the following patch. /proc interface: It is useful, when learning and making new uses of cpusets and placement to be able to see what are the current value of a tasks cpus_allowed and mems_allowed, which are the actual placement used by the kernel scheduler and memory allocator. The cpus_allowed and mems_allowed values are needed by user space apps that are micromanaging placement, such as when moving an app to a obtained by that app within its cpuset using sched_setaffinity, mbind and set_mempolicy. The cpus_allowed value is also available via the sched_getaffinity system call. But since the entire rest of the cpuset API, including the display of mems_allowed added here, is via an ascii style presentation in /proc and /dev/cpuset, it is worth the extra couple lines of code to display cpus_allowed in the same way. This patch adds the display of these two fields to the 'status' file in the /proc/<pid> directory of each task. The fields are only added if CONFIG_CPUSETS is enabled (which is also needed to define the mems_allowed field of each task). The new output lines look like: $ tail -2 /proc/1/status Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff Mems_allowed: ffffffff,ffffffff Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com> Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Simon Derr <simon.derr@bull.net> Signed-off-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-07	[PATCH] show RLIMIT_SIGPENDING usage in /proc/PID/status	Roland McGrath
	Jeremy mentioned the aggravation of not being able to tell when your processes are using up signal queue entries and hitting the RLIMIT_SIGPENDING limit. This patch adds a line to /proc/PID/status showing how many queue items are in use, and allowed, for your uid. I can certainly see the appeal of having a display of the number of queued items specific to each process, and even the items within the process broken down per signal number. However, those are not things that are directly counted, and ascertaining them requires iterating through the queue. This patch instead gives what can be readily determined in constant time using the accounting already done. I'm not sure something more complex is warranted just to facilitate one particular debugging need. With this, you can see quickly that this particular problem has come up. Then examination of each process's SigPnd/ShdPnd lines ought to give you an indication of which processes have any queued RT signals sitting around for a long time, and you can then attack those programs directly, though there is no way after the fact to determine how many queued signals with the same number a given process has (short of killing it and seeing the usage drop). Note you may still have a mystery if the leaking programs are not leaving pending RT signals queued, but rather preallocating queue items via timer_create. That usage is not readily apparent in any /proc information. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-07	[PATCH] make ITIMER_REAL per-process	Roland McGrath
	POSIX requires that setitimer, getitimer, and alarm work on a per-process basis. Currently, Linux implements these for individual threads. This patch fixes these semantics for the ITIMER_REAL timer (which generates SIGALRM), making it shared by all threads in a process (thread group). Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-11	[PATCH] cputime: introduce cputime	Martin Schwidefsky
	This patch introduces the concept of (virtual) cputime. Each architecture can define its method to measure cputime. The main idea is to define a cputime_t type and a set of operations on it (see asm-generic/cputime.h). Then use the type for utime, stime, cutime, cstime, it_virt_value, it_virt_incr, it_prof_value and it_prof_incr and use the cputime operations for each access to these variables. The default implementation is jiffies based and the effect of this patch for architectures which use the default implementation should be neglectible. There is a second type cputime64_t which is necessary for the kernel_stat cpu statistics. The default cputime_t is 32 bit and based on HZ, this will overflow after 49.7 days. This is not enough for kernel_stat (ihmo not enough for a processes too), so it is necessary to have a 64 bit type. The third thing that gets introduced by this patch is an additional field for the /proc/stat interface: cpu steal time. An architecture can account cpu steal time by calls to the account_stealtime function. The cpu which backs a virtual processor doesn't spent all of its time for the virtual cpu. To get meaningful cpu usage numbers this involuntary wait time needs to be accounted and exported to user space. From: Hugh Dickins <hugh@veritas.com> The p->signal check in account_system_time is insufficient. If the timer interrupt hits near the end of exit_notify, after EXIT_ZOMBIE has been set, another cpu may release_task (NULLifying p->signal) in between account_system_time's check and check_rlimit's dereference. Nor should account_it_prof risk send_sig. But surely account_user_time is safe? Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] FRV: procfs changes for nommu changes	David Howells
	The attached patch splits some memory-related procfs files into MMU and !MMU versions and places them in separate conditionally-compiled files. A header file local to the fs/proc/ directory is used to declare functions and the like. Additionally, a !MMU-only proc file (/proc/maps) is provided so that master VMA list in a uClinux kernel is viewable. Signed-Off-By: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-12-12	[PATCH] do_task_stat() use pid_alive()	Andrew Morton
	Use pid_alive() rather than testing for a zero value of ->pid. Is the right thing to do and addresses an oops dereferencing real_parent which one person reported. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-12-02	[PATCH] proc_pid_status() oops fix	Manfred Spraul
	proc_pid_status dereferences pointers in the task structure even if the task is already dead. This is probably the reason for the oops described in http://bugme.osdl.org/show_bug.cgi?id=3812 The attached patch removes the pointer dereferences by using pid_alive() for testing that the task structure contents is still valid before dereferencing them. The task structure itself is guaranteed to be valid - we hold a reference count. Signed-Off-By: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-20	[PATCH] stat shows wrong ppid	Dinakar Guniguntala
	One more place in fs/proc/array.c where ppid is wrong, which I missed in my previous mail to lkml. Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-19	[PATCH] ps shows wrong ppid	Dinakar Guniguntala
	/proc shows the wrong PID as parent in the following case Process A creates Threads 1 & 2 (using pthread_create) Thread 2 then forks and execs process B getppid() for Process B shows Process A (rightly) as parent, however /proc/B/status shows Thread 3 as PPid (incorrect). Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-18	[PATCH] fix & clean up zombie/dead task handling & preemption	Ingo Molnar
	This patch fixes all the preempt-after-task->state-is-TASK_DEAD problems we had. Right now, the moment procfs does a down() that sleeps in proc_pid_flush() [it could] our TASK_DEAD state is zapped and we might be back to TASK_RUNNING to and we trigger this assert: schedule(); BUG(); /* Avoid "noreturn function does return". */ for (;;) ; I have split out TASK_ZOMBIE and TASK_DEAD into a separate p->exit_state field, to allow the detaching of exit-signal/parent/wait-handling from descheduling a dead task. Dead-task freeing is done via PF_DEAD. Tested the patch on x86 SMP and UP, but all architectures should work fine. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-18	[PATCH] show aggregate per-process counters in /proc/PID/stat 2	Lev Makhlis
	Add up resource usage counters for live and dead threads to show aggregate per-process usage in /proc/<pid>/stat. This mirrors the new getrusage() semantics. /proc/<pid>/task/<tid>/stat still has the per-thread usage. After moving the counter aggregation loop inside a task->sighand lock to avoid nasty race conditions, it has survived stress-testing with '(while true; do sleep 1 & done) & top -d 0.1' Signed-off-by: Lev Makhlis <mlev@despammed.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-18	[PATCH] distinct tgid/tid CPU usage	Albert Cahalan
	This patch adjusts /proc/*/stat to have distinct per-process and per-thread CPU usage, faults, and wchan. Signed-off-by: Albert Cahalan <albert@users.sf.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-18	[PATCH] make rlimit settings per-process instead of per-thread	Roland McGrath
	POSIX specifies that the limit settings provided by getrlimit/setrlimit are shared by the whole process, not specific to individual threads. This patch changes the behavior of those calls to comply with POSIX. I've moved the struct rlimit array from task_struct to signal_struct, as it has the correct sharing properties. (This reduces kernel memory usage per thread in multithreaded processes by around 100/200 bytes for 32/64 machines respectively.) I took a fairly minimal approach to the locking issues with the newly shared struct rlimit array. It turns out that all the code that is checking limits really just needs to look at one word at a time (one rlim_cur field, usually). It's only the few places like getrlimit itself (and fork), that require atomicity in accessing a whole struct rlimit, so I just used a spin lock for them and no locking for most of the checks. If it turns out that readers of struct rlimit need more atomicity where they are now cheap, or less overhead where they are now atomic (e.g. fork), then seqcount is certainly the right thing to use for them instead of readers using the spin lock. Though it's in signal_struct, I didn't use siglock since the access to rlimits never needs to disable irqs and doesn't overlap with other siglock uses. Instead of adding something new, I overloaded task_lock(task->group_leader) for this; it is used for other things that are not likely to happen simultaneously with limit tweaking. To me that seems preferable to adding a word, but it would be trivial (and arguably cleaner) to add a separate lock for these users (or e.g. just use seqlock, which adds two words but is optimal for readers). Most of the changes here are just the trivial s/->rlim/->signal->rlim/. I stumbled across what must be a long-standing bug, in reparent_to_init. It does: memcpy(current->rlim, init_task.rlim, sizeof((current->rlim))); when surely it was intended to be: memcpy(current->rlim, init_task.rlim, sizeof(current->rlim)); As rlim is an array, the in the sizeof expression gets the size of the first element, so this just changes the first limit (RLIMIT_CPU). This is for kernel threads, where it's clear that resetting all the rlimits is what you want. With that fixed, the setting of RLIMIT_FSIZE in nfsd is superfluous since it will now already have been reset to RLIM_INFINITY. The other subtlety is removing: tsk->rlim[RLIMIT_CPU].rlim_cur = RLIM_INFINITY; in exit_notify, which was to avoid a race signalling during self-reaping exit. As the limit is now shared, a dying thread should not change it for others. Instead, I avoid that race by checking current->state before the RLIMIT_CPU check. (Adding one new conditional in that path is now required one way or another, since if not for this check there would also be a new race with self-reaping exit later on clearing current->signal that would have to be checked for.) The one loose end left by this patch is with process accounting. do_acct_process temporarily resets the RLIMIT_FSIZE limit while writing the accounting record. I left this as it was, but it is now changing a limit that might be shared by other threads still running. I left this in a dubious state because it seems to me that processing accounting may already be more generally a dubious state when it comes to NPTL threads. I would think you would want one record per process, with aggregate data about all threads that ever lived in it, not a separate record for each thread. I don't use process accounting myself, but if anyone is interested in testing it out I could provide a patch to change it this way. One final note, this is not 100% to POSIX compliance in regards to rlimits. POSIX specifies that RLIMIT_CPU refers to a whole process in aggregate, not to each individual thread. I will provide patches later on to achieve that change, assuming this patch goes in first. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-13	[PATCH] Fix reporting of process start times	Tim Schmielau
	Derive process start times from the posix_clock_monotonic notion of uptime instead of "jiffies", consistent with the earlier change to /proc/uptime itself. (http://linus.bkbits.net:8080/linux-2.5/cset@3ef4851dGg0fxX58R9Zv8SIq9fzNmQ?na%0Av=index.html\|src/.\|src/fs\|src/fs/proc\|related/fs/proc/proc_misc.c) Process start times are reported to userspace in units of 1/USER_HZ since boot, thus applications as procps need the value of "uptime" to convert them into absolute time. Currently "uptime" is derived from an ntp-corrected time base, but process start time is derived from the free-running "jiffies" counter. This results in inaccurate, drifting process start times as seen by the user, even if the exported number stays constant, because the users notion of "jiffies" changes in time. It's John Stultz's patch anyways, which I only messed up a bit, but since people started trading signed-off lines on lkml: Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-09-07	[PATCH] cleanup ptrace stops and remove notify_parent	Roland McGrath
	This adds a new state TASK_TRACED that is used in place of TASK_STOPPED when a thread stops because it is ptraced. Now ptrace operations are only permitted when the target is in TASK_TRACED state, not in TASK_STOPPED. This means that if a process is stopped normally by a job control signal and then you PTRACE_ATTACH to it, you will have to send it a SIGCONT before you can do any ptrace operations on it. (The SIGCONT will be reported to ptrace and then you can discard it instead of passing it through when you call PTRACE_CONT et al.) If a traced child gets orphaned while in TASK_TRACED state, it morphs into TASK_STOPPED state. This makes it again possible to resume or destroy the process with SIGCONT or SIGKILL. All non-signal tracing stops should now be done via ptrace_notify. I've updated the syscall tracing code in several architectures to do this instead of replicating the work by hand. I also fixed several that were unnecessarily repeating some of the checks in ptrace_check_attach. Calling ptrace_check_attach alone is sufficient, and the old checks repeated before are now incorrect, not just superfluous. I've closed a race in ptrace_check_attach. With this, we should have a robust guarantee that when ptrace starts operating, the task will be in TASK_TRACED state and won't come out of it. This is because the only way to resume from TASK_TRACED is via ptrace operations, and only the one parent thread attached as the tracer can do those. This patch also cleans up the do_notify_parent and do_notify_parent_cldstop code so that the dead and stopped cases are completely disjoint. The notify_parent function is gone. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-30	[PATCH] fix rusage semantics	Roland McGrath
	This patch changes the rusage bookkeeping and the semantics of the getrusage and times calls in a couple of ways. The first change is in the c* fields counting dead child processes. POSIX requires that children that have died be counted in these fields when they are reaped by a wait* call, and that if they are never reaped (e.g. because of ignoring SIGCHLD, or exitting yourself first) then they are never counted. These were counted in release_task for all threads. I've changed it so they are counted in wait_task_zombie, i.e. exactly when being reaped. POSIX also specifies for RUSAGE_CHILDREN that the report include the reaped child processes of the calling process, i.e. whole thread group in Linux, not just ones forked by the calling thread. POSIX specifies tms_c[us]time fields in the times call the same way. I've moved the c* fields that contain this information into signal_struct, where the single set of counters accumulates data from any thread in the group that calls wait*. Finally, POSIX specifies getrusage and times as returning cumulative totals for the whole process (aka thread group), not just the calling thread. I've added fields in signal_struct to accumulate the stats of detached threads as they die. The process stats are the sums of these records plus the stats of remaining each live/zombie thread. The times and getrusage calls, and the internal uses for filling in wait4 results and siginfo_t, now iterate over the threads in the thread group and sum up their stats along with the stats recorded for threads already dead and gone. I added a new value RUSAGE_GROUP (-3) for the getrusage system call rather than changing the behavior of the old RUSAGE_SELF (0). POSIX specifies RUSAGE_SELF to mean all threads, so the glibc getrusage call will just translate it to RUSAGE_GROUP for new kernels. I did this thinking that someone somewhere might want the old behavior with an old glibc and a new kernel (it is only different if they are using CLONE_THREAD anyway). However, I've changed the times system call to conform to POSIX as well and did not provide any backward compatibility there. In that case there is nothing easy like a parameter value to use, it would have to be a new system call number. That seems pretty pointless. Given that, I wonder if it is worth bothering to preserve the compatible RUSAGE_SELF behavior by introducing RUSAGE_GROUP instead of just changing RUSAGE_SELF's meaning. Comments? I've done some basic testing on x86 and x86-64, and all the numbers come out right after these fixes. (I have a test program that shows a few Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-26	[PATCH] O(1) proc_pid_statm()	William Lee Irwin III
	Merely removing down_read(&mm->mmap_sem) from task_vsize() is too half-assed to let stand. The following patch removes the vma iteration as well as the down_read(&mm->mmap_sem) from both task_mem() and task_statm() and callers for the CONFIG_MMU=y case in favor of accounting the various stats reported at the times of vma creation, destruction, and modification. Unlike the 2.4.x patches of the same name, this has no per-pte-modification overhead whatsoever. This patch quashes end user complaints of top(1) being slow as well as kernel hacker complaints of per-pte accounting overhead simultaneously. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-26	[PATCH] task_vsize() locking cleanup	William Lee Irwin III
	task_vsize() doesn't need mm->mmap_sem for the CONFIG_MMU case; the semaphore doesn't prevent mm->total_vm from going stale or getting inconsistent with other numbers regardless. Also, KSTK_EIP() and KSTK_ESP() don't want or need protection from mm->mmap_sem either. So this pushes mm->mmap_sem to task_vsize() in the CONFIG_MMU=n task_vsize(). Also, hoist the prototype of task_vsize() into proc_fs.h The net result of this is a small speedup of procps for CONFIG_MMU. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-23	[PATCH] clarify get_task_mm (mmgrab)	Hugh Dickins
	Clarify mmgrab by collapsing it into get_task_mm (in fork.c not inline), and commenting on the special case it is guarding against: when use_mm in an AIO daemon temporarily adopts the mm while it's on its way out. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22	[PATCH] proc fs task name locking fix	Mike Kravetz
	Races have been observed between excec-time overwriting of task->comm and /proc accesses to the same data. This causes environment string information to appear in /proc. Fix that up by taking task_lock() around updates to and accesses to task->comm. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-05	[PATCH] fix /proc printing of TASK_DEAD state	Roland McGrath
	I just stumbled across this patch that's been sitting in my tree for ages. I thought I'd sent this in before. It's a trivial fix for the printing of task state in /proc and sysrq dumps and such, so that TASK_DEAD shows up correctly. This state is pretty much only ever there to be seen when there are exit/reaping bugs, but it's not like that hasn't come up. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-05-24	[PATCH] Fix race condition with current->group_info	Andrew Morton
	From: Olaf Kirch <okir@suse.de> I have been chasing a corruption of current->group_info on PPC during NFS stress tests. The problem seems to be that nfsd is messing with its group_info quite a bit, while some monitoring processes look at /proc/<pid>/status and do a get_group_info/put_group_info without any locking. This problem can be reproduced on ppc platforms within a few seconds if you generate some NFS load and do a "cat /proc/XXX/status" of an nfsd thread in a tight loop. I therefore think changes to current->group_info, and querying it from a different process, needs to be protected using the task_lock. (akpm: task->group_info here is safe against exit() because the task holds a ref on group_info which is released in __put_task_struct, and the /proc file has a ref on the task_struct).
2004-04-26	[PATCH] fs/proc/array.c: workaround for gcc-2.96	Andrew Morton
	From: Alan Stern <stern@rowland.harvard.edu> This patch is needed to work around gcc-2.96's limited ability to cope with long long intermediate expression types. I don't know why the code compiled okay earlier and failed now.
2004-04-11	[PATCH] procfs LoadAVG/load_avg scaling fix	Andrew Morton
	From: Ingo Molnar <mingo@elte.hu> Dave reported that /proc/*/status sometimes shows 101% as LoadAVG, which makes no sense. the reason of the bug is slightly incorrect scaling of the load_avg value. The patch below fixes this.
2004-04-11	[PATCH] eliminate nswap and cnswap	Andrew Morton
	From: Matt Mackall <mpm@selenic.com> The nswap and cnswap variables counters have never been incremented as Linux doesn't do task swapping.
2004-04-11	[PATCH] move job control fields from task_struct to signal_struct	Andrew Morton
	From: Roland McGrath <roland@redhat.com> This patch moves all the fields relating to job control from task_struct to signal_struct, so that all this info is properly per-process rather than being per-thread.
2004-02-18	[PATCH] NGROUPS 2.6.2rc2 + fixups	Andrew Morton
	From: Tim Hockin <thockin@sun.com>, Neil Brown <neilb@cse.unsw.edu.au>, me New groups infrastructure. task->groups and task->ngroups are replaced by task->group_info. Group)info is a refcounted, dynamic struct with an array of pages. This allows for large numbers of groups. The current limit of 32 groups has been raised to 64k groups. It can be raised more by changing the NGROUPS_MAX constant in limits.h
2003-10-14	[PATCH] number of threads in /proc	Albert Cahalan
	Having the number-of-threads value easily available turns out to be very important for procps performance. The /proc/*/stat thing getting reused has been zero since the 2.2.xx days, and was the seldom-used timeout value before that.
2003-10-09	Revert the process group accessor functions. They are buggy, and	Linus Torvalds
	cause NULL pointer references in /proc. Moreover, it's questionable whether the whole thing makes sense at all. Per-thread state is good. Cset exclude: davem@nuts.ninka.net\|ChangeSet\|20031005193942\|01097 Cset exclude: akpm@osdl.org[torvalds]\|ChangeSet\|20031005180420\|42200 Cset exclude: akpm@osdl.org[torvalds]\|ChangeSet\|20031005180411\|42211
2003-10-04	[PATCH] move job control fields from task_struct to	Andrew Morton
	From: Roland McGrath <roland@redhat.com> This patch completes what was started with the `process_group' accessor function, moving all the job control-related fields from task_struct into signal_struct and using process_foo accessor functions to read them. All these things are per-process in POSIX, none per-thread. Off hand it's hard to come up with the hairy MT scenarios in which the existing code would do insane things, but trust me, they're there. At any rate, all the uses being done via inline accessor functions now has got to be all good. I did a "make allyesconfig" build and caught the few random drivers and whatnot that referred to these fields. I was surprised to find how few references to ->tty there really were to fix up. I'm sure there will be a few more fixups needed in non-x86 code. The only actual testing of a running kernel with these patches I've done is on my normal minimal x86 config. Everything works fine as it did before as far as I can tell. One issue that may be of concern is the lack of any locking on multiple threads diddling these fields. I don't think it really matters, though there might be some obscure races that could produce inconsistent job control results. Nothing shattering, I'm sure; probably only something like a multi-threaded program calling setsid while its other threads do tty i/o, which never happens in reality. This is the same situation we get by using ->group_leader->foo without other synchronization, which seemed to be the trend and noone was worried about it.
2003-09-23	[PATCH] 32-bit dev_t fixups	Alexander Viro
	Argh. A couple of places where we needed ..._encode_dev() had been lost in reordering the patchset - the most notable being ctty number in /proc/<pid>/stat. Fix follows:
2003-09-22	[PATCH] prepare for 32-bit dev_t: tty usage	Alexander Viro
	tty->device had been used only in a couple of places and can be calculated by tty->index and tty->driver. Field removed, its users switched to static inline dev_t tty_devnum(tty).
2003-09-21	[PATCH] scheduler infrastructure	Andrew Morton
	From: Ingo Molnar <mingo@elte.hu> the attached scheduler patch (against test2-mm2) adds the scheduling infrastructure items discussed on lkml. I got good feedback - and while i dont expect it to solve all problems, it does solve a number of bad ones: - test_starve.c code from David Mosberger - thud.c making the system unusuable due to unfairness - fair/accurate sleep average based on a finegrained clock - audio skipping way too easily other changes in sched-test2-mm2-A3: - ia64 sched_clock() code, from David Mosberger. - migration thread startup without relying on implicit scheduling behavior. While the current 2.6 code is correct (due to the cpu-up code adding CPUs one by one), but it's also fragile - and this code cannot be carried over into the 2.4 backports. So adding this method would clean up the startup and would make it easier to have 2.4 backports. and here's the original changelog for the scheduler changes: - cycle accuracy (nanosec resolution) timekeeping within the scheduler. This fixes a number of audio artifacts (skipping) i've reproduced. I dont think we can get away without going cycle accuracy - reading the cycle counter adds some overhead, but it's acceptable. The first nanosec-accuracy patch was done by Mike Galbraith - this patch is different but similar in nature. I went further in also changing the sleep_avg to be of nanosec resolution. - more finegrained timeslices: there's now a timeslice 'sub unit' of 50 usecs (TIMESLICE_GRANULARITY) - CPU hogs on the same priority level will roundrobin with this unit. This change is intended to make gaming latencies shorter. - include scheduling latency in sleep bonus calculation. This change extends the sleep-average calculation to the period of time a task spends on the runqueue but doesnt get scheduled yet, right after wakeup. Note that tasks that were preempted (ie. not woken up) and are still on the runqueue do not get this benefit. This change closes one of the last hole in the dynamic priority estimation, it should result in interactive tasks getting more priority under heavy load. This change also fixes the test-starve.c testcase from David Mosberger. The TSC-based scheduler clock is disabled on ia32 NUMA platforms. (ie. platforms that have unsynched TSC for sure.) Those platforms should provide the proper code to rely on the TSC in a global way. (no such infrastructure exists at the moment - the monotonic TSC-based clock doesnt deal with TSC offsets either, as far as i can tell.)
2003-09-21	[PATCH] Fix setpgid and threads	Andrew Morton
	From: Jeremy Fitzhardinge <jeremy@goop.org> I'm resending my patch to fix this problem. To recap: every task_struct has its own copy of the thread group's pgrp. Only the thread group leader is allowed to change the tgrp's pgrp, but it only updates its own copy of pgrp, while all the other threads in the tgrp use the old value they inherited on creation. This patch simply updates all the other thread's pgrp when the tgrp leader changes pgrp. Ulrich has already expressed reservations about this patch since it is (1) incomplete (it doesn't cover the case of other ids which have similar problems), (2) racy (it doesn't synchronize with other threads looking at the task pgrp, so they could see an inconsistent view) and (3) slow (it takes linear time with respect to the number of threads in the tgrp). My reaction is that (1) it fixes the actual bug I'm encountering in a real program. (2) doesn't really matter for pgrp, since it is mostly an issue with respect to the terminal job-control code (which is even more broken without this patch. Regarding (3), I think there are very few programs which have a large number of threads which change process group id on a regular basis (a heavily multi-threaded job-control shell?). Ulrich also said he has a (proposed?) much better fix, which I've been looking forward to. I'm submitting this patch as a stop-gap fix for a real bug, and perhaps to prompt the improved patch. An alternative fix, at least for pgrp, is to change all references to ->pgrp to group_leader->pgrp. This may be sufficient on its own, but it would be a reasonably intrusive patch (I count 95 instances in 32 files in the 2.6.0-test3-mm3 tree).
2003-08-20	[PATCH] fix /proc mm_struct refcounting bug	Andrew Morton
	From: Suparna Bhattacharya <suparna@in.ibm.com> The /proc code's bare atomic_inc(&mm->count) is racy against __exit_mm()'s mmput() on another CPU: it calls mmput() outside task_lock(tsk), and task_lock() isn't appropriate locking anyway. So what happens is: CPU0 CPU1 mmput() ->atomic_dec_and_lock(mm->mm_users) atomic_inc(mm->mm_users) ->list_del(mm->mmlist) mmput() ->atomic_dec_and_lock(mm->mm_users) ->list_del(mm->mmlist) And the double list_del() of course goes splat. So we use mmlist_lock to synchronise these steps. The patch implements a new mmgrab() routine which increments mm_users only if the mm isn't already going away. Changes get_task_mm() and proc_pid_stat() to call mmgrab() instead of a direct atomic_inc(&mm->mm_users). Hugh, there's some cruft in swapoff which looks like it should be using mmgrab()...
2003-04-23	[PATCH] tty cleanups (11/12)	Alexander Viro
	tty->device switched to dev_t There are very few uses of tty->device left by now; most of them actually want dev_t (process accounting, proc/<pid>/stat, several ioctls, slip.c logics, etc.) and the rest will go away shortly.
2003-02-24	[PATCH] make jiffies wrap 5 min after boot	Andrew Morton
	From Tim Schmielau <tim@physik3.uni-rostock.de> Force jiffies to start out at five-minutes-before-wrap. To find jiffy-wrapping bugs.
2003-02-16	It's usually considered stupid to lock the same spinlock twice in	Linus Torvalds
	close succession. However, for this once we'll just call it "inspired". But let's decide pair the lock with an unlock anyway, even if it is boring and "square".
2003-02-16	Do proper signal locking for the old-style /proc/stat too.	Linus Torvalds

2003-02-16	Clean up and fix locking around signal rendering	Linus Torvalds

2003-02-10	Report shared pending signals in /proc/<pid>/status	Linus Torvalds
	Patch from Roland McGrath.