user/sven/linux.git - Linux Kernel

Age	Commit message (Collapse)	Author
2006-02-28	[PATCH] Add mm->task_size and fix powerpc vdso	Benjamin Herrenschmidt
	This patch adds mm->task_size to keep track of the task size of a given mm and uses that to fix the powerpc vdso so that it uses the mm task size to decide what pages to fault in instead of the current thread flags (which broke when ptracing). (akpm: I expect that mm_struct.task_size will become the way in which we finally sort out the confusion between 32-bit processes and 32-bit mm's. It may need tweaks, but at this stage this patch is powerpc-only.) Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-15	[PATCH] fix zap_thread's ptrace related problems	Oleg Nesterov
	1. The tracee can go from ptrace_stop() to do_signal_stop() after __ptrace_unlink(p). 2. It is unsafe to __ptrace_unlink(p) while p->parent may wait for tasklist_lock in ptrace_detach(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Christoph Hellwig <hch@lst.de> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18	[PATCH] vfs: *at functions: core	Ulrich Drepper
	Here is a series of patches which introduce in total 13 new system calls which take a file descriptor/filename pair instead of a single file name. These functions, openat etc, have been discussed on numerous occasions. They are needed to implement race-free filesystem traversal, they are necessary to implement a virtual per-thread current working directory (think multi-threaded backup software), etc. We have in glibc today implementations of the interfaces which use the /proc/self/fd magic. But this code is rather expensive. Here are some results (similar to what Jim Meyering posted before). The test creates a deep directory hierarchy on a tmpfs filesystem. Then rm -fr is used to remove all directories. Without syscall support I get this: real 0m31.921s user 0m0.688s sys 0m31.234s With syscall support the results are much better: real 0m20.699s user 0m0.536s sys 0m20.149s The interfaces are for obvious reasons currently not much used. But they'll be used. coreutils (and Jeff's posixutils) are already using them. Furthermore, code like ftw/fts in libc (maybe even glob) will also start using them. I expect a patch to make follow soon. Every program which is walking the filesystem tree will benefit. Signed-off-by: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@ftp.linux.org.uk> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-14	[PATCH] Unlinline a bunch of other functions	Arjan van de Ven
	Remove the "inline" keyword from a bunch of big functions in the kernel with the goal of shrinking it by 30kb to 40kb Signed-off-by: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Jeff Garzik <jgarzik@pobox.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10	[PATCH] hrtimer: switch itimers to hrtimer	Thomas Gleixner
	switch itimers to a hrtimers-based implementation Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08	[PATCH] do_coredump() should reset group_stop_count earlier	Oleg Nesterov
	__group_complete_signal() sets ->group_stop_count in sig_kernel_coredump() path and marks the target thread as ->group_exit_task. So any thread except group_exit_task will go to handle_group_stop()->finish_stop(). However, when group_exit_task actually starts do_coredump(), it sets SIGNAL_GROUP_EXIT, but does not reset ->group_stop_count while killing other threads. If we have not yet stopped threads in the same thread group, they all will spin in kernel mode until group_exit_task sends them SIGKILL, because ->group_stop_count > 0 means: recalc_sigpending_tsk() never clears TIF_SIGPENDING get_signal_to_deliver() goes to handle_group_stop() handle_group_stop() returns when SIGNAL_GROUP_EXIT set syscall_exit/resume_userspace notice TIF_SIGPENDING, call get_signal_to_deliver() again. So we are wasting cpu cycles, and if one of these threads is rt_task() this may be a serious problem. NOTE: do_coredump() holds ->mmap_sem, so not stopped threads can't escape coredumping after clearing ->group_stop_count. See also this thread: http://marc.theaimsgroup.com/?t=112739139900002 Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08	[PATCH] Fix some problems with truncate and mtime semantics.	NeilBrown
	SUS requires that when truncating a file to the size that it currently is: truncate and ftruncate should NOT modify ctime or mtime O_TRUNC SHOULD modify ctime and mtime. Currently mtime and ctime are always modified on most local filesystems (side effect of ->truncate) or never modified (on NFS). With this patch: ATTR_CTIME\|ATTR_MTIME are sent with ATTR_SIZE precisely when an update of these times is required whether size changes or not (via a new argument to do_truncate). This allows NFS to do the right thing for O_TRUNC. inode_setattr nolonger forces ATTR_MTIME\|ATTR_CTIME when the ATTR_SIZE sets the size to it's current value. This allows local filesystems to do the right thing for f?truncate. Also, the logic in inode_setattr is changed a bit so there are two return points. One returns the error from vmtruncate if it failed, the other returns 0 (there can be no other failure). Finally, if vmtruncate succeeds, and ATTR_SIZE is the only change requested, we now fall-through and mark_inode_dirty. If a filesystem did not have a ->truncate function, then vmtruncate will have changed i_size, without marking the inode as 'dirty', and I think this is wrong. Signed-off-by: Neil Brown <neilb@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08	[PATCH] RCU signal handling	Ingo Molnar
	RCU tasklist_lock and RCU signal handling: send signals RCU-read-locked instead of tasklist_lock read-locked. This is a scalability improvement on SMP and a preemption-latency improvement under PREEMPT_RCU. Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: William Irwin <wli@holomorphy.com> Cc: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06	[PATCH] mm: rmap optimisation	Nick Piggin
	Optimise rmap functions by minimising atomic operations when we know there will be no concurrent modifications. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-29	VM: add common helper function to create the page tables	Linus Torvalds
	This logic was duplicated four times, for no good reason. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-23	[PATCH] fix do_wait() vs exec() race	Oleg Nesterov
	When non-leader thread does exec, de_thread adds old leader to the init's ->children list in EXIT_ZOMBIE state and drops tasklist_lock. This means that release_task(leader) in de_thread() is racy vs do_wait() from init task. I think de_thread() should set old leader's state to EXIT_DEAD instead. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: george anzinger <george@mvista.com> Cc: Roland Dreier <rolandd@cisco.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09	[PATCH] add a file_permission helper	Christoph Hellwig
	A few more callers of permission() just want to check for a different access pattern on an already open file. This patch adds a wrapper for permission() that takes a file in preparation of per-mount read-only support and to clean up the callers a little. The helper is not intended for new code, everything without the interface set in stone should use vfs_permission() Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09	[PATCH] add a vfs_permission helper	Christoph Hellwig
	Most permission() calls have a struct nameidata * available. This helper takes that as an argument and thus makes sure we pass it down for lookup intents and prepares for per-mount read-only support where we need a struct vfsmount for checking whether a file is writeable. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-08	[PATCH] fix de_thread() vs send_group_sigqueue() race	Oleg Nesterov
	When non-leader thread does exec, de_thread calls release_task(leader) before calling exit_itimers(). If local timer interrupt happens in between, it can oops in send_group_sigqueue() while taking ->sighand->siglock == NULL. However, we can't change send_group_sigqueue() to check p->signal != NULL, because sys_timer_create() does get_task_struct() only in SIGEV_THREAD_ID case. So it is possible that this task_struct was already freed and we can't trust p->signal. This patch changes de_thread() so that leader released after exit_itimers() call. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Chris Wright <chrisw@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07	[PATCH] VFS: pass file pointer to filesystem from ftruncate()	Miklos Szeredi
	This patch extends the iattr structure with a file pointer memeber, and adds an ATTR_FILE validity flag for this member. This is set if do_truncate() is invoked from ftruncate() or from do_coredump(). The change is source and binary compatible. Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07	[PATCH] Process Events Connector	Matt Helsley
	This patch adds a connector that reports fork, exec, id change, and exit events for all processes to userspace. It replaces the fork_advisor patch that ELSA is currently using. Applications that may find these events useful include accounting/auditing (e.g. ELSA), system activity monitoring (e.g. top), security, and resource management (e.g. CKRM). Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30	[PATCH] fix de_thread() vs do_coredump() deadlock	Oleg Nesterov
	de_thread() sends SIGKILL to all sub-threads and waits them to die in 'D' state. It is possible that one of the threads already dequeued coredump signal. When de_thread() unlocks ->sighand->lock that thread can enter do_coredump()->coredump_wait() and cause a deadlock. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30	[PATCH] coredump_wait() cleanup	Oleg Nesterov
	This patch deletes pointless code from coredump_wait(). 1. It does useless mm->core_waiters inc/dec under mm->mmap_sem, but any changes to ->core_waiters have no effect until we drop ->mmap_sem. 2. It calls yield() for absolutely unknown reason. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30	[PATCH] fix de_thread vs it_real_fn() deadlock	Oleg Nesterov
	de_thread() calls del_timer_sync(->real_timer) under ->sighand->siglock. This is deadlockable, it_real_fn sends a signal and needs this lock too. Also, delete unneeded ->real_timer.data assignment. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30	[PATCH] little de_thread() cleanup	Oleg Nesterov
	Trivial, saves one 'if' branch in de_thread(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29	[PATCH] mm: ptd_alloc take ptlock	Hugh Dickins
	Second step in pushing down the page_table_lock. Remove the temporary bridging hack from __pud_alloc, __pmd_alloc, __pte_alloc: expect callers not to hold page_table_lock, whether it's on init_mm or a user mm; take page_table_lock internally to check if a racing task already allocated. Convert their callers from common code. But avoid coming back to change them again later: instead of moving the spin_lock(&mm->page_table_lock) down, switch over to new macros pte_alloc_map_lock and pte_unmap_unlock, which encapsulate the mapping+locking and unlocking+unmapping together, and in the end may use alternatives to the mm page_table_lock itself. These callers all hold mmap_sem (some exclusively, some not), so at no level can a page table be whipped away from beneath them; and pte_alloc uses the "atomic" pmd_present to test whether it needs to allocate. It appears that on all arches we can safely descend without page_table_lock. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29	[PATCH] mm: update_hiwaters just in time	Hugh Dickins
	update_mem_hiwater has attracted various criticisms, in particular from those concerned with mm scalability. Originally it was called whenever rss or total_vm got raised. Then many of those callsites were replaced by a timer tick call from account_system_time. Now Frank van Maarseveen reports that to be found inadequate. How about this? Works for Frank. Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros update_hiwater_rss and update_hiwater_vm. Don't attempt to keep mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually by 1): those are hot paths. Do the opposite, update only when about to lower rss (usually by many), or just before final accounting in do_exit. Handle mm->hiwater_vm in the same way, though it's much less of an issue. Demand that whoever collects these hiwater statistics do the work of taking the maximum with rss or total_vm. And there has been no collector of these hiwater statistics in the tree. The new convention needs an example, so match Frank's usage by adding a VmPeak line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS (High-Water-Mark or High-Water-Memory). There was a particular anomaly during mremap move, that hiwater_vm might be captured too high. A fleeting such anomaly remains, but it's quickly corrected now, whereas before it would stick. What locking? None: if the app is racy then these statistics will be racy, it's not worth any overhead to make them exact. But whenever it suits, hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under page_table_lock (for now) or with preemption disabled (later on): without going to any trouble, minimize the time between reading current values and updating, to minimize those occasions when a racing thread bumps a count up and back down in between. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29	[PATCH] mm: rss = file_rss + anon_rss	Hugh Dickins
	I was lazy when we added anon_rss, and chose to change as few places as possible. So currently each anonymous page has to be counted twice, in rss and in anon_rss. Which won't be so good if those are atomic counts in some configurations. Change that around: keep file_rss and anon_rss separately, and add them together (with get_mm_rss macro) when the total is needed - reading two atomics is much cheaper than updating two atomics. And update anon_rss upfront, typically in memory.c, not tucked away in page_add_anon_rmap. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-18	VFS: Allow the filesystem to return a full file pointer on open intent	Trond Myklebust
	This is needed by NFSv4 for atomicity reasons: our open command is in fact a lookup+open, so we need to be able to propagate open context information from lookup() into the resulting struct file's private_data field. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2005-09-14	[PATCH] error path in setup_arg_pages() misses vm_unacct_memory()	Hugh Dickins
	Pavel Emelianov and Kirill Korotaev observe that fs and arch users of security_vm_enough_memory tend to forget to vm_unacct_memory when a failure occurs further down (typically in setup_arg_pages variants). These are all users of insert_vm_struct, and that reservation will only be unaccounted on exit if the vma is marked VM_ACCOUNT: which in some cases it is (hidden inside VM_STACK_FLAGS) and in some cases it isn't. So x86_64 32-bit and ppc64 vDSO ELFs have been leaking memory into Committed_AS each time they're run. But don't add VM_ACCOUNT to them, it's inappropriate to reserve against the very unlikely case that gdb be used to COW a vDSO page - we ought to do something about that in do_wp_page, but there are yet other inconsistencies to be resolved. The safe and economical way to fix this is to let insert_vm_struct do the security_vm_enough_memory check when it finds VM_ACCOUNT is set. And the MIPS irix_brk has been calling security_vm_enough_memory before calling do_brk which repeats it, doubly accounting and so also leaking. Remove that, and all the fs and arch calls to security_vm_enough_memory: give it a less misleading name later on. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-Off-By: Kirill Korotaev <dev@sw.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-14	[PATCH] Fix fs/exec.c:788 (de_thread()) BUG_ON	Alexander Nyberg
	It turns out that the BUG_ON() in fs/exec.c: de_thread() is unreliable and can trigger due to the test itself being racy. de_thread() does while (atomic_read(&sig->count) > count) { } ..... ..... BUG_ON(!thread_group_empty(current)); but release_task does write_lock_irq(&tasklist_lock) __exit_signal (this is where atomic_dec(&sig->count) is run) __exit_sighand __unhash_process takes write lock on tasklist_lock remove itself out of PIDTYPE_TGID list write_unlock_irq(&tasklist_lock) so there's a clear (although small) window between the atomic_dec(&sig->count) and the actual PIDTYPE_TGID unhashing of the thread. And actually there is no need for all threads to have exited at this point, so we simply kill the BUG_ON. Big thanks to Marc Lehmann who provided the test-case. Fixes Bug 5170 (http://bugme.osdl.org/show_bug.cgi?id=5170) Signed-off-by: Alexander Nyberg <alexn@telia.com> Cc: Roland McGrath <roland@redhat.com> Cc: Andrew Morton <akpm@osdl.org> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09	[PATCH] files: break up files struct	Dipankar Sarma
	In order for the RCU to work, the file table array, sets and their sizes must be updated atomically. Instead of ensuring this through too many memory barriers, we put the arrays and their sizes in a separate structure. This patch takes the first step of putting the file table elements in a separate structure fdtable that is embedded withing files_struct. It also changes all the users to refer to the file table using files_fdtable() macro. Subsequent applciation of RCU becomes easier after this. Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> Signed-Off-By: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-12	[PATCH] reset real_timer target on exec leader change	Roland McGrath
	When a noninitial thread does exec, it becomes the new group leader. If there is a ITIMER_REAL timer running, it points at the old group leader and when it fires it can follow a stale pointer. The timer data needs to be reset to point at the exec'ing thread that is becoming the group leader. This has to synchronize with any concurrent firing of the timer to make sure that it_real_fn can never run when the data points to a thread that might have been reaped already. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23	[PATCH] setuid core dump	Alan Cox
	Add a new `suid_dumpable' sysctl: This value can be used to query and set the core dump mode for setuid or otherwise protected/tainted binaries. The modes are 0 - (default) - traditional behaviour. Any process which has changed privilege levels or is execute only will not be dumped 1 - (debug) - all processes dump core when possible. The core dump is owned by the current user and no security is applied. This is intended for system debugging situations only. Ptrace is unchecked. 2 - (suidsafe) - any binary which normally would not be dumped is dumped readable by root only. This allows the end user to remove such a dump but not access it directly. For security reasons core dumps in this mode will not overwrite one another or other files. This mode is appropriate when adminstrators are attempting to debug problems in a normal environment. (akpm: > > +EXPORT_SYMBOL(suid_dumpable); > > EXPORT_SYMBOL_GPL? No problem to me. > > if (current->euid == current->uid && current->egid == current->gid) > > current->mm->dumpable = 1; > > Should this be SUID_DUMP_USER? Actually the feedback I had from last time was that the SUID_ defines should go because its clearer to follow the numbers. They can go everywhere (and there are lots of places where dumpable is tested/used as a bool in untouched code) > Maybe this should be renamed to `dump_policy' or something. Doing that > would help us catch any code which isn't using the #defines, too. Fair comment. The patch was designed to be easy to maintain for Red Hat rather than for merging. Changing that field would create a gigantic diff because it is used all over the place. ) Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-18	Clean up subthread exec	Linus Torvalds
	Make sure we re-parent itimers, and use BUG_ON() instead of an explicit conditional BUG().
2005-05-05	[PATCH] comments on locking of task->comm	Paolo 'Blaisorblade' Giarrusso
	Add some comments about task->comm, to explain what it is near its definition and provide some important pointers to its uses. Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-05	[PATCH] make some things static	Adrian Bunk
	This patch makes some needlessly global identifiers static. Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Arjan van de Ven <arjanv@infradead.org> Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-28	[PATCH] mm counter operations through macros	Christoph Lameter
	This patch extracts all the operations on counters protected by the page table lock (currently rss and anon_rss) into definitions in include/linux/sched.h. All rss operations are performed through the following macros: get_mm_counter(mm, member) -> Obtain the value of a counter set_mm_counter(mm, member, value) -> Set the value of a counter update_mm_counter(mm, member, value) -> Add to a counter inc_mm_counter(mm, member) -> Increment a counter dec_mm_counter(mm, member) -> Decrement a counter With this patch it becomes easier to add new counters and it is possible to redefine the method of counter handling. The counters are an issue for scalability since they are used in frequently used code paths and may cause cache line bouncing. F.e. One may not use counters at all and count the pages when needed, switch to atomic operations if the mm_struct locking changes or split the rss into counters that can be locally incremented. The relevant fields of the task_struct are renamed with a leading underscore to catch out people who are not using the acceessor macros. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-15	[PATCH] use strncpy in get_task_comm	Prasanna Meda
	Set_task_comm uses strlcpy, so get_task_comm must use strncpy. Signed-Off-by: Prasanna Meda <pmeda@akamai.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-05	Merge davem@nuts:/disk1/BK/set_pte-2.6	David S. Miller
	into picasso.davemloft.net:/home/davem/src/BK/set_pte-2.6
2005-03-04	[PATCH] Move accounting function calls out of critical vm code paths	Christoph Lameter
	In the 2.6.11 development cycle function calls have been added to lots of hot vm paths to do accounting. I think these should not go into the final 2.6.1 release because these statistics can be collected in a different way that does not require the updating of counters from frequently used vm code paths and is consistent with the methods use elsewhere in the kernel to obtain statistics. These function calls are acct_update_integrals -> Account for processes based on stime changes update_mem_hiwater -> takes rss and total_vm hiwater marks. acct_update_integrals is only useful to call if stime changes otherwise it will simply return. It is therefore best to relocate the function call to acct_update_integral into the function that updates stime which is account_system_time and remove it from the vm code paths. update_mem_hiwater finds the rss hiwater mark. We call that from timer context as well. This means that processes' high-water marks are now sampled statistically, at timer-interrupt time rather than deterministically. This may or may not be a problem.. This means that the rss limit is not always updated if rss is increased and thus not as accurate. But the benefit is that the rss checks do no pollute the vm paths and that it is consistent with the rss limit check. The following patch removes acct_update_integrals and update_mem_hiwater from the hot vm paths. Signed-off-by: Christoph Lameter <clameter@sgi.com> From: Jay Lan <jlan@sgi.com> The new "move-accounting-function-calls-out-of-critical-vm-code-paths" patch in 2.6.11-rc3-mm2 was different from the code i tested. In particular, it mistakenly dropped the accounting routine calls in fs/exec.c. The calls in do_execve() are needed to properly initialize accounting fields. Specifically, the tsk->acct_stimexpd needs to be initialized to tsk->stime. I have discussed this with Christoph Lameter and he gave me full blessings to bring the calls back. Signed-off-by: Jay Lan <jlan@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-04	[PATCH] Randomisation: top-of-stack randomization	Arjan van de Ven
	In addition to randomisation of the stack pointer within the stack, the stack itself should be randomized too. We need both approaches, we can only randomize the stack itself in pagesize increments. However randomizing large ranges with the stackpointer runs into the situation where a huge chunk of the stack rlimit is used by the randomisation; this is undesirable so we need to do both. Signed-off-by: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-04	[PATCH] Randomisation: stack randomisation	Arjan van de Ven
	The patch below replaces the existing 8Kb randomisation of the userspace stack pointer (which is currently only done for Hyperthreaded P-IVs) with a more general randomisation over a 64Kb range. 64Kb is not a lot, but it's a start and once the dust settles we can increase this value to a more agressive value. Signed-off-by: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-04	[PATCH] Randomisation: add PF_RANDOMIZE	Arjan van de Ven
	Even though there is a global flag to disable randomisation, it's useful to have a per process flag too; the patch below introduces this per process flag and automatically sets it for "new" binaries. Eventually we will want to tie this to the legacy-va-space personality Signed-off-by: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-02-22	[MM]: Add set_pte_at() which takes 'mm' and 'addr' args.	David S. Miller
	I'm taking a slightly different approach this time around so things are easier to integrate. Here is the first patch which builds the infrastructure. Basically: 1) Add set_pte_at() which is set_pte() with 'mm' and 'addr' arguments added. All generic code uses set_pte_at(). Most platforms simply get this define: #define set_pte_at(mm,addr,ptep,pteval) set_pte(ptep,pteval) I chose this method over simply changing all set_pte() call sites because many platforms implement this in assembler and it would take forever to preserve the build and stabilize things if modifying that was necessary. Soon, with platform maintainer's help, we can kill of set_pte() entirely. To be honest, there are only a handful of set_pte() call sites in the arch specific code. Actually, in this patch ppc64 is completely set_pte() free and does not define it. 2) pte_clear() gets 'mm' and 'addr' arguments now. This had a cascading effect on many ptep_test_and_*() routines. Specifically: a) ptep_test_and_clear_{young,dirty}() now take 'vma' and 'address' args. b) ptep_get_and_clear now take 'mm' and 'address' args. c) ptep_mkdirty was deleted, unused by any code. d) ptep_set_wrprotect now takes 'mm' and 'address' args. I've tested this patch as follows: 1) compile and run tested on sparc64/SMP 2) compile tested on: a) ppc64/SMP b) i386 both with and without PAE enabled Signed-off-by: David S. Miller <davem@davemloft.net>
2005-01-20	[PATCH] Lock initializer cleanup: Filesystems	Thomas Gleixner
	Use the new lock initializers DEFINE_SPIN_LOCK and DEFINE_RW_LOCK Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-14	[PATCH] clear false pending signal indication in core dump	Roland McGrath
	When kill is used to force a core dump, __group_complete_signal uses the group_stop_count machinery to stop other threads from doing anything more before the signal-taking thread starts the coredump synchronization. This intentionally results in group_stop_count always still being > 0 when the signal-taking thread gets into do_coredump. However, that has the unintended effect that signal_pending can return true when called from the filesystem code while writing the core dump file. For NFS mounts using the "intr" option, this results in NFS operations bailing out before they even try, so core files never get successfully dumped on such a filesystem when the crash was induced by an asynchronous process-wide signal. This patch fixes the problem by clearing group_stop_count after the coredump synchronization is complete. The locking I threw in is not directly related, but always should have been there and may avoid some potential races with kill. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-14	[PATCH] fix exec deadlock when ptrace used inside the thread group	Roland McGrath
	If one thread uses ptrace on another thread in the same thread group, there can be a deadlock when calling exec. The ptrace_stop change ensures that no tracing stop can be entered for a queued signal, or exit tracing, if the tracer is part of the same dying group. The exit_notify change prevents a ptrace zombie from sticking around if its tracer is in the midst of a group exit (which an exec fakes), so these zombies don't hold up de_thread's synchronization. The de_thread change ensures the new thread group leader doesn't wind up ptracing itself, which would produce its own deadlocks. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-14	[PATCH] fix race between core dumping and exec with shared mm	Roland McGrath
	When threads are sharing mm via CLONE_VM (linuxthreads, vfork), there is a race condition where one thread doing a core dump and synchronizing all mm-sharing threads for it can deadlock waiting for another thread that just did an exec and will never synchronize. This patch makes the exec_mmap check for a pending core dump and punt the exec to synchronize with that, as if the core dump had struck before entering the execve system call at all. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-14	[PATCH] fix coredump_wait deadlock with ptracer & tracee on shared mm	Roland McGrath
	In the oddball situation where one thread is using ptrace on another thread sharing the same mm, and then someone sharing that mm causes a coredump, there is a deadlock possible if the traced thread is in TASK_TRACED state. It leaves all the threads sharing that mm wedged and permanently unkillable. This patch checks for that situation and brings a thread out of TASK_TRACED if its tracer is part of the same coredump (i.e. shares the same mm). It's not pretty, but it does the job. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-10	[PATCH] split bprm_apply_creds into two functions	Serge Hallyn
	The following patch splits bprm_apply_creds into two functions, bprm_apply_creds and bprm_post_apply_creds. The latter is called after the task_lock has been dropped. Without this patch, SELinux must drop the task_lock and re-acquire it during apply_creds, making the 'unsafe' flag meaningless to any later security modules. Please apply. Signed-off-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Stephen Smalley <sds@epoch.ncsc.mil> Signed-off-by: Chris Wright <chrisw@osdl.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07	[PATCH] sched: fix scheduling latencies for !PREEMPT kernels	Ingo Molnar
	This patch adds a handful of cond_resched() points to a number of key, scheduling-latency related non-inlined functions. This reduces preemption latency for !PREEMPT kernels. These are scheduling points complementary to PREEMPT_VOLUNTARY scheduling points (might_sleep() places) - i.e. these are all points where an explicit cond_resched() had to be added. Has been tested as part of the -VP patchset. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07	[PATCH] don't hide thread_group_leader() from grep	Oleg Nesterov
	Replace open-coded thread_group_leader() calls. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] move group_exit flag into signal_struct.flags word	Roland McGrath
	After my last change, there are plenty of unused bits available in the new flags word in signal_struct. This patch moves the `group_exit' flag into one of those bits, saving a word in signal_struct. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] enhanced Memory accounting data collection	Jay Lan
	This patch is to offer common accounting data collection method at memory usage for various accounting packages including BSD accounting, ELSA, CSA and any other acct packages that use a common layer of data collection. New struct fields are added to mm_struct to save high watermarks of rss usage as well as virtual memory usage. New struct fields are added to task_struct to collect accumulated rss usage and vm usages. These data are collected on per process basis. Signed-off-by: Jay Lan <jlan@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>