user/sven/linux.git - Linux Kernel

Age	Commit message (Collapse)	Author
2005-09-27	[PATCH] Make POSIX message queue sys_mq_open() honor umask	Krzysztof Benedyczak
	We ignored umask when creating new queues via mq_open (when creating with open() on mqueue fs it is ok of course). According to the specification this a bug. This trivial patch fixes this. Signed-off-by: Krzysztof Benedyczak <golbi@mat.uni.torun.pl> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10	[PATCH] merge some from Rusty's trivial patches	Adrian Bunk
	This patch contains the most trivial from Rusty's trivial patches: - spelling fixes - remove duplicate includes Signed-off-by: Adrian Bunk <bunk@stusta.de> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07	[PATCH] ipc: convert /proc/sysvipc/* to generic seq_file interface	Mike Waychison
	Change the /proc/sysvipc/shm\|sem\|msg files to use the generic seq_file implementation for struct ipc_ids. Signed-off-by: Mike Waychison <mikew@google.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07	[PATCH] ipc: add generic struct ipc_ids seq_file iteration	Mike Waychison
	The following two patches convert /proc/sysvipc/* to use seq_file. This gives us the following: - Self-consistent IPC records in proc. - O(n) reading of the files themselves. This patch: Add a generic method for ipc types to be displayed using seq_file. This patch abstracts out seq_file iterating over struct ipc_ids into ipc/util.c Signed-off-by: Mike Waychison <mikew@google.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07	[PATCH] compat: be more consistent about [ug]id_t	Stephen Rothwell
	When I first wrote the compat layer patches, I was somewhat cavalier about the definition of compat_uid_t and compat_gid_t (or maybe I just misunderstood :-)). This patch makes the compat types much more consistent with the types we are being compatible with and hopefully will fix a few bugs along the way. compat type type in compat arch __compat_[ug]id_t __kernel_[ug]id_t __compat_[ug]id32_t __kernel_[ug]id32_t compat_[ug]id_t [ug]id_t The difference is that compat_uid_t is always 32 bits (for the archs we care about) but __compat_uid_t may be 16 bits on some. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-05	[PATCH] Fix semundo lock leakage	Ingo Molnar
	semundo->lock can leak if semundo->refcount goes from 2 to 1 while another thread has it locked. This causes major problems for PREEMPT kernels. The simplest fix for now is to undo the single-thread optimization. This bug was found via relentless testing by Dominik Karall. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-01	[PATCH] shm: CONFIG_SHMEM=n build fix	Andrew Morton
	Fix bug found by Grant Coady <lkml@dodo.com.au>'s autobuild setup. shmem_set_policy() and shmem_get_policy() are macros if !CONFIG_SHMEM, so this doesn't work. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-12	[PATCH] xtensa: remove old syscalls	Chris Zankel
	This patch fixes some minor bugs introduced by the previous patch (remove old syscalls). Both patches remove the obsolete syscalls. The changes in this patch were suggested by Arnd Bergmann. The vmlinux.lds.S changes are required for the latest gcc/binutils. Signed-off-by: Chris Zankel <chris@zankel.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-07	[PATCH] put_compat_shminfo() warning fix	Jesse Millan
	GCC 4 complains because the function put_compat_shminfo() can't get to its return statement if there is no error... If the function does not return -EFAULT, it doesn't return anything at all. Looks like a typo. Signed-off-by: Jesse Millan <jessem@cs.pdx.edu> Signed-off-by: Domen Puncer <domen@coderock.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23	[PATCH] ipcsem: remove superflous decrease variable from sys_semtimedop	Manfred Spraul
	Patrick noticed that the initial scan of the semaphore operations logs decrease and increase operations seperately, but then both cases are or'ed together and decrease is never used. The attached patch removes the decrease parameter - it shrinks sys_semtimedop() by 56 bytes. Signed-Of-By: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01	[PATCH] convert that currently tests _NSIG directly to use valid_signal()	Jesper Juhl
	Convert most of the current code that uses _NSIG directly to instead use valid_signal(). This avoids gcc -W warnings and off-by-one errors. Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01	[PATCH] consolidate sys_shmat	Stephen Rothwell
	Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01	[PATCH] use smp_mb/wmb/rmb where possible	akpm@osdl.org
	Replace a number of memory barriers with smp_ variants. This means we won't take the unnecessary hit on UP machines. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-16	Merge	Linus Torvalds

2005-03-13	[PATCH] verify_area cleanup : i386 and misc.	Jesper Juhl
	This patch converts verify_area to access_ok in arch/i386, fs/, kernel/ and a few other bits that didn't fit in the other patches or that I actually was able to test on my hardware - this is by far the best tested of all the patches. Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-09	[PATCH] consolidate the last of the compat sigevent structs	Stephen Rothwell
	This patch pulls together the compat_sigevent structs. It also consolidates the copying of these structures into the kernel. The only part of the second union in sigevent that the kernel looks at currently is the _tid, so that is the only bit we copy. This patch depends on my previous two patches "add and use COMPAT_SIGEV_PAD_SIZE" and "Consolidate the last compat sigvals". Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-01	Audit IPC object owner/permission changes.	David Woodhouse
	Add linked list of auxiliary data to audit_context Add callbacks in IPC_SET functions to record requested changes. Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-01-04	[PATCH] fix missing wakeup in ipc/sem	Manfred Spraul
	My patch that removed the spin_lock calls from the tail of sys_semtimedop introduced a bug: Before my patch was merged, every operation that altered an array called update_queue. That call woke up threads that were waiting until a semaphore value becomes 0. I've accidentially removed that call. The attached patch fixes that by modifying update_queue: the function now loops internally and wakes up all threads. The patch also removes update_queue calls from the error path of sys_semtimedop: failed operations do not modify the array, no need to rescan the list of waiting threads. Signed-Off-By: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-12-12	[PATCH] shmctl SHM_LOCK perms	Hugh Dickins
	Michael Kerrisk has observed that at present any process can SHM_LOCK any shm segment of size within process RLIMIT_MEMLOCK, despite having no permissions on the segment: surprising, though not obviously evil. And any process can SHM_UNLOCK any shm segment, despite no permissions on it: that is surely wrong. Unless CAP_IPC_LOCK, restrict both SHM_LOCK and SHM_UNLOCK to when the process euid matches the shm owner or creator: that seems the least surprising behaviour, which could be relaxed if a need appears later. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-27	[PATCH] handle posix message queues with /proc/sys disabled	Manfred Spraul
	register_sysctl_table() fails if sysctl support is not compiled into the kernel. The POSIX message queue subsystem aborted it's initialization if register_sysctl_table() fails, and that causes an oops in sys_mq_open(). The patch fixes that by ignoring failures from register_sysctl_table(). Signed-off-by; Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-27	[PATCH] Lock initializer unifying (Core)	Thomas Gleixner
	To make spinlock/rwlock initialization consistent all over the kernel, this patch converts explicit lock-initializers into spin_lock_init() and rwlock_init() calls. Currently, spinlocks and rwlocks are initialized in two different ways: lock = SPIN_LOCK_UNLOCKED spin_lock_init(&lock) rwlock = RW_LOCK_UNLOCKED rwlock_init(&rwlock) this patch converts all explicit lock initializations to spin_lock_init() or rwlock_init(). (Besides consistency this also helps automatic lock validators and debugging code.) The conversion was done with a script, it was verified manually and it was reviewed, compiled and tested as far as possible on x86, ARM, PPC. There is no runtime overhead or actual code change resulting out of this patch, because spin_lock_init() and rwlock_init() are macros and are thus equivalent to the explicit initialization method. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-27	[PATCH] RCU: eliminating explicit memory barriers from SysV IPC	Paul E. McKenney
	This patch uses the rcu_assign_pointer() API to eliminate a number of explicit memory barriers from the SysV IPC code that uses RCU. It also restructures the ipc_ids structure so that the array size is stored in the same memory block as the array itself (see the new struct ipc_id_ary). This prevents the race that the earlier code was subject to, where a reader could see a mismatch between the size and the actual array. With the size stored with the array, the possibility of mismatch is eliminated -- with out the need for careful ordering and explicit memory barriers. This has been tested successfully on i386 and ppc64. Signed-off-by: <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-18	[PATCH] add missing linux/syscalls.h includes	Arnd Bergmann
	I found that the prototypes for sys_waitid and sys_fcntl in <linux/syscalls.h> don't match the implementation. In order to keep all prototypes in sync in the future, now include the header from each file implementing any syscall. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-18	[PATCH] make rlimit settings per-process instead of per-thread	Roland McGrath
	POSIX specifies that the limit settings provided by getrlimit/setrlimit are shared by the whole process, not specific to individual threads. This patch changes the behavior of those calls to comply with POSIX. I've moved the struct rlimit array from task_struct to signal_struct, as it has the correct sharing properties. (This reduces kernel memory usage per thread in multithreaded processes by around 100/200 bytes for 32/64 machines respectively.) I took a fairly minimal approach to the locking issues with the newly shared struct rlimit array. It turns out that all the code that is checking limits really just needs to look at one word at a time (one rlim_cur field, usually). It's only the few places like getrlimit itself (and fork), that require atomicity in accessing a whole struct rlimit, so I just used a spin lock for them and no locking for most of the checks. If it turns out that readers of struct rlimit need more atomicity where they are now cheap, or less overhead where they are now atomic (e.g. fork), then seqcount is certainly the right thing to use for them instead of readers using the spin lock. Though it's in signal_struct, I didn't use siglock since the access to rlimits never needs to disable irqs and doesn't overlap with other siglock uses. Instead of adding something new, I overloaded task_lock(task->group_leader) for this; it is used for other things that are not likely to happen simultaneously with limit tweaking. To me that seems preferable to adding a word, but it would be trivial (and arguably cleaner) to add a separate lock for these users (or e.g. just use seqlock, which adds two words but is optimal for readers). Most of the changes here are just the trivial s/->rlim/->signal->rlim/. I stumbled across what must be a long-standing bug, in reparent_to_init. It does: memcpy(current->rlim, init_task.rlim, sizeof((current->rlim))); when surely it was intended to be: memcpy(current->rlim, init_task.rlim, sizeof(current->rlim)); As rlim is an array, the in the sizeof expression gets the size of the first element, so this just changes the first limit (RLIMIT_CPU). This is for kernel threads, where it's clear that resetting all the rlimits is what you want. With that fixed, the setting of RLIMIT_FSIZE in nfsd is superfluous since it will now already have been reset to RLIM_INFINITY. The other subtlety is removing: tsk->rlim[RLIMIT_CPU].rlim_cur = RLIM_INFINITY; in exit_notify, which was to avoid a race signalling during self-reaping exit. As the limit is now shared, a dying thread should not change it for others. Instead, I avoid that race by checking current->state before the RLIMIT_CPU check. (Adding one new conditional in that path is now required one way or another, since if not for this check there would also be a new race with self-reaping exit later on clearing current->signal that would have to be checked for.) The one loose end left by this patch is with process accounting. do_acct_process temporarily resets the RLIMIT_FSIZE limit while writing the accounting record. I left this as it was, but it is now changing a limit that might be shared by other threads still running. I left this in a dubious state because it seems to me that processing accounting may already be more generally a dubious state when it comes to NPTL threads. I would think you would want one record per process, with aggregate data about all threads that ever lived in it, not a separate record for each thread. I don't use process accounting myself, but if anyone is interested in testing it out I could provide a patch to change it this way. One final note, this is not 100% to POSIX compliance in regards to rlimits. POSIX specifies that RLIMIT_CPU refers to a whole process in aggregate, not to each individual thread. I will provide patches later on to achieve that change, assuming this patch goes in first. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-23	[PATCH] hugetlb: permit executable mappings	William Lee Irwin III
	During the kernel summit, some discussion was had about the support requirements for a userspace program loader that loads executables into hugetlb on behalf of a major application (Oracle). In order to support this in a robust fashion, the cleanup of the hugetlb must be robust in the presence of disorderly termination of the programs (e.g. kill -9). Hence, the cleanup semantics are those of System V shared memory, but Linux' System V shared memory needs one critical extension for this use: executability. The following microscopic patch enables this major application to provide robust hugetlb cleanup. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-23	[PATCH] remove magic +1 from shm segment count	Manfred Spraul
	Michael Kerrisk found a bug in the shm accounting code: sysv shm allows to create SHMMNI+1 shared memory segments, instead of SHMMNI segments. The +1 is probably from the first shared anonymous mapping implementation that used the sysv code to implement shared anon mappings. The implementation got replaced, it's now the other way around (sysv uses the shared anon code), but the +1 remained. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22	[PATCH] rlimit-based mlocks for unprivileged users	Rik van Riel
	Here is the last agreed-on patch that lets normal users mlock pages up to their rlimit. This patch addresses all the issues brought up by Chris and Andrea. From: Chris Wright <chrisw@osdl.org> Couple more nits. The default lockable amount is one page now (first patch is was 0). Why don't we keep it as 0, with the CAP_IPC_LOCK overrides in place? That way nothing is changed from user perspective, and the rest of the policy can be done by userspace as it should. This patch breaks in one scenario. When ulimit == 0, process has CAP_IPC_LOCK, and does SHM_LOCK. The subsequent unlock or destroy will corrupt the locked_shm count. It's also inconsistent in handling user_can_mlock/CAP_IPC_LOCK interaction betwen shm_lock and shm_hugetlb. SHM_HUGETLB can now only be done by the shm_group or CAP_IPC_LOCK. Not any can_do_mlock() user. Double check of can_do_mlock isn't needed in SHM_LOCK path. Interface names user_can_mlock and user_substract_mlock could be better. Incremental update below. Ran some simple sanity tests on this plus my patch below and didn't find any problems. * Make default RLIM_MEMLOCK limit 0. * Move CAP_IPC_LOCK check into user_can_mlock to be consistent and fix but with ulimit == 0 && CAP_IPC_LOCK with SHM_LOCK. * Allow can_do_mlock() user to try SHM_HUGETLB setup. * Remove unecessary extra can_do_mlock() test in shmem_lock(). * Rename user_can_mlock to user_shm_lock and user_subtract_mlock to user_shm_unlock. * Use user instead of current->user to fit in 80 cols on SHM_LOCK. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22	[PATCH] IS_ERR() unlikeliness cleanup	Andrew Morton
	Remove now-unneeded open-coded unlikelies around IS_ERR(). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22	[PATCH] rcu: abstracted RCU dereferencing	Dipankar Sarma
	Use abstracted RCU API to dereference RCU protected data. Hides barrier details. Patch from Paul McKenney. This patch introduced an rcu_dereference() macro that replaces most uses of smp_read_barrier_depends(). The new macro has the advantage of explicitly documenting which pointers are protected by RCU -- in contrast, it is sometimes difficult to figure out which pointer is being protected by a given smp_read_barrier_depends() call. Signed-off-by: Paul McKenney <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22	[PATCH] cleanup of ipc/msg.c	Manfred Spraul
	Attached is a cleanup of the main loops in sys_msgrcv and sys_msgsnd, based on ipc_lock_by_ptr(). Most backward gotos are gone, instead normal "for(;;)" loops until a suitable message is found. Description: - General cleanup of sys_msgrcv and sys_msgsnd: the function were too convoluted. - Enable lockless receive, update comments. - Use ipc_getref for sys_msgsnd(), it's better than rechecking that the msqid is still valid. Signed-Off-By: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22	[PATCH] ipc: enforce SEMVMX limit for undo	Manfred Spraul
	Independent from the other patches: undo operations should not result in out of range semaphore values. The test for newval > SEMVMX is missing. The attached patch adds the test and a comment. Signed-Off-By: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22	[PATCH] ipc: remove sem_revalidate	Manfred Spraul
	The attached patch removes sem_revalidate and replaces it with ipc_rcu_getref() calls followed by ipc_lock_by_ptr(). Signed-Off-By: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22	[PATCH] ipc: Add refcount to ipc_rcu_alloc	Manfred Spraul
	The lifetime of the ipc objects (sem array, msg queue, shm mapping) is controlled by kern_ipc_perms->lock - a spinlock. There is no simple way to reacquire this spinlock after it was dropped to schedule()/kmalloc/copy_{to,from}_user/whatever. The attached patch adds a reference count as a preparation to get rid of sem_revalidate(). Signed-Off-By: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-12	[PATCH] sparse: ipc compat annotations and cleanups	Alexander Viro
	ipc compat code switched to compat_alloc_user_space() and annotated.
2004-06-30	[PATCH] sparse: NULL vs 0 - the rest of it	Mika Kukkonen

2004-06-23	[PATCH] rcu: avoid passing an argument to the callback function	Andrew Morton
	From: Dipankar Sarma <dipankar@in.ibm.com> This patch changes the call_rcu() API and avoids passing an argument to the callback function as suggested by Rusty. Instead, it is assumed that the user has embedded the rcu head into a structure that is useful in the callback and the rcu_head pointer is passed to the callback. The callback can use container_of() to get the pointer to its structure and work with it. Together with the rcu-singly-link patch, it reduces the rcu_head size by 50%. Considering that we use these in things like struct dentry and struct dst_entry, this is good savings in space. An example : struct my_struct { struct rcu_head rcu; int x; int y; }; void my_rcu_callback(struct rcu_head head) { struct my_struct p = container_of(head, struct my_struct, rcu); free(p); } void my_delete(struct my_struct *p) { ... call_rcu(&p->rcu, my_rcu_callback); ... } Signed-Off-By: Dipankar Sarma <dipankar@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-17	[PATCH] RLIM: adjust default mqueue sizes	Chris Wright
	Lower default sizes for POSIX mqueue allocation now that rlimits are in place. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-17	[PATCH] RLIM: enforce rlimits for POSIX mqueue allocation	Chris Wright
	Add a user_struct to the mq_inode_info structure. Charge the maximum number of bytes that could be allocated to a mqueue to the user who creates the mqueue. This is checked against the per user rlimit. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-17	[PATCH] RLIM: add mq_attr_ok() helper	Chris Wright
	Add helper function mq_attr_ok() to do mq_attr sanity checking, and do some extra overlow checking. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-05-28	[PATCH] sparse: ipc __user annotation	Alexander Viro

2004-05-22	[PATCH] numa api: Add shared memory support	Andrew Morton
	From: Andi Kleen <ak@suse.de> Add support to tmpfs and hugetlbfs to support NUMA API. Shared memory is a bit of a special case for NUMA policy. Normally policy is associated to VMAs or to processes, but for a shared memory segment you really want to share the policy. The core NUMA API has code for that, this patch adds the necessary changes to tmpfs and hugetlbfs. First it changes the custom swapping code in tmpfs to follow the policy set via VMAs. It is also useful to have a "backing store" of policy that saves the policy even when nobody has the shared memory segment mapped. This allows command line tools to pre configure policy, which is then later used by programs. Note that hugetlbfs needs more changes - it is also required to switch it to lazy allocation, otherwise the prefault prevents mbind() from working.
2004-05-21	[PATCH] Sanitise handling of unneeded syscall stubs	Andrew Morton
	From: David Mosberger <davidm@napali.hpl.hp.com> Below is a patch that tries to sanitize the dropping of unneeded system-call stubs in generic code. In some instances, it would be possible to move the optional system-call stubs into a library routine which would avoid the need for #ifdefs, but in many cases, doing so would require making several functions global (and possibly exporting additional data-structures in header-files). Furthermore, it would inhibit (automatic) inlining in the cases in the cases where the stubs are needed. For these reasons, the patch keeps the #ifdef-approach. This has been tested on ia64 and there were no objections from the arch-maintainers (and one positive response). The patch should be safe but arch-maintainers may want to take a second look to see if some __ARCH_WANT_foo macros should be removed for their architecture (I'm quite sure that's the case, but I wanted to play it safe and only preserved the status-quo in that regard).
2004-05-10	[PATCH] simplify mqueue_inode_info->messages allocation	Andrew Morton
	From: Chris Wright <chrisw@osdl.org> Currently, if a user creates an mqueue and passes an mq_attr, the info->messages will be created twice (and the extra one is properly freed). This patch simply delays the allocation so that it only ever happens once. The relevant mq_attr data is passed to lower levels via the dentry->d_fsdata fs private data. This also helps isolate the areas we'd need to touch to do rlimits on mqueues.
2004-05-04	[PATCH] fix queues_count accounting in mqueue_delete_inode()	Chris Wright
	During mqueue_get_inode(), it's possible that kmalloc() of the info->messages array will fail. This failure mode will cause the queues_count to be (incorrectly) decremented twice. This patch uses info->messages on mqueue_delete_inode() to determine whether the mqueue was every truly created, and hence proper accounting is needed on destruction.
2004-05-04	[PATCH] fix memleak in sys_mq_timedsend	Chris Wright
	Move error handling to capture all three possible error conditions on sending to a full queue. Without this fix any unprivileged user can leak arbitrary amounts of kernel memory.
2004-04-17	[PATCH] mqueue permission fix	Andrew Morton
	From: Manfred Spraul <manfred@colorfullife.com> Any user can delete any entries in a mqueue mounted filesystem. The attached patch prevents that. - remove the writable test from mq_unlink. - set the sticky bit in the root inode. This affects both mq_unlink and sys_unlink: only the owner (and root) should be allowed to remove queues.
2004-04-14	[PATCH] mq_open() and close_on_exec	Andrew Morton
	From: Chris Wright <chrisw@osdl.org> SUSv3 doesn't seem to specify one way or the other. I don't have the POSIX specs, and the old docs I have suggest that mq_open() creates an object which is to be closed upon exec. Jakub said: I think it is valid and required: http://www.opengroup.org/onlinepubs/007904975/functions/exec.html All open message queue descriptors in the calling process shall be closed, as described in mq_close() I'll add a new test for this into glibc testsuite.
2004-04-14	[PATCH] Fix mq_notify with SIGEV_NONE notification	Andrew Morton
	From: Jakub Jelinek <jakub@redhat.com> mq_notify (q, NULL) and struct sigevent ev = { .sigev_notify = SIGEV_NONE }; mq_notify (q, &ev) are not the same thing in POSIX, yet the kernel treats them the same. Only the former makes the notification available to other processes immediately, see http://www.opengroup.org/onlinepubs/007904975/functions/mq_notify.html Without the patch below, http://sources.redhat.com/ml/libc-hacker/2004-04/msg00028.html glibc test fails. I looked at mq in Solaris and they behave the same in this regard as Linux with this patch. Kernel with this patch passes both Intel POSIX testsuite (with testsuite fixes from Ulrich) and glibc mq testsuite.
2004-04-11	[PATCH] make the pagecache lock irq-safe.	Andrew Morton
	Intro to these patches: - Major surgery against the pagecache, radix-tree and writeback code. This work is to address the O_DIRECT-vs-buffered data exposure horrors which we've been struggling with for months. As a side-effect, 32 bytes are saved from struct inode and eight bytes are removed from struct page. At a cost of approximately 2.5 bits per page in the radix tree nodes on 4k pagesize, assuming the pagecache is densely populated. Not all pages are pagecache; other pages gain the full 8 byte saving. This change will break any arch code which is using page->list and will also break any arch code which is using page->lru of memory which was obtained from slab. The basic problem which we (mainly Daniel McNeil) have been struggling with is in getting a really reliable fsync() across the page lists while other processes are performing writeback against the same file. It's like juggling four bars of wet soap with your eyes shut while someone is whacking you with a baseball bat. Daniel pretty much has the problem plugged but I suspect that's just because we don't have testcases to trigger the remaining problems. The complexity and additional locking which those patches add is worrisome. So the approach taken here is to remove the page lists altogether and replace the list-based writeback and wait operations with in-order radix-tree walks. The radix-tree code has been enhanced to support "tagging" of pages, for later searches for pages which have a particular tag set. This means that we can ask the radix tree code "find me the next 16 dirty pages starting at pagecache index N" and it will do that in O(log64(N)) time. This affects I/O scheduling potentially quite significantly. It is no longer the case that the kernel will submit pages for I/O in the order in which the application dirtied them. We instead submit them in file-offset order all the time. This is likely to be advantageous when applications are seeking all over a large file randomly writing small amounts of data. I haven't performed much benchmarking, but tiobench random write throughput seems to be increased by 30%. Other tests appear to be unaltered. dbench may have got 10-20% quicker, but it's variable. There is one large file which everyone seeks all over randomly writing small amounts of data: the blockdev mapping which caches filesystem metadata. The kernel's IO submission patterns for this are now ideal. Because writeback and wait-for-writeback use a tree walk instead of a list walk they are no longer livelockable. This probably means that we no longer need to hold i_sem across O_SYNC writes and perhaps fsync() and fdatasync(). This may be beneficial for databases: multiple processes writing and syncing different parts of the same file at the same time can now all submit and wait upon writes to just their own little bit of the file, so we can get a lot more data into the queues. It is trivial to implement a part-file-fdatasync() as well, so applications can say "sync the file from byte N to byte M", and multiple applications can do this concurrently. This is easy for ext2 filesystems, but probably needs lots of work for data-journalled filesystems and XFS and it probably doesn't offer much benefit over an i_semless O_SYNC write. These patches can end up making ext3 (even) slower: for i in 1 2 3 4 do dd if=/dev/zero of=$i bs=1M count=2000 & done runs awfully slow on SMP. This is, yet again, because all the file blocks are jumbled up and the per-file linear writeout causes tons of seeking. The above test runs sweetly on UP because the on UP we don't allocate blocks to different files in parallel. Mingming and Badari are working on getting block reservation working for ext3 (preallocation on steroids). That should fix ext3 up. This patch: - Later, we'll need to access the radix trees from inside disk I/O completion handlers. So make mapping->page_lock irq-safe. And rename it to tree_lock to reliably break any missed conversions.
2004-04-11	[PATCH] compat emulation for posix message queues	Andrew Morton
	From: Arnd Bergmann <arnd@arndb.de> I have tested the code with the open posix test suite and found the same four failures for both 64-bit and compat mode, most tests pass. The patch is against -mc1, but I guess it also applies to the other trees around. What worries me more than mq_attr compatibility is the conversion of struct sigevent, which might turn out really hard when more fields in there are used. AFAICS, the only other part in the kernel ABI is sys_timer_create(), so maybe it's not too late to deprecate the current structure and create a structure that can be used properly for compat syscalls.