user/sven/linux-bitkeeper.git/ipc/shm.c, branch master

Audit IPC object owner/permission changes.

2005-03-01T09:15:37Z

Add linked list of auxiliary data to audit_context Add callbacks in IPC_SET functions to record requested changes. Signed-off-by: David Woodhouse

[PATCH] shmctl SHM_LOCK perms

2004-12-13T00:30:17Z

Michael Kerrisk has observed that at present any process can SHM_LOCK any shm segment of size within process RLIMIT_MEMLOCK, despite having no permissions on the segment: surprising, though not obviously evil. And any process can SHM_UNLOCK any shm segment, despite no permissions on it: that is surely wrong. Unless CAP_IPC_LOCK, restrict both SHM_LOCK and SHM_UNLOCK to when the process euid matches the shm owner or creator: that seems the least surprising behaviour, which could be relaxed if a need appears later. Signed-off-by: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

[PATCH] add missing linux/syscalls.h includes

2004-10-18T15:54:02Z

I found that the prototypes for sys_waitid and sys_fcntl in don't match the implementation. In order to keep all prototypes in sync in the future, now include the header from each file implementing any syscall. Signed-off-by: Arnd Bergmann Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

[PATCH] hugetlb: permit executable mappings

2004-08-24T04:28:18Z

During the kernel summit, some discussion was had about the support requirements for a userspace program loader that loads executables into hugetlb on behalf of a major application (Oracle). In order to support this in a robust fashion, the cleanup of the hugetlb must be robust in the presence of disorderly termination of the programs (e.g. kill -9). Hence, the cleanup semantics are those of System V shared memory, but Linux' System V shared memory needs one critical extension for this use: executability. The following microscopic patch enables this major application to provide robust hugetlb cleanup. Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

[PATCH] remove magic +1 from shm segment count

2004-08-24T04:27:43Z

Michael Kerrisk found a bug in the shm accounting code: sysv shm allows to create SHMMNI+1 shared memory segments, instead of SHMMNI segments. The +1 is probably from the first shared anonymous mapping implementation that used the sysv code to implement shared anon mappings. The implementation got replaced, it's now the other way around (sysv uses the shared anon code), but the +1 remained. Signed-off-by: Manfred Spraul Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

[PATCH] rlimit-based mlocks for unprivileged users

2004-08-23T06:06:46Z

Here is the last agreed-on patch that lets normal users mlock pages up to their rlimit. This patch addresses all the issues brought up by Chris and Andrea. From: Chris Wright Couple more nits. The default lockable amount is one page now (first patch is was 0). Why don't we keep it as 0, with the CAP_IPC_LOCK overrides in place? That way nothing is changed from user perspective, and the rest of the policy can be done by userspace as it should. This patch breaks in one scenario. When ulimit == 0, process has CAP_IPC_LOCK, and does SHM_LOCK. The subsequent unlock or destroy will corrupt the locked_shm count. It's also inconsistent in handling user_can_mlock/CAP_IPC_LOCK interaction betwen shm_lock and shm_hugetlb. SHM_HUGETLB can now only be done by the shm_group or CAP_IPC_LOCK. Not any can_do_mlock() user. Double check of can_do_mlock isn't needed in SHM_LOCK path. Interface names user_can_mlock and user_substract_mlock could be better. Incremental update below. Ran some simple sanity tests on this plus my patch below and didn't find any problems. * Make default RLIM_MEMLOCK limit 0. * Move CAP_IPC_LOCK check into user_can_mlock to be consistent and fix but with ulimit == 0 && CAP_IPC_LOCK with SHM_LOCK. * Allow can_do_mlock() user to try SHM_HUGETLB setup. * Remove unecessary extra can_do_mlock() test in shmem_lock(). * Rename user_can_mlock to user_shm_lock and user_subtract_mlock to user_shm_unlock. * Use user instead of current->user to fit in 80 cols on SHM_LOCK. Signed-off-by: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

[PATCH] ipc: Add refcount to ipc_rcu_alloc

2004-08-23T05:40:37Z

The lifetime of the ipc objects (sem array, msg queue, shm mapping) is controlled by kern_ipc_perms->lock - a spinlock. There is no simple way to reacquire this spinlock after it was dropped to schedule()/kmalloc/copy_{to,from}_user/whatever. The attached patch adds a reference count as a preparation to get rid of sem_revalidate(). Signed-Off-By: Manfred Spraul Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

[PATCH] sparse: NULL vs 0 - the rest of it

2004-06-30T08:52:08Z

[PATCH] numa api: Add shared memory support

2004-05-22T15:04:40Z

From: Andi Kleen Add support to tmpfs and hugetlbfs to support NUMA API. Shared memory is a bit of a special case for NUMA policy. Normally policy is associated to VMAs or to processes, but for a shared memory segment you really want to share the policy. The core NUMA API has code for that, this patch adds the necessary changes to tmpfs and hugetlbfs. First it changes the custom swapping code in tmpfs to follow the policy set via VMAs. It is also useful to have a "backing store" of policy that saves the policy even when nobody has the shared memory segment mapped. This allows command line tools to pre configure policy, which is then later used by programs. Note that hugetlbfs needs more changes - it is also required to switch it to lazy allocation, otherwise the prefault prevents mbind() from working.

[PATCH] make the pagecache lock irq-safe.

2004-04-12T06:10:41Z

Intro to these patches: - Major surgery against the pagecache, radix-tree and writeback code. This work is to address the O_DIRECT-vs-buffered data exposure horrors which we've been struggling with for months. As a side-effect, 32 bytes are saved from struct inode and eight bytes are removed from struct page. At a cost of approximately 2.5 bits per page in the radix tree nodes on 4k pagesize, assuming the pagecache is densely populated. Not all pages are pagecache; other pages gain the full 8 byte saving. This change will break any arch code which is using page->list and will also break any arch code which is using page->lru of memory which was obtained from slab. The basic problem which we (mainly Daniel McNeil) have been struggling with is in getting a really reliable fsync() across the page lists while other processes are performing writeback against the same file. It's like juggling four bars of wet soap with your eyes shut while someone is whacking you with a baseball bat. Daniel pretty much has the problem plugged but I suspect that's just because we don't have testcases to trigger the remaining problems. The complexity and additional locking which those patches add is worrisome. So the approach taken here is to remove the page lists altogether and replace the list-based writeback and wait operations with in-order radix-tree walks. The radix-tree code has been enhanced to support "tagging" of pages, for later searches for pages which have a particular tag set. This means that we can ask the radix tree code "find me the next 16 dirty pages starting at pagecache index N" and it will do that in O(log64(N)) time. This affects I/O scheduling potentially quite significantly. It is no longer the case that the kernel will submit pages for I/O in the order in which the application dirtied them. We instead submit them in file-offset order all the time. This is likely to be advantageous when applications are seeking all over a large file randomly writing small amounts of data. I haven't performed much benchmarking, but tiobench random write throughput seems to be increased by 30%. Other tests appear to be unaltered. dbench may have got 10-20% quicker, but it's variable. There is one large file which everyone seeks all over randomly writing small amounts of data: the blockdev mapping which caches filesystem metadata. The kernel's IO submission patterns for this are now ideal. Because writeback and wait-for-writeback use a tree walk instead of a list walk they are no longer livelockable. This probably means that we no longer need to hold i_sem across O_SYNC writes and perhaps fsync() and fdatasync(). This may be beneficial for databases: multiple processes writing and syncing different parts of the same file at the same time can now all submit and wait upon writes to just their own little bit of the file, so we can get a lot more data into the queues. It is trivial to implement a part-file-fdatasync() as well, so applications can say "sync the file from byte N to byte M", and multiple applications can do this concurrently. This is easy for ext2 filesystems, but probably needs lots of work for data-journalled filesystems and XFS and it probably doesn't offer much benefit over an i_semless O_SYNC write. These patches can end up making ext3 (even) slower: for i in 1 2 3 4 do dd if=/dev/zero of=$i bs=1M count=2000 & done runs awfully slow on SMP. This is, yet again, because all the file blocks are jumbled up and the per-file linear writeout causes tons of seeking. The above test runs sweetly on UP because the on UP we don't allocate blocks to different files in parallel. Mingming and Badari are working on getting block reservation working for ext3 (preallocation on steroids). That should fix ext3 up. This patch: - Later, we'll need to access the radix trees from inside disk I/O completion handlers. So make mapping->page_lock irq-safe. And rename it to tree_lock to reliably break any missed conversions.