| Age | Commit message (Collapse) | Author |
|
Add linked list of auxiliary data to audit_context
Add callbacks in IPC_SET functions to record requested changes.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
|
|
Michael Kerrisk has observed that at present any process can SHM_LOCK any
shm segment of size within process RLIMIT_MEMLOCK, despite having no
permissions on the segment: surprising, though not obviously evil. And any
process can SHM_UNLOCK any shm segment, despite no permissions on it: that
is surely wrong.
Unless CAP_IPC_LOCK, restrict both SHM_LOCK and SHM_UNLOCK to when the
process euid matches the shm owner or creator: that seems the least
surprising behaviour, which could be relaxed if a need appears later.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
I found that the prototypes for sys_waitid and sys_fcntl in
<linux/syscalls.h> don't match the implementation. In order to keep all
prototypes in sync in the future, now include the header from each file
implementing any syscall.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
During the kernel summit, some discussion was had about the support
requirements for a userspace program loader that loads executables into
hugetlb on behalf of a major application (Oracle). In order to support
this in a robust fashion, the cleanup of the hugetlb must be robust in the
presence of disorderly termination of the programs (e.g. kill -9). Hence,
the cleanup semantics are those of System V shared memory, but Linux'
System V shared memory needs one critical extension for this use:
executability.
The following microscopic patch enables this major application to provide
robust hugetlb cleanup.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Michael Kerrisk found a bug in the shm accounting code: sysv shm allows to
create SHMMNI+1 shared memory segments, instead of SHMMNI segments. The +1
is probably from the first shared anonymous mapping implementation that
used the sysv code to implement shared anon mappings.
The implementation got replaced, it's now the other way around (sysv uses
the shared anon code), but the +1 remained.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Here is the last agreed-on patch that lets normal users mlock pages up to
their rlimit. This patch addresses all the issues brought up by Chris and
Andrea.
From: Chris Wright <chrisw@osdl.org>
Couple more nits.
The default lockable amount is one page now (first patch is was 0). Why
don't we keep it as 0, with the CAP_IPC_LOCK overrides in place? That way
nothing is changed from user perspective, and the rest of the policy can be
done by userspace as it should.
This patch breaks in one scenario. When ulimit == 0, process has
CAP_IPC_LOCK, and does SHM_LOCK. The subsequent unlock or destroy will
corrupt the locked_shm count.
It's also inconsistent in handling user_can_mlock/CAP_IPC_LOCK interaction
betwen shm_lock and shm_hugetlb.
SHM_HUGETLB can now only be done by the shm_group or CAP_IPC_LOCK.
Not any can_do_mlock() user.
Double check of can_do_mlock isn't needed in SHM_LOCK path.
Interface names user_can_mlock and user_substract_mlock could be better.
Incremental update below. Ran some simple sanity tests on this plus my
patch below and didn't find any problems.
* Make default RLIM_MEMLOCK limit 0.
* Move CAP_IPC_LOCK check into user_can_mlock to be consistent
and fix but with ulimit == 0 && CAP_IPC_LOCK with SHM_LOCK.
* Allow can_do_mlock() user to try SHM_HUGETLB setup.
* Remove unecessary extra can_do_mlock() test in shmem_lock().
* Rename user_can_mlock to user_shm_lock and user_subtract_mlock
to user_shm_unlock.
* Use user instead of current->user to fit in 80 cols on SHM_LOCK.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The lifetime of the ipc objects (sem array, msg queue, shm mapping) is
controlled by kern_ipc_perms->lock - a spinlock. There is no simple way to
reacquire this spinlock after it was dropped to
schedule()/kmalloc/copy_{to,from}_user/whatever.
The attached patch adds a reference count as a preparation to get rid of
sem_revalidate().
Signed-Off-By: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
|
|
From: Andi Kleen <ak@suse.de>
Add support to tmpfs and hugetlbfs to support NUMA API. Shared memory is a
bit of a special case for NUMA policy. Normally policy is associated to VMAs
or to processes, but for a shared memory segment you really want to share the
policy. The core NUMA API has code for that, this patch adds the necessary
changes to tmpfs and hugetlbfs.
First it changes the custom swapping code in tmpfs to follow the policy set
via VMAs.
It is also useful to have a "backing store" of policy that saves the policy
even when nobody has the shared memory segment mapped. This allows command
line tools to pre configure policy, which is then later used by programs.
Note that hugetlbfs needs more changes - it is also required to switch it to
lazy allocation, otherwise the prefault prevents mbind() from working.
|
|
Intro to these patches:
- Major surgery against the pagecache, radix-tree and writeback code. This
work is to address the O_DIRECT-vs-buffered data exposure horrors which
we've been struggling with for months.
As a side-effect, 32 bytes are saved from struct inode and eight bytes
are removed from struct page. At a cost of approximately 2.5 bits per page
in the radix tree nodes on 4k pagesize, assuming the pagecache is densely
populated. Not all pages are pagecache; other pages gain the full 8 byte
saving.
This change will break any arch code which is using page->list and will
also break any arch code which is using page->lru of memory which was
obtained from slab.
The basic problem which we (mainly Daniel McNeil) have been struggling
with is in getting a really reliable fsync() across the page lists while
other processes are performing writeback against the same file. It's like
juggling four bars of wet soap with your eyes shut while someone is
whacking you with a baseball bat. Daniel pretty much has the problem
plugged but I suspect that's just because we don't have testcases to
trigger the remaining problems. The complexity and additional locking
which those patches add is worrisome.
So the approach taken here is to remove the page lists altogether and
replace the list-based writeback and wait operations with in-order
radix-tree walks.
The radix-tree code has been enhanced to support "tagging" of pages, for
later searches for pages which have a particular tag set. This means that
we can ask the radix tree code "find me the next 16 dirty pages starting at
pagecache index N" and it will do that in O(log64(N)) time.
This affects I/O scheduling potentially quite significantly. It is no
longer the case that the kernel will submit pages for I/O in the order in
which the application dirtied them. We instead submit them in file-offset
order all the time.
This is likely to be advantageous when applications are seeking all over
a large file randomly writing small amounts of data. I haven't performed
much benchmarking, but tiobench random write throughput seems to be
increased by 30%. Other tests appear to be unaltered. dbench may have got
10-20% quicker, but it's variable.
There is one large file which everyone seeks all over randomly writing
small amounts of data: the blockdev mapping which caches filesystem
metadata. The kernel's IO submission patterns for this are now ideal.
Because writeback and wait-for-writeback use a tree walk instead of a
list walk they are no longer livelockable. This probably means that we no
longer need to hold i_sem across O_SYNC writes and perhaps fsync() and
fdatasync(). This may be beneficial for databases: multiple processes
writing and syncing different parts of the same file at the same time can
now all submit and wait upon writes to just their own little bit of the
file, so we can get a lot more data into the queues.
It is trivial to implement a part-file-fdatasync() as well, so
applications can say "sync the file from byte N to byte M", and multiple
applications can do this concurrently. This is easy for ext2 filesystems,
but probably needs lots of work for data-journalled filesystems and XFS and
it probably doesn't offer much benefit over an i_semless O_SYNC write.
These patches can end up making ext3 (even) slower:
for i in 1 2 3 4
do
dd if=/dev/zero of=$i bs=1M count=2000 &
done
runs awfully slow on SMP. This is, yet again, because all the file
blocks are jumbled up and the per-file linear writeout causes tons of
seeking. The above test runs sweetly on UP because the on UP we don't
allocate blocks to different files in parallel.
Mingming and Badari are working on getting block reservation working for
ext3 (preallocation on steroids). That should fix ext3 up.
This patch:
- Later, we'll need to access the radix trees from inside disk I/O
completion handlers. So make mapping->page_lock irq-safe. And rename it
to tree_lock to reliably break any missed conversions.
|
|
New inlined helper - file_accessed(file) (wrapper for update_atime())
|
|
From: Arun Sharma <arun.sharma@intel.com>
The current Linux implementation of shmat() insists on SHMLBA alignment even
when shmflg & SHM_RND == 0. This is not consistent with the man pages and
the single UNIX spec, which require only a page-aligned address.
However, some architectures require a SHMLBA alignment for correctness in all
cases. Such architectures use __ARCH_FORCE_SHMLBA.
|
|
From: Manfred Spraul <manfred@colorfullife.com>
There are a few unchecked do_munmap()s in the shm code. Manfred's comment
explains why they are OK.
|
|
This renames sys_shmat to do_shmat. Additionally, I've replaced the
cond_syscall with a conditional inline function.
It touches all archs - only i386 is tested.
|
|
From: Nick Piggin <piggin@cyberone.com.au>
sys_shmat() need to be declared asmlinkage. This causes breakage when we
actually get the proper prototypes into caller's scope.
|
|
One more overlooked area where the proper process ID has to be used:
SysV IPC "pid" values should use the thread group ID, not the per-thread
one.
|
|
From: Daniel McNeil <daniel@osdl.org>
This adds i_seqcount to the inode structure and then uses i_size_read() and
i_size_write() to provide atomic access to i_size. This is a port of
Andrea Arcangeli's i_size atomic access patch from 2.4. This only uses the
generic reader/writer consistent mechanism.
Before:
mnm:/usr/src/25> size vmlinux
text data bss dec hex filename
2229582 1027683 162436 3419701 342e35 vmlinux
After:
mnm:/usr/src/25> size vmlinux
text data bss dec hex filename
2225642 1027655 162436 3415733 341eb5 vmlinux
3.9k more text, a lot of it fastpath :(
It's a very minor bug, and the fix has a fairly non-minor cost. The most
compelling reason for fixing this is that writepage() checks i_size. If it
sees a transient value it may decide that page is outside i_size and will
refuse to write it. Lost user data.
|
|
|
|
From: Stewart Smith <stewartsmith@mac.com>
Remove the UPDATE_ATIME() macro, use update_atime() directly.
|
|
From: William Lee Irwin III <wli@holomorphy.com>
shm_get_stat() didn't know about hugetlbpage-backed shm.
|
|
From: William Lee Irwin III <wli@holomorphy.com>
Micro-optimize sys_shmdt(). There are methods of exploiting knowledge
of the vma's being searched to restrict the search space. These are:
(1) shm mappings always start their lives at file offset 0, so only
vma's above shmaddr need be considered. find_vma() can be used
to seek to the proper position in mm->mmap in O(lg(n)) time.
(2) The search is for a vma which could be a fragment of a broken-up
shm mapping, which would have been created starting at shmaddr
with vm_pgoff 0 and then continued no further into userspace
than shmaddr + size. So after having found an initial vma, find
the size of the shm segment it maps to calculate an upper bound
to the virtualspace that needs to be searched.
(3) mremap() would have caused the original checks to miss vma's mapping
the shm segment if shmaddr were the original address at which
the shm segments were attached. This does no better and no worse
than the original code in that situation.
(4) If the chain of references in vma->vm_file->f_dentry->d_inode->i_size
is not guaranteed by refcounting and/or the shm code then this is
oopsable; AFAICT an inode is always allocated.
|
|
This patch adds the remaining System V IPC hooks, including the inline
documentation for them in security.h. This includes a restored
sem_semop hook, as it does seem to be necessary to support fine-grained
access.
All of these System V IPC hooks are used by SELinux. The SELinux System
V IPC access controls were originally described in the technical report
available from http://www.nsa.gov/selinux/slinux-abs.html, and the
LSM-based implementation is described in the technical report available
from http://www.nsa.gov/selinux/module-abs.html.
|
|
From Rohit Seth
Attached is a patch that passes the correct information back to user
land for number of attachments to shared memory segment. I could have
done few more changes in a way nattach is getting set for regular cases
now, but just want to limit it at this point.
|
|
and net/* files.
|
|
|
|
Patch from Hugh Dickins <hugh@veritas.com>
Fixes the Oracle startup problem reported by Alessandro Suardi.
Reverts a "simplification" to shmdt() which was wrong if subsequent
mprotects broke up the original VMA, or if parts of it were munmapped.
|
|
stat64 has been changed to return jiffies granuality as nsec in previously
unused fields. This allows make to make better decisions on when
to recompile a file. Follows losely the Solaris API.
CURRENT_TIME has been redefined to return struct timespec. The users
who don't use it in a inode/attr context have been changed to use a new
get_seconds() function. CURRENT_TIME is implemented by an out-of-line
function.
There is a small performance penalty in this patch. The previous
filemap code had an optimization to flush atime only once a second.
This is currently gone, which will increase flushes a bit. I believe
the correct solution if it should be a problem is to have per super
block fields that give an arbitary atime flush granuality - so that you
can set it to be only flushed once a hour if you prefer that. I will
work on that later in separate patches if the need should arise.
struct inode and the attr struct has been changed to store struct
timespec instead of time_t for [cma]time. Not all file systems support
this granuality, but some like XFS,NFSv3,CIFS,JFS do. The others will
currently truncate the nsec part on flushing to disk. There was some
discussion on this rounding on l-k previously. I went for simple
truncation because there is not much evidence IMHO that the more
complicated roundings have any advantages. In practice application will
be rather unlikely to notice the rounding anyways - they can only see a
difference when an inode is flush from memory and reloaded in less than
a second, which is rather unlikely.
|
|
Patch from Mingming, Rusty, Hugh, Dipankar, me:
- It greatly reduces the lock contention by having one lock per id.
The global spinlock is removed and a spinlock is added in
kern_ipc_perm structure.
- Uses ReadCopyUpdate in grow_ary() for locking-free resizing.
- In the places where ipc_rmid() is called, delay calling ipc_free()
to RCU callbacks. This is to prevent ipc_lock() returning an invalid
pointer after ipc_rmid(). In addition, use the workqueue to enable
RCU freeing vmalloced entries.
Also some other changes:
- Remove redundant ipc_lockall/ipc_unlockall
- Now ipc_unlock() directly takes IPC ID pointer as argument, avoid
extra looking up the array.
The changes are made based on the input from Huge Dickens, Manfred
Spraul and Dipankar Sarma. In addition, Cliff White has run OSDL's
dbt1 test on a 2 way against the earlier version of this patch.
Results shows about 2-6% improvement on the average number of
transactions per second. Here is the summary of his tests:
2.5.42-mm2 2.5.42-mm2-ipclock
-----------------------------
Average over 5 runs 85.0 BT 89.8 BT
Std Deviation 5 runs 7.4 BT 1.0 BT
Average over 4 best 88.15 BT 90.2 BT
Std Deviation 4 best 2.8 BT 0.5 BT
Also, another test today from Bill Hartner:
I tested Mingming's RCU ipc lock patch using a *new* microbenchmark - semopbench.
semopbench was written to test the performance of Mingming's patch.
I also ran a 3 hour stress and it completed successfully.
Explanation of the microbenchmark is below the results.
Here is a link to the microbenchmark source.
http://www-124.ibm.com/developerworks/opensource/linuxperf/semopbench/semopbench.c
SUT : 8-way 700 Mhz PIII
I tested 2.5.44-mm2 and 2.5.44-mm2 + RCU ipc patch
>semopbench -g 64 -s 16 -n 16384 -r > sem.results.out
>readprofile -m /boot/System.map | sort -n +0 -r > sem.profile.out
The metric is seconds / per repetition. Lower is better.
kernel run 1 run 2
seconds seconds
================== ======= =======
2.5.44-mm2 515.1 515.4
2.5.44-mm2+rcu-ipc 46.7 46.7
With Mingming's patch, the test completes 10X faster.
|
|
From Bill Irwin
Optionally back priviled processes' shm with hugetlbfs.
One of the more common requests for and/or users of hugetlb interfaces
in general are databases using shm. This patch exports functionality
mostly equivalent to tmpfs, adds the calling sequence to ipc/shm.c, and
hashes out a small support function in fs/hugetlbfs/inode.c so that shm
segments may be hugetlbpage-backed if userspace passes a flag to
shmget().
Access to this resource requires CAP_IPC_LOCK.
|
|
|
|
The patch below adds the base set of LSM hooks for System V IPC to the
2.5.41 kernel. These hooks permit a security module to label
semaphore sets, message queues, and shared memory segments and to
perform security checks on these objects that parallel the existing
IPC access checks. Additional LSM hooks for labeling and controlling
individual messages sent on a single message queue and for providing
fine-grained distinctions among IPC operations will be submitted
separately after this base set of LSM IPC hooks has been accepted.
|
|
The old form of designated initializers are obsolete: we need to
replace them with the ISO C forms before 2.6. Gcc has always supported
both forms anyway.
|
|
An acct flag was added to do_munmap, true everywhere but in mremap's
move_vma: instead of updating the arch and driver sources, revert that
that change and temporarily mask VM_ACCOUNT around that one do_munmap.
Also, noticed that do_mremap fails needlessly if both shrinking _and_
moving a mapping: update old_len to pass vm area boundaries test.
|
|
If we support mmap MAP_NORESERVE, we should support it on shared
anonymous objects: too bad that needs a few changes. do_mmap_pgoff pass
VM_ACCOUNT (or not) down to shmem_file_setup, flag stored into shmem
info, for use by shmem_delete_inode later. Also removed a harmless but
pointless call to shmem_truncate.
|
|
Alan's overcommit patch, brought to 2.5 by Robert Love.
Can't say I've tested its functionality at all, but it doesn't crash,
it has been in -ac and RH kernels for some time and I haven't observed
any of its functions on profiles.
"So what is strict VM overcommit? We introduce new overcommit
policies that attempt to never succeed an allocation that can not be
fulfilled by the backing store and consequently never OOM. This is
achieved through strict accounting of the committed address space and
a policy to allow/refuse allocations based on that accounting.
In the strictest of modes, it should be impossible to allocate more
memory than available and impossible to OOM. All memory failures
should be pushed down to the allocation routines -- malloc, mmap, etc.
The new modes are available via sysctl (same as before). See
Documentation/vm/overcommit-accounting for more information."
|
|
Martin Schwidefsky <schwidefsky@de.ibm.com> reported "Bug with shared
memory" to LKML 14 May: hang due to schedule in truncate_list_pages
called from .... shm_destroy holding shm_lock spinlock. shm_destroy
needs that lock for shm_rmid, but it can be safely unlocked once link
from id to shp has been removed.
|
|
Also move where we set sma->sem_perm.mode and .key to before ipc_addid() gets called.
|
|
We always returned success even when we had no ->vm_ops
|
|
Seperates shmem_sb_info from struct super_block.
|
|
- Jens Axboe: more bio updates, fix some request list bogosity under load
- Al Viro: export seq_xxx functions
- Manfred Spraul: include file cleanups, pc110pad compile fix
- David Woodhouse: fix JFFS2 write error handling
- Dave Jones: start merging up with 2.4.x patches
- Manfred Spraul: coredump fixes, FS event counter cleanups
- me: fix SCSI CD-ROM sectorsize BIO breakage
|
|
- Greg KH: USB updates
- Jens Axboe: more bio updates
- Christoph Rohland: fix up proper shmat semantics
|
|
- Al Viro: mnt_list init
- Jeff Garzik: network driver update (license tags, tulip driver)
- David Miller: sparc, net updates
- Ben Collins: firewire update
- Gerd Knorr: btaudio/bttv update
- Tim Hockin: MD cleanups
- Greg KH, Petko Manolov: USB updates
- Leonard Zubkoff: DAC960 driver update
|
|
- Alan Cox: continued merging
- Mingming Cao: make msgrcv/shmat check the queue/segment ID's properly
- Greg KH: USB serial init failure fix, Xircom serial converter driver
- Neil Brown: nsfd/raid/md/lockd cleanups
- Ingo Molnar: multipath RAID personality, raid xor update
- Hugh Dickins/Marcelo Tosatti: swapin read-ahead race fix
- Vojtech Pavlik: fix up some of the infrastructure for x86-64
- Robert Love: AMD 761 AGP GART support
- Jens Axboe: fix SCSI-generic queue handling race
- me: be sane about page reference bits
|
|
- Russell King: ARM updates
- Al Viro: more init cleanups
- Cort Dougan: more PPC updates
- David Miller: cleanups, pci mmap updates
- Neil Brown: raid resync by sector
- Alan Cox: more merging with -ac
- Johannes Erdfelt: USB updates
- Kai Germaschewski: ISDN updates
- Tobias Ringstrom: dmfe.c network driver update
- Trond Myklebust: NFS client updates and cleanups
|
|
- Rik van Riel and others: mm rw-semaphore (ps/top ok when swapping)
- IDE: 256 sectors at a time is legal, but apparently confuses some
drives. Max out at 255 sectors instead.
- Petko Manolov: USB pegasus driver update
- make the boottime memory map printout at least almost readable.
- USB driver updates
- pte_alloc()/pmd_alloc() need page_table_lock.
|
|
- Jens: better ordering of requests when unable to merge
- Neil Brown: make md work as a module again (we cannot autodetect
in modules, not enough background information)
- Neil Brown: raid5 SMP locking cleanups
- Neil Brown: nfsd: handle Irix NFS clients named pipe behavior and
dentry leak fix
- maestro3 shutdown fix
- fix dcache hash calculation that could cause bad hashes under certain
circumstances (Dean Gaudet)
- David Miller: networking and sparc updates
- Jeff Garzik: include file cleanups
- Andy Grover: ACPI update
- Coda-fs error return fixes
- rth: alpha Jensen update
|
|
- ReiserFS merge
- fix DRM R128/AGP dependency
|
|
|