| Age | Commit message (Collapse) | Author |
|
We ignored umask when creating new queues via mq_open (when creating
with open() on mqueue fs it is ok of course). According to the
specification this a bug. This trivial patch fixes this.
Signed-off-by: Krzysztof Benedyczak <golbi@mat.uni.torun.pl>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch contains the most trivial from Rusty's trivial patches:
- spelling fixes
- remove duplicate includes
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Change the /proc/sysvipc/shm|sem|msg files to use the generic seq_file
implementation for struct ipc_ids.
Signed-off-by: Mike Waychison <mikew@google.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The following two patches convert /proc/sysvipc/* to use seq_file.
This gives us the following:
- Self-consistent IPC records in proc.
- O(n) reading of the files themselves.
This patch:
Add a generic method for ipc types to be displayed using seq_file. This
patch abstracts out seq_file iterating over struct ipc_ids into ipc/util.c
Signed-off-by: Mike Waychison <mikew@google.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
When I first wrote the compat layer patches, I was somewhat cavalier about
the definition of compat_uid_t and compat_gid_t (or maybe I just
misunderstood :-)). This patch makes the compat types much more consistent
with the types we are being compatible with and hopefully will fix a few
bugs along the way.
compat type type in compat arch
__compat_[ug]id_t __kernel_[ug]id_t
__compat_[ug]id32_t __kernel_[ug]id32_t
compat_[ug]id_t [ug]id_t
The difference is that compat_uid_t is always 32 bits (for the archs we
care about) but __compat_uid_t may be 16 bits on some.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
semundo->lock can leak if semundo->refcount goes from 2 to 1 while
another thread has it locked. This causes major problems for PREEMPT
kernels.
The simplest fix for now is to undo the single-thread optimization.
This bug was found via relentless testing by Dominik Karall.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Fix bug found by Grant Coady <lkml@dodo.com.au>'s autobuild setup.
shmem_set_policy() and shmem_get_policy() are macros if !CONFIG_SHMEM, so this
doesn't work.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch fixes some minor bugs introduced by the previous patch (remove
old syscalls). Both patches remove the obsolete syscalls. The changes in
this patch were suggested by Arnd Bergmann. The vmlinux.lds.S changes are
required for the latest gcc/binutils.
Signed-off-by: Chris Zankel <chris@zankel.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
GCC 4 complains because the function put_compat_shminfo() can't get to its
return statement if there is no error... If the function does not return
-EFAULT, it doesn't return anything at all. Looks like a typo.
Signed-off-by: Jesse Millan <jessem@cs.pdx.edu>
Signed-off-by: Domen Puncer <domen@coderock.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Patrick noticed that the initial scan of the semaphore operations logs
decrease and increase operations seperately, but then both cases are or'ed
together and decrease is never used. The attached patch removes the
decrease parameter - it shrinks sys_semtimedop() by 56 bytes.
Signed-Of-By: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Convert most of the current code that uses _NSIG directly to instead use
valid_signal(). This avoids gcc -W warnings and off-by-one errors.
Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Replace a number of memory barriers with smp_ variants. This means we won't
take the unnecessary hit on UP machines.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
|
|
This patch converts verify_area to access_ok in arch/i386, fs/, kernel/ and a
few other bits that didn't fit in the other patches or that I actually was
able to test on my hardware - this is by far the best tested of all the
patches.
Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch pulls together the compat_sigevent structs. It also
consolidates the copying of these structures into the kernel.
The only part of the second union in sigevent that the kernel looks at
currently is the _tid, so that is the only bit we copy.
This patch depends on my previous two patches "add and use
COMPAT_SIGEV_PAD_SIZE" and "Consolidate the last compat sigvals".
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Add linked list of auxiliary data to audit_context
Add callbacks in IPC_SET functions to record requested changes.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
|
|
My patch that removed the spin_lock calls from the tail of sys_semtimedop
introduced a bug:
Before my patch was merged, every operation that altered an array called
update_queue. That call woke up threads that were waiting until a
semaphore value becomes 0. I've accidentially removed that call.
The attached patch fixes that by modifying update_queue: the function now
loops internally and wakes up all threads. The patch also removes
update_queue calls from the error path of sys_semtimedop: failed operations
do not modify the array, no need to rescan the list of waiting threads.
Signed-Off-By: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Michael Kerrisk has observed that at present any process can SHM_LOCK any
shm segment of size within process RLIMIT_MEMLOCK, despite having no
permissions on the segment: surprising, though not obviously evil. And any
process can SHM_UNLOCK any shm segment, despite no permissions on it: that
is surely wrong.
Unless CAP_IPC_LOCK, restrict both SHM_LOCK and SHM_UNLOCK to when the
process euid matches the shm owner or creator: that seems the least
surprising behaviour, which could be relaxed if a need appears later.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
register_sysctl_table() fails if sysctl support is not compiled into the
kernel. The POSIX message queue subsystem aborted it's initialization if
register_sysctl_table() fails, and that causes an oops in sys_mq_open().
The patch fixes that by ignoring failures from register_sysctl_table().
Signed-off-by; Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
To make spinlock/rwlock initialization consistent all over the kernel,
this patch converts explicit lock-initializers into spin_lock_init() and
rwlock_init() calls.
Currently, spinlocks and rwlocks are initialized in two different ways:
lock = SPIN_LOCK_UNLOCKED
spin_lock_init(&lock)
rwlock = RW_LOCK_UNLOCKED
rwlock_init(&rwlock)
this patch converts all explicit lock initializations to
spin_lock_init() or rwlock_init(). (Besides consistency this also helps
automatic lock validators and debugging code.)
The conversion was done with a script, it was verified manually and it
was reviewed, compiled and tested as far as possible on x86, ARM, PPC.
There is no runtime overhead or actual code change resulting out of this
patch, because spin_lock_init() and rwlock_init() are macros and are
thus equivalent to the explicit initialization method.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch uses the rcu_assign_pointer() API to eliminate a number of explicit
memory barriers from the SysV IPC code that uses RCU. It also restructures
the ipc_ids structure so that the array size is stored in the same memory
block as the array itself (see the new struct ipc_id_ary). This prevents the
race that the earlier code was subject to, where a reader could see a mismatch
between the size and the actual array. With the size stored with the array,
the possibility of mismatch is eliminated -- with out the need for careful
ordering and explicit memory barriers. This has been tested successfully on
i386 and ppc64.
Signed-off-by: <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
I found that the prototypes for sys_waitid and sys_fcntl in
<linux/syscalls.h> don't match the implementation. In order to keep all
prototypes in sync in the future, now include the header from each file
implementing any syscall.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
POSIX specifies that the limit settings provided by getrlimit/setrlimit are
shared by the whole process, not specific to individual threads. This
patch changes the behavior of those calls to comply with POSIX.
I've moved the struct rlimit array from task_struct to signal_struct, as it
has the correct sharing properties. (This reduces kernel memory usage per
thread in multithreaded processes by around 100/200 bytes for 32/64
machines respectively.) I took a fairly minimal approach to the locking
issues with the newly shared struct rlimit array. It turns out that all
the code that is checking limits really just needs to look at one word at a
time (one rlim_cur field, usually). It's only the few places like
getrlimit itself (and fork), that require atomicity in accessing a whole
struct rlimit, so I just used a spin lock for them and no locking for most
of the checks. If it turns out that readers of struct rlimit need more
atomicity where they are now cheap, or less overhead where they are now
atomic (e.g. fork), then seqcount is certainly the right thing to use for
them instead of readers using the spin lock. Though it's in signal_struct,
I didn't use siglock since the access to rlimits never needs to disable
irqs and doesn't overlap with other siglock uses. Instead of adding
something new, I overloaded task_lock(task->group_leader) for this; it is
used for other things that are not likely to happen simultaneously with
limit tweaking. To me that seems preferable to adding a word, but it would
be trivial (and arguably cleaner) to add a separate lock for these users
(or e.g. just use seqlock, which adds two words but is optimal for readers).
Most of the changes here are just the trivial s/->rlim/->signal->rlim/.
I stumbled across what must be a long-standing bug, in reparent_to_init.
It does:
memcpy(current->rlim, init_task.rlim, sizeof(*(current->rlim)));
when surely it was intended to be:
memcpy(current->rlim, init_task.rlim, sizeof(current->rlim));
As rlim is an array, the * in the sizeof expression gets the size of the
first element, so this just changes the first limit (RLIMIT_CPU). This is
for kernel threads, where it's clear that resetting all the rlimits is what
you want. With that fixed, the setting of RLIMIT_FSIZE in nfsd is
superfluous since it will now already have been reset to RLIM_INFINITY.
The other subtlety is removing:
tsk->rlim[RLIMIT_CPU].rlim_cur = RLIM_INFINITY;
in exit_notify, which was to avoid a race signalling during self-reaping
exit. As the limit is now shared, a dying thread should not change it for
others. Instead, I avoid that race by checking current->state before the
RLIMIT_CPU check. (Adding one new conditional in that path is now required
one way or another, since if not for this check there would also be a new
race with self-reaping exit later on clearing current->signal that would
have to be checked for.)
The one loose end left by this patch is with process accounting.
do_acct_process temporarily resets the RLIMIT_FSIZE limit while writing the
accounting record. I left this as it was, but it is now changing a limit
that might be shared by other threads still running. I left this in a
dubious state because it seems to me that processing accounting may already
be more generally a dubious state when it comes to NPTL threads. I would
think you would want one record per process, with aggregate data about all
threads that ever lived in it, not a separate record for each thread.
I don't use process accounting myself, but if anyone is interested in
testing it out I could provide a patch to change it this way.
One final note, this is not 100% to POSIX compliance in regards to rlimits.
POSIX specifies that RLIMIT_CPU refers to a whole process in aggregate, not
to each individual thread. I will provide patches later on to achieve that
change, assuming this patch goes in first.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
During the kernel summit, some discussion was had about the support
requirements for a userspace program loader that loads executables into
hugetlb on behalf of a major application (Oracle). In order to support
this in a robust fashion, the cleanup of the hugetlb must be robust in the
presence of disorderly termination of the programs (e.g. kill -9). Hence,
the cleanup semantics are those of System V shared memory, but Linux'
System V shared memory needs one critical extension for this use:
executability.
The following microscopic patch enables this major application to provide
robust hugetlb cleanup.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Michael Kerrisk found a bug in the shm accounting code: sysv shm allows to
create SHMMNI+1 shared memory segments, instead of SHMMNI segments. The +1
is probably from the first shared anonymous mapping implementation that
used the sysv code to implement shared anon mappings.
The implementation got replaced, it's now the other way around (sysv uses
the shared anon code), but the +1 remained.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Here is the last agreed-on patch that lets normal users mlock pages up to
their rlimit. This patch addresses all the issues brought up by Chris and
Andrea.
From: Chris Wright <chrisw@osdl.org>
Couple more nits.
The default lockable amount is one page now (first patch is was 0). Why
don't we keep it as 0, with the CAP_IPC_LOCK overrides in place? That way
nothing is changed from user perspective, and the rest of the policy can be
done by userspace as it should.
This patch breaks in one scenario. When ulimit == 0, process has
CAP_IPC_LOCK, and does SHM_LOCK. The subsequent unlock or destroy will
corrupt the locked_shm count.
It's also inconsistent in handling user_can_mlock/CAP_IPC_LOCK interaction
betwen shm_lock and shm_hugetlb.
SHM_HUGETLB can now only be done by the shm_group or CAP_IPC_LOCK.
Not any can_do_mlock() user.
Double check of can_do_mlock isn't needed in SHM_LOCK path.
Interface names user_can_mlock and user_substract_mlock could be better.
Incremental update below. Ran some simple sanity tests on this plus my
patch below and didn't find any problems.
* Make default RLIM_MEMLOCK limit 0.
* Move CAP_IPC_LOCK check into user_can_mlock to be consistent
and fix but with ulimit == 0 && CAP_IPC_LOCK with SHM_LOCK.
* Allow can_do_mlock() user to try SHM_HUGETLB setup.
* Remove unecessary extra can_do_mlock() test in shmem_lock().
* Rename user_can_mlock to user_shm_lock and user_subtract_mlock
to user_shm_unlock.
* Use user instead of current->user to fit in 80 cols on SHM_LOCK.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Remove now-unneeded open-coded unlikelies around IS_ERR().
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Use abstracted RCU API to dereference RCU protected data. Hides barrier
details. Patch from Paul McKenney.
This patch introduced an rcu_dereference() macro that replaces most uses of
smp_read_barrier_depends(). The new macro has the advantage of explicitly
documenting which pointers are protected by RCU -- in contrast, it is
sometimes difficult to figure out which pointer is being protected by a given
smp_read_barrier_depends() call.
Signed-off-by: Paul McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Attached is a cleanup of the main loops in sys_msgrcv and sys_msgsnd, based on
ipc_lock_by_ptr(). Most backward gotos are gone, instead normal "for(;;)"
loops until a suitable message is found.
Description:
- General cleanup of sys_msgrcv and sys_msgsnd: the function were too
convoluted.
- Enable lockless receive, update comments.
- Use ipc_getref for sys_msgsnd(), it's better than rechecking that the
msqid is still valid.
Signed-Off-By: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Independent from the other patches:
undo operations should not result in out of range semaphore values. The test
for newval > SEMVMX is missing. The attached patch adds the test and a
comment.
Signed-Off-By: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The attached patch removes sem_revalidate and replaces it with
ipc_rcu_getref() calls followed by ipc_lock_by_ptr().
Signed-Off-By: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The lifetime of the ipc objects (sem array, msg queue, shm mapping) is
controlled by kern_ipc_perms->lock - a spinlock. There is no simple way to
reacquire this spinlock after it was dropped to
schedule()/kmalloc/copy_{to,from}_user/whatever.
The attached patch adds a reference count as a preparation to get rid of
sem_revalidate().
Signed-Off-By: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
ipc compat code switched to compat_alloc_user_space() and annotated.
|
|
|
|
From: Dipankar Sarma <dipankar@in.ibm.com>
This patch changes the call_rcu() API and avoids passing an argument to the
callback function as suggested by Rusty. Instead, it is assumed that the
user has embedded the rcu head into a structure that is useful in the
callback and the rcu_head pointer is passed to the callback. The callback
can use container_of() to get the pointer to its structure and work with
it. Together with the rcu-singly-link patch, it reduces the rcu_head size
by 50%. Considering that we use these in things like struct dentry and
struct dst_entry, this is good savings in space.
An example :
struct my_struct {
struct rcu_head rcu;
int x;
int y;
};
void my_rcu_callback(struct rcu_head *head)
{
struct my_struct *p = container_of(head, struct my_struct, rcu);
free(p);
}
void my_delete(struct my_struct *p)
{
...
call_rcu(&p->rcu, my_rcu_callback);
...
}
Signed-Off-By: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Lower default sizes for POSIX mqueue allocation now that rlimits are in place.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Add a user_struct to the mq_inode_info structure. Charge the maximum number
of bytes that could be allocated to a mqueue to the user who creates the
mqueue. This is checked against the per user rlimit.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Add helper function mq_attr_ok() to do mq_attr sanity checking, and do some
extra overlow checking.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
|
|
From: Andi Kleen <ak@suse.de>
Add support to tmpfs and hugetlbfs to support NUMA API. Shared memory is a
bit of a special case for NUMA policy. Normally policy is associated to VMAs
or to processes, but for a shared memory segment you really want to share the
policy. The core NUMA API has code for that, this patch adds the necessary
changes to tmpfs and hugetlbfs.
First it changes the custom swapping code in tmpfs to follow the policy set
via VMAs.
It is also useful to have a "backing store" of policy that saves the policy
even when nobody has the shared memory segment mapped. This allows command
line tools to pre configure policy, which is then later used by programs.
Note that hugetlbfs needs more changes - it is also required to switch it to
lazy allocation, otherwise the prefault prevents mbind() from working.
|
|
From: David Mosberger <davidm@napali.hpl.hp.com>
Below is a patch that tries to sanitize the dropping of unneeded system-call
stubs in generic code. In some instances, it would be possible to move the
optional system-call stubs into a library routine which would avoid the need
for #ifdefs, but in many cases, doing so would require making several
functions global (and possibly exporting additional data-structures in
header-files). Furthermore, it would inhibit (automatic) inlining in the
cases in the cases where the stubs are needed. For these reasons, the patch
keeps the #ifdef-approach.
This has been tested on ia64 and there were no objections from the
arch-maintainers (and one positive response). The patch should be safe but
arch-maintainers may want to take a second look to see if some __ARCH_WANT_foo
macros should be removed for their architecture (I'm quite sure that's the
case, but I wanted to play it safe and only preserved the status-quo in that
regard).
|
|
From: Chris Wright <chrisw@osdl.org>
Currently, if a user creates an mqueue and passes an mq_attr, the
info->messages will be created twice (and the extra one is properly freed).
This patch simply delays the allocation so that it only ever happens once.
The relevant mq_attr data is passed to lower levels via the dentry->d_fsdata
fs private data. This also helps isolate the areas we'd need to touch to do
rlimits on mqueues.
|
|
During mqueue_get_inode(), it's possible that kmalloc() of the
info->messages array will fail. This failure mode will cause the
queues_count to be (incorrectly) decremented twice. This patch uses
info->messages on mqueue_delete_inode() to determine whether the
mqueue was every truly created, and hence proper accounting is needed
on destruction.
|
|
Move error handling to capture all three possible error conditions on
sending to a full queue. Without this fix any unprivileged user can
leak arbitrary amounts of kernel memory.
|
|
From: Manfred Spraul <manfred@colorfullife.com>
Any user can delete any entries in a mqueue mounted filesystem. The attached
patch prevents that.
- remove the writable test from mq_unlink.
- set the sticky bit in the root inode. This affects both mq_unlink and
sys_unlink: only the owner (and root) should be allowed to remove queues.
|
|
From: Chris Wright <chrisw@osdl.org>
SUSv3 doesn't seem to specify one way or the other. I don't have the POSIX
specs, and the old docs I have suggest that mq_open() creates an object
which is to be closed upon exec.
Jakub said:
I think it is valid and required:
http://www.opengroup.org/onlinepubs/007904975/functions/exec.html
All open message queue descriptors in the calling process shall be
closed, as described in mq_close()
I'll add a new test for this into glibc testsuite.
|
|
From: Jakub Jelinek <jakub@redhat.com>
mq_notify (q, NULL)
and
struct sigevent ev = { .sigev_notify = SIGEV_NONE };
mq_notify (q, &ev)
are not the same thing in POSIX, yet the kernel treats them the same. Only
the former makes the notification available to other processes immediately,
see
http://www.opengroup.org/onlinepubs/007904975/functions/mq_notify.html
Without the patch below,
http://sources.redhat.com/ml/libc-hacker/2004-04/msg00028.html
glibc test fails.
I looked at mq in Solaris and they behave the same in this regard as Linux
with this patch. Kernel with this patch passes both Intel POSIX testsuite
(with testsuite fixes from Ulrich) and glibc mq testsuite.
|
|
Intro to these patches:
- Major surgery against the pagecache, radix-tree and writeback code. This
work is to address the O_DIRECT-vs-buffered data exposure horrors which
we've been struggling with for months.
As a side-effect, 32 bytes are saved from struct inode and eight bytes
are removed from struct page. At a cost of approximately 2.5 bits per page
in the radix tree nodes on 4k pagesize, assuming the pagecache is densely
populated. Not all pages are pagecache; other pages gain the full 8 byte
saving.
This change will break any arch code which is using page->list and will
also break any arch code which is using page->lru of memory which was
obtained from slab.
The basic problem which we (mainly Daniel McNeil) have been struggling
with is in getting a really reliable fsync() across the page lists while
other processes are performing writeback against the same file. It's like
juggling four bars of wet soap with your eyes shut while someone is
whacking you with a baseball bat. Daniel pretty much has the problem
plugged but I suspect that's just because we don't have testcases to
trigger the remaining problems. The complexity and additional locking
which those patches add is worrisome.
So the approach taken here is to remove the page lists altogether and
replace the list-based writeback and wait operations with in-order
radix-tree walks.
The radix-tree code has been enhanced to support "tagging" of pages, for
later searches for pages which have a particular tag set. This means that
we can ask the radix tree code "find me the next 16 dirty pages starting at
pagecache index N" and it will do that in O(log64(N)) time.
This affects I/O scheduling potentially quite significantly. It is no
longer the case that the kernel will submit pages for I/O in the order in
which the application dirtied them. We instead submit them in file-offset
order all the time.
This is likely to be advantageous when applications are seeking all over
a large file randomly writing small amounts of data. I haven't performed
much benchmarking, but tiobench random write throughput seems to be
increased by 30%. Other tests appear to be unaltered. dbench may have got
10-20% quicker, but it's variable.
There is one large file which everyone seeks all over randomly writing
small amounts of data: the blockdev mapping which caches filesystem
metadata. The kernel's IO submission patterns for this are now ideal.
Because writeback and wait-for-writeback use a tree walk instead of a
list walk they are no longer livelockable. This probably means that we no
longer need to hold i_sem across O_SYNC writes and perhaps fsync() and
fdatasync(). This may be beneficial for databases: multiple processes
writing and syncing different parts of the same file at the same time can
now all submit and wait upon writes to just their own little bit of the
file, so we can get a lot more data into the queues.
It is trivial to implement a part-file-fdatasync() as well, so
applications can say "sync the file from byte N to byte M", and multiple
applications can do this concurrently. This is easy for ext2 filesystems,
but probably needs lots of work for data-journalled filesystems and XFS and
it probably doesn't offer much benefit over an i_semless O_SYNC write.
These patches can end up making ext3 (even) slower:
for i in 1 2 3 4
do
dd if=/dev/zero of=$i bs=1M count=2000 &
done
runs awfully slow on SMP. This is, yet again, because all the file
blocks are jumbled up and the per-file linear writeout causes tons of
seeking. The above test runs sweetly on UP because the on UP we don't
allocate blocks to different files in parallel.
Mingming and Badari are working on getting block reservation working for
ext3 (preallocation on steroids). That should fix ext3 up.
This patch:
- Later, we'll need to access the radix trees from inside disk I/O
completion handlers. So make mapping->page_lock irq-safe. And rename it
to tree_lock to reliably break any missed conversions.
|
|
From: Arnd Bergmann <arnd@arndb.de>
I have tested the code with the open posix test suite and found the same
four failures for both 64-bit and compat mode, most tests pass. The patch
is against -mc1, but I guess it also applies to the other trees around.
What worries me more than mq_attr compatibility is the conversion of struct
sigevent, which might turn out really hard when more fields in there are
used. AFAICS, the only other part in the kernel ABI is sys_timer_create(),
so maybe it's not too late to deprecate the current structure and create a
structure that can be used properly for compat syscalls.
|