| Age | Commit message (Collapse) | Author |
|
Pass the POSIX lock owner ID to the flush operation.
This is useful for filesystems which don't want to store any locking state
in inode->i_flock but want to handle locking/unlocking POSIX locks
internally. FUSE is one such filesystem but I think it possible that some
network filesystems would need this also.
Also add a flag to indicate that a POSIX locking request was generated by
close(), so filesystems using the above feature won't send an extra locking
request in this case.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Extend the get_sb() filesystem operation to take an extra argument that
permits the VFS to pass in the target vfsmount that defines the mountpoint.
The filesystem is then required to manually set the superblock and root dentry
pointers. For most filesystems, this should be done with simple_set_mnt()
which will set the superblock pointer and then set the root dentry to the
superblock's s_root (as per the old default behaviour).
The get_sb() op now returns an integer as there's now no need to return the
superblock pointer.
This patch permits a superblock to be implicitly shared amongst several mount
points, such as can be done with NFS to avoid potential inode aliasing. In
such a case, simple_set_mnt() would not be called, and instead the mnt_root
and mnt_sb would be set directly.
The patch also makes the following changes:
(*) the get_sb_*() convenience functions in the core kernel now take a vfsmount
pointer argument and return an integer, so most filesystems have to change
very little.
(*) If one of the convenience function is not used, then get_sb() should
normally call simple_set_mnt() to instantiate the vfsmount. This will
always return 0, and so can be tail-called from get_sb().
(*) generic_shutdown_super() now calls shrink_dcache_sb() to clean up the
dcache upon superblock destruction rather than shrink_dcache_anon().
This is required because the superblock may now have multiple trees that
aren't actually bound to s_root, but that still need to be cleaned up. The
currently called functions assume that the whole tree is rooted at s_root,
and that anonymous dentries are not the roots of trees which results in
dentries being left unculled.
However, with the way NFS superblock sharing are currently set to be
implemented, these assumptions are violated: the root of the filesystem is
simply a dummy dentry and inode (the real inode for '/' may well be
inaccessible), and all the vfsmounts are rooted on anonymous[*] dentries
with child trees.
[*] Anonymous until discovered from another tree.
(*) The documentation has been adjusted, including the additional bit of
changing ext2_* into foo_* in the documentation.
[akpm@osdl.org: convert ipath_fs, do other stuff]
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch adds audit support to POSIX message queues. It applies cleanly to
the lspp.b15 branch of Al Viro's git tree. There are new auxiliary data
structures, and collection and emission routines in kernel/auditsc.c. New hooks
in ipc/mqueue.c collect arguments from the syscalls.
I tested the patch by building the examples from the POSIX MQ library tarball.
Build them -lrt, not against the old MQ library in the tarball. Here's the URL:
http://www.geocities.com/wronski12/posix_ipc/libmqueue-4.41.tar.gz
Do auditctl -a exit,always -S for mq_open, mq_timedsend, mq_timedreceive,
mq_notify, mq_getsetattr. mq_unlink has no new hooks. Please see the
corresponding userspace patch to get correct output from auditd for the new
record types.
[fixes folded]
Signed-off-by: George Wilson <ltcgcw@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
(akpm: I don't do comment typos patches. This one snuck through by accident)
Signed-off-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Ingo's sem2mutex patch incorrectly replaced one reference to ipc/sem.c
with ipc/mutex.c in a comment.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Semaphore to mutex conversion.
The conversion was generated via scripts, and the result was validated
automatically via a script as well.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
NOTIFY_COOKIE_LEN is defined in mqueue.h as well as mqueue.c
This patch removes redundant definition from mqueue.c
Signed-off-by: Michal Wronski <Michal.Wronski@motorola.com>
Signed-Off-By: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
netlink overrun was broken while improvement of netlink.
Destination socket is used in the place where it was meant to be source socket,
so that now overrun is never sent to user netlink sockets, when it should be,
and it even can be set on kernel socket, which results in complete deadlock
of rtnetlink.
Suggested fix is to restore status quo passing source socket as additional
argument to netlink_attachskb().
A little explanation: overrun is set on a socket, when it failed
to receive some message and sender of this messages does not or even
have no way to handle this error. This happens in two cases:
1. when kernel sends something. Kernel never retransmits and cannot
wait for buffer space.
2. when user sends a broadcast and the message was not delivered
to some recipients.
Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Fixed the refcounting on failure exits in sys_mq_open() and
cleaned the logics up. Rules are actually pretty simple - dentry_open()
expects vfsmount and dentry to be pinned down and it either transfers
them into created struct file or drops them. Old code had been very
confused in that area - if dentry_open() had failed either in do_open()
or do_create(), we ended up dentry and mqueue_mnt dropped twice, once
by dentry_open() cleanup and then by sys_mq_open().
Fix consists of making the rules for do_create() and do_open()
same as for dentry_open() and updating the sys_mq_open() accordingly;
that actually leads to more straightforward code and less work on
normal path.
Signed-off-by: Al Viro <aviro@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
- Move capable() from sched.h to capability.h;
- Use <linux/capability.h> where capable() is used
(in include/, block/, ipc/, kernel/, a few drivers/,
mm/, security/, & sound/;
many more drivers/ to go)
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch converts the inode semaphore to a mutex. I have tested it on
XFS and compiled as much as one can consider on an ia64. Anyway your
luck with it might be different.
Modified-by: Ingo Molnar <mingo@elte.hu>
(finished the conversion)
Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
|
|
We ignored umask when creating new queues via mq_open (when creating
with open() on mqueue fs it is ok of course). According to the
specification this a bug. This trivial patch fixes this.
Signed-off-by: Krzysztof Benedyczak <golbi@mat.uni.torun.pl>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch contains the most trivial from Rusty's trivial patches:
- spelling fixes
- remove duplicate includes
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Convert most of the current code that uses _NSIG directly to instead use
valid_signal(). This avoids gcc -W warnings and off-by-one errors.
Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Replace a number of memory barriers with smp_ variants. This means we won't
take the unnecessary hit on UP machines.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
register_sysctl_table() fails if sysctl support is not compiled into the
kernel. The POSIX message queue subsystem aborted it's initialization if
register_sysctl_table() fails, and that causes an oops in sys_mq_open().
The patch fixes that by ignoring failures from register_sysctl_table().
Signed-off-by; Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
I found that the prototypes for sys_waitid and sys_fcntl in
<linux/syscalls.h> don't match the implementation. In order to keep all
prototypes in sync in the future, now include the header from each file
implementing any syscall.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
POSIX specifies that the limit settings provided by getrlimit/setrlimit are
shared by the whole process, not specific to individual threads. This
patch changes the behavior of those calls to comply with POSIX.
I've moved the struct rlimit array from task_struct to signal_struct, as it
has the correct sharing properties. (This reduces kernel memory usage per
thread in multithreaded processes by around 100/200 bytes for 32/64
machines respectively.) I took a fairly minimal approach to the locking
issues with the newly shared struct rlimit array. It turns out that all
the code that is checking limits really just needs to look at one word at a
time (one rlim_cur field, usually). It's only the few places like
getrlimit itself (and fork), that require atomicity in accessing a whole
struct rlimit, so I just used a spin lock for them and no locking for most
of the checks. If it turns out that readers of struct rlimit need more
atomicity where they are now cheap, or less overhead where they are now
atomic (e.g. fork), then seqcount is certainly the right thing to use for
them instead of readers using the spin lock. Though it's in signal_struct,
I didn't use siglock since the access to rlimits never needs to disable
irqs and doesn't overlap with other siglock uses. Instead of adding
something new, I overloaded task_lock(task->group_leader) for this; it is
used for other things that are not likely to happen simultaneously with
limit tweaking. To me that seems preferable to adding a word, but it would
be trivial (and arguably cleaner) to add a separate lock for these users
(or e.g. just use seqlock, which adds two words but is optimal for readers).
Most of the changes here are just the trivial s/->rlim/->signal->rlim/.
I stumbled across what must be a long-standing bug, in reparent_to_init.
It does:
memcpy(current->rlim, init_task.rlim, sizeof(*(current->rlim)));
when surely it was intended to be:
memcpy(current->rlim, init_task.rlim, sizeof(current->rlim));
As rlim is an array, the * in the sizeof expression gets the size of the
first element, so this just changes the first limit (RLIMIT_CPU). This is
for kernel threads, where it's clear that resetting all the rlimits is what
you want. With that fixed, the setting of RLIMIT_FSIZE in nfsd is
superfluous since it will now already have been reset to RLIM_INFINITY.
The other subtlety is removing:
tsk->rlim[RLIMIT_CPU].rlim_cur = RLIM_INFINITY;
in exit_notify, which was to avoid a race signalling during self-reaping
exit. As the limit is now shared, a dying thread should not change it for
others. Instead, I avoid that race by checking current->state before the
RLIMIT_CPU check. (Adding one new conditional in that path is now required
one way or another, since if not for this check there would also be a new
race with self-reaping exit later on clearing current->signal that would
have to be checked for.)
The one loose end left by this patch is with process accounting.
do_acct_process temporarily resets the RLIMIT_FSIZE limit while writing the
accounting record. I left this as it was, but it is now changing a limit
that might be shared by other threads still running. I left this in a
dubious state because it seems to me that processing accounting may already
be more generally a dubious state when it comes to NPTL threads. I would
think you would want one record per process, with aggregate data about all
threads that ever lived in it, not a separate record for each thread.
I don't use process accounting myself, but if anyone is interested in
testing it out I could provide a patch to change it this way.
One final note, this is not 100% to POSIX compliance in regards to rlimits.
POSIX specifies that RLIMIT_CPU refers to a whole process in aggregate, not
to each individual thread. I will provide patches later on to achieve that
change, assuming this patch goes in first.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Remove now-unneeded open-coded unlikelies around IS_ERR().
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Lower default sizes for POSIX mqueue allocation now that rlimits are in place.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Add a user_struct to the mq_inode_info structure. Charge the maximum number
of bytes that could be allocated to a mqueue to the user who creates the
mqueue. This is checked against the per user rlimit.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Add helper function mq_attr_ok() to do mq_attr sanity checking, and do some
extra overlow checking.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
|
|
From: Chris Wright <chrisw@osdl.org>
Currently, if a user creates an mqueue and passes an mq_attr, the
info->messages will be created twice (and the extra one is properly freed).
This patch simply delays the allocation so that it only ever happens once.
The relevant mq_attr data is passed to lower levels via the dentry->d_fsdata
fs private data. This also helps isolate the areas we'd need to touch to do
rlimits on mqueues.
|
|
During mqueue_get_inode(), it's possible that kmalloc() of the
info->messages array will fail. This failure mode will cause the
queues_count to be (incorrectly) decremented twice. This patch uses
info->messages on mqueue_delete_inode() to determine whether the
mqueue was every truly created, and hence proper accounting is needed
on destruction.
|
|
Move error handling to capture all three possible error conditions on
sending to a full queue. Without this fix any unprivileged user can
leak arbitrary amounts of kernel memory.
|
|
From: Manfred Spraul <manfred@colorfullife.com>
Any user can delete any entries in a mqueue mounted filesystem. The attached
patch prevents that.
- remove the writable test from mq_unlink.
- set the sticky bit in the root inode. This affects both mq_unlink and
sys_unlink: only the owner (and root) should be allowed to remove queues.
|
|
From: Chris Wright <chrisw@osdl.org>
SUSv3 doesn't seem to specify one way or the other. I don't have the POSIX
specs, and the old docs I have suggest that mq_open() creates an object
which is to be closed upon exec.
Jakub said:
I think it is valid and required:
http://www.opengroup.org/onlinepubs/007904975/functions/exec.html
All open message queue descriptors in the calling process shall be
closed, as described in mq_close()
I'll add a new test for this into glibc testsuite.
|
|
From: Jakub Jelinek <jakub@redhat.com>
mq_notify (q, NULL)
and
struct sigevent ev = { .sigev_notify = SIGEV_NONE };
mq_notify (q, &ev)
are not the same thing in POSIX, yet the kernel treats them the same. Only
the former makes the notification available to other processes immediately,
see
http://www.opengroup.org/onlinepubs/007904975/functions/mq_notify.html
Without the patch below,
http://sources.redhat.com/ml/libc-hacker/2004-04/msg00028.html
glibc test fails.
I looked at mq in Solaris and they behave the same in this regard as Linux
with this patch. Kernel with this patch passes both Intel POSIX testsuite
(with testsuite fixes from Ulrich) and glibc mq testsuite.
|
|
From: Manfred Spraul <manfred@colorfullife.com>
SIGEV_THREAD means that a given callback should be called in the context on a
new thread. This must be done by the C library. The kernel must deliver a
notice of the event to the C library when the callback should be called.
This patch switches to a new, simpler interface: User space creates a socket
with socket(PF_NETLINK, SOCK_RAW,0) and passes the fd to the mq_notify call
together with a cookie. When the mq_notify() condition is satisfied, the
kernel "writes" the cookie to the socket. User space then reads the cookie
and calls the appropriate callback.
|
|
From: Manfred Spraul <manfred@colorfullife.com>
I found a security bug in the new mqueue code: a process that has only
write permissions to a message queue could call mq_notify(SIGEV_THREAD) and
use the returned notification file descriptor to read from the message
queue.
|
|
From: Manfred Spraul <manfred@colorfullife.com>
My discussion with Ulrich had one result:
- mq_setattr can accept implementation defined flags. Right now we have
none, but we might add some later (e.g. switch to CLOCK_MONOTONIC for
mq_timed{send,receive} or something similar). When we add flags, we
might need the fields for additional information. And they don't hurt.
Therefore add four __reserved fields to mq_attr.
- fail mq_setattr if we get unknown flags - otherwise glibc can't detect
if it's running on a future kernel that supports new features.
- use memset to initialize the mq_attr structure - theoretically we could
leak kernel memory.
- Only set O_NONBLOCK in mq_attr, explicitely clear O_RDWR & friends.
openposix uses getattr, attr |=O_NONBLOCK, setattr - a sane approach.
Without clearing O_RDWR, this fails.
I've retested all openposix conformance tests with the new patch - the two
new FAILED tests check undefined behavior. Note that I won't have net
access until Sunday - if the message queue patch breaks something important
either ask Krzysztof or drop it.
Ulrich had another good idea for SIGEV_THREAD, but I must think about it.
It would mean less complexitiy in glibc, but more code in the kernel. I'm
not yet convinced that it's overall better.
|
|
From: Manfred Spraul <manfred@colorfullife.com>
Make the posix message queue mountable by the user. This replaces ipcs and
ipcrm for posix message queue: The admin can check which queues exist with ls
and remove stale queues with rm.
I'd like a final confirmation from Ulrich that our SIGEV_THREAD approach is
the right thing(tm): He's aware of the design and didn't object, but I think
he hasn't seen the final API yet.
|
|
From: Manfred Spraul <manfred@colorfullife.com>
Linux specific extension: make the message queue identifiers pollable. It's
simple and could be useful.
|
|
From: Manfred Spraul <manfred@colorfullife.com>
Actual implementation of the posix message queues, written by Krzysztof
Benedyczak and Michal Wronski. The complete implementation is dependant on
CONFIG_POSIX_MQUEUE.
It passed the openposix test suite with two exceptions: one mq_unlink test
was bad and tested undefined behavior. And Linux succeeds
mq_close(open(,,,)). The spec mandates EBADF, but we have decided to ignore
that: we would have to add a new syscall just for the right error code.
The patch intentionally doesn't use all helpers from fs/libfs for kernel-only
filesystems: step 5 allows user space mounts of the file system.
Signal changes:
The patch redefines SI_MESGQ using __SI_CODE: The generic Linux ABI uses
a negative value (i.e. from user) for SI_MESGQ, but the kernel internal
value must be posive to pass check_kill_value. Additionally, the patch
adds support into copy_siginfo_to_user to copy the "new" signal type to
user space.
Changes in signal code caused by POSIX message queues patch:
General & rationale:
mqueues generated signals (only upon notification) must have si_code
== SI_MESGQ. In fact such a signal is send from one process which
caused notification (== sent message to empty message queue) to
another which requested it. Both processes can be of course unrelated
in terms of uids/euids. So SI_MESGQ signals must be classified as
SI_FROMKERNEL to pass check_kill_permissions (not need to say that
this signals ARE from kernel).
Signals generated by message queues notification need the same
fields in siginfo struct's union _sifields as POSIX.1b signals and we
can reuse its union entry.
SI_MESGQ was previously defined to -3 in kernel and also in glibc.
So in userspace SI_MESGQ must be still visible as -3.
Solution:
SI_MESGQ is defined in the same style as SI_TIMER using __SI_CODE macro.
Details:
Fortunately copy_siginfo_to_user copies si_code as short. So we
can use remaining part of int value freely. __SI_CODE does the
work. SI_MESGQ is in kernel:
6<<16 | (-3 & 0xffff) what is > 0
but to userspace is copied
(short) SI_MESGQ == -3
Actual changes:
Changes in include/asm-generic/siginfo.h
__SI_MESGQ added in signal.h to represent inside-kernel prefix of
SI_MESGQ. SI_MESGQ is redefined from -3 to __SI_CODE(__SI_MESGQ, -3)
Except mips architecture those changes should be arch independent
(asm-generic/siginfo.h is included in arch versions). On mips
SI_MESGQ is redefined to -4 in order to be compatible with IRIX. But
the same schema can be used.
Change in copy_siginfo_to_user: We only add one line to order the
same copy semantics as for _SI_RT.
This change isn't very portable - some arch have its own
copy_siginfo_to_user. All those should have similar change (but
possibly not one-line as _SI_RT case was sometimes ignored because i
wasn't used yet, e.g. see ia64 signal.c).
Update:
mq: only fail with invalid timespec if mq_timed{send,receive} needs to block
From: Jakub Jelinek <jakub@redhat.com>
POSIX requires EINVAL to be set if:
"The process or thread would have blocked, and the abs_timeout parameter
specified a nanoseconds field value less than zero or greater than or equal
to 1000 million."
but 2.6.5-mm3 returns -EINVAL even if the process or thread would not block
(if the queue is not empty for timedreceive or not full for timedsend).
|