| Age | Commit message (Collapse) | Author |
|
Add linked list of auxiliary data to audit_context
Add callbacks in IPC_SET functions to record requested changes.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
|
|
My patch that removed the spin_lock calls from the tail of sys_semtimedop
introduced a bug:
Before my patch was merged, every operation that altered an array called
update_queue. That call woke up threads that were waiting until a
semaphore value becomes 0. I've accidentially removed that call.
The attached patch fixes that by modifying update_queue: the function now
loops internally and wakes up all threads. The patch also removes
update_queue calls from the error path of sys_semtimedop: failed operations
do not modify the array, no need to rescan the list of waiting threads.
Signed-Off-By: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch uses the rcu_assign_pointer() API to eliminate a number of explicit
memory barriers from the SysV IPC code that uses RCU. It also restructures
the ipc_ids structure so that the array size is stored in the same memory
block as the array itself (see the new struct ipc_id_ary). This prevents the
race that the earlier code was subject to, where a reader could see a mismatch
between the size and the actual array. With the size stored with the array,
the possibility of mismatch is eliminated -- with out the need for careful
ordering and explicit memory barriers. This has been tested successfully on
i386 and ppc64.
Signed-off-by: <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
I found that the prototypes for sys_waitid and sys_fcntl in
<linux/syscalls.h> don't match the implementation. In order to keep all
prototypes in sync in the future, now include the header from each file
implementing any syscall.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Independent from the other patches:
undo operations should not result in out of range semaphore values. The test
for newval > SEMVMX is missing. The attached patch adds the test and a
comment.
Signed-Off-By: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The attached patch removes sem_revalidate and replaces it with
ipc_rcu_getref() calls followed by ipc_lock_by_ptr().
Signed-Off-By: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The lifetime of the ipc objects (sem array, msg queue, shm mapping) is
controlled by kern_ipc_perms->lock - a spinlock. There is no simple way to
reacquire this spinlock after it was dropped to
schedule()/kmalloc/copy_{to,from}_user/whatever.
The attached patch adds a reference count as a preparation to get rid of
sem_revalidate().
Signed-Off-By: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
|
|
|
|
From: badari <pbadari@us.ibm.com>
I ran into an ipc hang while trying to shutdown a database. The problem is
due to missing sem_unlock() in find_undo().
|
|
From: Manfred Spraul <manfred@colorfullife.com>
sem_revalidate checks that a semaphore array didn't disappear while the
code was running without the semaphore array spinlock. If the array
disappeared, then it will return without holding a lock. find_undo calls
sem_revalidate and then sem_unlock, even if sem_revalidate failed. The
sem_unlock call must be removed.
Mingming Cao reported a spinlock deadlock with sysv semaphores. A
superflous unlock doesn't explain the deadlock, but it's obviously a bug.
|
|
From: "Randy.Dunlap" <rddunlap@osdl.org>
Add syscalls.h, which contains prototypes for the kernel's system calls.
Replace open-coded declarations all over the place. This patch found a
couple of prior bugs. It appears to be more important with -mregparm=3 as we
discover more asmlinkage mismatches.
Some syscalls have arch-dependent arguments, so their prototypes are in the
arch-specific unistd.h. Maybe it should have been asm/syscalls.h, but there
were already arch-specific syscall prototypes in asm/unistd.h...
Tested on x86, ia64, x86_64, ppc64, s390 and sparc64. May cause
trivial-to-fix build breakage on other architectures.
|
|
From: Manfred Spraul <manfred@colorfullife.com>
attached is the lockless semop patch. I did another test run with
idle=poll on an pentium III, and it remained unchanged: 99.9% direct
fast path, 0.1% race with wakeup against writing the final result code:
http://khack.osdl.org/stp/282936/environment/proc/slabinfo
That means there is no immediate need to add the two-stage
implementation to finish_wait.
It reduces the spinlock operations on the semaphore array spinlock by 1/3.
|
|
From: Anton Blanchard <anton@samba.org>
I saw a lockup where 2 cpus were stuck in sem_lock(). It seems like we can
loop back to retry_undos with the lock held. That path takes the lock so
we will deadlock.
|
|
One more overlooked area where the proper process ID has to be used:
SysV IPC "pid" values should use the thread group ID, not the per-thread
one.
|
|
|
|
From: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
This patch proposes a performance fix for the current IPC semaphore
implementation.
There are two shortcoming in the current implementation:
try_atomic_semop() was called two times to wake up a blocked process,
once from the update_queue() (executed from the process that wakes up
the sleeping process) and once in the retry part of the blocked process
(executed from the block process that gets woken up).
A second issue is that when several sleeping processes that are eligible
for wake up, they woke up in daisy chain formation and each one in turn
to wake up next process in line. However, every time when a process
wakes up, it start scans the wait queue from the beginning, not from
where it was last scanned. This causes large number of unnecessary
scanning of the wait queue under a situation of deep wait queue.
Blocked processes come and go, but chances are there are still quite a
few blocked processes sit at the beginning of that queue.
What we are proposing here is to merge the portion of the code in the
bottom part of sys_semtimedop() (code that gets executed when a sleeping
process gets woken up) into update_queue() function. The benefit is two
folds: (1) is to reduce redundant calls to try_atomic_semop() and (2) to
increase efficiency of finding eligible processes to wake up and higher
concurrency for multiple wake-ups.
We have measured that this patch improves throughput for a large
application significantly on a industry standard benchmark.
This patch is relative to 2.5.72. Any feedback is very much
appreciated.
Some kernel profile data attached:
Kernel profile before optimization:
-----------------------------------------------
0.05 0.14 40805/529060 sys_semop [133]
0.55 1.73 488255/529060 ia64_ret_from_syscall
[2]
[52] 2.5 0.59 1.88 529060 sys_semtimedop [52]
0.05 0.83 477766/817966 schedule_timeout [62]
0.34 0.46 529064/989340 update_queue [61]
0.14 0.00 1006740/6473086 try_atomic_semop [75]
0.06 0.00 529060/989336 ipcperms [149]
-----------------------------------------------
0.30 0.40 460276/989340 semctl_main [68]
0.34 0.46 529064/989340 sys_semtimedop [52]
[61] 1.5 0.64 0.87 989340 update_queue [61]
0.75 0.00 5466346/6473086 try_atomic_semop [75]
0.01 0.11 477676/576698 wake_up_process [146]
-----------------------------------------------
0.14 0.00 1006740/6473086 sys_semtimedop [52]
0.75 0.00 5466346/6473086 update_queue [61]
[75] 0.9 0.89 0.00 6473086 try_atomic_semop [75]
-----------------------------------------------
Kernel profile with optimization:
-----------------------------------------------
0.03 0.05 26139/503178 sys_semop [155]
0.46 0.92 477039/503178 ia64_ret_from_syscall
[2]
[61] 1.2 0.48 0.97 503178 sys_semtimedop [61]
0.04 0.79 470724/784394 schedule_timeout [62]
0.05 0.00 503178/3301773 try_atomic_semop [109]
0.05 0.00 503178/930934 ipcperms [149]
0.00 0.03 32454/460210 update_queue [99]
-----------------------------------------------
0.00 0.03 32454/460210 sys_semtimedop [61]
0.06 0.36 427756/460210 semctl_main [75]
[99] 0.4 0.06 0.39 460210 update_queue [99]
0.30 0.00 2798595/3301773 try_atomic_semop [109]
0.00 0.09 470630/614097 wake_up_process [146]
-----------------------------------------------
0.05 0.00 503178/3301773 sys_semtimedop [61]
0.30 0.00 2798595/3301773 update_queue [99]
[109] 0.3 0.35 0.00 3301773 try_atomic_semop [109]
-----------------------------------------------=20
Both number of function calls to try_atomic_semop() and update_queue()
are reduced by 50% as a result of the merge. Execution time of
sys_semtimedop is reduced because of the reduction in the low level
functions.
|
|
From: Manfred Spraul <manfred@colorfullife.com>
The CLONE_SYSVSEM implementation is racy: it does an (atomic_read(->refcnt)
==1) instead of atomic_dec_and_test calls in the exit handling. The patch
fixes that.
Additionally, the patch contains the following changes:
- lock_undo() locks the list of undo structures. The lock is held
throughout the semop() syscall, but that's unnecessary - we can drop it
immediately after the lookup.
- undo structures are only allocated when necessary. The need for undo
structures is only noticed in the middle of the semop operation, while
holding the semaphore array spinlock. The result is a convoluted
unlock&revalidate implementation. I've reordered the code, and now the
undo allocation can happen before acquiring the semaphore array spinlock.
As a bonus, less code runs under the semaphore array spinlock.
- sysvsem.sleep_list looks like code to handle oopses: if an oops kills a
thread that sleeps in sys_timedsemop(), then sem_exit tries to recover.
I've removed that - too fragile.
|
|
From: Manfred Spraul <manfred@colorfullife.com>
SysV sem operations that involve multiple semaphores can fail in the
middle, and then sempid (pid of the last successful operation) must be
restored. This happens with "sempid >>= 16" - broken due to the 32-bit pid
values. The attached patch fixes that by reordering the updates of the
semaphore fields.
Additionally, the patch fixes the corruption of the sempid value that occurs
if a wait-for-zero operation fails.
The patch is more than two years old, and was in -dj and -ak kernels.
|
|
From: Mingming Cao <cmm@us.ibm.com>
Basically, freeary() is called with the spinlock for that semaphore set
hold. But after the semaphore set is removed from the ID array by
calling sem_rmid(), there is no lock to protect the waiting queue for
that semaphore set. So, if a waiter is woken up by a signal (not by the
wakeup from freeary()), it will check the q->status and q->prev fields.
At that moment, freeary() may not have a chance to update those fields
yet.
static void freeary (int id)
{
.......
sma = sem_rmid(id);
......
/* Wake up all pending processes and let them fail with EIDRM.*/
for (q = sma->sem_pending; q; q = q->next) {
q->status = -EIDRM;
q->prev = NULL;
wake_up_process(q->sleeper); /* doesn't sleep */
}
sem_unlock(sma);
......
}
So I propose move sem_rmid() after the loop of waking up every waiters.
That could gurantee that when the waiters are woke up, the updates for
q->status and q->prev have already done. Similar thing in message queue
case. The patch is attached below. Comments are very welcomed.
I have tested this patch on 2.5.68 kernel with LTP tests, seems fine to
me. Paul, could you test this on DOTS test again? Thanks!
|
|
This patch adds the remaining System V IPC hooks, including the inline
documentation for them in security.h. This includes a restored
sem_semop hook, as it does seem to be necessary to support fine-grained
access.
All of these System V IPC hooks are used by SELinux. The SELinux System
V IPC access controls were originally described in the technical report
available from http://www.nsa.gov/selinux/slinux-abs.html, and the
LSM-based implementation is described in the technical report available
from http://www.nsa.gov/selinux/module-abs.html.
|
|
Patch from Mark Fasheh <mark.fasheh@oracle.com> (plus a few cleanups
and a speedup from yours truly)
Adds the semtimedop() function - semop with a timeout. Solaris has
this. It's apparently worth a couple of percent to Oracle throughput
and given the simplicity, that is sufficient benefit for inclusion IMO.
This patch hooks up semtimedop() only for ia64 and ia32.
|
|
and net/* files.
|
|
|
|
stat64 has been changed to return jiffies granuality as nsec in previously
unused fields. This allows make to make better decisions on when
to recompile a file. Follows losely the Solaris API.
CURRENT_TIME has been redefined to return struct timespec. The users
who don't use it in a inode/attr context have been changed to use a new
get_seconds() function. CURRENT_TIME is implemented by an out-of-line
function.
There is a small performance penalty in this patch. The previous
filemap code had an optimization to flush atime only once a second.
This is currently gone, which will increase flushes a bit. I believe
the correct solution if it should be a problem is to have per super
block fields that give an arbitary atime flush granuality - so that you
can set it to be only flushed once a hour if you prefer that. I will
work on that later in separate patches if the need should arise.
struct inode and the attr struct has been changed to store struct
timespec instead of time_t for [cma]time. Not all file systems support
this granuality, but some like XFS,NFSv3,CIFS,JFS do. The others will
currently truncate the nsec part on flushing to disk. There was some
discussion on this rounding on l-k previously. I went for simple
truncation because there is not much evidence IMHO that the more
complicated roundings have any advantages. In practice application will
be rather unlikely to notice the rounding anyways - they can only see a
difference when an inode is flush from memory and reloaded in less than
a second, which is rather unlikely.
|
|
Patch from Mingming, Rusty, Hugh, Dipankar, me:
- It greatly reduces the lock contention by having one lock per id.
The global spinlock is removed and a spinlock is added in
kern_ipc_perm structure.
- Uses ReadCopyUpdate in grow_ary() for locking-free resizing.
- In the places where ipc_rmid() is called, delay calling ipc_free()
to RCU callbacks. This is to prevent ipc_lock() returning an invalid
pointer after ipc_rmid(). In addition, use the workqueue to enable
RCU freeing vmalloced entries.
Also some other changes:
- Remove redundant ipc_lockall/ipc_unlockall
- Now ipc_unlock() directly takes IPC ID pointer as argument, avoid
extra looking up the array.
The changes are made based on the input from Huge Dickens, Manfred
Spraul and Dipankar Sarma. In addition, Cliff White has run OSDL's
dbt1 test on a 2 way against the earlier version of this patch.
Results shows about 2-6% improvement on the average number of
transactions per second. Here is the summary of his tests:
2.5.42-mm2 2.5.42-mm2-ipclock
-----------------------------
Average over 5 runs 85.0 BT 89.8 BT
Std Deviation 5 runs 7.4 BT 1.0 BT
Average over 4 best 88.15 BT 90.2 BT
Std Deviation 4 best 2.8 BT 0.5 BT
Also, another test today from Bill Hartner:
I tested Mingming's RCU ipc lock patch using a *new* microbenchmark - semopbench.
semopbench was written to test the performance of Mingming's patch.
I also ran a 3 hour stress and it completed successfully.
Explanation of the microbenchmark is below the results.
Here is a link to the microbenchmark source.
http://www-124.ibm.com/developerworks/opensource/linuxperf/semopbench/semopbench.c
SUT : 8-way 700 Mhz PIII
I tested 2.5.44-mm2 and 2.5.44-mm2 + RCU ipc patch
>semopbench -g 64 -s 16 -n 16384 -r > sem.results.out
>readprofile -m /boot/System.map | sort -n +0 -r > sem.profile.out
The metric is seconds / per repetition. Lower is better.
kernel run 1 run 2
seconds seconds
================== ======= =======
2.5.44-mm2 515.1 515.4
2.5.44-mm2+rcu-ipc 46.7 46.7
With Mingming's patch, the test completes 10X faster.
|
|
|
|
The patch below adds the base set of LSM hooks for System V IPC to the
2.5.41 kernel. These hooks permit a security module to label
semaphore sets, message queues, and shared memory segments and to
perform security checks on these objects that parallel the existing
IPC access checks. Additional LSM hooks for labeling and controlling
individual messages sent on a single message queue and for providing
fine-grained distinctions among IPC operations will be submitted
separately after this base set of LSM IPC hooks has been accepted.
|
|
into kroah.com:/home/greg/linux/BK/lsm-2.5
|
|
This patch just makes some stuff in ipc/ static.
|
|
Also move where we set sma->sem_perm.mode and .key to before ipc_addid() gets called.
|
|
Christopher Yeoh <cyeoh@samba.org>: (Made -p1 compliant by rusty) SUSv2 semctl compliance:
The semctl call with SETVAL currently does not set sempid (at the
moment sempid is only set during a successful semop call). An
explanation from Geoff Clare of the Open Group regarding why sempid
should be set during the semctl call:
"The spec isn't very clear, but there is a statement on the semget()
page which I think justifies the assumption made by the test. It says
that upon creation, the data structure associated with each semaphore
in the set is not initialised, and that the semctl() function with
SETVAL or SETALL can be used to initialise each semaphore.
Therefore semctl() with SETVAL has to set sempid to *something*, and
since sempid contains the "process ID of the last operation", setting
it to anything other than the pid of the calling process would mean
that sempid contained misleading information. It could be argued that
setting it to zero would not be misleading, but zero cannot be the
process ID of a process, and so is not a valid value for sempid anyway."
The following patch changes semctl so when called with SETVAL
sempid is set to the pid of the calling process:
|
|
The patch below fixes sem_exit() so that the BKL is always released.
|
|
As we discussed some time ago, here is a patch for the SEM_UNDO change
that can be applied to linux-2.5.9.
|
|
|
|
Push BKL down to the (few) routines that actually need it,
remove it from the do_exit() path.
|
|
- me/Al Viro: fix bdget() oops with block device modules that don't
clean up after they exit
- Alan Cox: continued merging (drivers, license tags)
- David Miller: sparc update, network fixes
- Christoph Hellwig: work around broken drivers that add a gendisk more
than once
- Jakub Jelinek: handle more ELF loading special cases
- Trond Myklebust: NFS client and lockd reclaimer cleanups/fixes
- Greg KH: USB updates
- Mikael Pettersson: sparate out local APIC / IO-APIC config options
|
|
- sync up more with Alan
- Urban Widmark: smbfs and HIGHMEM fix
- Chris Mason: reiserfs tail unpacking fix ("null bytes in reiserfs files")
- Adan Richter: new cpia usb ID
- Hugh Dickins: misc small sysv ipc fixes
- Andries Brouwer: remove overly restrictive sector size check for
SCSI cd-roms
|
|
- Jens: better ordering of requests when unable to merge
- Neil Brown: make md work as a module again (we cannot autodetect
in modules, not enough background information)
- Neil Brown: raid5 SMP locking cleanups
- Neil Brown: nfsd: handle Irix NFS clients named pipe behavior and
dentry leak fix
- maestro3 shutdown fix
- fix dcache hash calculation that could cause bad hashes under certain
circumstances (Dean Gaudet)
- David Miller: networking and sparc updates
- Jeff Garzik: include file cleanups
- Andy Grover: ACPI update
- Coda-fs error return fixes
- rth: alpha Jensen update
|
|
|