| Age | Commit message (Collapse) | Author |
|
As per http://www.nist.gov/dads/HTML/shellsort.html, this should be
referred to as a Shell sort. Shell-Metzner is a misnomer.
Signed-off-by: Daniel Dickman <didickman@yahoo.com>
Signed-off-by: Domen Puncer <domen@coderock.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Convert most of the current code that uses _NSIG directly to instead use
valid_signal(). This avoids gcc -W warnings and off-by-one errors.
Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Add a pair of rlimits for allowing non-root tasks to raise nice and rt
priorities. Defaults to traditional behavior. Originally written by
Chris Wright.
The patch implements a simple rlimit ceiling for the RT (and nice) priorities
a task can set. The rlimit defaults to 0, meaning no change in behavior by
default. A value of 50 means RT priority levels 1-50 are allowed. A value of
100 means all 99 privilege levels from 1 to 99 are allowed. CAP_SYS_NICE is
blanket permission.
(akpm: see http://www.uwsg.iu.edu/hypermail/linux/kernel/0503.1/1921.html for
tips on integrating this with PAM).
Signed-off-by: Matt Mackall <mpm@selenic.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Replace a number of memory barriers with smp_ variants. This means we won't
take the unnecessary hit on UP machines.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This is a megarollup of ~60 patches which give various things static scope.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
POSIX requires that the RLIMIT_CPU resource limit that generates SIGXCPU be
counted on a per-process basis. Currently, Linux implements this for
individual threads. This patch fixes the semantics to conform with POSIX.
The essential machinery for the process CPU limit is is tied into the new
posix-timers code for process CPU clocks and timers.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
What _is_ inconsistent is kernel/sys.c's setpriority()/set_one_prio().
It checks current->euid|uid against p->uid, which makes little sense, but
is how we've been doing it ever since. It's a Linux quirk documented in
the manpage. To make things funnier, SuS requires current->euid|uid match
against p->euid.
The patch below fixes it (and brings the logic in line with what
setscheduler()/setaffinity() does), but if we do it then it should be done
only in 2.6.12 or later, after good exposure in -mm.
(Worst-case this could break an application but i highly doubt it: it at
most could deny renicing another task to positive (or in very rare cases,
to negative) nice values, which no application should crash on something
like that, normally.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch introduces the concept of (virtual) cputime. Each architecture
can define its method to measure cputime. The main idea is to define a
cputime_t type and a set of operations on it (see asm-generic/cputime.h).
Then use the type for utime, stime, cutime, cstime, it_virt_value,
it_virt_incr, it_prof_value and it_prof_incr and use the cputime operations
for each access to these variables. The default implementation is jiffies
based and the effect of this patch for architectures which use the default
implementation should be neglectible.
There is a second type cputime64_t which is necessary for the kernel_stat
cpu statistics. The default cputime_t is 32 bit and based on HZ, this will
overflow after 49.7 days. This is not enough for kernel_stat (ihmo not
enough for a processes too), so it is necessary to have a 64 bit type.
The third thing that gets introduced by this patch is an additional field
for the /proc/stat interface: cpu steal time. An architecture can account
cpu steal time by calls to the account_stealtime function. The cpu which
backs a virtual processor doesn't spent all of its time for the virtual
cpu. To get meaningful cpu usage numbers this involuntary wait time needs
to be accounted and exported to user space.
From: Hugh Dickins <hugh@veritas.com>
The p->signal check in account_system_time is insufficient. If the timer
interrupt hits near the end of exit_notify, after EXIT_ZOMBIE has been set,
another cpu may release_task (NULLifying p->signal) in between
account_system_time's check and check_rlimit's dereference. Nor should
account_it_prof risk send_sig. But surely account_user_time is safe?
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Kernel core files converted to use the new lock initializers.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
On x86-64, the attached patch is required to fix
> kernel/sys.c: In function `sys_setsid':
> kernel/sys.c:1078: error: `tty_sem' undeclared (first use in this function)
> kernel/sys.c:1078: error: (Each undeclared identifier is reported only once
> kernel/sys.c:1078: error: for each function it appears in.)
kernel/sys.c needs the tty_sem declaration from linux/tty.h.
|
|
Use the existing "tty_sem" to protect against the process tty changes
too.
|
|
A while back we added the PR_SET_NAME prctl, but no PR_GET_NAME. I guess
we should add this, if only to enable testing of PR_SET_NAME.
Signed-off-by: Prasanna Meda <pmeda@akamai.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This change brings the semantics equivalent to 2.4 and also to what the man
page says; Also optimises by avoiding unneeded lookup in uid cache, when
who is same as the current->uid.
sys_set/getpriority is rewritten in 2.5/2.6, perhaps while transitioning to
the pid maps. It has now semantical bug, when uid is zero. Note that akpm
also fixed refcount leak and locking in the new functions in changeset
http://linus.bkbits.net:8080/linux-2.5/cset@1.1608.10.84
Signed-off-by: <pmeda@akamai.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Sticking the not-implemented syscall stuff in sys.c is a pain because the
cond_syscall()s explode when certain prototypes are in scope. And we need
those prototypes' header files for the C code in sys.c.
Fix all that up by moving all the sys_ni_syscall code into its own .c file.
Signed-off-by: Peter Chubb <peterc@gelato.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Fix name, and make sure that it's listed as a conditional
system call so that we stub it out to ENOSYS if the kernel
isn't compiled with key management support.
|
|
|
|
The feature set the patch includes:
- Key attributes:
- Key type
- Description (by which a key of a particular type can be selected)
- Payload
- UID, GID and permissions mask
- Expiry time
- Keyrings (just a type of key that holds links to other keys)
- User-defined keys
- Key revokation
- Access controls
- Per user key-count and key-memory consumption quota
- Three std keyrings per task: per-thread, per-process, session
- Two std keyrings per user: per-user and default-user-session
- prctl() functions for key and keyring creation and management
- Kernel interfaces for filesystem, blockdev, net stack access
- JIT key creation by usermode helper
There are also two utility programs available:
(*) http://people.redhat.com/~dhowells/keys/keyctl.c
A comprehensive key management tool, permitting all the interfaces
available to userspace to be exercised.
(*) http://people.redhat.com/~dhowells/keys/request-key
An example shell script (to be installed in /sbin) for instantiating a
key.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
I found that the prototypes for sys_waitid and sys_fcntl in
<linux/syscalls.h> don't match the implementation. In order to keep all
prototypes in sync in the future, now include the header from each file
implementing any syscall.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
POSIX specifies that the limit settings provided by getrlimit/setrlimit are
shared by the whole process, not specific to individual threads. This
patch changes the behavior of those calls to comply with POSIX.
I've moved the struct rlimit array from task_struct to signal_struct, as it
has the correct sharing properties. (This reduces kernel memory usage per
thread in multithreaded processes by around 100/200 bytes for 32/64
machines respectively.) I took a fairly minimal approach to the locking
issues with the newly shared struct rlimit array. It turns out that all
the code that is checking limits really just needs to look at one word at a
time (one rlim_cur field, usually). It's only the few places like
getrlimit itself (and fork), that require atomicity in accessing a whole
struct rlimit, so I just used a spin lock for them and no locking for most
of the checks. If it turns out that readers of struct rlimit need more
atomicity where they are now cheap, or less overhead where they are now
atomic (e.g. fork), then seqcount is certainly the right thing to use for
them instead of readers using the spin lock. Though it's in signal_struct,
I didn't use siglock since the access to rlimits never needs to disable
irqs and doesn't overlap with other siglock uses. Instead of adding
something new, I overloaded task_lock(task->group_leader) for this; it is
used for other things that are not likely to happen simultaneously with
limit tweaking. To me that seems preferable to adding a word, but it would
be trivial (and arguably cleaner) to add a separate lock for these users
(or e.g. just use seqlock, which adds two words but is optimal for readers).
Most of the changes here are just the trivial s/->rlim/->signal->rlim/.
I stumbled across what must be a long-standing bug, in reparent_to_init.
It does:
memcpy(current->rlim, init_task.rlim, sizeof(*(current->rlim)));
when surely it was intended to be:
memcpy(current->rlim, init_task.rlim, sizeof(current->rlim));
As rlim is an array, the * in the sizeof expression gets the size of the
first element, so this just changes the first limit (RLIMIT_CPU). This is
for kernel threads, where it's clear that resetting all the rlimits is what
you want. With that fixed, the setting of RLIMIT_FSIZE in nfsd is
superfluous since it will now already have been reset to RLIM_INFINITY.
The other subtlety is removing:
tsk->rlim[RLIMIT_CPU].rlim_cur = RLIM_INFINITY;
in exit_notify, which was to avoid a race signalling during self-reaping
exit. As the limit is now shared, a dying thread should not change it for
others. Instead, I avoid that race by checking current->state before the
RLIMIT_CPU check. (Adding one new conditional in that path is now required
one way or another, since if not for this check there would also be a new
race with self-reaping exit later on clearing current->signal that would
have to be checked for.)
The one loose end left by this patch is with process accounting.
do_acct_process temporarily resets the RLIMIT_FSIZE limit while writing the
accounting record. I left this as it was, but it is now changing a limit
that might be shared by other threads still running. I left this in a
dubious state because it seems to me that processing accounting may already
be more generally a dubious state when it comes to NPTL threads. I would
think you would want one record per process, with aggregate data about all
threads that ever lived in it, not a separate record for each thread.
I don't use process accounting myself, but if anyone is interested in
testing it out I could provide a patch to change it this way.
One final note, this is not 100% to POSIX compliance in regards to rlimits.
POSIX specifies that RLIMIT_CPU refers to a whole process in aggregate, not
to each individual thread. I will provide patches later on to achieve that
change, assuming this patch goes in first.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Signed-off-by: Olaf Hering <olh@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
This patch adds a prctl to modify current->comm as shown in /proc. This
feature was requested by KDE developers. In KDE most programs are started by
forking from a kdeinit program that already has the libraries loaded and some
other state.
Problem is to give these forked programs the proper name. It already writes
the command line in the environment (as seen in ps), but top uses a different
field in /proc/pid/status that reports current->comm. And that was always
"kdeinit" instead of the real command name. So you ended up with lots of
kdeinits in your top listing, which was not very useful.
This patch adds a new prctl PR_SET_NAME to allow a program to change its comm
field.
I considered the potential security issues of a program obscuring itself with
this interface, but I don't think it matters much because a program can
already obscure itself when the admin uses ps instead of top. In case of a
KDE desktop calling everything kdeinit is much more obfuscation than the
alternative.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Plumb the NUMA API syscalls into ppc64. Also add some missing cond_syscalls
so we still link with NUMA API disabled.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
After my cleanup of the rusage semantics was so quickly taken in by Andrew
and Linus without comment, I wonder if I should not have tried to be so
accommodating of potential objections as I was. :-)
In my original posting, I solicited comment on whether introducing
RUSAGE_GROUP as distinct from RUSAGE_SELF was warranted. Note that we've
now changed the behavior of the times system call when using CLONE_THREAD,
so changing getrusage RUSAGE_SELF to match would be consistent. I think
that changing the meaning of the old RUSAGE_SELF value is preferable to
introducing the new value for the proper POSIX getrusage behavior. This
patch against Linus's current tree dumps RUSAGE_GROUP and makes RUSAGE_SELF
have the fixed behavior.
If there is interest in having a new explicit interface to sample a single
thread's stats alone, then I think that would be better done by introducing
a new value for RUSAGE_THREAD. This is trivial to implement, but I won't
offer patches bloating the interface if noone is actually interested in
using it.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch fixes strange and obscure pid implementation in current kernels:
- it removes calling of put_task_struct() from detach_pid()
under tasklist_lock. This allows to use blocking calls
in security_task_free() hooks (in __put_task_struct()).
- it saves some space = 5*5 ints = 100 bytes in task_struct
- it's smaller and tidy, more straigthforward and doesn't use
any knowledge about pids using and assignment.
- it removes pid_links and pid_struct doesn't hold reference counters
on task_struct. instead, new pid_structs and linked altogether and
only one of them is inserted in hash_list.
Signed-off-by: Kirill Korotaev (kksx@mail.ru)
Signed-off-by: William Irwin <wli@holomorphy.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch changes the rusage bookkeeping and the semantics of the
getrusage and times calls in a couple of ways.
The first change is in the c* fields counting dead child processes. POSIX
requires that children that have died be counted in these fields when they
are reaped by a wait* call, and that if they are never reaped (e.g.
because of ignoring SIGCHLD, or exitting yourself first) then they are
never counted. These were counted in release_task for all threads. I've
changed it so they are counted in wait_task_zombie, i.e. exactly when
being reaped.
POSIX also specifies for RUSAGE_CHILDREN that the report include the reaped
child processes of the calling process, i.e. whole thread group in Linux,
not just ones forked by the calling thread. POSIX specifies tms_c[us]time
fields in the times call the same way. I've moved the c* fields that
contain this information into signal_struct, where the single set of
counters accumulates data from any thread in the group that calls wait*.
Finally, POSIX specifies getrusage and times as returning cumulative totals
for the whole process (aka thread group), not just the calling thread.
I've added fields in signal_struct to accumulate the stats of detached
threads as they die. The process stats are the sums of these records plus
the stats of remaining each live/zombie thread. The times and getrusage
calls, and the internal uses for filling in wait4 results and siginfo_t,
now iterate over the threads in the thread group and sum up their stats
along with the stats recorded for threads already dead and gone.
I added a new value RUSAGE_GROUP (-3) for the getrusage system call rather
than changing the behavior of the old RUSAGE_SELF (0). POSIX specifies
RUSAGE_SELF to mean all threads, so the glibc getrusage call will just
translate it to RUSAGE_GROUP for new kernels. I did this thinking that
someone somewhere might want the old behavior with an old glibc and a new
kernel (it is only different if they are using CLONE_THREAD anyway).
However, I've changed the times system call to conform to POSIX as well and
did not provide any backward compatibility there. In that case there is
nothing easy like a parameter value to use, it would have to be a new
system call number. That seems pretty pointless. Given that, I wonder if
it is worth bothering to preserve the compatible RUSAGE_SELF behavior by
introducing RUSAGE_GROUP instead of just changing RUSAGE_SELF's meaning.
Comments?
I've done some basic testing on x86 and x86-64, and all the numbers come
out right after these fixes. (I have a test program that shows a few
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch adds a new system call `waitid'. This is a new POSIX call that
subsumes the rest of the wait* family and can do some things the older
calls cannot. A minor addition is the ability to select what kinds of
status to check for with a mask of independent bits, so you can wait for
just stops and not terminations, for example. A more significant
improvement is the WNOWAIT flag, which allows for polling child status
without reaping. This interface fills in a siginfo_t with the same details
that a SIGCHLD for the status change has; some of that info (e.g. si_uid)
is not available via wait4 or other calls.
I've added a new system call that has the parameter conventions of the
POSIX function because that seems like the cleanest thing. This patch
includes the actual system call table additions for i386 and x86-64; other
architectures will need to assign the system call number, and 64-bit ones
may need to implement 32-bit compat support for it as I did for x86-64.
The new features could instead be provided by some new kludge inventions in
the wait4 system call interface (that's what BSD did). If kludges are
preferable to adding a system call, I can work up something different.
I added a struct rusage field si_rusage to siginfo_t in the SIGCHLD case
(this does not affect the size or layout of the struct). This is not part
of the POSIX interface, but it makes it so that `waitid' subsumes all the
functionality of `wait4'. Future kernel ABIs (new arch's or whatnot) can
have only the `waitid' system call and the rest of the wait* family
including wait3 and wait4 can be implemented in user space using waitid.
There is nothing in user space as yet that would make use of the new field.
Most of the new functionality is implemented purely in the waitid system
call itself. POSIX also provides for the WCONTINUED flag to report when a
child process had been stopped by job control and then resumed with
SIGCONT. Corresponding to this, a SIGCHLD is now generated when a child
resumes (unless SA_NOCLDSTOP is set), with the value CLD_CONTINUED in
siginfo_t.si_code. To implement this, some additional bookkeeping is
required in the signal code handling job control stops.
The motivation for this work is to make it possible to implement the POSIX
semantics of the `waitid' function in glibc completely and correctly. If
changing either the system call interface used to accomplish that, or any
details of the kernel implementation work, would improve the chances of
getting this incorporated, I am more than happy to work through any issues.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This fixes compilation of x86-64 without CONFIG_NUMA again (got broken
by the previous patchkit)
|
|
|
|
From: Olaf Kirch <okir@suse.de>
I have been chasing a corruption of current->group_info on PPC during NFS
stress tests. The problem seems to be that nfsd is messing with its
group_info quite a bit, while some monitoring processes look at
/proc/<pid>/status and do a get_group_info/put_group_info without any locking.
This problem can be reproduced on ppc platforms within a few seconds if you
generate some NFS load and do a "cat /proc/XXX/status" of an nfsd thread in a
tight loop.
I therefore think changes to current->group_info, and querying it from a
different process, needs to be protected using the task_lock.
(akpm: task->group_info here is safe against exit() because the task holds a
ref on group_info which is released in __put_task_struct, and the /proc file
has a ref on the task_struct).
|
|
From: Andi Kleen <ak@suse.de>
The following patches add support for configurable NUMA memory policy
for user processes. It is based on the proposal from last kernel summit
with feedback from various people.
This NUMA API doesn't not attempt to implement page migration or anything
else complicated: all it does is to police the allocation when a page
is first allocation or when a page is reallocated after swapping. Currently
only support for shared memory and anonymous memory is there; policy for
file based mappings is not implemented yet (although they get implicitely
policied by the default process policy)
It adds three new system calls: mbind to change the policy of a VMA,
set_mempolicy to change the policy of a process, get_mempolicy to retrieve
memory policy. User tools (numactl, libnuma, test programs, manpages) can be
found in ftp://ftp.suse.com/pub/people/ak/numa/numactl-0.6.tar.gz
For details on the system calls see the manpages
http://www.firstfloor.org/~andi/mbind.html
http://www.firstfloor.org/~andi/set_mempolicy.html
http://www.firstfloor.org/~andi/get_mempolicy.html
Most user programs should actually not use the system calls directly,
but use the higher level functions in libnuma
(http://www.firstfloor.org/~andi/numa.html) or the command line tools
(http://www.firstfloor.org/~andi/numactl.html
The system calls allow user programs and administors to set various NUMA memory
policies for putting memory on specific nodes. Here is a short description
of the policies copied from the kernel patch:
* NUMA policy allows the user to give hints in which node(s) memory should
* be allocated.
*
* Support four policies per VMA and per process:
*
* The VMA policy has priority over the process policy for a page fault.
*
* interleave Allocate memory interleaved over a set of nodes,
* with normal fallback if it fails.
* For VMA based allocations this interleaves based on the
* offset into the backing object or offset into the mapping
* for anonymous memory. For process policy an process counter
* is used.
* bind Only allocate memory on a specific set of nodes,
* no fallback.
* preferred Try a specific node first before normal fallback.
* As a special case node -1 here means do the allocation
* on the local CPU. This is normally identical to default,
* but useful to set in a VMA when you have a non default
* process policy.
* default Allocate on the local node first, or when on a VMA
* use the process policy. This is what Linux always did
* in a NUMA aware kernel and still does by, ahem, default.
*
* The process policy is applied for most non interrupt memory allocations
* in that process' context. Interrupts ignore the policies and always
* try to allocate on the local CPU. The VMA policy is only applied for memory
* allocations for a VMA in the VM.
*
* Currently there are a few corner cases in swapping where the policy
* is not applied, but the majority should be handled. When process policy
* is used it is not remembered over swap outs/swap ins.
*
* Only the highest zone in the zone hierarchy gets policied. Allocations
* requesting a lower zone just use default policy. This implies that
* on systems with highmem kernel lowmem allocation don't get policied.
* Same with GFP_DMA allocations.
*
* For shmfs/tmpfs/hugetlbfs shared memory the policy is shared between
* all users and remembered even when nobody has memory mapped.
This patch:
This is the core NUMA API code. This includes NUMA policy aware
wrappers for get_free_pages and alloc_page_vma(). On non NUMA kernels
these are defined away.
The system calls mbind (see http://www.firstfloor.org/~andi/mbind.html),
get_mempolicy (http://www.firstfloor.org/~andi/get_mempolicy.html) and
set_mempolicy (http://www.firstfloor.org/~andi/set_mempolicy.html) are
implemented here.
Adds a vm_policy field to the VMA and to the process. The process
also has field for interleaving. VMA interleaving uses the offset
into the VMA, but that's not possible for process allocations.
From: Andi Kleen <ak@muc.de>
> Andi, how come policy_vma() calls ->set_policy under i_shared_sem?
I think this can be actually dropped now. In an earlier version I did
walk the vma shared list to change the policies of other mappings to the
same shared memory region. This turned out too complicated with all the
corner cases, so I eventually gave in and added ->get_policy to the fast
path. Also there is still the mmap_sem which prevents races in the same MM.
Patch to remove it attached. Also adds documentation and removes the
bogus __alloc_page_vma() prototype noticed by hch.
From: Andi Kleen <ak@suse.de>
A few incremental fixes for NUMA API.
- Fix a few comments
- Add a compat_ function for get_mem_policy I considered changing the
ABI to avoid this, but that would have made the API too ugly. I put it
directly into the file because a mm/compat.c didn't seem worth it just for
this.
- Fix the algorithm for VMA interleave.
From: Matthew Dobson <colpatch@us.ibm.com>
1) Move the extern of alloc_pages_current() into #ifdef CONFIG_NUMA.
The only references to the function are in NUMA code in mempolicy.c
2) Remove the definitions of __alloc_page_vma(). They aren't used.
3) Move forward declaration of struct vm_area_struct to top of file.
|
|
From: David Mosberger <davidm@napali.hpl.hp.com>
Below is a patch that tries to sanitize the dropping of unneeded system-call
stubs in generic code. In some instances, it would be possible to move the
optional system-call stubs into a library routine which would avoid the need
for #ifdefs, but in many cases, doing so would require making several
functions global (and possibly exporting additional data-structures in
header-files). Furthermore, it would inhibit (automatic) inlining in the
cases in the cases where the stubs are needed. For these reasons, the patch
keeps the #ifdef-approach.
This has been tested on ia64 and there were no objections from the
arch-maintainers (and one positive response). The patch should be safe but
arch-maintainers may want to take a second look to see if some __ARCH_WANT_foo
macros should be removed for their architecture (I'm quite sure that's the
case, but I wanted to play it safe and only preserved the status-quo in that
regard).
|
|
Split the system_state state `SYSTEM_SHUTDOWN' into SYSTEM_HALT,
SYSTEM_POWER_OFF and SYSTEM_RESTART and export system_state to modules.
This allows driver shutdown routines to know why they are being shutdown. The
IDE subsystem wants this so that it knows to not spin the disks down across a
reboot.
|
|
We need to always allocate at least one indirect block
pointer, since we always fill out blocks[0] even if
we don't have any groups.
|
|
From: Olaf Kirch <okir@suse.de>
Authentication code in net/sunrpc makes frequent use of groups_alloc(0),
which seems to clobber memory past the end of what it allocated.
If called with gidsetsize == 0, groups_alloc will set nblocks = 0,
but still does a
group_info->blocks[0] = group_info->small_block;
|
|
find_user() is being called from set/get_priority(), but it doesn't take the
needed lock, and those callers were forgetting to drop the refcount which
find_user() took.
|
|
From: Matt Mackall <mpm@selenic.com>
The nswap and cnswap variables counters have never been incremented as
Linux doesn't do task swapping.
|
|
From: Roland McGrath <roland@redhat.com>
This patch moves all the fields relating to job control from task_struct to
signal_struct, so that all this info is properly per-process rather than
being per-thread.
|
|
From: Arnd Bergmann <arnd@arndb.de>
I have tested the code with the open posix test suite and found the same
four failures for both 64-bit and compat mode, most tests pass. The patch
is against -mc1, but I guess it also applies to the other trees around.
What worries me more than mq_attr compatibility is the conversion of struct
sigevent, which might turn out really hard when more fields in there are
used. AFAICS, the only other part in the kernel ABI is sys_timer_create(),
so maybe it's not too late to deprecate the current structure and create a
structure that can be used properly for compat syscalls.
|
|
From: Manfred Spraul <manfred@colorfullife.com>
Add -ENOSYS stubs for the posix message queue syscalls. The API is a direct
mapping of the api from the unix spec, with two exceptions:
- mq_close() doesn't exist. Message queue file descriptors can be closed
with close().
- mq_notify(SIGEV_THREAD) cannot be implemented in the kernel. The kernel
returns a pollable file descriptor . User space must poll (or read) this
descriptor and call the notifier function if the file descriptor is
signaled.
|
|
From: Olof Johansson <olof@austin.ibm.com>
It's currently a boolean, but that means that system_running goes to zero
again when shutting down. So we then use code (in the page allocator) which
is only designed to be used during bootup - it is marked __init.
So we need to be able to distinguish early boot state from late shutdown
state. Rename system_running to system_state and give it the three
appropriate states.
|
|
This renames sys_shmat to do_shmat. Additionally, I've replaced the
cond_syscall with a conditional inline function.
It touches all archs - only i386 is tested.
|
|
From: Manfred Spraul <manfred@colorfullife.com>
Attached is a patch that replaces the #ifndef CONFIG_SYSV syscall stubs
with cond_syscall stubs.
|
|
From: Tim Hockin <thockin@sun.com>,
Neil Brown <neilb@cse.unsw.edu.au>,
me
New groups infrastructure. task->groups and task->ngroups are replaced by
task->group_info. Group)info is a refcounted, dynamic struct with an array
of pages. This allows for large numbers of groups. The current limit of
32 groups has been raised to 64k groups. It can be raised more by changing
the NGROUPS_MAX constant in limits.h
|
|
From: Pavel Machek <pavel@ucw.cz>
software_suspend() can fail for quite a lot of reasons (for example not
enough swapspace). However current interface returned void, so you could
not propagate error back to userland. This fixes it. Plus
__read_suspend_image() is only done during init time, so we might as well
mark it __init.
|
|
From: Matt Mackall <mpm@selenic.com>
Experimenting with trying to use cond_syscall for a few arch-specific
syscalls, I discovered that it can't actually be used outside the file
in which sys_ni_syscall is declared because the assembler doesn't feel
obliged to output the symbol in that case:
weak.c:
#define cond_syscall(x) asm(".weak\t" #x "\n\t.set\t" #x ",sys_ni_syscall");
cond_syscall(sys_foo);
$ nm weak.o
U sys_ni_syscall
One arch (PPC) is apparently trying to use cond_syscall this way
anyway, though it's probably never been actually tested as the above
test was done on a PPC.
After trying a bunch of tricks to get it to work nicely, I decided
there are basically two alternatives: make weak versions of
sys_ni_syscall wherever they're wanted or put the arch-specific
cond_syscalls in kernel/sys.c where sys_ni_syscall is defined.
The former approach is a bit crufty and doesn't actually do the right
thing in practice as you'll get multiple copies of sys_ni_syscall in
your final image.
The latter introduces some slight arch-pollution in sys.c, but as
arch-specific cond_syscalls aren't all that frequent, it should be
pretty minor. So here's a patch to move the current offender to sys.c:
|
|
From: William Lee Irwin III <wli@holomorphy.com>
Our accounting of minor faults versus major faults is currently quite wrong.
To fix it up we need to propagate the actual fault type back to the
higher-level code. Repurpose the currently-unused third arg to ->nopage
for this.
|
|
There seems to be no header file which declares system_running.
|
|
cause NULL pointer references in /proc.
Moreover, it's questionable whether the whole thing makes sense at all.
Per-thread state is good.
Cset exclude: davem@nuts.ninka.net|ChangeSet|20031005193942|01097
Cset exclude: akpm@osdl.org[torvalds]|ChangeSet|20031005180420|42200
Cset exclude: akpm@osdl.org[torvalds]|ChangeSet|20031005180411|42211
|
|
|