summaryrefslogtreecommitdiff
path: root/kernel/sys.c
AgeCommit message (Collapse)Author
2005-05-05[PATCH] correctly name the Shell sortDomen Puncer
As per http://www.nist.gov/dads/HTML/shellsort.html, this should be referred to as a Shell sort. Shell-Metzner is a misnomer. Signed-off-by: Daniel Dickman <didickman@yahoo.com> Signed-off-by: Domen Puncer <domen@coderock.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01[PATCH] convert that currently tests _NSIG directly to use valid_signal()Jesper Juhl
Convert most of the current code that uses _NSIG directly to instead use valid_signal(). This avoids gcc -W warnings and off-by-one errors. Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01[PATCH] nice and rt-prio rlimitsMatt Mackall
Add a pair of rlimits for allowing non-root tasks to raise nice and rt priorities. Defaults to traditional behavior. Originally written by Chris Wright. The patch implements a simple rlimit ceiling for the RT (and nice) priorities a task can set. The rlimit defaults to 0, meaning no change in behavior by default. A value of 50 means RT priority levels 1-50 are allowed. A value of 100 means all 99 privilege levels from 1 to 99 are allowed. CAP_SYS_NICE is blanket permission. (akpm: see http://www.uwsg.iu.edu/hypermail/linux/kernel/0503.1/1921.html for tips on integrating this with PAM). Signed-off-by: Matt Mackall <mpm@selenic.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01[PATCH] use smp_mb/wmb/rmb where possibleakpm@osdl.org
Replace a number of memory barriers with smp_ variants. This means we won't take the unnecessary hit on UP machines. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-11[PATCH] Make lots of things staticAdrian Bunk
This is a megarollup of ~60 patches which give various things static scope. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-07[PATCH] make RLIMIT_CPU/SIGXCPU per-processRoland McGrath
POSIX requires that the RLIMIT_CPU resource limit that generates SIGXCPU be counted on a per-process basis. Currently, Linux implements this for individual threads. This patch fixes the semantics to conform with POSIX. The essential machinery for the process CPU limit is is tied into the new posix-timers code for process CPU clocks and timers. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-07[PATCH] sys_setpriority() euid semantics fixIngo Molnar
What _is_ inconsistent is kernel/sys.c's setpriority()/set_one_prio(). It checks current->euid|uid against p->uid, which makes little sense, but is how we've been doing it ever since. It's a Linux quirk documented in the manpage. To make things funnier, SuS requires current->euid|uid match against p->euid. The patch below fixes it (and brings the logic in line with what setscheduler()/setaffinity() does), but if we do it then it should be done only in 2.6.12 or later, after good exposure in -mm. (Worst-case this could break an application but i highly doubt it: it at most could deny renicing another task to positive (or in very rare cases, to negative) nice values, which no application should crash on something like that, normally.) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-11[PATCH] cputime: introduce cputimeMartin Schwidefsky
This patch introduces the concept of (virtual) cputime. Each architecture can define its method to measure cputime. The main idea is to define a cputime_t type and a set of operations on it (see asm-generic/cputime.h). Then use the type for utime, stime, cutime, cstime, it_virt_value, it_virt_incr, it_prof_value and it_prof_incr and use the cputime operations for each access to these variables. The default implementation is jiffies based and the effect of this patch for architectures which use the default implementation should be neglectible. There is a second type cputime64_t which is necessary for the kernel_stat cpu statistics. The default cputime_t is 32 bit and based on HZ, this will overflow after 49.7 days. This is not enough for kernel_stat (ihmo not enough for a processes too), so it is necessary to have a 64 bit type. The third thing that gets introduced by this patch is an additional field for the /proc/stat interface: cpu steal time. An architecture can account cpu steal time by calls to the account_stealtime function. The cpu which backs a virtual processor doesn't spent all of its time for the virtual cpu. To get meaningful cpu usage numbers this involuntary wait time needs to be accounted and exported to user space. From: Hugh Dickins <hugh@veritas.com> The p->signal check in account_system_time is insufficient. If the timer interrupt hits near the end of exit_notify, after EXIT_ZOMBIE has been set, another cpu may release_task (NULLifying p->signal) in between account_system_time's check and check_rlimit's dereference. Nor should account_it_prof risk send_sig. But surely account_user_time is safe? Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07[PATCH] Lock initializer cleanup (Core)Thomas Gleixner
Kernel core files converted to use the new lock initializers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-06[PATCH] x86-64: kernel/sys.c build fixJeff Garzik
On x86-64, the attached patch is required to fix > kernel/sys.c: In function `sys_setsid': > kernel/sys.c:1078: error: `tty_sem' undeclared (first use in this function) > kernel/sys.c:1078: error: (Each undeclared identifier is reported only once > kernel/sys.c:1078: error: for each function it appears in.) kernel/sys.c needs the tty_sem declaration from linux/tty.h.
2005-01-06[PATCH] First cut at setsid/tty lockingAlan Cox
Use the existing "tty_sem" to protect against the process tty changes too.
2005-01-04[PATCH] Add PR_GET_NAMEPrasanna Meda
A while back we added the PR_SET_NAME prctl, but no PR_GET_NAME. I guess we should add this, if only to enable testing of PR_SET_NAME. Signed-off-by: Prasanna Meda <pmeda@akamai.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-12-02[PATCH] sys_set/getpriority PRIO_USER semantics fix and optimisationPrasanna Meda
This change brings the semantics equivalent to 2.4 and also to what the man page says; Also optimises by avoiding unneeded lookup in uid cache, when who is same as the current->uid. sys_set/getpriority is rewritten in 2.5/2.6, perhaps while transitioning to the pid maps. It has now semantical bug, when uid is zero. Note that akpm also fixed refcount leak and locking in the new functions in changeset http://linus.bkbits.net:8080/linux-2.5/cset@1.1608.10.84 Signed-off-by: <pmeda@akamai.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-11-01[PATCH] standalone sys_ni.c for not-implemented syscallsPeter Chubb
Sticking the not-implemented syscall stuff in sys.c is a pain because the cond_syscall()s explode when certain prototypes are in scope. And we need those prototypes' header files for the C code in sys.c. Fix all that up by moving all the sys_ni_syscall code into its own .c file. Signed-off-by: Peter Chubb <peterc@gelato.unsw.edu.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-27Fix up "compat_sys_keyctl()" system call.Linus Torvalds
Fix name, and make sure that it's listed as a conditional system call so that we stub it out to ENOSYS if the kernel isn't compiled with key management support.
2004-10-19MergeDavid S. Miller
2004-10-18[PATCH] implement in-kernel keys & keyring managementDavid Howells
The feature set the patch includes: - Key attributes: - Key type - Description (by which a key of a particular type can be selected) - Payload - UID, GID and permissions mask - Expiry time - Keyrings (just a type of key that holds links to other keys) - User-defined keys - Key revokation - Access controls - Per user key-count and key-memory consumption quota - Three std keyrings per task: per-thread, per-process, session - Two std keyrings per user: per-user and default-user-session - prctl() functions for key and keyring creation and management - Kernel interfaces for filesystem, blockdev, net stack access - JIT key creation by usermode helper There are also two utility programs available: (*) http://people.redhat.com/~dhowells/keys/keyctl.c A comprehensive key management tool, permitting all the interfaces available to userspace to be exercised. (*) http://people.redhat.com/~dhowells/keys/request-key An example shell script (to be installed in /sbin) for instantiating a key. Signed-Off-By: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-18[PATCH] add missing linux/syscalls.h includesArnd Bergmann
I found that the prototypes for sys_waitid and sys_fcntl in <linux/syscalls.h> don't match the implementation. In order to keep all prototypes in sync in the future, now include the header from each file implementing any syscall. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-18[PATCH] make rlimit settings per-process instead of per-threadRoland McGrath
POSIX specifies that the limit settings provided by getrlimit/setrlimit are shared by the whole process, not specific to individual threads. This patch changes the behavior of those calls to comply with POSIX. I've moved the struct rlimit array from task_struct to signal_struct, as it has the correct sharing properties. (This reduces kernel memory usage per thread in multithreaded processes by around 100/200 bytes for 32/64 machines respectively.) I took a fairly minimal approach to the locking issues with the newly shared struct rlimit array. It turns out that all the code that is checking limits really just needs to look at one word at a time (one rlim_cur field, usually). It's only the few places like getrlimit itself (and fork), that require atomicity in accessing a whole struct rlimit, so I just used a spin lock for them and no locking for most of the checks. If it turns out that readers of struct rlimit need more atomicity where they are now cheap, or less overhead where they are now atomic (e.g. fork), then seqcount is certainly the right thing to use for them instead of readers using the spin lock. Though it's in signal_struct, I didn't use siglock since the access to rlimits never needs to disable irqs and doesn't overlap with other siglock uses. Instead of adding something new, I overloaded task_lock(task->group_leader) for this; it is used for other things that are not likely to happen simultaneously with limit tweaking. To me that seems preferable to adding a word, but it would be trivial (and arguably cleaner) to add a separate lock for these users (or e.g. just use seqlock, which adds two words but is optimal for readers). Most of the changes here are just the trivial s/->rlim/->signal->rlim/. I stumbled across what must be a long-standing bug, in reparent_to_init. It does: memcpy(current->rlim, init_task.rlim, sizeof(*(current->rlim))); when surely it was intended to be: memcpy(current->rlim, init_task.rlim, sizeof(current->rlim)); As rlim is an array, the * in the sizeof expression gets the size of the first element, so this just changes the first limit (RLIMIT_CPU). This is for kernel threads, where it's clear that resetting all the rlimits is what you want. With that fixed, the setting of RLIMIT_FSIZE in nfsd is superfluous since it will now already have been reset to RLIM_INFINITY. The other subtlety is removing: tsk->rlim[RLIMIT_CPU].rlim_cur = RLIM_INFINITY; in exit_notify, which was to avoid a race signalling during self-reaping exit. As the limit is now shared, a dying thread should not change it for others. Instead, I avoid that race by checking current->state before the RLIMIT_CPU check. (Adding one new conditional in that path is now required one way or another, since if not for this check there would also be a new race with self-reaping exit later on clearing current->signal that would have to be checked for.) The one loose end left by this patch is with process accounting. do_acct_process temporarily resets the RLIMIT_FSIZE limit while writing the accounting record. I left this as it was, but it is now changing a limit that might be shared by other threads still running. I left this in a dubious state because it seems to me that processing accounting may already be more generally a dubious state when it comes to NPTL threads. I would think you would want one record per process, with aggregate data about all threads that ever lived in it, not a separate record for each thread. I don't use process accounting myself, but if anyone is interested in testing it out I could provide a patch to change it this way. One final note, this is not 100% to POSIX compliance in regards to rlimits. POSIX specifies that RLIMIT_CPU refers to a whole process in aggregate, not to each individual thread. I will provide patches later on to achieve that change, assuming this patch goes in first. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-07[NET]: Allow CONFIG_NET=n on ppc64.Olaf Hering
Signed-off-by: Olaf Hering <olh@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2004-09-27Add __user annotation to PR_SET_NAMELinus Torvalds
2004-09-13[PATCH] Add prctl to modify current->commAndi Kleen
This patch adds a prctl to modify current->comm as shown in /proc. This feature was requested by KDE developers. In KDE most programs are started by forking from a kdeinit program that already has the libraries loaded and some other state. Problem is to give these forked programs the proper name. It already writes the command line in the environment (as seen in ps), but top uses a different field in /proc/pid/status that reports current->comm. And that was always "kdeinit" instead of the real command name. So you ended up with lots of kdeinits in your top listing, which was not very useful. This patch adds a new prctl PR_SET_NAME to allow a program to change its comm field. I considered the potential security issues of a program obscuring itself with this interface, but I don't think it matters much because a program can already obscure itself when the admin uses ps instead of top. In case of a KDE desktop calling everything kdeinit is much more obfuscation than the alternative. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-09-13[PATCH] ppc64: Enable NUMA APIAnton Blanchard
Plumb the NUMA API syscalls into ppc64. Also add some missing cond_syscalls so we still link with NUMA API disabled. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-09-07[PATCH] Remove RUSAGE_GROUPRoland McGrath
After my cleanup of the rusage semantics was so quickly taken in by Andrew and Linus without comment, I wonder if I should not have tried to be so accommodating of potential objections as I was. :-) In my original posting, I solicited comment on whether introducing RUSAGE_GROUP as distinct from RUSAGE_SELF was warranted. Note that we've now changed the behavior of the times system call when using CLONE_THREAD, so changing getrusage RUSAGE_SELF to match would be consistent. I think that changing the meaning of the old RUSAGE_SELF value is preferable to introducing the new value for the proper POSIX getrusage behavior. This patch against Linus's current tree dumps RUSAGE_GROUP and makes RUSAGE_SELF have the fixed behavior. If there is interest in having a new explicit interface to sample a single thread's stats alone, then I think that would be better done by introducing a new value for RUSAGE_THREAD. This is trivial to implement, but I won't offer patches bloating the interface if noone is actually interested in using it. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-09-02[PATCH] fixed pidhashing patchKirill Korotaev
This patch fixes strange and obscure pid implementation in current kernels: - it removes calling of put_task_struct() from detach_pid() under tasklist_lock. This allows to use blocking calls in security_task_free() hooks (in __put_task_struct()). - it saves some space = 5*5 ints = 100 bytes in task_struct - it's smaller and tidy, more straigthforward and doesn't use any knowledge about pids using and assignment. - it removes pid_links and pid_struct doesn't hold reference counters on task_struct. instead, new pid_structs and linked altogether and only one of them is inserted in hash_list. Signed-off-by: Kirill Korotaev (kksx@mail.ru) Signed-off-by: William Irwin <wli@holomorphy.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-30[PATCH] fix rusage semanticsRoland McGrath
This patch changes the rusage bookkeeping and the semantics of the getrusage and times calls in a couple of ways. The first change is in the c* fields counting dead child processes. POSIX requires that children that have died be counted in these fields when they are reaped by a wait* call, and that if they are never reaped (e.g. because of ignoring SIGCHLD, or exitting yourself first) then they are never counted. These were counted in release_task for all threads. I've changed it so they are counted in wait_task_zombie, i.e. exactly when being reaped. POSIX also specifies for RUSAGE_CHILDREN that the report include the reaped child processes of the calling process, i.e. whole thread group in Linux, not just ones forked by the calling thread. POSIX specifies tms_c[us]time fields in the times call the same way. I've moved the c* fields that contain this information into signal_struct, where the single set of counters accumulates data from any thread in the group that calls wait*. Finally, POSIX specifies getrusage and times as returning cumulative totals for the whole process (aka thread group), not just the calling thread. I've added fields in signal_struct to accumulate the stats of detached threads as they die. The process stats are the sums of these records plus the stats of remaining each live/zombie thread. The times and getrusage calls, and the internal uses for filling in wait4 results and siginfo_t, now iterate over the threads in the thread group and sum up their stats along with the stats recorded for threads already dead and gone. I added a new value RUSAGE_GROUP (-3) for the getrusage system call rather than changing the behavior of the old RUSAGE_SELF (0). POSIX specifies RUSAGE_SELF to mean all threads, so the glibc getrusage call will just translate it to RUSAGE_GROUP for new kernels. I did this thinking that someone somewhere might want the old behavior with an old glibc and a new kernel (it is only different if they are using CLONE_THREAD anyway). However, I've changed the times system call to conform to POSIX as well and did not provide any backward compatibility there. In that case there is nothing easy like a parameter value to use, it would have to be a new system call number. That seems pretty pointless. Given that, I wonder if it is worth bothering to preserve the compatible RUSAGE_SELF behavior by introducing RUSAGE_GROUP instead of just changing RUSAGE_SELF's meaning. Comments? I've done some basic testing on x86 and x86-64, and all the numbers come out right after these fixes. (I have a test program that shows a few Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-30[PATCH] waitid system callRoland McGrath
This patch adds a new system call `waitid'. This is a new POSIX call that subsumes the rest of the wait* family and can do some things the older calls cannot. A minor addition is the ability to select what kinds of status to check for with a mask of independent bits, so you can wait for just stops and not terminations, for example. A more significant improvement is the WNOWAIT flag, which allows for polling child status without reaping. This interface fills in a siginfo_t with the same details that a SIGCHLD for the status change has; some of that info (e.g. si_uid) is not available via wait4 or other calls. I've added a new system call that has the parameter conventions of the POSIX function because that seems like the cleanest thing. This patch includes the actual system call table additions for i386 and x86-64; other architectures will need to assign the system call number, and 64-bit ones may need to implement 32-bit compat support for it as I did for x86-64. The new features could instead be provided by some new kludge inventions in the wait4 system call interface (that's what BSD did). If kludges are preferable to adding a system call, I can work up something different. I added a struct rusage field si_rusage to siginfo_t in the SIGCHLD case (this does not affect the size or layout of the struct). This is not part of the POSIX interface, but it makes it so that `waitid' subsumes all the functionality of `wait4'. Future kernel ABIs (new arch's or whatnot) can have only the `waitid' system call and the rest of the wait* family including wait3 and wait4 can be implemented in user space using waitid. There is nothing in user space as yet that would make use of the new field. Most of the new functionality is implemented purely in the waitid system call itself. POSIX also provides for the WCONTINUED flag to report when a child process had been stopped by job control and then resumed with SIGCONT. Corresponding to this, a SIGCHLD is now generated when a child resumes (unless SA_NOCLDSTOP is set), with the value CLD_CONTINUED in siginfo_t.si_code. To implement this, some additional bookkeeping is required in the signal code handling job control stops. The motivation for this work is to make it possible to implement the POSIX semantics of the `waitid' function in glibc completely and correctly. If changing either the system call interface used to accomplish that, or any details of the kernel implementation work, would improve the chances of getting this incorporated, I am more than happy to work through any issues. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-05-31[PATCH] Fix x86-64 compilation without CONFIG_NUMAAndi Kleen
This fixes compilation of x86-64 without CONFIG_NUMA again (got broken by the previous patchkit)
2004-05-28[PATCH] sparse: trivial part of kernel/* __user annotationAlexander Viro
2004-05-24[PATCH] Fix race condition with current->group_infoAndrew Morton
From: Olaf Kirch <okir@suse.de> I have been chasing a corruption of current->group_info on PPC during NFS stress tests. The problem seems to be that nfsd is messing with its group_info quite a bit, while some monitoring processes look at /proc/<pid>/status and do a get_group_info/put_group_info without any locking. This problem can be reproduced on ppc platforms within a few seconds if you generate some NFS load and do a "cat /proc/XXX/status" of an nfsd thread in a tight loop. I therefore think changes to current->group_info, and querying it from a different process, needs to be protected using the task_lock. (akpm: task->group_info here is safe against exit() because the task holds a ref on group_info which is released in __put_task_struct, and the /proc file has a ref on the task_struct).
2004-05-22[PATCH] numa api: Core NUMA API codeAndrew Morton
From: Andi Kleen <ak@suse.de> The following patches add support for configurable NUMA memory policy for user processes. It is based on the proposal from last kernel summit with feedback from various people. This NUMA API doesn't not attempt to implement page migration or anything else complicated: all it does is to police the allocation when a page is first allocation or when a page is reallocated after swapping. Currently only support for shared memory and anonymous memory is there; policy for file based mappings is not implemented yet (although they get implicitely policied by the default process policy) It adds three new system calls: mbind to change the policy of a VMA, set_mempolicy to change the policy of a process, get_mempolicy to retrieve memory policy. User tools (numactl, libnuma, test programs, manpages) can be found in ftp://ftp.suse.com/pub/people/ak/numa/numactl-0.6.tar.gz For details on the system calls see the manpages http://www.firstfloor.org/~andi/mbind.html http://www.firstfloor.org/~andi/set_mempolicy.html http://www.firstfloor.org/~andi/get_mempolicy.html Most user programs should actually not use the system calls directly, but use the higher level functions in libnuma (http://www.firstfloor.org/~andi/numa.html) or the command line tools (http://www.firstfloor.org/~andi/numactl.html The system calls allow user programs and administors to set various NUMA memory policies for putting memory on specific nodes. Here is a short description of the policies copied from the kernel patch: * NUMA policy allows the user to give hints in which node(s) memory should * be allocated. * * Support four policies per VMA and per process: * * The VMA policy has priority over the process policy for a page fault. * * interleave Allocate memory interleaved over a set of nodes, * with normal fallback if it fails. * For VMA based allocations this interleaves based on the * offset into the backing object or offset into the mapping * for anonymous memory. For process policy an process counter * is used. * bind Only allocate memory on a specific set of nodes, * no fallback. * preferred Try a specific node first before normal fallback. * As a special case node -1 here means do the allocation * on the local CPU. This is normally identical to default, * but useful to set in a VMA when you have a non default * process policy. * default Allocate on the local node first, or when on a VMA * use the process policy. This is what Linux always did * in a NUMA aware kernel and still does by, ahem, default. * * The process policy is applied for most non interrupt memory allocations * in that process' context. Interrupts ignore the policies and always * try to allocate on the local CPU. The VMA policy is only applied for memory * allocations for a VMA in the VM. * * Currently there are a few corner cases in swapping where the policy * is not applied, but the majority should be handled. When process policy * is used it is not remembered over swap outs/swap ins. * * Only the highest zone in the zone hierarchy gets policied. Allocations * requesting a lower zone just use default policy. This implies that * on systems with highmem kernel lowmem allocation don't get policied. * Same with GFP_DMA allocations. * * For shmfs/tmpfs/hugetlbfs shared memory the policy is shared between * all users and remembered even when nobody has memory mapped. This patch: This is the core NUMA API code. This includes NUMA policy aware wrappers for get_free_pages and alloc_page_vma(). On non NUMA kernels these are defined away. The system calls mbind (see http://www.firstfloor.org/~andi/mbind.html), get_mempolicy (http://www.firstfloor.org/~andi/get_mempolicy.html) and set_mempolicy (http://www.firstfloor.org/~andi/set_mempolicy.html) are implemented here. Adds a vm_policy field to the VMA and to the process. The process also has field for interleaving. VMA interleaving uses the offset into the VMA, but that's not possible for process allocations. From: Andi Kleen <ak@muc.de> > Andi, how come policy_vma() calls ->set_policy under i_shared_sem? I think this can be actually dropped now. In an earlier version I did walk the vma shared list to change the policies of other mappings to the same shared memory region. This turned out too complicated with all the corner cases, so I eventually gave in and added ->get_policy to the fast path. Also there is still the mmap_sem which prevents races in the same MM. Patch to remove it attached. Also adds documentation and removes the bogus __alloc_page_vma() prototype noticed by hch. From: Andi Kleen <ak@suse.de> A few incremental fixes for NUMA API. - Fix a few comments - Add a compat_ function for get_mem_policy I considered changing the ABI to avoid this, but that would have made the API too ugly. I put it directly into the file because a mm/compat.c didn't seem worth it just for this. - Fix the algorithm for VMA interleave. From: Matthew Dobson <colpatch@us.ibm.com> 1) Move the extern of alloc_pages_current() into #ifdef CONFIG_NUMA. The only references to the function are in NUMA code in mempolicy.c 2) Remove the definitions of __alloc_page_vma(). They aren't used. 3) Move forward declaration of struct vm_area_struct to top of file.
2004-05-21[PATCH] Sanitise handling of unneeded syscall stubsAndrew Morton
From: David Mosberger <davidm@napali.hpl.hp.com> Below is a patch that tries to sanitize the dropping of unneeded system-call stubs in generic code. In some instances, it would be possible to move the optional system-call stubs into a library routine which would avoid the need for #ifdefs, but in many cases, doing so would require making several functions global (and possibly exporting additional data-structures in header-files). Furthermore, it would inhibit (automatic) inlining in the cases in the cases where the stubs are needed. For these reasons, the patch keeps the #ifdef-approach. This has been tested on ia64 and there were no objections from the arch-maintainers (and one positive response). The patch should be safe but arch-maintainers may want to take a second look to see if some __ARCH_WANT_foo macros should be removed for their architecture (I'm quite sure that's the case, but I wanted to play it safe and only preserved the status-quo in that regard).
2004-05-19[PATCH] system_state splitupAndrew Morton
Split the system_state state `SYSTEM_SHUTDOWN' into SYSTEM_HALT, SYSTEM_POWER_OFF and SYSTEM_RESTART and export system_state to modules. This allows driver shutdown routines to know why they are being shutdown. The IDE subsystem wants this so that it knows to not spin the disks down across a reboot.
2004-05-14Fix gidsetsize == 0 for real this time.Linus Torvalds
We need to always allocate at least one indirect block pointer, since we always fill out blocks[0] even if we don't have any groups.
2004-05-14[PATCH] groups_alloc(0) clobbers memory past end of blockAndrew Morton
From: Olaf Kirch <okir@suse.de> Authentication code in net/sunrpc makes frequent use of groups_alloc(0), which seems to clobber memory past the end of what it allocated. If called with gidsetsize == 0, groups_alloc will set nblocks = 0, but still does a group_info->blocks[0] = group_info->small_block;
2004-05-10[PATCH] find_user locking and leak fixAndrew Morton
find_user() is being called from set/get_priority(), but it doesn't take the needed lock, and those callers were forgetting to drop the refcount which find_user() took.
2004-04-11[PATCH] eliminate nswap and cnswapAndrew Morton
From: Matt Mackall <mpm@selenic.com> The nswap and cnswap variables counters have never been incremented as Linux doesn't do task swapping.
2004-04-11[PATCH] move job control fields from task_struct to signal_structAndrew Morton
From: Roland McGrath <roland@redhat.com> This patch moves all the fields relating to job control from task_struct to signal_struct, so that all this info is properly per-process rather than being per-thread.
2004-04-11[PATCH] compat emulation for posix message queuesAndrew Morton
From: Arnd Bergmann <arnd@arndb.de> I have tested the code with the open posix test suite and found the same four failures for both 64-bit and compat mode, most tests pass. The patch is against -mc1, but I guess it also applies to the other trees around. What worries me more than mq_attr compatibility is the conversion of struct sigevent, which might turn out really hard when more fields in there are used. AFAICS, the only other part in the kernel ABI is sys_timer_create(), so maybe it's not too late to deprecate the current structure and create a structure that can be used properly for compat syscalls.
2004-04-11[PATCH] posix message queues: syscall stubsAndrew Morton
From: Manfred Spraul <manfred@colorfullife.com> Add -ENOSYS stubs for the posix message queue syscalls. The API is a direct mapping of the api from the unix spec, with two exceptions: - mq_close() doesn't exist. Message queue file descriptors can be closed with close(). - mq_notify(SIGEV_THREAD) cannot be implemented in the kernel. The kernel returns a pollable file descriptor . User space must poll (or read) this descriptor and call the notifier function if the file descriptor is signaled.
2004-04-11[PATCH] generalise system_runningAndrew Morton
From: Olof Johansson <olof@austin.ibm.com> It's currently a boolean, but that means that system_running goes to zero again when shutting down. So we then use code (in the page allocator) which is only designed to be used during bootup - it is marked __init. So we need to be able to distinguish early boot state from late shutdown state. Rename system_running to system_state and give it the three appropriate states.
2004-02-22[PATCH] rename shmat to make it clear it isn't a system call entrypointManfred Spraul
This renames sys_shmat to do_shmat. Additionally, I've replaced the cond_syscall with a conditional inline function. It touches all archs - only i386 is tested.
2004-02-22[PATCH] cleanup condsyscall for sysv ipcAndrew Morton
From: Manfred Spraul <manfred@colorfullife.com> Attached is a patch that replaces the #ifndef CONFIG_SYSV syscall stubs with cond_syscall stubs.
2004-02-18[PATCH] NGROUPS 2.6.2rc2 + fixupsAndrew Morton
From: Tim Hockin <thockin@sun.com>, Neil Brown <neilb@cse.unsw.edu.au>, me New groups infrastructure. task->groups and task->ngroups are replaced by task->group_info. Group)info is a refcounted, dynamic struct with an array of pages. This allows for large numbers of groups. The current limit of 32 groups has been raised to 64k groups. It can be raised more by changing the NGROUPS_MAX constant in limits.h
2004-02-03[PATCH] Allow software_suspend to failAndrew Morton
From: Pavel Machek <pavel@ucw.cz> software_suspend() can fail for quite a lot of reasons (for example not enough swapspace). However current interface returned void, so you could not propagate error back to userland. This fixes it. Plus __read_suspend_image() is only done during init time, so we might as well mark it __init.
2004-01-19[PATCH] ppc cond_syscall fixAndrew Morton
From: Matt Mackall <mpm@selenic.com> Experimenting with trying to use cond_syscall for a few arch-specific syscalls, I discovered that it can't actually be used outside the file in which sys_ni_syscall is declared because the assembler doesn't feel obliged to output the symbol in that case: weak.c: #define cond_syscall(x) asm(".weak\t" #x "\n\t.set\t" #x ",sys_ni_syscall"); cond_syscall(sys_foo); $ nm weak.o U sys_ni_syscall One arch (PPC) is apparently trying to use cond_syscall this way anyway, though it's probably never been actually tested as the above test was done on a PPC. After trying a bunch of tricks to get it to work nicely, I decided there are basically two alternatives: make weak versions of sys_ni_syscall wherever they're wanted or put the arch-specific cond_syscalls in kernel/sys.c where sys_ni_syscall is defined. The former approach is a bit crufty and doesn't actually do the right thing in practice as you'll get multiple copies of sys_ni_syscall in your final image. The latter introduces some slight arch-pollution in sys.c, but as arch-specific cond_syscalls aren't all that frequent, it should be pretty minor. So here's a patch to move the current offender to sys.c:
2003-12-29[PATCH] pagefault accounting fixAndrew Morton
From: William Lee Irwin III <wli@holomorphy.com> Our accounting of minor faults versus major faults is currently quite wrong. To fix it up we need to propagate the actual fault type back to the higher-level code. Repurpose the currently-unused third arg to ->nopage for this.
2003-10-21[PATCH] export system_running to other filesAndrew Morton
There seems to be no header file which declares system_running.
2003-10-09Revert the process group accessor functions. They are buggy, andLinus Torvalds
cause NULL pointer references in /proc. Moreover, it's questionable whether the whole thing makes sense at all. Per-thread state is good. Cset exclude: davem@nuts.ninka.net|ChangeSet|20031005193942|01097 Cset exclude: akpm@osdl.org[torvalds]|ChangeSet|20031005180420|42200 Cset exclude: akpm@osdl.org[torvalds]|ChangeSet|20031005180411|42211
2003-10-07o kernel/ksyms.c: move remaining EXPORT_SYMBOLs, remove this file from the treeArnaldo Carvalho de Melo