| Age | Commit message (Collapse) | Author |
|
POSIX requires that setitimer, getitimer, and alarm work on a per-process
basis. Currently, Linux implements these for individual threads. This patch
fixes these semantics for the ITIMER_REAL timer (which generates SIGALRM),
making it shared by all threads in a process (thread group).
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch introduces the concept of (virtual) cputime. Each architecture
can define its method to measure cputime. The main idea is to define a
cputime_t type and a set of operations on it (see asm-generic/cputime.h).
Then use the type for utime, stime, cutime, cstime, it_virt_value,
it_virt_incr, it_prof_value and it_prof_incr and use the cputime operations
for each access to these variables. The default implementation is jiffies
based and the effect of this patch for architectures which use the default
implementation should be neglectible.
There is a second type cputime64_t which is necessary for the kernel_stat
cpu statistics. The default cputime_t is 32 bit and based on HZ, this will
overflow after 49.7 days. This is not enough for kernel_stat (ihmo not
enough for a processes too), so it is necessary to have a 64 bit type.
The third thing that gets introduced by this patch is an additional field
for the /proc/stat interface: cpu steal time. An architecture can account
cpu steal time by calls to the account_stealtime function. The cpu which
backs a virtual processor doesn't spent all of its time for the virtual
cpu. To get meaningful cpu usage numbers this involuntary wait time needs
to be accounted and exported to user space.
From: Hugh Dickins <hugh@veritas.com>
The p->signal check in account_system_time is insufficient. If the timer
interrupt hits near the end of exit_notify, after EXIT_ZOMBIE has been set,
another cpu may release_task (NULLifying p->signal) in between
account_system_time's check and check_rlimit's dereference. Nor should
account_it_prof risk send_sig. But surely account_user_time is safe?
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The attached patch splits some memory-related procfs files into MMU and !MMU
versions and places them in separate conditionally-compiled files. A header
file local to the fs/proc/ directory is used to declare functions and the like.
Additionally, a !MMU-only proc file (/proc/maps) is provided so that master VMA
list in a uClinux kernel is viewable.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Use pid_alive() rather than testing for a zero value of ->pid. Is the right
thing to do and addresses an oops dereferencing real_parent which one person
reported.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
proc_pid_status dereferences pointers in the task structure even if the
task is already dead. This is probably the reason for the oops described
in
http://bugme.osdl.org/show_bug.cgi?id=3812
The attached patch removes the pointer dereferences by using pid_alive()
for testing that the task structure contents is still valid before
dereferencing them. The task structure itself is guaranteed to be valid -
we hold a reference count.
Signed-Off-By: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
One more place in fs/proc/array.c where ppid is wrong, which I missed in my
previous mail to lkml.
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
/proc shows the wrong PID as parent in the following case
Process A creates Threads 1 & 2 (using pthread_create) Thread 2 then forks
and execs process B getppid() for Process B shows Process A (rightly) as
parent, however /proc/B/status shows Thread 3 as PPid (incorrect).
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch fixes all the preempt-after-task->state-is-TASK_DEAD problems we
had. Right now, the moment procfs does a down() that sleeps in
proc_pid_flush() [it could] our TASK_DEAD state is zapped and we might be
back to TASK_RUNNING to and we trigger this assert:
schedule();
BUG();
/* Avoid "noreturn function does return". */
for (;;) ;
I have split out TASK_ZOMBIE and TASK_DEAD into a separate p->exit_state
field, to allow the detaching of exit-signal/parent/wait-handling from
descheduling a dead task. Dead-task freeing is done via PF_DEAD.
Tested the patch on x86 SMP and UP, but all architectures should work
fine.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Add up resource usage counters for live and dead threads to show aggregate
per-process usage in /proc/<pid>/stat. This mirrors the new getrusage()
semantics. /proc/<pid>/task/<tid>/stat still has the per-thread usage.
After moving the counter aggregation loop inside a task->sighand lock to
avoid nasty race conditions, it has survived stress-testing with '(while
true; do sleep 1 & done) & top -d 0.1'
Signed-off-by: Lev Makhlis <mlev@despammed.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch adjusts /proc/*/stat to have distinct per-process and per-thread
CPU usage, faults, and wchan.
Signed-off-by: Albert Cahalan <albert@users.sf.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
POSIX specifies that the limit settings provided by getrlimit/setrlimit are
shared by the whole process, not specific to individual threads. This
patch changes the behavior of those calls to comply with POSIX.
I've moved the struct rlimit array from task_struct to signal_struct, as it
has the correct sharing properties. (This reduces kernel memory usage per
thread in multithreaded processes by around 100/200 bytes for 32/64
machines respectively.) I took a fairly minimal approach to the locking
issues with the newly shared struct rlimit array. It turns out that all
the code that is checking limits really just needs to look at one word at a
time (one rlim_cur field, usually). It's only the few places like
getrlimit itself (and fork), that require atomicity in accessing a whole
struct rlimit, so I just used a spin lock for them and no locking for most
of the checks. If it turns out that readers of struct rlimit need more
atomicity where they are now cheap, or less overhead where they are now
atomic (e.g. fork), then seqcount is certainly the right thing to use for
them instead of readers using the spin lock. Though it's in signal_struct,
I didn't use siglock since the access to rlimits never needs to disable
irqs and doesn't overlap with other siglock uses. Instead of adding
something new, I overloaded task_lock(task->group_leader) for this; it is
used for other things that are not likely to happen simultaneously with
limit tweaking. To me that seems preferable to adding a word, but it would
be trivial (and arguably cleaner) to add a separate lock for these users
(or e.g. just use seqlock, which adds two words but is optimal for readers).
Most of the changes here are just the trivial s/->rlim/->signal->rlim/.
I stumbled across what must be a long-standing bug, in reparent_to_init.
It does:
memcpy(current->rlim, init_task.rlim, sizeof(*(current->rlim)));
when surely it was intended to be:
memcpy(current->rlim, init_task.rlim, sizeof(current->rlim));
As rlim is an array, the * in the sizeof expression gets the size of the
first element, so this just changes the first limit (RLIMIT_CPU). This is
for kernel threads, where it's clear that resetting all the rlimits is what
you want. With that fixed, the setting of RLIMIT_FSIZE in nfsd is
superfluous since it will now already have been reset to RLIM_INFINITY.
The other subtlety is removing:
tsk->rlim[RLIMIT_CPU].rlim_cur = RLIM_INFINITY;
in exit_notify, which was to avoid a race signalling during self-reaping
exit. As the limit is now shared, a dying thread should not change it for
others. Instead, I avoid that race by checking current->state before the
RLIMIT_CPU check. (Adding one new conditional in that path is now required
one way or another, since if not for this check there would also be a new
race with self-reaping exit later on clearing current->signal that would
have to be checked for.)
The one loose end left by this patch is with process accounting.
do_acct_process temporarily resets the RLIMIT_FSIZE limit while writing the
accounting record. I left this as it was, but it is now changing a limit
that might be shared by other threads still running. I left this in a
dubious state because it seems to me that processing accounting may already
be more generally a dubious state when it comes to NPTL threads. I would
think you would want one record per process, with aggregate data about all
threads that ever lived in it, not a separate record for each thread.
I don't use process accounting myself, but if anyone is interested in
testing it out I could provide a patch to change it this way.
One final note, this is not 100% to POSIX compliance in regards to rlimits.
POSIX specifies that RLIMIT_CPU refers to a whole process in aggregate, not
to each individual thread. I will provide patches later on to achieve that
change, assuming this patch goes in first.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Derive process start times from the posix_clock_monotonic notion of uptime
instead of "jiffies", consistent with the earlier change to /proc/uptime
itself.
(http://linus.bkbits.net:8080/linux-2.5/cset@3ef4851dGg0fxX58R9Zv8SIq9fzNmQ?na%0Av=index.html|src/.|src/fs|src/fs/proc|related/fs/proc/proc_misc.c)
Process start times are reported to userspace in units of 1/USER_HZ since
boot, thus applications as procps need the value of "uptime" to convert
them into absolute time.
Currently "uptime" is derived from an ntp-corrected time base, but process
start time is derived from the free-running "jiffies" counter. This
results in inaccurate, drifting process start times as seen by the user,
even if the exported number stays constant, because the users notion of
"jiffies" changes in time.
It's John Stultz's patch anyways, which I only messed up a bit, but since
people started trading signed-off lines on lkml:
Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This adds a new state TASK_TRACED that is used in place of TASK_STOPPED
when a thread stops because it is ptraced. Now ptrace operations are only
permitted when the target is in TASK_TRACED state, not in TASK_STOPPED.
This means that if a process is stopped normally by a job control signal
and then you PTRACE_ATTACH to it, you will have to send it a SIGCONT before
you can do any ptrace operations on it. (The SIGCONT will be reported to
ptrace and then you can discard it instead of passing it through when you
call PTRACE_CONT et al.)
If a traced child gets orphaned while in TASK_TRACED state, it morphs into
TASK_STOPPED state. This makes it again possible to resume or destroy the
process with SIGCONT or SIGKILL.
All non-signal tracing stops should now be done via ptrace_notify. I've
updated the syscall tracing code in several architectures to do this
instead of replicating the work by hand. I also fixed several that were
unnecessarily repeating some of the checks in ptrace_check_attach. Calling
ptrace_check_attach alone is sufficient, and the old checks repeated before
are now incorrect, not just superfluous.
I've closed a race in ptrace_check_attach. With this, we should have a
robust guarantee that when ptrace starts operating, the task will be in
TASK_TRACED state and won't come out of it. This is because the only way
to resume from TASK_TRACED is via ptrace operations, and only the one
parent thread attached as the tracer can do those.
This patch also cleans up the do_notify_parent and do_notify_parent_cldstop
code so that the dead and stopped cases are completely disjoint. The
notify_parent function is gone.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch changes the rusage bookkeeping and the semantics of the
getrusage and times calls in a couple of ways.
The first change is in the c* fields counting dead child processes. POSIX
requires that children that have died be counted in these fields when they
are reaped by a wait* call, and that if they are never reaped (e.g.
because of ignoring SIGCHLD, or exitting yourself first) then they are
never counted. These were counted in release_task for all threads. I've
changed it so they are counted in wait_task_zombie, i.e. exactly when
being reaped.
POSIX also specifies for RUSAGE_CHILDREN that the report include the reaped
child processes of the calling process, i.e. whole thread group in Linux,
not just ones forked by the calling thread. POSIX specifies tms_c[us]time
fields in the times call the same way. I've moved the c* fields that
contain this information into signal_struct, where the single set of
counters accumulates data from any thread in the group that calls wait*.
Finally, POSIX specifies getrusage and times as returning cumulative totals
for the whole process (aka thread group), not just the calling thread.
I've added fields in signal_struct to accumulate the stats of detached
threads as they die. The process stats are the sums of these records plus
the stats of remaining each live/zombie thread. The times and getrusage
calls, and the internal uses for filling in wait4 results and siginfo_t,
now iterate over the threads in the thread group and sum up their stats
along with the stats recorded for threads already dead and gone.
I added a new value RUSAGE_GROUP (-3) for the getrusage system call rather
than changing the behavior of the old RUSAGE_SELF (0). POSIX specifies
RUSAGE_SELF to mean all threads, so the glibc getrusage call will just
translate it to RUSAGE_GROUP for new kernels. I did this thinking that
someone somewhere might want the old behavior with an old glibc and a new
kernel (it is only different if they are using CLONE_THREAD anyway).
However, I've changed the times system call to conform to POSIX as well and
did not provide any backward compatibility there. In that case there is
nothing easy like a parameter value to use, it would have to be a new
system call number. That seems pretty pointless. Given that, I wonder if
it is worth bothering to preserve the compatible RUSAGE_SELF behavior by
introducing RUSAGE_GROUP instead of just changing RUSAGE_SELF's meaning.
Comments?
I've done some basic testing on x86 and x86-64, and all the numbers come
out right after these fixes. (I have a test program that shows a few
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Merely removing down_read(&mm->mmap_sem) from task_vsize() is too
half-assed to let stand. The following patch removes the vma iteration
as well as the down_read(&mm->mmap_sem) from both task_mem() and
task_statm() and callers for the CONFIG_MMU=y case in favor of
accounting the various stats reported at the times of vma creation,
destruction, and modification. Unlike the 2.4.x patches of the same
name, this has no per-pte-modification overhead whatsoever.
This patch quashes end user complaints of top(1) being slow as well as
kernel hacker complaints of per-pte accounting overhead simultaneously.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
task_vsize() doesn't need mm->mmap_sem for the CONFIG_MMU case; the
semaphore doesn't prevent mm->total_vm from going stale or getting
inconsistent with other numbers regardless. Also, KSTK_EIP() and
KSTK_ESP() don't want or need protection from mm->mmap_sem either. So this
pushes mm->mmap_sem to task_vsize() in the CONFIG_MMU=n task_vsize().
Also, hoist the prototype of task_vsize() into proc_fs.h
The net result of this is a small speedup of procps for CONFIG_MMU.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Clarify mmgrab by collapsing it into get_task_mm (in fork.c not inline),
and commenting on the special case it is guarding against: when use_mm in
an AIO daemon temporarily adopts the mm while it's on its way out.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Races have been observed between excec-time overwriting of task->comm and
/proc accesses to the same data. This causes environment string
information to appear in /proc.
Fix that up by taking task_lock() around updates to and accesses to
task->comm.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
I just stumbled across this patch that's been sitting in my tree for ages.
I thought I'd sent this in before. It's a trivial fix for the printing
of task state in /proc and sysrq dumps and such, so that TASK_DEAD shows
up correctly. This state is pretty much only ever there to be seen when
there are exit/reaping bugs, but it's not like that hasn't come up.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
From: Olaf Kirch <okir@suse.de>
I have been chasing a corruption of current->group_info on PPC during NFS
stress tests. The problem seems to be that nfsd is messing with its
group_info quite a bit, while some monitoring processes look at
/proc/<pid>/status and do a get_group_info/put_group_info without any locking.
This problem can be reproduced on ppc platforms within a few seconds if you
generate some NFS load and do a "cat /proc/XXX/status" of an nfsd thread in a
tight loop.
I therefore think changes to current->group_info, and querying it from a
different process, needs to be protected using the task_lock.
(akpm: task->group_info here is safe against exit() because the task holds a
ref on group_info which is released in __put_task_struct, and the /proc file
has a ref on the task_struct).
|
|
From: Alan Stern <stern@rowland.harvard.edu>
This patch is needed to work around gcc-2.96's limited ability to cope with
long long intermediate expression types. I don't know why the code
compiled okay earlier and failed now.
|
|
From: Ingo Molnar <mingo@elte.hu>
Dave reported that /proc/*/status sometimes shows 101% as LoadAVG, which
makes no sense.
the reason of the bug is slightly incorrect scaling of the load_avg value.
The patch below fixes this.
|
|
From: Matt Mackall <mpm@selenic.com>
The nswap and cnswap variables counters have never been incremented as
Linux doesn't do task swapping.
|
|
From: Roland McGrath <roland@redhat.com>
This patch moves all the fields relating to job control from task_struct to
signal_struct, so that all this info is properly per-process rather than
being per-thread.
|
|
From: Tim Hockin <thockin@sun.com>,
Neil Brown <neilb@cse.unsw.edu.au>,
me
New groups infrastructure. task->groups and task->ngroups are replaced by
task->group_info. Group)info is a refcounted, dynamic struct with an array
of pages. This allows for large numbers of groups. The current limit of
32 groups has been raised to 64k groups. It can be raised more by changing
the NGROUPS_MAX constant in limits.h
|
|
Having the number-of-threads value easily available turns out to be very
important for procps performance.
The /proc/*/stat thing getting reused has been zero since the 2.2.xx
days, and was the seldom-used timeout value before that.
|
|
cause NULL pointer references in /proc.
Moreover, it's questionable whether the whole thing makes sense at all.
Per-thread state is good.
Cset exclude: davem@nuts.ninka.net|ChangeSet|20031005193942|01097
Cset exclude: akpm@osdl.org[torvalds]|ChangeSet|20031005180420|42200
Cset exclude: akpm@osdl.org[torvalds]|ChangeSet|20031005180411|42211
|
|
From: Roland McGrath <roland@redhat.com>
This patch completes what was started with the `process_group' accessor
function, moving all the job control-related fields from task_struct into
signal_struct and using process_foo accessor functions to read them. All
these things are per-process in POSIX, none per-thread. Off hand it's hard
to come up with the hairy MT scenarios in which the existing code would do
insane things, but trust me, they're there. At any rate, all the uses
being done via inline accessor functions now has got to be all good.
I did a "make allyesconfig" build and caught the few random drivers and
whatnot that referred to these fields. I was surprised to find how few
references to ->tty there really were to fix up. I'm sure there will be a
few more fixups needed in non-x86 code. The only actual testing of a
running kernel with these patches I've done is on my normal minimal x86
config. Everything works fine as it did before as far as I can tell.
One issue that may be of concern is the lack of any locking on multiple
threads diddling these fields. I don't think it really matters, though
there might be some obscure races that could produce inconsistent job
control results. Nothing shattering, I'm sure; probably only something
like a multi-threaded program calling setsid while its other threads do tty
i/o, which never happens in reality. This is the same situation we get by
using ->group_leader->foo without other synchronization, which seemed to be
the trend and noone was worried about it.
|
|
Argh. A couple of places where we needed ..._encode_dev() had
been lost in reordering the patchset - the most notable being ctty number in
/proc/<pid>/stat. Fix follows:
|
|
tty->device had been used only in a couple of places and can be
calculated by tty->index and tty->driver. Field removed, its users switched
to static inline dev_t tty_devnum(tty).
|
|
From: Ingo Molnar <mingo@elte.hu>
the attached scheduler patch (against test2-mm2) adds the scheduling
infrastructure items discussed on lkml. I got good feedback - and while i
dont expect it to solve all problems, it does solve a number of bad ones:
- test_starve.c code from David Mosberger
- thud.c making the system unusuable due to unfairness
- fair/accurate sleep average based on a finegrained clock
- audio skipping way too easily
other changes in sched-test2-mm2-A3:
- ia64 sched_clock() code, from David Mosberger.
- migration thread startup without relying on implicit scheduling
behavior. While the current 2.6 code is correct (due to the cpu-up code
adding CPUs one by one), but it's also fragile - and this code cannot
be carried over into the 2.4 backports. So adding this method would
clean up the startup and would make it easier to have 2.4 backports.
and here's the original changelog for the scheduler changes:
- cycle accuracy (nanosec resolution) timekeeping within the scheduler.
This fixes a number of audio artifacts (skipping) i've reproduced. I
dont think we can get away without going cycle accuracy - reading the
cycle counter adds some overhead, but it's acceptable. The first
nanosec-accuracy patch was done by Mike Galbraith - this patch is
different but similar in nature. I went further in also changing the
sleep_avg to be of nanosec resolution.
- more finegrained timeslices: there's now a timeslice 'sub unit' of 50
usecs (TIMESLICE_GRANULARITY) - CPU hogs on the same priority level
will roundrobin with this unit. This change is intended to make gaming
latencies shorter.
- include scheduling latency in sleep bonus calculation. This change
extends the sleep-average calculation to the period of time a task
spends on the runqueue but doesnt get scheduled yet, right after
wakeup. Note that tasks that were preempted (ie. not woken up) and are
still on the runqueue do not get this benefit. This change closes one
of the last hole in the dynamic priority estimation, it should result
in interactive tasks getting more priority under heavy load. This
change also fixes the test-starve.c testcase from David Mosberger.
The TSC-based scheduler clock is disabled on ia32 NUMA platforms. (ie.
platforms that have unsynched TSC for sure.) Those platforms should provide
the proper code to rely on the TSC in a global way. (no such infrastructure
exists at the moment - the monotonic TSC-based clock doesnt deal with TSC
offsets either, as far as i can tell.)
|
|
From: Jeremy Fitzhardinge <jeremy@goop.org>
I'm resending my patch to fix this problem. To recap: every task_struct
has its own copy of the thread group's pgrp. Only the thread group
leader is allowed to change the tgrp's pgrp, but it only updates its own
copy of pgrp, while all the other threads in the tgrp use the old value
they inherited on creation.
This patch simply updates all the other thread's pgrp when the tgrp
leader changes pgrp. Ulrich has already expressed reservations about
this patch since it is (1) incomplete (it doesn't cover the case of
other ids which have similar problems), (2) racy (it doesn't synchronize
with other threads looking at the task pgrp, so they could see an
inconsistent view) and (3) slow (it takes linear time with respect to
the number of threads in the tgrp).
My reaction is that (1) it fixes the actual bug I'm encountering in a
real program. (2) doesn't really matter for pgrp, since it is mostly an
issue with respect to the terminal job-control code (which is even more
broken without this patch. Regarding (3), I think there are very few
programs which have a large number of threads which change process group
id on a regular basis (a heavily multi-threaded job-control shell?).
Ulrich also said he has a (proposed?) much better fix, which I've been
looking forward to. I'm submitting this patch as a stop-gap fix for a
real bug, and perhaps to prompt the improved patch.
An alternative fix, at least for pgrp, is to change all references to
->pgrp to group_leader->pgrp. This may be sufficient on its own, but it
would be a reasonably intrusive patch (I count 95 instances in 32 files
in the 2.6.0-test3-mm3 tree).
|
|
From: Suparna Bhattacharya <suparna@in.ibm.com>
The /proc code's bare atomic_inc(&mm->count) is racy against __exit_mm()'s
mmput() on another CPU: it calls mmput() outside task_lock(tsk), and
task_lock() isn't appropriate locking anyway.
So what happens is:
CPU0 CPU1
mmput()
->atomic_dec_and_lock(mm->mm_users)
atomic_inc(mm->mm_users)
->list_del(mm->mmlist)
mmput()
->atomic_dec_and_lock(mm->mm_users)
->list_del(mm->mmlist)
And the double list_del() of course goes splat.
So we use mmlist_lock to synchronise these steps.
The patch implements a new mmgrab() routine which increments mm_users only if
the mm isn't already going away. Changes get_task_mm() and proc_pid_stat()
to call mmgrab() instead of a direct atomic_inc(&mm->mm_users).
Hugh, there's some cruft in swapoff which looks like it should be using
mmgrab()...
|
|
tty->device switched to dev_t
There are very few uses of tty->device left by now; most of
them actually want dev_t (process accounting, proc/<pid>/stat, several
ioctls, slip.c logics, etc.) and the rest will go away shortly.
|
|
From Tim Schmielau <tim@physik3.uni-rostock.de>
Force jiffies to start out at five-minutes-before-wrap. To find
jiffy-wrapping bugs.
|
|
close succession. However, for this once we'll just call it "inspired".
But let's decide pair the lock with an unlock anyway, even if it is
boring and "square".
|
|
|
|
|
|
Patch from Roland McGrath.
|
|
Fix wrong order of process status. It's
#define TASK_RUNNING 0
#define TASK_INTERRUPTIBLE 1
#define TASK_UNINTERRUPTIBLE 2
#define TASK_STOPPED 4
#define TASK_ZOMBIE 8
#define TASK_DEAD 16
but SysRQ printout routines switch stopped and zombie around.
So, for one more time, here's another mailing of the same patch to fix
this brokenness. In addition, fix the wrong comment in fs/proc/array.c
|
|
This is required to get make the old LinuxThread semantics work
together with the fixed-for-POSIX full signal sharing. A traditional
CLONE_SIGHAND thread (LinuxThread) will not see any other shared
signal state, while a new-style CLONE_THREAD thread will share all
of it.
This way the two methods don't confuse each other.
|
|
This prevents reporting processes as having started in the future, after
32 bit jiffies wrap.
|
|
New version with all ifdef CONFIG_MMU gone from procfs.
Instead, the conditional code is in either task_mmu.c/task_nommu.c, and
the Makefile will select the proper file for inclusion depending on
CONFIG_MMU.
|
|
Can't remember where this came from, but its been around
for quite a while. Prints the parent (tracer) pid if
its being traced.
|
|
So stub it out, similar to /proc/<pid>/wchan for !CONFIG_KALLSYMS
|
|
Patch from Bill Irwin. It has the potential to break userspace
monitoring tools a little bit, and I'm a rater uncertain about
how useful the per-process per-cpu accounting is.
Bill sent this out as an RFC on July 29:
"These statistics severely bloat the task_struct and nothing in
userspace can rely on them as they're conditional on CONFIG_SMP. If
anyone is using them (or just wants them around), please speak up."
And nobody spoke up.
If we apply this, the contents of /proc/783/cpu will go from
cpu 1 1
cpu0 0 0
cpu1 0 0
cpu2 1 1
cpu3 0 0
to
cpu 1 1
And we shall save 256 bytes from the ia32 task_struct.
On my SMP build with NR_CPUS=32:
Without this patch, sizeof(task_struct) is 1824, slab uses a 1-order
allocation and we are getting 2 task_structs per page.
With this patch, sizeof(task_struct) is 1568, slab uses a 2-order
allocation and we are getting 2.5 task_structs per page.
So it seems worthwhile.
(Maybe this highlights a shortcoming in slab. For the 1824-byte case
it could have used a 0-order allocation)
|
|
|
|
The i_dev field is deleted and the few uses are replaced by i_sb->s_dev.
There is a single side effect: a stat on a socket now sees a nonzero
st_dev. There is nothing against that - FreeBSD has a nonzero value as
well - but there is at least one utility (fuser) that will need an
update.
|
|
From Bill Irwin
Move hugetlb and hugetlbfs declarations into a dedicated header file.
Hugetlb's big #ifdeffed block in mm.h got a lot bigger with hugetlbfs.
This patch basically attempts to remove the noise from mm.h by simply
rearranging it into a new header, and fixing all users of hugetlb.
|
|
As per CodingStyle
|