[PATCH] synchronize use of mm->core_waiters

From: Roland McGrath <roland@redhat.com> I believe I have identified a failure mode that Linus saw a couple weeks back when tracking down some other fork/exit sorts of races. We saw this come up on rare occasions with the RHEL3 kernel's backport of the new code (while trying to track down other race failure modes we have yet to fix, sigh). I am talking about the following scenario: > Btw, even with the fix, doing a "while : ; ./crash t 10 ; done" will > eventually result in a stuck process: > > 1415 tty1 D 0:00 ./crash > > This is some kind of deadlock: most of the fifty threads are in "D" > state, with a trace something like > > [<c011fbe3>] schedule+0x360/0x7f8 > [<c0120539>] wait_for_completion+0xd4/0x1c3 > [<c0128c9e>] do_exit+0x627/0x6a4 > [<c0128ddd>] do_group_exit+0x3d/0x177 > [<c0130c13>] dequeue_signal+0x2d/0x84 > [<c0133911>] get_signal_to_deliver+0x390/0x575 > [<c010a541>] do_signal+0x6c/0xf1 > [<c01200be>] default_wake_function+0x0/0x12 > [<c01200be>] default_wake_function+0x0/0x12 > [<c013d50f>] do_futex+0x6d/0x7d > [<c013d635>] sys_futex+0x116/0x12f > [<c010a601>] do_notify_resume+0x3b/0x3d > [<c010a82e>] work_notifysig+0x13/0x15 > > except for one that is trying to core-dump: > > [<c0120539>] wait_for_completion+0xd4/0x1c3 > [<c01200be>] default_wake_function+0x0/0x12 > [<c01200be>] default_wake_function+0x0/0x12 > [<c02101aa>] rwsem_wake+0x86/0x12d > [<c01738af>] coredump_wait+0xa8/0xaa > [<c0173a26>] do_coredump+0x175/0x26c > > and three that are just doing a regular "exit()" system call: > > [<c011fbe3>] schedule+0x360/0x7f8 > [<c011e19a>] recalc_task_prio+0x90/0x1aa > [<c0120539>] wait_for_completion+0xd4/0x1c3 > [<c01200be>] default_wake_function+0x0/0x12 > [<c01200be>] default_wake_function+0x0/0x12 > [<c0210207>] rwsem_wake+0xe3/0x12d > [<c0128c9e>] do_exit+0x627/0x6a4 > [<c0128d4d>] next_thread+0x0/0x53 > [<c010a7e3>] syscall_call+0x7/0xb > > However, the rest of the system is totally unaffected by this deadlock: > it's only deadlocked withing the thread group itself, nobody else cares. What happens here is a race between an exiting thread checking mm->core_waiters in __exit_mm, and the thread taking the core-dump signal (in coredump_wait) examining the first thread's ->mm pointer and incrementing mm->core_waiters to account for it. There is no synchronization at all in __exit_mm's use of mm->core_waiters. If the coredump_wait thread reads tsk->mm when tsk is in __exit_mm between checking mm->core_waiters and clearing tsk->mm, then it will increment mm->core_waiters and the total count will later exceed the number of threads that will ever decrement it and synchronize. Hence it blocks forever. The following patch fixes the problem by using mm->mmap_sem in __exit_mm. The read lock must be held around checking mm->core_waiters and clearing tsk->mm so that coredump_wait (which gets the write lock) cannot come in between and do bogus bookkeeping.
author: Andrew Morton <akpm@osdl.org> 2003-12-29 05:54:13 -0800
committer: Linus Torvalds <torvalds@home.osdl.org> 2003-12-29 05:54:13 -0800
commit: 99365bd4725d431255ff4bdd51fb3dca60c47322 (patch)
tree: c8f549cc714e20270d4f0c096d654ec0d1c3cb22 /kernel
parent: dc942a21e4c8fd1cbc135ad3ca35178c6217c77a (diff)
1 files changed, 9 insertions, 1 deletions
diff --git a/kernel/exit.c b/kernel/exit.c
index 1f7e7545e1d0..749da057424a 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -472,21 +472,29 @@ static inline void __exit_mm(struct task_struct * tsk)
 	if (!mm)
 		return;
 	/*
-	 * Serialize with any possible pending coredump:
+	 * Serialize with any possible pending coredump.
+	 * We must hold mmap_sem around checking core_waiters
+	 * and clearing tsk->mm.  The core-inducing thread
+	 * will increment core_waiters for each thread in the
+	 * group with ->mm != NULL.
 	 */
+	down_read(&mm->mmap_sem);
 	if (mm->core_waiters) {
+		up_read(&mm->mmap_sem);
 		down_write(&mm->mmap_sem);
 		if (!--mm->core_waiters)
 			complete(mm->core_startup_done);
 		up_write(&mm->mmap_sem);
 
 		wait_for_completion(&mm->core_done);
+		down_read(&mm->mmap_sem);
 	}
 	atomic_inc(&mm->mm_count);
 	if (mm != tsk->active_mm) BUG();
 	/* more a memory barrier than a real lock */
 	task_lock(tsk);
 	tsk->mm = NULL;
+	up_read(&mm->mmap_sem);
 	enter_lazy_tlb(mm, current);
 	task_unlock(tsk);
 	mmput(mm);
author	Andrew Morton <akpm@osdl.org>	2003-12-29 05:54:13 -0800
committer	Linus Torvalds <torvalds@home.osdl.org>	2003-12-29 05:54:13 -0800
commit	99365bd4725d431255ff4bdd51fb3dca60c47322 (patch)
tree	c8f549cc714e20270d4f0c096d654ec0d1c3cb22 /kernel
parent	dc942a21e4c8fd1cbc135ad3ca35178c6217c77a (diff)