user/sven/linux.git/kernel/sched_rt.c, branch v3.2.11

sched/rt: Fix task stack corruption under __ARCH_WANT_INTERRUPTS_ON_CTXSW

2012-02-13T19:16:56Z

commit cb297a3e433dbdcf7ad81e0564e7b804c941ff0d upstream. This issue happens under the following conditions: 1. preemption is off 2. __ARCH_WANT_INTERRUPTS_ON_CTXSW is defined 3. RT scheduling class 4. SMP system Sequence is as follows: 1.suppose current task is A. start schedule() 2.task A is enqueued pushable task at the entry of schedule() __schedule prev = rq->curr; ... put_prev_task put_prev_task_rt enqueue_pushable_task 4.pick the task B as next task. next = pick_next_task(rq); 3.rq->curr set to task B and context_switch is started. rq->curr = next; 4.At the entry of context_swtich, release this cpu's rq->lock. context_switch prepare_task_switch prepare_lock_switch raw_spin_unlock_irq(&rq->lock); 5.Shortly after rq->lock is released, interrupt is occurred and start IRQ context 6.try_to_wake_up() which called by ISR acquires rq->lock try_to_wake_up ttwu_remote rq = __task_rq_lock(p) ttwu_do_wakeup(rq, p, wake_flags); task_woken_rt 7.push_rt_task picks the task A which is enqueued before. task_woken_rt push_rt_tasks(rq) next_task = pick_next_pushable_task(rq) 8.At find_lock_lowest_rq(), If double_lock_balance() returns 0, lowest_rq can be the remote rq. (But,If preemption is on, double_lock_balance always return 1 and it does't happen.) push_rt_task find_lock_lowest_rq if (double_lock_balance(rq, lowest_rq)).. 9.find_lock_lowest_rq return the available rq. task A is migrated to the remote cpu/rq. push_rt_task ... deactivate_task(rq, next_task, 0); set_task_cpu(next_task, lowest_rq->cpu); activate_task(lowest_rq, next_task, 0); 10. But, task A is on irq context at this cpu. So, task A is scheduled by two cpus at the same time until restore from IRQ. Task A's stack is corrupted. To fix it, don't migrate an RT task if it's still running. Signed-off-by: Chanho Min Signed-off-by: Peter Zijlstra Acked-by: Steven Rostedt Link: http://lkml.kernel.org/r/CAOAMb1BHA=5fm7KTewYyke6u-8DP0iUuJMpgQw54vNeXFsGpoQ@mail.gmail.com Signed-off-by: Ingo Molnar Signed-off-by: Greg Kroah-Hartman

sched, rt: Provide means of disabling cross-cpu bandwidth sharing

2011-11-14T11:50:40Z

Normally the RT bandwidth scheme will share bandwidth across the entire root_domain. However sometimes its convenient to disable this sharing for debug purposes. Provide a simple feature switch to this end. Signed-off-by: Peter Zijlstra Signed-off-by: Ingo Molnar

sched: Warn on rt throttling

2011-10-06T10:47:04Z

The default rt-throttling is a source of never ending questions. Warn once when we go into throttling so folks have that info in dmesg. Signed-off-by: Thomas Gleixner Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1110051331480.18778@ionos Signed-off-by: Ingo Molnar

sched: Unify the ->cpus_allowed mask copy

2011-10-06T10:47:00Z

Currently every sched_class::set_cpus_allowed() implementation has to copy the cpumask into task_struct::cpus_allowed, this is pointless, put this copy in the generic code. Signed-off-by: Peter Zijlstra Acked-by: Thomas Gleixner Link: http://lkml.kernel.org/n/tip-jhl5s9fckd9ptw1fzbqqlrd3@git.kernel.org Signed-off-by: Ingo Molnar

sched: Wrap scheduler p->cpus_allowed access

2011-10-06T10:46:56Z

This task is preparatory for the migrate_disable() implementation, but stands on its own and provides a cleanup. It currently only converts those sites required for task-placement. Kosaki-san once mentioned replacing cpus_allowed with a proper cpumask_t instead of the NR_CPUS sized array it currently is, that would also require something like this. Signed-off-by: Peter Zijlstra Acked-by: Thomas Gleixner Cc: KOSAKI Motohiro Link: http://lkml.kernel.org/n/tip-e42skvaddos99psip0vce41o@git.kernel.org Signed-off-by: Ingo Molnar

Merge branch 'linus' into sched/core

2011-10-04T09:09:08Z

Merge reason: pick up the latest fixes. Signed-off-by: Ingo Molnar

sched/rt: Migrate equal priority tasks to available CPUs

2011-09-18T11:48:56Z

Commit 43fa5460fe60dea5c610490a1d263415419c60f6 ("sched: Try not to migrate higher priority RT tasks") also introduced a change in behavior which keeps RT tasks on the same CPU if there is an equal priority RT task currently running even if there are empty CPUs available. This can cause unnecessary wakeup latencies, and can prevent the scheduler from balancing all RT tasks across available CPUs. This change causes an RT task to search for a new CPU if an equal priority RT task is already running on wakeup. Lower priority tasks will still have to wait on higher priority tasks, but the system should still balance out because there is always the possibility that if there are both a high and low priority RT tasks on a given CPU that the high priority task could wakeup while the low priority task is running and force it to search for a better runqueue. Signed-off-by: Shawn Bohrer Acked-by: Steven Rostedt Tested-by: Steven Rostedt Signed-off-by: Peter Zijlstra Cc: stable@kernel.org # 37+ Link: http://lkml.kernel.org/r/1315837684-18733-1-git-send-email-sbohrer@rgmadvisors.com Signed-off-by: Ingo Molnar

sched: Implement hierarchical task accounting for SCHED_OTHER

2011-08-14T10:01:13Z

Introduce hierarchical task accounting for the group scheduling case in CFS, as well as promoting the responsibility for maintaining rq->nr_running to the scheduling classes. The primary motivation for this is that with scheduling classes supporting bandwidth throttling it is possible for entities participating in throttled sub-trees to not have root visible changes in rq->nr_running across activate and de-activate operations. This in turn leads to incorrect idle and weight-per-task load balance decisions. This also allows us to make a small fixlet to the fastpath in pick_next_task() under group scheduling. Note: this issue also exists with the existing sched_rt throttling mechanism. This patch does not address that. Signed-off-by: Paul Turner Reviewed-by: Hidetoshi Seto Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/20110721184756.878333391@google.com Signed-off-by: Ingo Molnar

sched: Use pushable_tasks to determine next highest prio

2011-08-14T10:00:55Z

Hillf Danton proposed a patch (see link) that cleaned up the sched_rt code that calculates the priority of the next highest priority task to be used in finding run queues to pull from. His patch removed the calculating of the next prio to just use the current prio when deteriming if we should examine a run queue to pull from. The problem with his patch was that it caused more false checks. Because we check a run queue for pushable tasks if the current priority of that run queue is higher in priority than the task about to run on our run queue. But after grabbing the locks and doing the real check, we find that there may not be a task that has a higher prio task to pull. Thus the locks were taken with nothing to do. I added some trace_printks() to record when and how many times the run queue locks were taken to check for pullable tasks, compared to how many times we pulled a task. With the current method, it was: 3806 locks taken vs 2812 pulled tasks With Hillf's patch: 6728 locks taken vs 2804 pulled tasks The number of times locks were taken to pull a task went up almost double with no more success rate. But his patch did get me thinking. When we look at the priority of the highest task to consider taking the locks to do a pull, a failure to pull can be one of the following: (in order of most likely) o RT task was pushed off already between the check and taking the lock o Waiting RT task can not be migrated o RT task's CPU affinity does not include the target run queue's CPU o RT task's priority changed between the check and taking the lock And with Hillf's patch, the thing that caused most of the failures, is the RT task to pull was not at the right priority to pull (not greater than the current RT task priority on the target run queue). Most of the above cases we can't help. But the current method does not check if the next highest prio RT task can be migrated or not, and if it can not, we still grab the locks to do the test (we don't find out about this fact until after we have the locks). I thought about this case, and realized that the pushable task plist that is maintained only holds RT tasks that can migrate. If we move the calculating of the next highest prio task from the inc/dec_rt_task() functions into the queuing of the pushable tasks, then we only measure the priorities of those tasks that we push, and we get this basically for free. Not only does this patch make the code a little more efficient, it cleans it up and makes it a little simpler. Thanks to Hillf Danton for inspiring me on this patch. Signed-off-by: Steven Rostedt Signed-off-by: Peter Zijlstra Cc: Hillf Danton Cc: Gregory Haskins Link: http://lkml.kernel.org/r/BANLkTimQ67180HxCx5vgMqumqw1EkFh3qg@mail.gmail.com Signed-off-by: Ingo Molnar

sched: Balance RT tasks when forked as well

2011-08-14T10:00:52Z

When a new task is woken, the code to balance the RT task is currently skipped in the select_task_rq() call. But it will be pushed if the rq is currently overloaded with RT tasks anyway. The issue is that we already queued the task, and if it does get pushed, it will have to be dequeued and requeued on the new run queue. The advantage with pushing it first is that we avoid this requeuing as we are pushing it off before the task is ever queued. See commit 318e0893ce3f524 ("sched: pre-route RT tasks on wakeup") for more details. The return of select_task_rq() when it is not a wake up has also been changed to return task_cpu() instead of smp_processor_id(). This is more of a sanity because the current only other user of select_task_rq() besides wake ups, is an exec, where task_cpu() should also be the same as smp_processor_id(). But if it is used for other purposes, lets keep the task on the same CPU. Why would we mant to migrate it to the current CPU? Signed-off-by: Steven Rostedt Signed-off-by: Peter Zijlstra Cc: Hillf Danton Link: http://lkml.kernel.org/r/20110617015919.832743148@goodmis.org Signed-off-by: Ingo Molnar