user/sven/linux.git/kernel/sched/debug.c, branch v6.7.9

sched/deadline: Make dl_rq->pushable_dl_tasks update drive dl_rq->overloaded

2023-09-29T08:20:21Z

dl_rq->dl_nr_migratory is increased whenever a DL entity is enqueued and it has nr_cpus_allowed > 1. Unlike the pushable_dl_tasks tree, dl_rq->dl_nr_migratory includes a dl_rq's current task. This means a dl_rq can have a migratable current, N non-migratable queued tasks, and be flagged as overloaded and have its CPU set in the dlo_mask, despite having an empty pushable_tasks tree. Make an dl_rq's overload logic be driven by {enqueue,dequeue}_pushable_dl_task(), in other words make DL RQs only be flagged as overloaded if they have at least one runnable-but-not-current migratable task. o push_dl_task() is unaffected, as it is a no-op if there are no pushable tasks. o pull_dl_task() now no longer scans runqueues whose sole migratable task is their current one, which it can't do anything about anyway. It may also now pull tasks to a DL RQ with dl_nr_running > 1 if only its current task is migratable. Since dl_rq->dl_nr_migratory becomes unused, remove it. RT had the exact same mechanism (rt_rq->rt_nr_migratory) which was dropped in favour of relying on rt_rq->pushable_tasks, see: 612f769edd06 ("sched/rt: Make rt_rq->pushable_tasks updates drive rto_mask") Signed-off-by: Valentin Schneider Signed-off-by: Ingo Molnar Acked-by: Juri Lelli Link: https://lore.kernel.org/r/20230928150251.463109-1-vschneid@redhat.com

sched/rt: Make rt_rq->pushable_tasks updates drive rto_mask

2023-09-25T08:25:29Z

Sebastian noted that the rto_push_work IRQ work can be queued for a CPU that has an empty pushable_tasks list, which means nothing useful will be done in the IPI other than queue the work for the next CPU on the rto_mask. rto_push_irq_work_func() only operates on tasks in the pushable_tasks list, but the conditions for that irq_work to be queued (and for a CPU to be added to the rto_mask) rely on rq_rt->nr_migratory instead. nr_migratory is increased whenever an RT task entity is enqueued and it has nr_cpus_allowed > 1. Unlike the pushable_tasks list, nr_migratory includes a rt_rq's current task. This means a rt_rq can have a migratible current, N non-migratible queued tasks, and be flagged as overloaded / have its CPU set in the rto_mask, despite having an empty pushable_tasks list. Make an rt_rq's overload logic be driven by {enqueue,dequeue}_pushable_task(). Since rt_rq->{rt_nr_migratory,rt_nr_total} become unused, remove them. Note that the case where the current task is pushed away to make way for a migration-disabled task remains unchanged: the migration-disabled task has to be in the pushable_tasks list in the first place, which means it has nr_cpus_allowed > 1. Reported-by: Sebastian Andrzej Siewior Signed-off-by: Valentin Schneider Signed-off-by: Ingo Molnar Tested-by: Sebastian Andrzej Siewior Link: https://lore.kernel.org/r/20230811112044.3302588-1-vschneid@redhat.com

sched/debug: Update stale reference to sched_debug.c

2023-09-21T06:30:19Z

Since commit: 8a99b6833c884 ("sched: Move SCHED_DEBUG sysctl to debugfs") The sched_debug interface moved from /proc to debugfs. The comment mentions still the outdated proc interfaces. Update the comment, point to the current location of the interface. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/r/20230920130025.412071-3-bigeasy@linutronix.de

sched/debug: Remove the /proc/sys/kernel/sched_child_runs_first sysctl

2023-09-21T06:30:18Z

The /proc/sys/kernel/sched_child_runs_first knob is no longer connected since: 5e963f2bd4654 ("sched/fair: Commit to EEVDF") Remove it. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/r/20230920130025.412071-2-bigeasy@linutronix.de

sched/debug: Rename sysctl_sched_min_granularity to sysctl_sched_base_slice

2023-07-19T07:43:59Z

EEVDF uses this tunable as the base request/slice -- make sure the name reflects this. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/r/20230531124604.205287511@infradead.org

sched/fair: Commit to EEVDF

2023-07-19T07:43:58Z

EEVDF is a better defined scheduling policy, as a result it has less heuristics/tunables. There is no compelling reason to keep CFS around. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/r/20230531124604.137187212@infradead.org

sched/fair: Implement an EEVDF-like scheduling policy

2023-07-19T07:43:58Z

Where CFS is currently a WFQ based scheduler with only a single knob, the weight. The addition of a second, latency oriented parameter, makes something like WF2Q or EEVDF based a much better fit. Specifically, EEVDF does EDF like scheduling in the left half of the tree -- those entities that are owed service. Except because this is a virtual time scheduler, the deadlines are in virtual time as well, which is what allows over-subscription. EEVDF has two parameters: - weight, or time-slope: which is mapped to nice just as before - request size, or slice length: which is used to compute the virtual deadline as: vd_i = ve_i + r_i/w_i Basically, by setting a smaller slice, the deadline will be earlier and the task will be more eligible and ran earlier. Tick driven preemption is driven by request/slice completion; while wakeup preemption is driven by the deadline. Because the tree is now effectively an interval tree, and the selection is no longer 'leftmost', over-scheduling is less of a problem. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/r/20230531124603.931005524@infradead.org

sched/fair: Add cfs_rq::avg_vruntime

2023-07-19T07:43:58Z

In order to move to an eligibility based scheduling policy, we need to have a better approximation of the ideal scheduler. Specifically, for a virtual time weighted fair queueing based scheduler the ideal scheduler will be the weighted average of the individual virtual runtimes (math in the comment). As such, compute the weighted average to approximate the ideal scheduler -- note that the approximation is in the individual task behaviour, which isn't strictly conformant. Specifically consider adding a task with a vruntime left of center, in this case the average will move backwards in time -- something the ideal scheduler would of course never do. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/r/20230531124603.654144274@infradead.org

sched/debug: Dump domains' sched group flags

2023-07-13T13:21:53Z

There have been a case where the SD_SHARE_CPUCAPACITY sched group flag in a parent domain were not set and propagated properly when a degenerate domain is removed. Add dump of domain sched group flags of a CPU to make debug easier in the future. Usage: cat /debug/sched/domains/cpu0/domain1/groups_flags to dump cpu0 domain1's sched group flags. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Tim Chen Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Valentin Schneider Link: https://lore.kernel.org/r/ed1749262d94d95a8296c86a415999eda90bcfe3.1688770494.git.tim.c.chen@linux.intel.com

sched/debug: Correct printing for rq->nr_uninterruptible

2023-05-08T08:58:39Z

Commit e6fe3f422be1 ("sched: Make multiple runqueue task counters 32-bit") changed the type for rq->nr_uninterruptible from "unsigned long" to "unsigned int", but left wrong cast print to /sys/kernel/debug/sched/debug and to the console. For example, nr_uninterruptible's value is fffffff7 with type "unsigned int", (long)nr_uninterruptible shows 4294967287 while (int)nr_uninterruptible prints -9. So using int cast fixes wrong printing. Signed-off-by: Yan Yan Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20230506074253.44526-1-yanyan.yan@antgroup.com