From 7d5265ffcd8b41da5e09066360540d6e0716e9cd Mon Sep 17 00:00:00 2001 From: Mathieu Desnoyers Date: Tue, 12 Nov 2024 10:28:26 -0500 Subject: rseq: Validate read-only fields under DEBUG_RSEQ config The rseq uapi requires cooperation between users of the rseq fields to ensure that all libraries and applications using rseq within a process do not interfere with each other. This is especially important for fields which are meant to be read-only from user-space, as documented in uapi/linux/rseq.h: - cpu_id_start, - cpu_id, - node_id, - mm_cid. Storing to those fields from a user-space library prevents any sharing of the rseq ABI with other libraries and applications, as other users are not aware that the content of those fields has been altered by a third-party library. This is unfortunately the current behavior of tcmalloc: it purposefully overlaps part of a cached value with the cpu_id_start upper bits to get notified about preemption, because the kernel clears those upper bits before returning to user-space. This behavior does not conform to the rseq uapi header ABI. This prevents tcmalloc from using rseq when rseq is registered by the GNU C library 2.35+. It requires tcmalloc users to disable glibc rseq registration with a glibc tunable, which is a sad state of affairs. Considering that tcmalloc and the GNU C library are the two first upstream projects using rseq, and that they are already incompatible due to use of this hack, adding kernel-level validation of all read-only fields content is necessary to ensure future users of rseq abide by the rseq ABI requirements. Validate that user-space does not corrupt the read-only fields and conform to the rseq uapi header ABI when the kernel is built with CONFIG_DEBUG_RSEQ=y. This is done by storing a copy of the read-only fields in the task_struct, and validating the prior values present in user-space before updating them. If the values do not match, print a warning on the console (printk_ratelimited()). This is a first step to identify misuses of the rseq ABI by printing a warning on the console. After a giving some time to userspace to correct its use of rseq, the plan is to eventually terminate offending processes with SIGSEGV. This change is expected to produce warnings for the upstream tcmalloc implementation, but tcmalloc developers mentioned they were open to adapt their implementation to kernel-level change. Signed-off-by: Mathieu Desnoyers Signed-off-by: Peter Zijlstra (Intel) Link: https://github.com/google/tcmalloc/issues/144 --- include/linux/sched.h | 9 +++++++++ 1 file changed, 9 insertions(+) (limited to 'include/linux/sched.h') diff --git a/include/linux/sched.h b/include/linux/sched.h index d380bffee2ef..b5916be49f62 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1367,6 +1367,15 @@ struct task_struct { * with respect to preemption. */ unsigned long rseq_event_mask; +# ifdef CONFIG_DEBUG_RSEQ + /* + * This is a place holder to save a copy of the rseq fields for + * validation of read-only fields. The struct rseq has a + * variable-length array at the end, so it cannot be used + * directly. Reserve a size large enough for the known fields. + */ + char rseq_fields[sizeof(struct rseq)]; +# endif #endif #ifdef CONFIG_SCHED_MM_CID -- cgit v1.2.3 From b53127db1dbf7f1047cf35c10922d801dcd40324 Mon Sep 17 00:00:00 2001 From: "Vineeth Pillai (Google)" Date: Thu, 12 Dec 2024 22:22:36 -0500 Subject: sched/dlserver: Fix dlserver double enqueue dlserver can get dequeued during a dlserver pick_task due to the delayed deueue feature and this can lead to issues with dlserver logic as it still thinks that dlserver is on the runqueue. The dlserver throttling and replenish logic gets confused and can lead to double enqueue of dlserver. Double enqueue of dlserver could happend due to couple of reasons: Case 1 ------ Delayed dequeue feature[1] can cause dlserver being stopped during a pick initiated by dlserver: __pick_next_task pick_task_dl -> server_pick_task pick_task_fair pick_next_entity (if (sched_delayed)) dequeue_entities dl_server_stop server_pick_task goes ahead with update_curr_dl_se without knowing that dlserver is dequeued and this confuses the logic and may lead to unintended enqueue while the server is stopped. Case 2 ------ A race condition between a task dequeue on one cpu and same task's enqueue on this cpu by a remote cpu while the lock is released causing dlserver double enqueue. One cpu would be in the schedule() and releasing RQ-lock: current->state = TASK_INTERRUPTIBLE(); schedule(); deactivate_task() dl_stop_server(); pick_next_task() pick_next_task_fair() sched_balance_newidle() rq_unlock(this_rq) at which point another CPU can take our RQ-lock and do: try_to_wake_up() ttwu_queue() rq_lock() ... activate_task() dl_server_start() --> first enqueue wakeup_preempt() := check_preempt_wakeup_fair() update_curr() update_curr_task() if (current->dl_server) dl_server_update() enqueue_dl_entity() --> second enqueue This bug was not apparent as the enqueue in dl_server_start doesn't usually happen because of the defer logic. But as a side effect of the first case(dequeue during dlserver pick), dl_throttled and dl_yield will be set and this causes the time accounting of dlserver to messup and then leading to a enqueue in dl_server_start. Have an explicit flag representing the status of dlserver to avoid the confusion. This is set in dl_server_start and reset in dlserver_stop. Fixes: 63ba8422f876 ("sched/deadline: Introduce deadline servers") Suggested-by: Peter Zijlstra Signed-off-by: "Vineeth Pillai (Google)" Signed-off-by: Peter Zijlstra (Intel) Tested-by: Marcel Ziswiler # ROCK 5B Link: https://lkml.kernel.org/r/20241213032244.877029-1-vineeth@bitbyteword.org --- include/linux/sched.h | 7 +++++++ kernel/sched/deadline.c | 8 ++++++-- kernel/sched/sched.h | 5 +++++ 3 files changed, 18 insertions(+), 2 deletions(-) (limited to 'include/linux/sched.h') diff --git a/include/linux/sched.h b/include/linux/sched.h index d380bffee2ef..66b311fbd5d6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -656,6 +656,12 @@ struct sched_dl_entity { * @dl_defer_armed tells if the deferrable server is waiting * for the replenishment timer to activate it. * + * @dl_server_active tells if the dlserver is active(started). + * dlserver is started on first cfs enqueue on an idle runqueue + * and is stopped when a dequeue results in 0 cfs tasks on the + * runqueue. In other words, dlserver is active only when cpu's + * runqueue has atleast one cfs task. + * * @dl_defer_running tells if the deferrable server is actually * running, skipping the defer phase. */ @@ -664,6 +670,7 @@ struct sched_dl_entity { unsigned int dl_non_contending : 1; unsigned int dl_overrun : 1; unsigned int dl_server : 1; + unsigned int dl_server_active : 1; unsigned int dl_defer : 1; unsigned int dl_defer_armed : 1; unsigned int dl_defer_running : 1; diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index db47f33cb7d2..d94f2ed6d1f4 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1647,6 +1647,7 @@ void dl_server_start(struct sched_dl_entity *dl_se) if (!dl_se->dl_runtime) return; + dl_se->dl_server_active = 1; enqueue_dl_entity(dl_se, ENQUEUE_WAKEUP); if (!dl_task(dl_se->rq->curr) || dl_entity_preempt(dl_se, &rq->curr->dl)) resched_curr(dl_se->rq); @@ -1661,6 +1662,7 @@ void dl_server_stop(struct sched_dl_entity *dl_se) hrtimer_try_to_cancel(&dl_se->dl_timer); dl_se->dl_defer_armed = 0; dl_se->dl_throttled = 0; + dl_se->dl_server_active = 0; } void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq, @@ -2421,8 +2423,10 @@ again: if (dl_server(dl_se)) { p = dl_se->server_pick_task(dl_se); if (!p) { - dl_se->dl_yielded = 1; - update_curr_dl_se(rq, dl_se, 0); + if (dl_server_active(dl_se)) { + dl_se->dl_yielded = 1; + update_curr_dl_se(rq, dl_se, 0); + } goto again; } rq->dl_server = dl_se; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 1e494af2cd23..c5d67a43fe52 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -398,6 +398,11 @@ extern void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq extern int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init); +static inline bool dl_server_active(struct sched_dl_entity *dl_se) +{ + return dl_se->dl_server_active; +} + #ifdef CONFIG_CGROUP_SCHED extern struct list_head task_groups; -- cgit v1.2.3 From 3a3f61ce5e0b4bcf730acc09c1af91012d241f85 Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Fri, 29 Nov 2024 20:06:55 -0800 Subject: exec: Make sure task->comm is always NUL-terminated Using strscpy() meant that the final character in task->comm may be non-NUL for a moment before the "string too long" truncation happens. Instead of adding a new use of the ambiguous strncpy(), we'd want to use memtostr_pad() which enforces being able to check at compile time that sizes are sensible, but this requires being able to see string buffer lengths. Instead of trying to inline __set_task_comm() (which needs to call trace and perf functions), just open-code it. But to make sure we're always safe, add compile-time checking like we already do for get_task_comm(). Suggested-by: Linus Torvalds Suggested-by: "Eric W. Biederman" Signed-off-by: Kees Cook --- fs/exec.c | 12 ++++++------ include/linux/sched.h | 9 ++++----- io_uring/io-wq.c | 2 +- io_uring/sqpoll.c | 2 +- kernel/kthread.c | 3 ++- 5 files changed, 14 insertions(+), 14 deletions(-) (limited to 'include/linux/sched.h') diff --git a/fs/exec.c b/fs/exec.c index e0435b31a811..5f16500ac325 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1200,16 +1200,16 @@ char *__get_task_comm(char *buf, size_t buf_size, struct task_struct *tsk) EXPORT_SYMBOL_GPL(__get_task_comm); /* - * These functions flushes out all traces of the currently running executable - * so that a new one can be started + * This is unlocked -- the string will always be NUL-terminated, but + * may show overlapping contents if racing concurrent reads. */ - void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec) { - task_lock(tsk); + size_t len = min(strlen(buf), sizeof(tsk->comm) - 1); + trace_task_rename(tsk, buf); - strscpy_pad(tsk->comm, buf, sizeof(tsk->comm)); - task_unlock(tsk); + memcpy(tsk->comm, buf, len); + memset(&tsk->comm[len], 0, sizeof(tsk->comm) - len); perf_event_comm(tsk, exec); } diff --git a/include/linux/sched.h b/include/linux/sched.h index e6ee4258169a..ac9f429ddc17 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1932,11 +1932,10 @@ static inline void kick_process(struct task_struct *tsk) { } #endif extern void __set_task_comm(struct task_struct *tsk, const char *from, bool exec); - -static inline void set_task_comm(struct task_struct *tsk, const char *from) -{ - __set_task_comm(tsk, from, false); -} +#define set_task_comm(tsk, from) ({ \ + BUILD_BUG_ON(sizeof(from) != TASK_COMM_LEN); \ + __set_task_comm(tsk, from, false); \ +}) extern char *__get_task_comm(char *to, size_t len, struct task_struct *tsk); #define get_task_comm(buf, tsk) ({ \ diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c index a38f36b68060..5d0928f37471 100644 --- a/io_uring/io-wq.c +++ b/io_uring/io-wq.c @@ -634,7 +634,7 @@ static int io_wq_worker(void *data) struct io_wq_acct *acct = io_wq_get_acct(worker); struct io_wq *wq = worker->wq; bool exit_mask = false, last_timeout = false; - char buf[TASK_COMM_LEN]; + char buf[TASK_COMM_LEN] = {}; set_mask_bits(&worker->flags, 0, BIT(IO_WORKER_F_UP) | BIT(IO_WORKER_F_RUNNING)); diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c index a26593979887..90011f06c7fb 100644 --- a/io_uring/sqpoll.c +++ b/io_uring/sqpoll.c @@ -271,7 +271,7 @@ static int io_sq_thread(void *data) struct io_ring_ctx *ctx; struct rusage start; unsigned long timeout = 0; - char buf[TASK_COMM_LEN]; + char buf[TASK_COMM_LEN] = {}; DEFINE_WAIT(wait); /* offload context creation failed, just exit */ diff --git a/kernel/kthread.c b/kernel/kthread.c index db4ceb0f503c..162d55811744 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -736,10 +736,11 @@ EXPORT_SYMBOL(kthread_stop_put); int kthreadd(void *unused) { + static const char comm[TASK_COMM_LEN] = "kthreadd"; struct task_struct *tsk = current; /* Setup a clean context for our children to inherit. */ - set_task_comm(tsk, "kthreadd"); + set_task_comm(tsk, comm); ignore_signals(tsk); set_cpus_allowed_ptr(tsk, housekeeping_cpumask(HK_TYPE_KTHREAD)); set_mems_allowed(node_states[N_MEMORY]); -- cgit v1.2.3 From a430d99e349026d53e2557b7b22bd2ebd61fe12a Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Fri, 20 Dec 2024 06:32:19 +0000 Subject: sched/fair: Fix value reported by hot tasks pulled in /proc/schedstat In /proc/schedstat, lb_hot_gained reports the number hot tasks pulled during load balance. This value is incremented in can_migrate_task() if the task is migratable and hot. After incrementing the value, load balancer can still decide not to migrate this task leading to wrong accounting. Fix this by incrementing stats when hot tasks are detached. This issue only exists in detach_tasks() where we can decide to not migrate hot task even if it is migratable. However, in detach_one_task(), we migrate it unconditionally. [Swapnil: Handled the case where nr_failed_migrations_hot was not accounted properly and wrote commit log] Fixes: d31980846f96 ("sched: Move up affinity check to mitigate useless redoing overhead") Signed-off-by: Peter Zijlstra (Intel) Reported-by: "Gautham R. Shenoy" Not-yet-signed-off-by: Peter Zijlstra Signed-off-by: Swapnil Sapkal Signed-off-by: Peter Zijlstra (Intel) Link: https://lore.kernel.org/r/20241220063224.17767-2-swapnil.sapkal@amd.com --- include/linux/sched.h | 1 + kernel/sched/fair.c | 17 +++++++++++++---- 2 files changed, 14 insertions(+), 4 deletions(-) (limited to 'include/linux/sched.h') diff --git a/include/linux/sched.h b/include/linux/sched.h index b5916be49f62..8c6a2ed9f80e 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -937,6 +937,7 @@ struct task_struct { unsigned sched_reset_on_fork:1; unsigned sched_contributes_to_load:1; unsigned sched_migrated:1; + unsigned sched_task_hot:1; /* Force alignment to the next boundary: */ unsigned :0; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ae8095aa4585..8fc6648a0aa8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9396,6 +9396,8 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) int tsk_cache_hot; lockdep_assert_rq_held(env->src_rq); + if (p->sched_task_hot) + p->sched_task_hot = 0; /* * We do not migrate tasks that are: @@ -9472,10 +9474,8 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) if (tsk_cache_hot <= 0 || env->sd->nr_balance_failed > env->sd->cache_nice_tries) { - if (tsk_cache_hot == 1) { - schedstat_inc(env->sd->lb_hot_gained[env->idle]); - schedstat_inc(p->stats.nr_forced_migrations); - } + if (tsk_cache_hot == 1) + p->sched_task_hot = 1; return 1; } @@ -9490,6 +9490,12 @@ static void detach_task(struct task_struct *p, struct lb_env *env) { lockdep_assert_rq_held(env->src_rq); + if (p->sched_task_hot) { + p->sched_task_hot = 0; + schedstat_inc(env->sd->lb_hot_gained[env->idle]); + schedstat_inc(p->stats.nr_forced_migrations); + } + deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK); set_task_cpu(p, env->dst_cpu); } @@ -9650,6 +9656,9 @@ static int detach_tasks(struct lb_env *env) continue; next: + if (p->sched_task_hot) + schedstat_inc(p->stats.nr_failed_migrations_hot); + list_move(&p->se.group_node, tasks); } -- cgit v1.2.3 From f718faf3940e95d5d34af9041f279f598396ab7d Mon Sep 17 00:00:00 2001 From: Chen Ridong Date: Tue, 17 Dec 2024 00:48:18 +0000 Subject: freezer, sched: Report frozen tasks as 'D' instead of 'R' MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Before commit: f5d39b020809 ("freezer,sched: Rewrite core freezer logic") the frozen task stat was reported as 'D' in cgroup v1. However, after rewriting the core freezer logic, the frozen task stat is reported as 'R'. This is confusing, especially when a task with stat of 'S' is frozen. This bug can be reproduced with these steps: $ cd /sys/fs/cgroup/freezer/ $ mkdir test $ sleep 1000 & [1] 739 // task whose stat is 'S' $ echo 739 > test/cgroup.procs $ echo FROZEN > test/freezer.state $ ps -aux | grep 739 root 739 0.1 0.0 8376 1812 pts/0 R 10:56 0:00 sleep 1000 As shown above, a task whose stat is 'S' was changed to 'R' when it was frozen. To solve this regression, simply maintain the same reported state as before the rewrite. [ mingo: Enhanced the changelog and comments ] Fixes: f5d39b020809 ("freezer,sched: Rewrite core freezer logic") Signed-off-by: Chen Ridong Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Acked-by: Tejun Heo Acked-by: Michal Koutný Link: https://lore.kernel.org/r/20241217004818.3200515-1-chenridong@huaweicloud.com --- include/linux/sched.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'include/linux/sched.h') diff --git a/include/linux/sched.h b/include/linux/sched.h index 66b311fbd5d6..64934e0830af 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1637,8 +1637,9 @@ static inline unsigned int __task_state_index(unsigned int tsk_state, * We're lying here, but rather than expose a completely new task state * to userspace, we can make this appear as if the task has gone through * a regular rt_mutex_lock() call. + * Report frozen tasks as uninterruptible. */ - if (tsk_state & TASK_RTLOCK_WAIT) + if ((tsk_state & TASK_RTLOCK_WAIT) || (tsk_state & TASK_FROZEN)) state = TASK_UNINTERRUPTIBLE; return fls(state); -- cgit v1.2.3 From 658eb5ab916ddc92f294dbce8e3d449470be9f86 Mon Sep 17 00:00:00 2001 From: Wang Yaxin Date: Tue, 3 Dec 2024 16:48:48 +0800 Subject: delayacct: add delay max to record delay peak Introduce the use cases of delay max, which can help quickly detect potential abnormal delays in the system and record the types and specific details of delay spikes. Problem ======== Delay accounting can track the average delay of processes to show system workload. However, when a process experiences a significant delay, maybe a delay spike, which adversely affects performance, getdelays can only display the average system delay over a period of time. Yet, average delay is unhelpful for diagnosing delay peak. It is not even possible to determine which type of delay has spiked, as this information might be masked by the average delay. Solution ========= the 'delay max' can display delay peak since the system's startup, which can record potential abnormal delays over time, including the type of delay and the maximum delay. This is helpful for quickly identifying crash caused by delay. Use case ========= bash# ./getdelays -d -p 244 print delayacct stats ON PID 244 CPU count real total virtual total delay total delay average delay max 68 192000000 213676651 705643 0.010ms 0.306381ms IO count delay total delay average delay max 0 0 0.000ms 0.000000ms SWAP count delay total delay average delay max 0 0 0.000ms 0.000000ms RECLAIM count delay total delay average delay max 0 0 0.000ms 0.000000ms THRASHING count delay total delay average delay max 0 0 0.000ms 0.000000ms COMPACT count delay total delay average delay max 0 0 0.000ms 0.000000ms WPCOPY count delay total delay average delay max 235 15648284 0.067ms 0.263842ms IRQ count delay total delay average delay max 0 0 0.000ms 0.000000ms [wang.yaxin@zte.com.cn: update docs and fix some spelling errors] Link: https://lkml.kernel.org/r/20241213192700771XKZ8H30OtHSeziGqRVMs0@zte.com.cn Link: https://lkml.kernel.org/r/20241203164848805CS62CQPQWG9GLdQj2_BxS@zte.com.cn Co-developed-by: Wang Yong Signed-off-by: Wang Yong Co-developed-by: xu xin Signed-off-by: xu xin Co-developed-by: Wang Yaxin Signed-off-by: Wang Yaxin Signed-off-by: Kun Jiang Cc: Balbir Singh Cc: David Hildenbrand Cc: Fan Yu Cc: Peilin He Cc: tuqiang Cc: Yang Yang Cc: ye xingchen Cc: Yunkai Zhang Signed-off-by: Andrew Morton --- Documentation/accounting/delay-accounting.rst | 42 +++++++++---------- include/linux/delayacct.h | 7 ++++ include/linux/sched.h | 3 ++ include/uapi/linux/taskstats.h | 9 ++++ kernel/delayacct.c | 37 ++++++++++++----- kernel/sched/stats.h | 5 ++- tools/accounting/getdelays.c | 59 +++++++++++++++------------ 7 files changed, 105 insertions(+), 57 deletions(-) (limited to 'include/linux/sched.h') diff --git a/Documentation/accounting/delay-accounting.rst b/Documentation/accounting/delay-accounting.rst index f61c01fc376e..8a0277428ccf 100644 --- a/Documentation/accounting/delay-accounting.rst +++ b/Documentation/accounting/delay-accounting.rst @@ -100,29 +100,29 @@ Get delays, since system boot, for pid 10:: # ./getdelays -d -p 10 (output similar to next case) -Get sum of delays, since system boot, for all pids with tgid 5:: +Get sum and peak of delays, since system boot, for all pids with tgid 242:: - # ./getdelays -d -t 5 + bash-4.4# ./getdelays -d -t 242 print delayacct stats ON - TGID 5 - - - CPU count real total virtual total delay total delay average - 8 7000000 6872122 3382277 0.423ms - IO count delay total delay average - 0 0 0.000ms - SWAP count delay total delay average - 0 0 0.000ms - RECLAIM count delay total delay average - 0 0 0.000ms - THRASHING count delay total delay average - 0 0 0.000ms - COMPACT count delay total delay average - 0 0 0.000ms - WPCOPY count delay total delay average - 0 0 0.000ms - IRQ count delay total delay average - 0 0 0.000ms + TGID 242 + + + CPU count real total virtual total delay total delay average delay max + 239 296000000 307724885 1127792 0.005ms 0.238382ms + IO count delay total delay average delay max + 0 0 0.000ms 0.000000ms + SWAP count delay total delay average delay max + 0 0 0.000ms 0.000000ms + RECLAIM count delay total delay average delay max + 0 0 0.000ms 0.000000ms + THRASHING count delay total delay average delay max + 0 0 0.000ms 0.000000ms + COMPACT count delay total delay average delay max + 0 0 0.000ms 0.000000ms + WPCOPY count delay total delay average delay max + 230 19100476 0.083ms 0.383822ms + IRQ count delay total delay average delay max + 0 0 0.000ms 0.000000ms Get IO accounting for pid 1, it works only with -p:: diff --git a/include/linux/delayacct.h b/include/linux/delayacct.h index 6639f48dac36..56fbfa2c2ac5 100644 --- a/include/linux/delayacct.h +++ b/include/linux/delayacct.h @@ -29,25 +29,32 @@ struct task_delay_info { * XXX_delay contains the accumulated delay time in nanoseconds. */ u64 blkio_start; + u64 blkio_delay_max; u64 blkio_delay; /* wait for sync block io completion */ u64 swapin_start; + u64 swapin_delay_max; u64 swapin_delay; /* wait for swapin */ u32 blkio_count; /* total count of the number of sync block */ /* io operations performed */ u32 swapin_count; /* total count of swapin */ u64 freepages_start; + u64 freepages_delay_max; u64 freepages_delay; /* wait for memory reclaim */ u64 thrashing_start; + u64 thrashing_delay_max; u64 thrashing_delay; /* wait for thrashing page */ u64 compact_start; + u64 compact_delay_max; u64 compact_delay; /* wait for memory compact */ u64 wpcopy_start; + u64 wpcopy_delay_max; u64 wpcopy_delay; /* wait for write-protect copy */ + u64 irq_delay_max; u64 irq_delay; /* wait for IRQ/SOFTIRQ */ u32 freepages_count; /* total count of memory reclaim */ diff --git a/include/linux/sched.h b/include/linux/sched.h index 64934e0830af..a0ae3923b41d 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -398,6 +398,9 @@ struct sched_info { /* Time spent waiting on a runqueue: */ unsigned long long run_delay; + /* Max time spent waiting on a runqueue: */ + unsigned long long max_run_delay; + /* Timestamps: */ /* When did we last run on a CPU? */ diff --git a/include/uapi/linux/taskstats.h b/include/uapi/linux/taskstats.h index b50b2eb257a0..e0d1c6fc9f3b 100644 --- a/include/uapi/linux/taskstats.h +++ b/include/uapi/linux/taskstats.h @@ -72,6 +72,7 @@ struct taskstats { */ __u64 cpu_count __attribute__((aligned(8))); __u64 cpu_delay_total; + __u64 cpu_delay_max; /* Following four fields atomically updated using task->delays->lock */ @@ -80,10 +81,12 @@ struct taskstats { */ __u64 blkio_count; __u64 blkio_delay_total; + __u64 blkio_delay_max; /* Delay waiting for page fault I/O (swap in only) */ __u64 swapin_count; __u64 swapin_delay_total; + __u64 swapin_delay_max; /* cpu "wall-clock" running time * On some architectures, value will adjust for cpu time stolen @@ -166,10 +169,12 @@ struct taskstats { /* Delay waiting for memory reclaim */ __u64 freepages_count; __u64 freepages_delay_total; + __u64 freepages_delay_max; /* Delay waiting for thrashing page */ __u64 thrashing_count; __u64 thrashing_delay_total; + __u64 thrashing_delay_max; /* v10: 64-bit btime to avoid overflow */ __u64 ac_btime64; /* 64-bit begin time */ @@ -177,6 +182,7 @@ struct taskstats { /* v11: Delay waiting for memory compact */ __u64 compact_count; __u64 compact_delay_total; + __u64 compact_delay_max; /* v12 begin */ __u32 ac_tgid; /* thread group ID */ @@ -198,10 +204,13 @@ struct taskstats { /* v13: Delay waiting for write-protect copy */ __u64 wpcopy_count; __u64 wpcopy_delay_total; + __u64 wpcopy_delay_max; /* v14: Delay waiting for IRQ/SOFTIRQ */ __u64 irq_count; __u64 irq_delay_total; + __u64 irq_delay_max; + /* v15: add Delay max */ }; diff --git a/kernel/delayacct.c b/kernel/delayacct.c index dead51de8eb5..23212a0c88e4 100644 --- a/kernel/delayacct.c +++ b/kernel/delayacct.c @@ -93,9 +93,9 @@ void __delayacct_tsk_init(struct task_struct *tsk) /* * Finish delay accounting for a statistic using its timestamps (@start), - * accumalator (@total) and @count + * accumulator (@total) and @count */ -static void delayacct_end(raw_spinlock_t *lock, u64 *start, u64 *total, u32 *count) +static void delayacct_end(raw_spinlock_t *lock, u64 *start, u64 *total, u32 *count, u64 *max) { s64 ns = local_clock() - *start; unsigned long flags; @@ -104,6 +104,8 @@ static void delayacct_end(raw_spinlock_t *lock, u64 *start, u64 *total, u32 *cou raw_spin_lock_irqsave(lock, flags); *total += ns; (*count)++; + if (ns > *max) + *max = ns; raw_spin_unlock_irqrestore(lock, flags); } } @@ -122,7 +124,8 @@ void __delayacct_blkio_end(struct task_struct *p) delayacct_end(&p->delays->lock, &p->delays->blkio_start, &p->delays->blkio_delay, - &p->delays->blkio_count); + &p->delays->blkio_count, + &p->delays->blkio_delay_max); } int delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk) @@ -153,10 +156,11 @@ int delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk) d->cpu_count += t1; + d->cpu_delay_max = tsk->sched_info.max_run_delay; tmp = (s64)d->cpu_delay_total + t2; d->cpu_delay_total = (tmp < (s64)d->cpu_delay_total) ? 0 : tmp; - tmp = (s64)d->cpu_run_virtual_total + t3; + d->cpu_run_virtual_total = (tmp < (s64)d->cpu_run_virtual_total) ? 0 : tmp; @@ -164,20 +168,26 @@ int delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk) return 0; /* zero XXX_total, non-zero XXX_count implies XXX stat overflowed */ - raw_spin_lock_irqsave(&tsk->delays->lock, flags); + d->blkio_delay_max = tsk->delays->blkio_delay_max; tmp = d->blkio_delay_total + tsk->delays->blkio_delay; d->blkio_delay_total = (tmp < d->blkio_delay_total) ? 0 : tmp; + d->swapin_delay_max = tsk->delays->swapin_delay_max; tmp = d->swapin_delay_total + tsk->delays->swapin_delay; d->swapin_delay_total = (tmp < d->swapin_delay_total) ? 0 : tmp; + d->freepages_delay_max = tsk->delays->freepages_delay_max; tmp = d->freepages_delay_total + tsk->delays->freepages_delay; d->freepages_delay_total = (tmp < d->freepages_delay_total) ? 0 : tmp; + d->thrashing_delay_max = tsk->delays->thrashing_delay_max; tmp = d->thrashing_delay_total + tsk->delays->thrashing_delay; d->thrashing_delay_total = (tmp < d->thrashing_delay_total) ? 0 : tmp; + d->compact_delay_max = tsk->delays->compact_delay_max; tmp = d->compact_delay_total + tsk->delays->compact_delay; d->compact_delay_total = (tmp < d->compact_delay_total) ? 0 : tmp; + d->wpcopy_delay_max = tsk->delays->wpcopy_delay_max; tmp = d->wpcopy_delay_total + tsk->delays->wpcopy_delay; d->wpcopy_delay_total = (tmp < d->wpcopy_delay_total) ? 0 : tmp; + d->irq_delay_max = tsk->delays->irq_delay_max; tmp = d->irq_delay_total + tsk->delays->irq_delay; d->irq_delay_total = (tmp < d->irq_delay_total) ? 0 : tmp; d->blkio_count += tsk->delays->blkio_count; @@ -213,7 +223,8 @@ void __delayacct_freepages_end(void) delayacct_end(¤t->delays->lock, ¤t->delays->freepages_start, ¤t->delays->freepages_delay, - ¤t->delays->freepages_count); + ¤t->delays->freepages_count, + ¤t->delays->freepages_delay_max); } void __delayacct_thrashing_start(bool *in_thrashing) @@ -235,7 +246,8 @@ void __delayacct_thrashing_end(bool *in_thrashing) delayacct_end(¤t->delays->lock, ¤t->delays->thrashing_start, ¤t->delays->thrashing_delay, - ¤t->delays->thrashing_count); + ¤t->delays->thrashing_count, + ¤t->delays->thrashing_delay_max); } void __delayacct_swapin_start(void) @@ -248,7 +260,8 @@ void __delayacct_swapin_end(void) delayacct_end(¤t->delays->lock, ¤t->delays->swapin_start, ¤t->delays->swapin_delay, - ¤t->delays->swapin_count); + ¤t->delays->swapin_count, + ¤t->delays->swapin_delay_max); } void __delayacct_compact_start(void) @@ -261,7 +274,8 @@ void __delayacct_compact_end(void) delayacct_end(¤t->delays->lock, ¤t->delays->compact_start, ¤t->delays->compact_delay, - ¤t->delays->compact_count); + ¤t->delays->compact_count, + ¤t->delays->compact_delay_max); } void __delayacct_wpcopy_start(void) @@ -274,7 +288,8 @@ void __delayacct_wpcopy_end(void) delayacct_end(¤t->delays->lock, ¤t->delays->wpcopy_start, ¤t->delays->wpcopy_delay, - ¤t->delays->wpcopy_count); + ¤t->delays->wpcopy_count, + ¤t->delays->wpcopy_delay_max); } void __delayacct_irq(struct task_struct *task, u32 delta) @@ -284,6 +299,8 @@ void __delayacct_irq(struct task_struct *task, u32 delta) raw_spin_lock_irqsave(&task->delays->lock, flags); task->delays->irq_delay += delta; task->delays->irq_count++; + if (delta > task->delays->irq_delay_max) + task->delays->irq_delay_max = delta; raw_spin_unlock_irqrestore(&task->delays->lock, flags); } diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index 8ee0add5a48a..ed72435aef51 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -244,7 +244,8 @@ static inline void sched_info_dequeue(struct rq *rq, struct task_struct *t) delta = rq_clock(rq) - t->sched_info.last_queued; t->sched_info.last_queued = 0; t->sched_info.run_delay += delta; - + if (delta > t->sched_info.max_run_delay) + t->sched_info.max_run_delay = delta; rq_sched_info_dequeue(rq, delta); } @@ -266,6 +267,8 @@ static void sched_info_arrive(struct rq *rq, struct task_struct *t) t->sched_info.run_delay += delta; t->sched_info.last_arrival = now; t->sched_info.pcount++; + if (delta > t->sched_info.max_run_delay) + t->sched_info.max_run_delay = delta; rq_sched_info_arrive(rq, delta); } diff --git a/tools/accounting/getdelays.c b/tools/accounting/getdelays.c index 1334214546d7..e570bcad185d 100644 --- a/tools/accounting/getdelays.c +++ b/tools/accounting/getdelays.c @@ -192,60 +192,69 @@ static int get_family_id(int sd) } #define average_ms(t, c) (t / 1000000ULL / (c ? c : 1)) +#define delay_max_ms(t) (t / 1000000ULL) static void print_delayacct(struct taskstats *t) { - printf("\n\nCPU %15s%15s%15s%15s%15s\n" - " %15llu%15llu%15llu%15llu%15.3fms\n" - "IO %15s%15s%15s\n" - " %15llu%15llu%15.3fms\n" - "SWAP %15s%15s%15s\n" - " %15llu%15llu%15.3fms\n" - "RECLAIM %12s%15s%15s\n" - " %15llu%15llu%15.3fms\n" - "THRASHING%12s%15s%15s\n" - " %15llu%15llu%15.3fms\n" - "COMPACT %12s%15s%15s\n" - " %15llu%15llu%15.3fms\n" - "WPCOPY %12s%15s%15s\n" - " %15llu%15llu%15.3fms\n" - "IRQ %15s%15s%15s\n" - " %15llu%15llu%15.3fms\n", + printf("\n\nCPU %15s%15s%15s%15s%15s%15s\n" + " %15llu%15llu%15llu%15llu%15.3fms%13.6fms\n" + "IO %15s%15s%15s%15s\n" + " %15llu%15llu%15.3fms%13.6fms\n" + "SWAP %15s%15s%15s%15s\n" + " %15llu%15llu%15.3fms%13.6fms\n" + "RECLAIM %12s%15s%15s%15s\n" + " %15llu%15llu%15.3fms%13.6fms\n" + "THRASHING%12s%15s%15s%15s\n" + " %15llu%15llu%15.3fms%13.6fms\n" + "COMPACT %12s%15s%15s%15s\n" + " %15llu%15llu%15.3fms%13.6fms\n" + "WPCOPY %12s%15s%15s%15s\n" + " %15llu%15llu%15.3fms%13.6fms\n" + "IRQ %15s%15s%15s%15s\n" + " %15llu%15llu%15.3fms%13.6fms\n", "count", "real total", "virtual total", - "delay total", "delay average", + "delay total", "delay average", "delay max", (unsigned long long)t->cpu_count, (unsigned long long)t->cpu_run_real_total, (unsigned long long)t->cpu_run_virtual_total, (unsigned long long)t->cpu_delay_total, average_ms((double)t->cpu_delay_total, t->cpu_count), - "count", "delay total", "delay average", + delay_max_ms((double)t->cpu_delay_max), + "count", "delay total", "delay average", "delay max", (unsigned long long)t->blkio_count, (unsigned long long)t->blkio_delay_total, average_ms((double)t->blkio_delay_total, t->blkio_count), - "count", "delay total", "delay average", + delay_max_ms((double)t->blkio_delay_max), + "count", "delay total", "delay average", "delay max", (unsigned long long)t->swapin_count, (unsigned long long)t->swapin_delay_total, average_ms((double)t->swapin_delay_total, t->swapin_count), - "count", "delay total", "delay average", + delay_max_ms((double)t->swapin_delay_max), + "count", "delay total", "delay average", "delay max", (unsigned long long)t->freepages_count, (unsigned long long)t->freepages_delay_total, average_ms((double)t->freepages_delay_total, t->freepages_count), - "count", "delay total", "delay average", + delay_max_ms((double)t->freepages_delay_max), + "count", "delay total", "delay average", "delay max", (unsigned long long)t->thrashing_count, (unsigned long long)t->thrashing_delay_total, average_ms((double)t->thrashing_delay_total, t->thrashing_count), - "count", "delay total", "delay average", + delay_max_ms((double)t->thrashing_delay_max), + "count", "delay total", "delay average", "delay max", (unsigned long long)t->compact_count, (unsigned long long)t->compact_delay_total, average_ms((double)t->compact_delay_total, t->compact_count), - "count", "delay total", "delay average", + delay_max_ms((double)t->compact_delay_max), + "count", "delay total", "delay average", "delay max", (unsigned long long)t->wpcopy_count, (unsigned long long)t->wpcopy_delay_total, average_ms((double)t->wpcopy_delay_total, t->wpcopy_count), - "count", "delay total", "delay average", + delay_max_ms((double)t->wpcopy_delay_max), + "count", "delay total", "delay average", "delay max", (unsigned long long)t->irq_count, (unsigned long long)t->irq_delay_total, - average_ms((double)t->irq_delay_total, t->irq_count)); + average_ms((double)t->irq_delay_total, t->irq_count), + delay_max_ms((double)t->irq_delay_max)); } static void task_context_switch_counts(struct taskstats *t) -- cgit v1.2.3 From f65c64f311ee2f1ddc1eb395ed8b20e6b9d14e85 Mon Sep 17 00:00:00 2001 From: Wang Yaxin Date: Fri, 20 Dec 2024 17:31:05 +0800 Subject: delayacct: add delay min to record delay peak Delay accounting can now calculate the average delay of processes, detect the overall system load, and also record the 'delay max' to identify potential abnormal delays. However, 'delay min' can help us identify another useful delay peak. By comparing the difference between 'delay max' and 'delay min', we can understand the optimization space for latency, providing a reference for the optimization of latency performance. Use case ========= bash-4.4# ./getdelays -d -t 242 print delayacct stats ON TGID 242 CPU count real total virtual total delay total delay average delay max delay min 39 156000000 156576579 2111069 0.054ms 0.212296ms 0.031307ms IO count delay total delay average delay max delay min 0 0 0.000ms 0.000000ms 0.000000ms SWAP count delay total delay average delay max delay min 0 0 0.000ms 0.000000ms 0.000000ms RECLAIM count delay total delay average delay max delay min 0 0 0.000ms 0.000000ms 0.000000ms THRASHING count delay total delay average delay max delay min 0 0 0.000ms 0.000000ms 0.000000ms COMPACT count delay total delay average delay max delay min 0 0 0.000ms 0.000000ms 0.000000ms WPCOPY count delay total delay average delay max delay min 156 11215873 0.072ms 0.207403ms 0.033913ms IRQ count delay total delay average delay max delay min 0 0 0.000ms 0.000000ms 0.000000ms Link: https://lkml.kernel.org/r/20241220173105906EOdsPhzjMLYNJJBqgz1ga@zte.com.cn Co-developed-by: Wang Yong Signed-off-by: Wang Yong Co-developed-by: xu xin Signed-off-by: xu xin Signed-off-by: Wang Yaxin Co-developed-by: Kun Jiang Signed-off-by: Kun Jiang Cc: Balbir Singh Cc: David Hildenbrand Cc: Fan Yu Cc: Peilin He Cc: tuqiang Cc: ye xingchen Cc: Yunkai Zhang Signed-off-by: Andrew Morton --- Documentation/accounting/delay-accounting.rst | 32 ++++++++++---------- include/linux/delayacct.h | 7 +++++ include/linux/sched.h | 3 ++ include/uapi/linux/taskstats.h | 8 +++++ kernel/delayacct.c | 32 +++++++++++++++----- kernel/sched/stats.h | 4 +++ tools/accounting/getdelays.c | 42 ++++++++++++++++----------- 7 files changed, 88 insertions(+), 40 deletions(-) (limited to 'include/linux/sched.h') diff --git a/Documentation/accounting/delay-accounting.rst b/Documentation/accounting/delay-accounting.rst index 8a0277428ccf..210c194d4a7b 100644 --- a/Documentation/accounting/delay-accounting.rst +++ b/Documentation/accounting/delay-accounting.rst @@ -107,22 +107,22 @@ Get sum and peak of delays, since system boot, for all pids with tgid 242:: TGID 242 - CPU count real total virtual total delay total delay average delay max - 239 296000000 307724885 1127792 0.005ms 0.238382ms - IO count delay total delay average delay max - 0 0 0.000ms 0.000000ms - SWAP count delay total delay average delay max - 0 0 0.000ms 0.000000ms - RECLAIM count delay total delay average delay max - 0 0 0.000ms 0.000000ms - THRASHING count delay total delay average delay max - 0 0 0.000ms 0.000000ms - COMPACT count delay total delay average delay max - 0 0 0.000ms 0.000000ms - WPCOPY count delay total delay average delay max - 230 19100476 0.083ms 0.383822ms - IRQ count delay total delay average delay max - 0 0 0.000ms 0.000000ms + CPU count real total virtual total delay total delay average delay max delay min + 39 156000000 156576579 2111069 0.054ms 0.212296ms 0.031307ms + IO count delay total delay average delay max delay min + 0 0 0.000ms 0.000000ms 0.000000ms + SWAP count delay total delay average delay max delay min + 0 0 0.000ms 0.000000ms 0.000000ms + RECLAIM count delay total delay average delay max delay min + 0 0 0.000ms 0.000000ms 0.000000ms + THRASHING count delay total delay average delay max delay min + 0 0 0.000ms 0.000000ms 0.000000ms + COMPACT count delay total delay average delay max delay min + 0 0 0.000ms 0.000000ms 0.000000ms + WPCOPY count delay total delay average delay max delay min + 156 11215873 0.072ms 0.207403ms 0.033913ms + IRQ count delay total delay average delay max delay min + 0 0 0.000ms 0.000000ms 0.000000ms Get IO accounting for pid 1, it works only with -p:: diff --git a/include/linux/delayacct.h b/include/linux/delayacct.h index 56fbfa2c2ac5..800dcc360db2 100644 --- a/include/linux/delayacct.h +++ b/include/linux/delayacct.h @@ -30,9 +30,11 @@ struct task_delay_info { */ u64 blkio_start; u64 blkio_delay_max; + u64 blkio_delay_min; u64 blkio_delay; /* wait for sync block io completion */ u64 swapin_start; u64 swapin_delay_max; + u64 swapin_delay_min; u64 swapin_delay; /* wait for swapin */ u32 blkio_count; /* total count of the number of sync block */ /* io operations performed */ @@ -40,21 +42,26 @@ struct task_delay_info { u64 freepages_start; u64 freepages_delay_max; + u64 freepages_delay_min; u64 freepages_delay; /* wait for memory reclaim */ u64 thrashing_start; u64 thrashing_delay_max; + u64 thrashing_delay_min; u64 thrashing_delay; /* wait for thrashing page */ u64 compact_start; u64 compact_delay_max; + u64 compact_delay_min; u64 compact_delay; /* wait for memory compact */ u64 wpcopy_start; u64 wpcopy_delay_max; + u64 wpcopy_delay_min; u64 wpcopy_delay; /* wait for write-protect copy */ u64 irq_delay_max; + u64 irq_delay_min; u64 irq_delay; /* wait for IRQ/SOFTIRQ */ u32 freepages_count; /* total count of memory reclaim */ diff --git a/include/linux/sched.h b/include/linux/sched.h index a0ae3923b41d..155012467b21 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -401,6 +401,9 @@ struct sched_info { /* Max time spent waiting on a runqueue: */ unsigned long long max_run_delay; + /* Min time spent waiting on a runqueue: */ + unsigned long long min_run_delay; + /* Timestamps: */ /* When did we last run on a CPU? */ diff --git a/include/uapi/linux/taskstats.h b/include/uapi/linux/taskstats.h index e0d1c6fc9f3b..934e20ef7f79 100644 --- a/include/uapi/linux/taskstats.h +++ b/include/uapi/linux/taskstats.h @@ -73,6 +73,7 @@ struct taskstats { __u64 cpu_count __attribute__((aligned(8))); __u64 cpu_delay_total; __u64 cpu_delay_max; + __u64 cpu_delay_min; /* Following four fields atomically updated using task->delays->lock */ @@ -82,11 +83,13 @@ struct taskstats { __u64 blkio_count; __u64 blkio_delay_total; __u64 blkio_delay_max; + __u64 blkio_delay_min; /* Delay waiting for page fault I/O (swap in only) */ __u64 swapin_count; __u64 swapin_delay_total; __u64 swapin_delay_max; + __u64 swapin_delay_min; /* cpu "wall-clock" running time * On some architectures, value will adjust for cpu time stolen @@ -170,11 +173,13 @@ struct taskstats { __u64 freepages_count; __u64 freepages_delay_total; __u64 freepages_delay_max; + __u64 freepages_delay_min; /* Delay waiting for thrashing page */ __u64 thrashing_count; __u64 thrashing_delay_total; __u64 thrashing_delay_max; + __u64 thrashing_delay_min; /* v10: 64-bit btime to avoid overflow */ __u64 ac_btime64; /* 64-bit begin time */ @@ -183,6 +188,7 @@ struct taskstats { __u64 compact_count; __u64 compact_delay_total; __u64 compact_delay_max; + __u64 compact_delay_min; /* v12 begin */ __u32 ac_tgid; /* thread group ID */ @@ -205,11 +211,13 @@ struct taskstats { __u64 wpcopy_count; __u64 wpcopy_delay_total; __u64 wpcopy_delay_max; + __u64 wpcopy_delay_min; /* v14: Delay waiting for IRQ/SOFTIRQ */ __u64 irq_count; __u64 irq_delay_total; __u64 irq_delay_max; + __u64 irq_delay_min; /* v15: add Delay max */ }; diff --git a/kernel/delayacct.c b/kernel/delayacct.c index 23212a0c88e4..b238eb8c6573 100644 --- a/kernel/delayacct.c +++ b/kernel/delayacct.c @@ -95,7 +95,7 @@ void __delayacct_tsk_init(struct task_struct *tsk) * Finish delay accounting for a statistic using its timestamps (@start), * accumulator (@total) and @count */ -static void delayacct_end(raw_spinlock_t *lock, u64 *start, u64 *total, u32 *count, u64 *max) +static void delayacct_end(raw_spinlock_t *lock, u64 *start, u64 *total, u32 *count, u64 *max, u64 *min) { s64 ns = local_clock() - *start; unsigned long flags; @@ -106,6 +106,8 @@ static void delayacct_end(raw_spinlock_t *lock, u64 *start, u64 *total, u32 *cou (*count)++; if (ns > *max) *max = ns; + if (*min == 0 || ns < *min) + *min = ns; raw_spin_unlock_irqrestore(lock, flags); } } @@ -125,7 +127,8 @@ void __delayacct_blkio_end(struct task_struct *p) &p->delays->blkio_start, &p->delays->blkio_delay, &p->delays->blkio_count, - &p->delays->blkio_delay_max); + &p->delays->blkio_delay_max, + &p->delays->blkio_delay_min); } int delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk) @@ -157,6 +160,7 @@ int delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk) d->cpu_count += t1; d->cpu_delay_max = tsk->sched_info.max_run_delay; + d->cpu_delay_min = tsk->sched_info.min_run_delay; tmp = (s64)d->cpu_delay_total + t2; d->cpu_delay_total = (tmp < (s64)d->cpu_delay_total) ? 0 : tmp; tmp = (s64)d->cpu_run_virtual_total + t3; @@ -170,24 +174,31 @@ int delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk) /* zero XXX_total, non-zero XXX_count implies XXX stat overflowed */ raw_spin_lock_irqsave(&tsk->delays->lock, flags); d->blkio_delay_max = tsk->delays->blkio_delay_max; + d->blkio_delay_min = tsk->delays->blkio_delay_min; tmp = d->blkio_delay_total + tsk->delays->blkio_delay; d->blkio_delay_total = (tmp < d->blkio_delay_total) ? 0 : tmp; d->swapin_delay_max = tsk->delays->swapin_delay_max; + d->swapin_delay_min = tsk->delays->swapin_delay_min; tmp = d->swapin_delay_total + tsk->delays->swapin_delay; d->swapin_delay_total = (tmp < d->swapin_delay_total) ? 0 : tmp; d->freepages_delay_max = tsk->delays->freepages_delay_max; + d->freepages_delay_min = tsk->delays->freepages_delay_min; tmp = d->freepages_delay_total + tsk->delays->freepages_delay; d->freepages_delay_total = (tmp < d->freepages_delay_total) ? 0 : tmp; d->thrashing_delay_max = tsk->delays->thrashing_delay_max; + d->thrashing_delay_min = tsk->delays->thrashing_delay_min; tmp = d->thrashing_delay_total + tsk->delays->thrashing_delay; d->thrashing_delay_total = (tmp < d->thrashing_delay_total) ? 0 : tmp; d->compact_delay_max = tsk->delays->compact_delay_max; + d->compact_delay_min = tsk->delays->compact_delay_min; tmp = d->compact_delay_total + tsk->delays->compact_delay; d->compact_delay_total = (tmp < d->compact_delay_total) ? 0 : tmp; d->wpcopy_delay_max = tsk->delays->wpcopy_delay_max; + d->wpcopy_delay_min = tsk->delays->wpcopy_delay_min; tmp = d->wpcopy_delay_total + tsk->delays->wpcopy_delay; d->wpcopy_delay_total = (tmp < d->wpcopy_delay_total) ? 0 : tmp; d->irq_delay_max = tsk->delays->irq_delay_max; + d->irq_delay_min = tsk->delays->irq_delay_min; tmp = d->irq_delay_total + tsk->delays->irq_delay; d->irq_delay_total = (tmp < d->irq_delay_total) ? 0 : tmp; d->blkio_count += tsk->delays->blkio_count; @@ -224,7 +235,8 @@ void __delayacct_freepages_end(void) ¤t->delays->freepages_start, ¤t->delays->freepages_delay, ¤t->delays->freepages_count, - ¤t->delays->freepages_delay_max); + ¤t->delays->freepages_delay_max, + ¤t->delays->freepages_delay_min); } void __delayacct_thrashing_start(bool *in_thrashing) @@ -247,7 +259,8 @@ void __delayacct_thrashing_end(bool *in_thrashing) ¤t->delays->thrashing_start, ¤t->delays->thrashing_delay, ¤t->delays->thrashing_count, - ¤t->delays->thrashing_delay_max); + ¤t->delays->thrashing_delay_max, + ¤t->delays->thrashing_delay_min); } void __delayacct_swapin_start(void) @@ -261,7 +274,8 @@ void __delayacct_swapin_end(void) ¤t->delays->swapin_start, ¤t->delays->swapin_delay, ¤t->delays->swapin_count, - ¤t->delays->swapin_delay_max); + ¤t->delays->swapin_delay_max, + ¤t->delays->swapin_delay_min); } void __delayacct_compact_start(void) @@ -275,7 +289,8 @@ void __delayacct_compact_end(void) ¤t->delays->compact_start, ¤t->delays->compact_delay, ¤t->delays->compact_count, - ¤t->delays->compact_delay_max); + ¤t->delays->compact_delay_max, + ¤t->delays->compact_delay_min); } void __delayacct_wpcopy_start(void) @@ -289,7 +304,8 @@ void __delayacct_wpcopy_end(void) ¤t->delays->wpcopy_start, ¤t->delays->wpcopy_delay, ¤t->delays->wpcopy_count, - ¤t->delays->wpcopy_delay_max); + ¤t->delays->wpcopy_delay_max, + ¤t->delays->wpcopy_delay_min); } void __delayacct_irq(struct task_struct *task, u32 delta) @@ -301,6 +317,8 @@ void __delayacct_irq(struct task_struct *task, u32 delta) task->delays->irq_count++; if (delta > task->delays->irq_delay_max) task->delays->irq_delay_max = delta; + if (delta && (!task->delays->irq_delay_min || delta < task->delays->irq_delay_min)) + task->delays->irq_delay_min = delta; raw_spin_unlock_irqrestore(&task->delays->lock, flags); } diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index ed72435aef51..693537b908a1 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -246,6 +246,8 @@ static inline void sched_info_dequeue(struct rq *rq, struct task_struct *t) t->sched_info.run_delay += delta; if (delta > t->sched_info.max_run_delay) t->sched_info.max_run_delay = delta; + if (delta && (!t->sched_info.min_run_delay || delta < t->sched_info.min_run_delay)) + t->sched_info.min_run_delay = delta; rq_sched_info_dequeue(rq, delta); } @@ -269,6 +271,8 @@ static void sched_info_arrive(struct rq *rq, struct task_struct *t) t->sched_info.pcount++; if (delta > t->sched_info.max_run_delay) t->sched_info.max_run_delay = delta; + if (delta && (!t->sched_info.min_run_delay || delta < t->sched_info.min_run_delay)) + t->sched_info.min_run_delay = delta; rq_sched_info_arrive(rq, delta); } diff --git a/tools/accounting/getdelays.c b/tools/accounting/getdelays.c index e570bcad185d..100ad3dc091a 100644 --- a/tools/accounting/getdelays.c +++ b/tools/accounting/getdelays.c @@ -192,7 +192,7 @@ static int get_family_id(int sd) } #define average_ms(t, c) (t / 1000000ULL / (c ? c : 1)) -#define delay_max_ms(t) (t / 1000000ULL) +#define delay_ms(t) (t / 1000000ULL) static void print_delayacct(struct taskstats *t) { @@ -213,48 +213,56 @@ static void print_delayacct(struct taskstats *t) "IRQ %15s%15s%15s%15s\n" " %15llu%15llu%15.3fms%13.6fms\n", "count", "real total", "virtual total", - "delay total", "delay average", "delay max", + "delay total", "delay average", "delay max", "delay min", (unsigned long long)t->cpu_count, (unsigned long long)t->cpu_run_real_total, (unsigned long long)t->cpu_run_virtual_total, (unsigned long long)t->cpu_delay_total, average_ms((double)t->cpu_delay_total, t->cpu_count), - delay_max_ms((double)t->cpu_delay_max), - "count", "delay total", "delay average", "delay max", + delay_ms((double)t->cpu_delay_max), + delay_ms((double)t->cpu_delay_min), + "count", "delay total", "delay average", "delay max", "delay min", (unsigned long long)t->blkio_count, (unsigned long long)t->blkio_delay_total, average_ms((double)t->blkio_delay_total, t->blkio_count), - delay_max_ms((double)t->blkio_delay_max), - "count", "delay total", "delay average", "delay max", + delay_ms((double)t->blkio_delay_max), + delay_ms((double)t->blkio_delay_min), + "count", "delay total", "delay average", "delay max", "delay min", (unsigned long long)t->swapin_count, (unsigned long long)t->swapin_delay_total, average_ms((double)t->swapin_delay_total, t->swapin_count), - delay_max_ms((double)t->swapin_delay_max), - "count", "delay total", "delay average", "delay max", + delay_ms((double)t->swapin_delay_max), + delay_ms((double)t->swapin_delay_min), + "count", "delay total", "delay average", "delay max", "delay min", (unsigned long long)t->freepages_count, (unsigned long long)t->freepages_delay_total, average_ms((double)t->freepages_delay_total, t->freepages_count), - delay_max_ms((double)t->freepages_delay_max), - "count", "delay total", "delay average", "delay max", + delay_ms((double)t->freepages_delay_max), + delay_ms((double)t->freepages_delay_min), + "count", "delay total", "delay average", "delay max", "delay min", (unsigned long long)t->thrashing_count, (unsigned long long)t->thrashing_delay_total, average_ms((double)t->thrashing_delay_total, t->thrashing_count), - delay_max_ms((double)t->thrashing_delay_max), - "count", "delay total", "delay average", "delay max", + delay_ms((double)t->thrashing_delay_max), + delay_ms((double)t->thrashing_delay_min), + "count", "delay total", "delay average", "delay max", "delay min", (unsigned long long)t->compact_count, (unsigned long long)t->compact_delay_total, average_ms((double)t->compact_delay_total, t->compact_count), - delay_max_ms((double)t->compact_delay_max), - "count", "delay total", "delay average", "delay max", + delay_ms((double)t->compact_delay_max), + delay_ms((double)t->compact_delay_min), + "count", "delay total", "delay average", "delay max", "delay min", (unsigned long long)t->wpcopy_count, (unsigned long long)t->wpcopy_delay_total, average_ms((double)t->wpcopy_delay_total, t->wpcopy_count), - delay_max_ms((double)t->wpcopy_delay_max), - "count", "delay total", "delay average", "delay max", + delay_ms((double)t->wpcopy_delay_max), + delay_ms((double)t->wpcopy_delay_min), + "count", "delay total", "delay average", "delay max", "delay min", (unsigned long long)t->irq_count, (unsigned long long)t->irq_delay_total, average_ms((double)t->irq_delay_total, t->irq_count), - delay_max_ms((double)t->irq_delay_max)); + delay_ms((double)t->irq_delay_max), + delay_ms((double)t->irq_delay_min)); } static void task_context_switch_counts(struct taskstats *t) -- cgit v1.2.3