summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2026-02-19cpuset: Fix missing adaptation for cpuset_is_populatedChen Ridong
Commit b1bcaed1e39a ("cpuset: Treat cpusets in attaching as populated") was backported to the long‑term support (LTS) branches. However, because commit d5cf4d34a333 ("cgroup/cpuset: Don't track # of local child partitions") was not backported, a corresponding adaptation to the backported code is still required. To ensure correct behavior, replace cgroup_is_populated with cpuset_is_populated in the partition_is_populated function. Cc: stable@vger.kernel.org # 6.1+ Fixes: b1bcaed1e39a ("cpuset: Treat cpusets in attaching as populated") Cc: Waiman Long <longman@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-11ring-buffer: Avoid softlockup in ring_buffer_resize() during memory freeWupeng Ma
[ Upstream commit 6435ffd6c7fcba330dfa91c58dc30aed2df3d0bf ] When user resize all trace ring buffer through file 'buffer_size_kb', then in ring_buffer_resize(), kernel allocates buffer pages for each cpu in a loop. If the kernel preemption model is PREEMPT_NONE and there are many cpus and there are many buffer pages to be freed, it may not give up cpu for a long time and finally cause a softlockup. To avoid it, call cond_resched() after each cpu buffer free as Commit f6bd2c92488c ("ring-buffer: Avoid softlockup in ring_buffer_resize()") does. Detailed call trace as follow: rcu: INFO: rcu_sched self-detected stall on CPU rcu: 24-....: (14837 ticks this GP) idle=521c/1/0x4000000000000000 softirq=230597/230597 fqs=5329 rcu: (t=15004 jiffies g=26003221 q=211022 ncpus=96) CPU: 24 UID: 0 PID: 11253 Comm: bash Kdump: loaded Tainted: G EL 6.18.2+ #278 NONE pc : arch_local_irq_restore+0x8/0x20 arch_local_irq_restore+0x8/0x20 (P) free_frozen_page_commit+0x28c/0x3b0 __free_frozen_pages+0x1c0/0x678 ___free_pages+0xc0/0xe0 free_pages+0x3c/0x50 ring_buffer_resize.part.0+0x6a8/0x880 ring_buffer_resize+0x3c/0x58 __tracing_resize_ring_buffer.part.0+0x34/0xd8 tracing_resize_ring_buffer+0x8c/0xd0 tracing_entries_write+0x74/0xd8 vfs_write+0xcc/0x288 ksys_write+0x74/0x118 __arm64_sys_write+0x24/0x38 Cc: <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251228065008.2396573-1-mawupeng1@huawei.com Signed-off-by: Wupeng Ma <mawupeng1@huawei.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-11tracing: Fix ftrace event field alignmentsSteven Rostedt
[ Upstream commit 033c55fe2e326bea022c3cc5178ecf3e0e459b82 ] The fields of ftrace specific events (events used to save ftrace internal events like function traces and trace_printk) are generated similarly to how normal trace event fields are generated. That is, the fields are added to a trace_events_fields array that saves the name, offset, size, alignment and signness of the field. It is used to produce the output in the format file in tracefs so that tooling knows how to parse the binary data of the trace events. The issue is that some of the ftrace event structures are packed. The function graph exit event structures are one of them. The 64 bit calltime and rettime fields end up 4 byte aligned, but the algorithm to show to userspace shows them as 8 byte aligned. The macros that create the ftrace events has one for embedded structure fields. There's two macros for theses fields: __field_desc() and __field_packed() The difference of the latter macro is that it treats the field as packed. Rename that field to __field_desc_packed() and create replace the __field_packed() to be a normal field that is packed and have the calltime and rettime use those. This showed up on 32bit architectures for function graph time fields. It had: ~# cat /sys/kernel/tracing/events/ftrace/funcgraph_exit/format [..] field:unsigned long func; offset:8; size:4; signed:0; field:unsigned int depth; offset:12; size:4; signed:0; field:unsigned int overrun; offset:16; size:4; signed:0; field:unsigned long long calltime; offset:24; size:8; signed:0; field:unsigned long long rettime; offset:32; size:8; signed:0; Notice that overrun is at offset 16 with size 4, where in the structure calltime is at offset 20 (16 + 4), but it shows the offset at 24. That's because it used the alignment of unsigned long long when used as a declaration and not as a member of a structure where it would be aligned by word size (in this case 4). By using the proper structure alignment, the format has it at the correct offset: ~# cat /sys/kernel/tracing/events/ftrace/funcgraph_exit/format [..] field:unsigned long func; offset:8; size:4; signed:0; field:unsigned int depth; offset:12; size:4; signed:0; field:unsigned int overrun; offset:16; size:4; signed:0; field:unsigned long long calltime; offset:20; size:8; signed:0; field:unsigned long long rettime; offset:28; size:8; signed:0; Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mark Rutland <mark.rutland@arm.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reported-by: "jempty.liang" <imntjempty@163.com> Link: https://patch.msgid.link/20260204113628.53faec78@gandalf.local.home Fixes: 04ae87a52074e ("ftrace: Rework event_create_dir()") Closes: https://lore.kernel.org/all/20260130015740.212343-1-imntjempty@163.com/ Closes: https://lore.kernel.org/all/20260202123342.2544795-1-imntjempty@163.com/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> [ adapted field types and macro arguments ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-06perf: sched: Fix perf crash with new is_user_task() helperSteven Rostedt
[ Upstream commit 76ed27608f7dd235b727ebbb12163438c2fbb617 ] In order to do a user space stacktrace the current task needs to be a user task that has executed in user space. It use to be possible to test if a task is a user task or not by simply checking the task_struct mm field. If it was non NULL, it was a user task and if not it was a kernel task. But things have changed over time, and some kernel tasks now have their own mm field. An idea was made to instead test PF_KTHREAD and two functions were used to wrap this check in case it became more complex to test if a task was a user task or not[1]. But this was rejected and the C code simply checked the PF_KTHREAD directly. It was later found that not all kernel threads set PF_KTHREAD. The io-uring helpers instead set PF_USER_WORKER and this needed to be added as well. But checking the flags is still not enough. There's a very small window when a task exits that it frees its mm field and it is set back to NULL. If perf were to trigger at this moment, the flags test would say its a user space task but when perf would read the mm field it would crash with at NULL pointer dereference. Now there are flags that can be used to test if a task is exiting, but they are set in areas that perf may still want to profile the user space task (to see where it exited). The only real test is to check both the flags and the mm field. Instead of making this modification in every location, create a new is_user_task() helper function that does all the tests needed to know if it is safe to read the user space memory or not. [1] https://lore.kernel.org/all/20250425204120.639530125@goodmis.org/ Fixes: 90942f9fac05 ("perf: Use current->flags & PF_KTHREAD|PF_USER_WORKER instead of current->mm == NULL") Closes: https://lore.kernel.org/all/0d877e6f-41a7-4724-875d-0b0a27b8a545@roeck-us.net/ Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Guenter Roeck <linux@roeck-us.net> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260129102821.46484722@gandalf.local.home Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-06perf: Simplify get_perf_callchain() user logicJosh Poimboeuf
[ Upstream commit d77e3319e31098a6cb97b7ce4e71ba676e327fd7 ] Simplify the get_perf_callchain() user logic a bit. task_pt_regs() should never be NULL. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20250820180428.760066227@kernel.org Stable-dep-of: 76ed27608f7d ("perf: sched: Fix perf crash with new is_user_task() helper") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-06cgroup: Fix kernfs_node UAF in css_free_rwork_fnT.J. Mercier
This fix patch is not upstream, and is applicable only to kernels 6.10 (where the cgroup_rstat_lock tracepoint was added) through 6.15 after which commit 5da3bfa029d6 ("cgroup: use separate rstat trees for each subsystem") reordered cgroup_rstat_flush as part of a new feature addition and inadvertently fixed this UAF. css_free_rwork_fn first releases the last reference on the cgroup's kernfs_node, and then calls cgroup_rstat_exit which attempts to use it in the cgroup_rstat_lock tracepoint: kernfs_put(cgrp->kn); cgroup_rstat_exit cgroup_rstat_flush __cgroup_rstat_lock trace_cgroup_rstat_locked: TP_fast_assign( __entry->root = cgrp->root->hierarchy_id; __entry->id = cgroup_id(cgrp); Where cgroup_id is: static inline u64 cgroup_id(const struct cgroup *cgrp) { return cgrp->kn->id; } Fix this by reordering the kernfs_put after cgroup_rstat_exit. [78782.605161][ T9861] BUG: KASAN: slab-use-after-free in trace_event_raw_event_cgroup_rstat+0x110/0x1dc [78782.605182][ T9861] Read of size 8 at addr ffffff890270e610 by task kworker/6:1/9861 [78782.605199][ T9861] CPU: 6 UID: 0 PID: 9861 Comm: kworker/6:1 Tainted: G W OE 6.12.23-android16-5-gabaf21382e8f-4k #1 0308449da8ad70d2d3649ae989c1d02f0fbf562c [78782.605220][ T9861] Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE [78782.605226][ T9861] Hardware name: Qualcomm Technologies, Inc. Alor QRD + WCN7750 WLAN + Kundu PD2536F_EX (DT) [78782.605235][ T9861] Workqueue: cgroup_destroy css_free_rwork_fn [78782.605251][ T9861] Call trace: [78782.605254][ T9861] dump_backtrace+0x120/0x170 [78782.605267][ T9861] show_stack+0x2c/0x40 [78782.605276][ T9861] dump_stack_lvl+0x84/0xb4 [78782.605286][ T9861] print_report+0x144/0x7a4 [78782.605301][ T9861] kasan_report+0xe0/0x140 [78782.605315][ T9861] __asan_load8+0x98/0xa0 [78782.605329][ T9861] trace_event_raw_event_cgroup_rstat+0x110/0x1dc [78782.605339][ T9861] __traceiter_cgroup_rstat_locked+0x78/0xc4 [78782.605355][ T9861] __cgroup_rstat_lock+0xe8/0x1dc [78782.605368][ T9861] cgroup_rstat_flush_locked+0x7dc/0xaec [78782.605383][ T9861] cgroup_rstat_flush+0x34/0x108 [78782.605396][ T9861] cgroup_rstat_exit+0x2c/0x120 [78782.605409][ T9861] css_free_rwork_fn+0x504/0xa18 [78782.605421][ T9861] process_scheduled_works+0x378/0x8e0 [78782.605435][ T9861] worker_thread+0x5a8/0x77c [78782.605446][ T9861] kthread+0x1c4/0x270 [78782.605455][ T9861] ret_from_fork+0x10/0x20 [78782.605470][ T9861] Allocated by task 2864 on cpu 7 at 78781.564561s: [78782.605467][ C5] ENHANCE: rpm_suspend+0x93c/0xafc: 0:0:0:0 ret=0 [78782.605481][ T9861] kasan_save_track+0x44/0x9c [78782.605497][ T9861] kasan_save_alloc_info+0x40/0x54 [78782.605507][ T9861] __kasan_slab_alloc+0x70/0x8c [78782.605521][ T9861] kmem_cache_alloc_noprof+0x1a0/0x428 [78782.605534][ T9861] __kernfs_new_node+0xd4/0x3e4 [78782.605545][ T9861] kernfs_new_node+0xbc/0x168 [78782.605554][ T9861] kernfs_create_dir_ns+0x58/0xe8 [78782.605565][ T9861] cgroup_mkdir+0x25c/0xc9c [78782.605576][ T9861] kernfs_iop_mkdir+0x130/0x214 [78782.605586][ T9861] vfs_mkdir+0x290/0x388 [78782.605599][ T9861] do_mkdirat+0xfc/0x27c [78782.605612][ T9861] __arm64_sys_mkdirat+0x5c/0x78 [78782.605625][ T9861] invoke_syscall+0x90/0x1e8 [78782.605634][ T9861] el0_svc_common+0x134/0x168 [78782.605643][ T9861] do_el0_svc+0x34/0x44 [78782.605652][ T9861] el0_svc+0x38/0x84 [78782.605667][ T9861] el0t_64_sync_handler+0x70/0xbc [78782.605681][ T9861] el0t_64_sync+0x19c/0x1a0 [78782.605695][ T9861] Freed by task 69 on cpu 1 at 78782.573275s: [78782.605705][ T9861] kasan_save_track+0x44/0x9c [78782.605719][ T9861] kasan_save_free_info+0x54/0x70 [78782.605729][ T9861] __kasan_slab_free+0x68/0x8c [78782.605743][ T9861] kmem_cache_free+0x118/0x488 [78782.605755][ T9861] kernfs_free_rcu+0xa0/0xb8 [78782.605765][ T9861] rcu_do_batch+0x324/0xaa0 [78782.605775][ T9861] rcu_nocb_cb_kthread+0x388/0x690 [78782.605785][ T9861] kthread+0x1c4/0x270 [78782.605794][ T9861] ret_from_fork+0x10/0x20 [78782.605809][ T9861] Last potentially related work creation: [78782.605814][ T9861] kasan_save_stack+0x40/0x70 [78782.605829][ T9861] __kasan_record_aux_stack+0xb0/0xcc [78782.605839][ T9861] kasan_record_aux_stack_noalloc+0x14/0x24 [78782.605849][ T9861] __call_rcu_common+0x54/0x390 [78782.605863][ T9861] call_rcu+0x18/0x28 [78782.605875][ T9861] kernfs_put+0x17c/0x28c [78782.605884][ T9861] css_free_rwork_fn+0x4f4/0xa18 [78782.605897][ T9861] process_scheduled_works+0x378/0x8e0 [78782.605910][ T9861] worker_thread+0x5a8/0x77c [78782.605923][ T9861] kthread+0x1c4/0x270 [78782.605932][ T9861] ret_from_fork+0x10/0x20 [78782.605947][ T9861] The buggy address belongs to the object at ffffff890270e5b0 [78782.605947][ T9861] which belongs to the cache kernfs_node_cache of size 144 [78782.605957][ T9861] The buggy address is located 96 bytes inside of [78782.605957][ T9861] freed 144-byte region [ffffff890270e5b0, ffffff890270e640) Fixes: fc29e04ae1ad ("cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints") Signed-off-by: T.J. Mercier <tjmercier@google.com> Acked-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-06sched/deadline: Fix 'stuck' dl_serverPeter Zijlstra
[ Upstream commit 115135422562e2f791e98a6f55ec57b2da3b3a95 ] Andrea reported the dl_server getting stuck for him. He tracked it down to a state where dl_server_start() saw dl_defer_running==1, but the dl_server's job is no longer valid at the time of dl_server_start(). In the state diagram this corresponds to [4] D->A (or dl_server_stop() due to no more runnable tasks) followed by [1], which in case of a lapsed deadline must then be A->B. Now our A has dl_defer_running==1, while B demands dl_defer_running==0, therefore it must get cleared when the CBS wakeup rules demand a replenish. Fixes: a110a81c52a9 ("sched/deadline: Deferrable dl server") Reported-by: Andrea Righi arighi@nvidia.com Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Tested-by: Andrea Righi arighi@nvidia.com Link: https://lkml.kernel.org/r/20260123161645.2181752-1-arighi@nvidia.com Link: https://patch.msgid.link/20260130124100.GC1079264@noisy.programming.kicks-ass.net Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-06sched/deadline: Document dl_serverPeter Zijlstra
[ Upstream commit 2614069c5912e9d6f1f57c262face1b368fb8c93 ] Place the notes that resulted from going through the dl_server code in a comment. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Stable-dep-of: 115135422562 ("sched/deadline: Fix 'stuck' dl_server") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-06dma/pool: distinguish between missing and exhausted atomic poolsSai Sree Kartheek Adivi
[ Upstream commit 56c430c7f06d838fe3b2077dbbc4cc0bf992312b ] Currently, dma_alloc_from_pool() unconditionally warns and dumps a stack trace when an allocation fails, with the message "Failed to get suitable pool". This conflates two distinct failure modes: 1. Configuration error: No atomic pool is available for the requested DMA mask (a fundamental system setup issue) 2. Resource Exhaustion: A suitable pool exists but is currently full (a recoverable runtime state) This lack of distinction prevents drivers from using __GFP_NOWARN to suppress error messages during temporary pressure spikes, such as when awaiting synchronous reclaim of descriptors. Refactor the error handling to distinguish these cases: - If no suitable pool is found, keep the unconditional WARN regarding the missing pool. - If a pool was found but is exhausted, respect __GFP_NOWARN and update the warning message to explicitly state "DMA pool exhausted". Fixes: 9420139f516d ("dma-pool: fix coherent pool allocations for IOMMU mappings") Signed-off-by: Sai Sree Kartheek Adivi <s-adivi@ti.com> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260128133554.3056582-1-s-adivi@ti.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-30sched_ext: Fix possible deadlock in the deferred_irq_workfn()Zqiang
[ Upstream commit a257e974210320ede524f340ffe16bf4bf0dda1e ] For PREEMPT_RT=y kernels, the deferred_irq_workfn() is executed in the per-cpu irq_work/* task context and not disable-irq, if the rq returned by container_of() is current CPU's rq, the following scenarios may occur: lock(&rq->__lock); <Interrupt> lock(&rq->__lock); This commit use IRQ_WORK_INIT_HARD() to replace init_irq_work() to initialize rq->scx.deferred_irq_work, make the deferred_irq_workfn() is always invoked in hard-irq context. Signed-off-by: Zqiang <qiang.zhang@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chen Yu <xnguchen@sina.cn> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-30tracing: Fix crash on synthetic stacktrace field usageSteven Rostedt
commit 90f9f5d64cae4e72defd96a2a22760173cb3c9ec upstream. When creating a synthetic event based on an existing synthetic event that had a stacktrace field and the new synthetic event used that field a kernel crash occurred: ~# cd /sys/kernel/tracing ~# echo 's:stack unsigned long stack[];' > dynamic_events ~# echo 'hist:keys=prev_pid:s0=common_stacktrace if prev_state & 3' >> events/sched/sched_switch/trigger ~# echo 'hist:keys=next_pid:s1=$s0:onmatch(sched.sched_switch).trace(stack,$s1)' >> events/sched/sched_switch/trigger The above creates a synthetic event that takes a stacktrace when a task schedules out in a non-running state and passes that stacktrace to the sched_switch event when that task schedules back in. It triggers the "stack" synthetic event that has a stacktrace as its field (called "stack"). ~# echo 's:syscall_stack s64 id; unsigned long stack[];' >> dynamic_events ~# echo 'hist:keys=common_pid:s2=stack' >> events/synthetic/stack/trigger ~# echo 'hist:keys=common_pid:s3=$s2,i0=id:onmatch(synthetic.stack).trace(syscall_stack,$i0,$s3)' >> events/raw_syscalls/sys_exit/trigger The above makes another synthetic event called "syscall_stack" that attaches the first synthetic event (stack) to the sys_exit trace event and records the stacktrace from the stack event with the id of the system call that is exiting. When enabling this event (or using it in a historgram): ~# echo 1 > events/synthetic/syscall_stack/enable Produces a kernel crash! BUG: unable to handle page fault for address: 0000000000400010 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP PTI CPU: 6 UID: 0 PID: 1257 Comm: bash Not tainted 6.16.3+deb14-amd64 #1 PREEMPT(lazy) Debian 6.16.3-1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-debian-1.17.0-1 04/01/2014 RIP: 0010:trace_event_raw_event_synth+0x90/0x380 Code: c5 00 00 00 00 85 d2 0f 84 e1 00 00 00 31 db eb 34 0f 1f 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 <49> 8b 04 24 48 83 c3 01 8d 0c c5 08 00 00 00 01 cd 41 3b 5d 40 0f RSP: 0018:ffffd2670388f958 EFLAGS: 00010202 RAX: ffff8ba1065cc100 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: fffff266ffda7b90 RDI: ffffd2670388f9b0 RBP: 0000000000000010 R08: ffff8ba104e76000 R09: ffffd2670388fa50 R10: ffff8ba102dd42e0 R11: ffffffff9a908970 R12: 0000000000400010 R13: ffff8ba10a246400 R14: ffff8ba10a710220 R15: fffff266ffda7b90 FS: 00007fa3bc63f740(0000) GS:ffff8ba2e0f48000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000400010 CR3: 0000000107f9e003 CR4: 0000000000172ef0 Call Trace: <TASK> ? __tracing_map_insert+0x208/0x3a0 action_trace+0x67/0x70 event_hist_trigger+0x633/0x6d0 event_triggers_call+0x82/0x130 trace_event_buffer_commit+0x19d/0x250 trace_event_raw_event_sys_exit+0x62/0xb0 syscall_exit_work+0x9d/0x140 do_syscall_64+0x20a/0x2f0 ? trace_event_raw_event_sched_switch+0x12b/0x170 ? save_fpregs_to_fpstate+0x3e/0x90 ? _raw_spin_unlock+0xe/0x30 ? finish_task_switch.isra.0+0x97/0x2c0 ? __rseq_handle_notify_resume+0xad/0x4c0 ? __schedule+0x4b8/0xd00 ? restore_fpregs_from_fpstate+0x3c/0x90 ? switch_fpu_return+0x5b/0xe0 ? do_syscall_64+0x1ef/0x2f0 ? do_fault+0x2e9/0x540 ? __handle_mm_fault+0x7d1/0xf70 ? count_memcg_events+0x167/0x1d0 ? handle_mm_fault+0x1d7/0x2e0 ? do_user_addr_fault+0x2c3/0x7f0 entry_SYSCALL_64_after_hwframe+0x76/0x7e The reason is that the stacktrace field is not labeled as such, and is treated as a normal field and not as a dynamic event that it is. In trace_event_raw_event_synth() the event is field is still treated as a dynamic array, but the retrieval of the data is considered a normal field, and the reference is just the meta data: // Meta data is retrieved instead of a dynamic array str_val = (char *)(long)var_ref_vals[val_idx]; // Then when it tries to process it: len = *((unsigned long *)str_val) + 1; It triggers a kernel page fault. To fix this, first when defining the fields of the first synthetic event, set the filter type to FILTER_STACKTRACE. This is used later by the second synthetic event to know that this field is a stacktrace. When creating the field of the new synthetic event, have it use this FILTER_STACKTRACE to know to create a stacktrace field to copy the stacktrace into. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tom Zanussi <zanussi@kernel.org> Link: https://patch.msgid.link/20260122194824.6905a38e@gandalf.local.home Fixes: 00cf3d672a9d ("tracing: Allow synthetic events to pass around stacktraces") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-30sched/fair: Fix pelt clock sync when entering idleVincent Guittot
[ Upstream commit 98c88dc8a1ace642d9021b103b28cba7b51e3abc ] Samuel and Alex reported regressions of the util_avg of RT rq with commit 17e3e88ed0b6 ("sched/fair: Fix pelt lost idle time detection"). It happens that fair is updating and syncing the pelt clock with task one when pick_next_task_fair() fails to pick a task but before the prev scheduling class got a chance to update its pelt signals. Move update_idle_rq_clock_pelt() in set_next_task_idle() which is called after prev class has been called. Fixes: 17e3e88ed0b6 ("sched/fair: Fix pelt lost idle time detection") Closes: https://lore.kernel.org/all/CAG2KctpO6VKS6GN4QWDji0t92_gNBJ7HjjXrE+6H+RwRXt=iLg@mail.gmail.com/ Closes: https://lore.kernel.org/all/8cf19bf0e0054dcfed70e9935029201694f1bb5a.camel@mediatek.com/ Reported-by: Samuel Wu <wusamuel@google.com> Reported-by: Alex Hoh <Alex.Hoh@mediatek.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Samuel Wu <wusamuel@google.com> Tested-by: Alex Hoh <Alex.Hoh@mediatek.com> Link: https://patch.msgid.link/20260121163317.505635-1-vincent.guittot@linaro.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-30clocksource: Reduce watchdog readout delay limit to prevent false positivesThomas Gleixner
[ Upstream commit c06343be0b4e03fe319910dd7a5d5b9929e1c0cb ] The "valid" readout delay between the two reads of the watchdog is larger than the valid delta between the resulting watchdog and clocksource intervals, which results in false positive watchdog results. Assume TSC is the clocksource and HPET is the watchdog and both have a uncertainty margin of 250us (default). The watchdog readout does: 1) wdnow = read(HPET); 2) csnow = read(TSC); 3) wdend = read(HPET); The valid window for the delta between #1 and #3 is calculated by the uncertainty margins of the watchdog and the clocksource: m = 2 * watchdog.uncertainty_margin + cs.uncertainty margin; which results in 750us for the TSC/HPET case. The actual interval comparison uses a smaller margin: m = watchdog.uncertainty_margin + cs.uncertainty margin; which results in 500us for the TSC/HPET case. That means the following scenario will trigger the watchdog: Watchdog cycle N: 1) wdnow[N] = read(HPET); 2) csnow[N] = read(TSC); 3) wdend[N] = read(HPET); Assume the delay between #1 and #2 is 100us and the delay between #1 and Watchdog cycle N + 1: 4) wdnow[N + 1] = read(HPET); 5) csnow[N + 1] = read(TSC); 6) wdend[N + 1] = read(HPET); If the delay between #4 and #6 is within the 750us margin then any delay between #4 and #5 which is larger than 600us will fail the interval check and mark the TSC unstable because the intervals are calculated against the previous value: wd_int = wdnow[N + 1] - wdnow[N]; cs_int = csnow[N + 1] - csnow[N]; Putting the above delays in place this results in: cs_int = (wdnow[N + 1] + 610us) - (wdnow[N] + 100us); -> cs_int = wd_int + 510us; which is obviously larger than the allowed 500us margin and results in marking TSC unstable. Fix this by using the same margin as the interval comparison. If the delay between two watchdog reads is larger than that, then the readout was either disturbed by interconnect congestion, NMIs or SMIs. Fixes: 4ac1dd3245b9 ("clocksource: Set cs_watchdog_read() checks based on .uncertainty_margin") Reported-by: Daniel J Blueman <daniel@quora.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/lkml/20250602223251.496591-1-daniel@quora.org/ Link: https://patch.msgid.link/87bjjxc9dq.ffs@tglx Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-30ptp: Add PHC file mode checks. Allow RO adjtime() without FMODE_WRITE.Wojtek Wasko
[ Upstream commit b4e53b15c04e3852949003752f48f7a14ae39e86 ] Many devices implement highly accurate clocks, which the kernel manages as PTP Hardware Clocks (PHCs). Userspace applications rely on these clocks to timestamp events, trace workload execution, correlate timescales across devices, and keep various clocks in sync. The kernel’s current implementation of PTP clocks does not enforce file permissions checks for most device operations except for POSIX clock operations, where file mode is verified in the POSIX layer before forwarding the call to the PTP subsystem. Consequently, it is common practice to not give unprivileged userspace applications any access to PTP clocks whatsoever by giving the PTP chardevs 600 permissions. An example of users running into this limitation is documented in [1]. Additionally, POSIX layer requires WRITE permission even for readonly adjtime() calls which are used in PTP layer to return current frequency offset applied to the PHC. Add permission checks for functions that modify the state of a PTP device. Continue enforcing permission checks for POSIX clock operations (settime, adjtime) in the POSIX layer. Only require WRITE access for dynamic clocks adjtime() if any flags are set in the modes field. [1] https://lists.nwtime.org/sympa/arc/linuxptp-users/2024-01/msg00036.html Changes in v4: - Require FMODE_WRITE in ajtime() only for calls modifying the clock in any way. Acked-by: Richard Cochran <richardcochran@gmail.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Signed-off-by: Wojtek Wasko <wwasko@nvidia.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-30posix-clock: Store file pointer in struct posix_clock_contextWojtek Wasko
[ Upstream commit e859d375d1694488015e6804bfeea527a0b25b9f ] File descriptor based pc_clock_*() operations of dynamic posix clocks have access to the file pointer and implement permission checks in the generic code before invoking the relevant dynamic clock callback. Character device operations (open, read, poll, ioctl) do not implement a generic permission control and the dynamic clock callbacks have no access to the file pointer to implement them. Extend struct posix_clock_context with a struct file pointer and initialize it in posix_clock_open(), so that all dynamic clock callbacks can access it. Acked-by: Richard Cochran <richardcochran@gmail.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Wojtek Wasko <wwasko@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-23bpf: Reject narrower access to pointer ctx fieldsPaul Chaignon
commit e09299225d5ba3916c91ef70565f7d2187e4cca0 upstream. The following BPF program, simplified from a syzkaller repro, causes a kernel warning: r0 = *(u8 *)(r1 + 169); exit; With pointer field sk being at offset 168 in __sk_buff. This access is detected as a narrower read in bpf_skb_is_valid_access because it doesn't match offsetof(struct __sk_buff, sk). It is therefore allowed and later proceeds to bpf_convert_ctx_access. Note that for the "is_narrower_load" case in the convert_ctx_accesses(), the insn->off is aligned, so the cnt may not be 0 because it matches the offsetof(struct __sk_buff, sk) in the bpf_convert_ctx_access. However, the target_size stays 0 and the verifier errors with a kernel warning: verifier bug: error during ctx access conversion(1) This patch fixes that to return a proper "invalid bpf_context access off=X size=Y" error on the load instruction. The same issue affects multiple other fields in context structures that allow narrow access. Some other non-affected fields (for sk_msg, sk_lookup, and sockopt) were also changed to use bpf_ctx_range_ptr for consistency. Note this syzkaller crash was reported in the "Closes" link below, which used to be about a different bug, fixed in commit fce7bd8e385a ("bpf/verifier: Handle BPF_LOAD_ACQ instructions in insn_def_regno()"). Because syzbot somehow confused the two bugs, the new crash and repro didn't get reported to the mailing list. Fixes: f96da09473b52 ("bpf: simplify narrower ctx access") Fixes: 0df1a55afa832 ("bpf: Warn on internal verifier errors") Reported-by: syzbot+0ef84a7bdf5301d4cbec@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=0ef84a7bdf5301d4cbec Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://patch.msgid.link/3b8dcee67ff4296903351a974ddd9c4dca768b64.1753194596.git.paul.chaignon@gmail.com Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-23hrtimer: Fix softirq base check in update_needs_ipi()Thomas Weißschuh
commit 05dc4a9fc8b36d4c99d76bbc02aa9ec0132de4c2 upstream. The 'clockid' field is not the correct way to check for a softirq base. Fix the check to correctly compare the base type instead of the clockid. Fixes: 1e7f7fbcd40c ("hrtimer: Avoid more SMP function calls in clock_was_set()") Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260107-hrtimer-clock-base-check-v1-1-afb5dbce94a1@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-11sched/fair: Proportional newidle balancePeter Zijlstra
commit 33cf66d88306663d16e4759e9d24766b0aaa2e17 upstream. Add a randomized algorithm that runs newidle balancing proportional to its success rate. This improves schbench significantly: 6.18-rc4: 2.22 Mrps/s 6.18-rc4+revert: 2.04 Mrps/s 6.18-rc4+revert+random: 2.18 Mrps/S Conversely, per Adam Li this affects SpecJBB slightly, reducing it by 1%: 6.17: -6% 6.17+revert: 0% 6.17+revert+random: -1% Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Chris Mason <clm@meta.com> Link: https://lkml.kernel.org/r/6825c50d-7fa7-45d8-9b81-c6e7e25738e2@meta.com Link: https://patch.msgid.link/20251107161739.770122091@infradead.org [ Ajay: Modified to apply on v6.12 ] Signed-off-by: Ajay Kaher <ajay.kaher@broadcom.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-11sched/fair: Small cleanup to update_newidle_cost()Peter Zijlstra
commit 08d473dd8718e4a4d698b1113a14a40ad64a909b upstream. Simplify code by adding a few variables. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Chris Mason <clm@meta.com> Link: https://patch.msgid.link/20251107161739.655208666@infradead.org [ Ajay: Modified to apply on v6.12 ] Signed-off-by: Ajay Kaher <ajay.kaher@broadcom.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-11sched/fair: Small cleanup to sched_balance_newidle()Peter Zijlstra
commit e78e70dbf603c1425f15f32b455ca148c932f6c1 upstream. Pull out the !sd check to simplify code. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Chris Mason <clm@meta.com> Link: https://patch.msgid.link/20251107161739.525916173@infradead.org [ Ajay: Modified to apply on v6.12 ] Signed-off-by: Ajay Kaher <ajay.kaher@broadcom.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08sched_ext: Fix missing post-enqueue handling in move_local_task_to_local_dsq()Tejun Heo
[ Upstream commit f5e1e5ec204da11fa87fdf006d451d80ce06e118 ] move_local_task_to_local_dsq() is used when moving a task from a non-local DSQ to a local DSQ on the same CPU. It directly manipulates the local DSQ without going through dispatch_enqueue() and was missing the post-enqueue handling that triggers preemption when SCX_ENQ_PREEMPT is set or the idle task is running. The function is used by move_task_between_dsqs() which backs scx_bpf_dsq_move() and may be called while the CPU is busy. Add local_dsq_post_enq() call to move_local_task_to_local_dsq(). As the dispatch path doesn't need post-enqueue handling, add SCX_RQ_IN_BALANCE early exit to keep consume_dispatch_q() behavior unchanged and avoid triggering unnecessary resched when scx_bpf_dsq_move() is used from the dispatch path. Fixes: 4c30f5ce4f7a ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()") Cc: stable@vger.kernel.org # v6.12+ Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08sched_ext: Factor out local_dsq_post_enq() from dispatch_enqueue()Tejun Heo
[ Upstream commit 530b6637c79e728d58f1d9b66bd4acf4b735b86d ] Factor out local_dsq_post_enq() which performs post-enqueue handling for local DSQs - triggering resched_curr() if SCX_ENQ_PREEMPT is specified or if the current CPU is idle. No functional change. This will be used by the next patch to fix move_local_task_to_local_dsq(). Cc: stable@vger.kernel.org # v6.12+ Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08sched_ext: Fix incorrect sched_class settings for per-cpu migration tasksZqiang
[ Upstream commit 1dd6c84f1c544e552848a8968599220bd464e338 ] When loading the ebpf scheduler, the tasks in the scx_tasks list will be traversed and invoke __setscheduler_class() to get new sched_class. however, this would also incorrectly set the per-cpu migration task's->sched_class to rt_sched_class, even after unload, the per-cpu migration task's->sched_class remains sched_rt_class. The log for this issue is as follows: ./scx_rustland --stats 1 [ 199.245639][ T630] sched_ext: "rustland" does not implement cgroup cpu.weight [ 199.269213][ T630] sched_ext: BPF scheduler "rustland" enabled 04:25:09 [INFO] RustLand scheduler attached bpftrace -e 'iter:task /strcontains(ctx->task->comm, "migration")/ { printf("%s:%d->%pS\n", ctx->task->comm, ctx->task->pid, ctx->task->sched_class); }' Attaching 1 probe... migration/0:24->rt_sched_class+0x0/0xe0 migration/1:27->rt_sched_class+0x0/0xe0 migration/2:33->rt_sched_class+0x0/0xe0 migration/3:39->rt_sched_class+0x0/0xe0 migration/4:45->rt_sched_class+0x0/0xe0 migration/5:52->rt_sched_class+0x0/0xe0 migration/6:58->rt_sched_class+0x0/0xe0 migration/7:64->rt_sched_class+0x0/0xe0 sched_ext: BPF scheduler "rustland" disabled (unregistered from user space) EXIT: unregistered from user space 04:25:21 [INFO] Unregister RustLand scheduler bpftrace -e 'iter:task /strcontains(ctx->task->comm, "migration")/ { printf("%s:%d->%pS\n", ctx->task->comm, ctx->task->pid, ctx->task->sched_class); }' Attaching 1 probe... migration/0:24->rt_sched_class+0x0/0xe0 migration/1:27->rt_sched_class+0x0/0xe0 migration/2:33->rt_sched_class+0x0/0xe0 migration/3:39->rt_sched_class+0x0/0xe0 migration/4:45->rt_sched_class+0x0/0xe0 migration/5:52->rt_sched_class+0x0/0xe0 migration/6:58->rt_sched_class+0x0/0xe0 migration/7:64->rt_sched_class+0x0/0xe0 This commit therefore generate a new scx_setscheduler_class() and add check for stop_sched_class to replace __setscheduler_class(). Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class") Cc: stable@vger.kernel.org # v6.12+ Signed-off-by: Zqiang <qiang.zhang@linux.dev> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> [ Adjust context ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08sched/eevdf: Fix min_vruntime vs avg_vruntimePeter Zijlstra
[ Upstream commit 79f3f9bedd149ea438aaeb0fb6a083637affe205 ] Basically, from the constraint that the sum of lag is zero, you can infer that the 0-lag point is the weighted average of the individual vruntime, which is what we're trying to compute: \Sum w_i * v_i avg = -------------- \Sum w_i Now, since vruntime takes the whole u64 (worse, it wraps), this multiplication term in the numerator is not something we can compute; instead we do the min_vruntime (v0 henceforth) thing like: v_i = (v_i - v0) + v0 This does two things: - it keeps the key: (v_i - v0) 'small'; - it creates a relative 0-point in the modular space. If you do that subtitution and work it all out, you end up with: \Sum w_i * (v_i - v0) avg = --------------------- + v0 \Sum w_i Since you cannot very well track a ratio like that (and not suffer terrible numerical problems) we simpy track the numerator and denominator individually and only perform the division when strictly needed. Notably, the numerator lives in cfs_rq->avg_vruntime and the denominator lives in cfs_rq->avg_load. The one extra 'funny' is that these numbers track the entities in the tree, and current is typically outside of the tree, so avg_vruntime() adds current when needed before doing the division. (vruntime_eligible() elides the division by cross-wise multiplication) Anyway, as mentioned above, we currently use the CFS era min_vruntime for this purpose. However, this thing can only move forward, while the above avg can in fact move backward (when a non-eligible task leaves, the average becomes smaller), this can cause trouble when through happenstance (or construction) these values drift far enough apart to wreck the game. Replace cfs_rq::min_vruntime with cfs_rq::zero_vruntime which is kept near/at avg_vruntime, following its motion. The down-side is that this requires computing the avg more often. Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy") Reported-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251106111741.GC4068168@noisy.programming.kicks-ass.net Cc: stable@vger.kernel.org [ Adjust context in comments + init_cfs_rq ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08fgraph: Check ftrace_pids_enabled on registration for early filteringShengming Hu
commit 1650a1b6cb1ae6cb99bb4fce21b30ebdf9fc238e upstream. When registering ftrace_graph, check if ftrace_pids_enabled is active. If enabled, assign entryfunc to fgraph_pid_func to ensure filtering is performed before executing the saved original entry function. Cc: stable@vger.kernel.org Cc: <wang.yaxin@zte.com.cn> Cc: <mhiramat@kernel.org> Cc: <mark.rutland@arm.com> Cc: <mathieu.desnoyers@efficios.com> Cc: <zhang.run@zte.com.cn> Cc: <yang.yang29@zte.com.cn> Link: https://patch.msgid.link/20251126173331679XGVF98NLhyLJRdtNkVZ6w@zte.com.cn Fixes: df3ec5da6a1e7 ("function_graph: Add pid tracing back to function graph tracer") Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08fgraph: Initialize ftrace_ops->private for function graph opsShengming Hu
commit b5d6d3f73d0bac4a7e3a061372f6da166fc6ee5c upstream. The ftrace_pids_enabled(op) check relies on op->private being properly initialized, but fgraph_ops's underlying ftrace_ops->private was left uninitialized. This caused ftrace_pids_enabled() to always return false, effectively disabling PID filtering for function graph tracing. Fix this by copying src_ops->private to dst_ops->private in fgraph_init_ops(), ensuring PID filter state is correctly propagated. Cc: stable@vger.kernel.org Cc: <wang.yaxin@zte.com.cn> Cc: <mhiramat@kernel.org> Cc: <mark.rutland@arm.com> Cc: <mathieu.desnoyers@efficios.com> Cc: <zhang.run@zte.com.cn> Cc: <yang.yang29@zte.com.cn> Fixes: c132be2c4fcc1 ("function_graph: Have the instances use their own ftrace_ops for filtering") Link: https://patch.msgid.link/20251126172926004y3hC8QyU4WFOjBkU_UxLC@zte.com.cn Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08tracing: Fix fixed array of synthetic eventSteven Rostedt
commit 47ef834209e5981f443240d8a8b45bf680df22aa upstream. The commit 4d38328eb442d ("tracing: Fix synth event printk format for str fields") replaced "%.*s" with "%s" but missed removing the number size of the dynamic and static strings. The commit e1a453a57bc7 ("tracing: Do not add length to print format in synthetic events") fixed the dynamic part but did not fix the static part. That is, with the commands: # echo 's:wake_lat char[] wakee; u64 delta;' >> /sys/kernel/tracing/dynamic_events # echo 'hist:keys=pid:ts=common_timestamp.usecs if !(common_flags & 0x18)' > /sys/kernel/tracing/events/sched/sched_waking/trigger # echo 'hist:keys=next_pid:delta=common_timestamp.usecs-$ts:onmatch(sched.sched_waking).trace(wake_lat,next_comm,$delta)' > /sys/kernel/tracing/events/sched/sched_switch/trigger That caused the output of: <idle>-0 [001] d..5. 193.428167: wake_lat: wakee=(efault)sshd-sessiondelta=155 sshd-session-879 [001] d..5. 193.811080: wake_lat: wakee=(efault)kworker/u34:5delta=58 <idle>-0 [002] d..5. 193.811198: wake_lat: wakee=(efault)bashdelta=91 The commit e1a453a57bc7 fixed the part where the synthetic event had "char[] wakee". But if one were to replace that with a static size string: # echo 's:wake_lat char[16] wakee; u64 delta;' >> /sys/kernel/tracing/dynamic_events Where "wakee" is defined as "char[16]" and not "char[]" making it a static size, the code triggered the "(efaul)" again. Remove the added STR_VAR_LEN_MAX size as the string is still going to be nul terminated. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Douglas Raillard <douglas.raillard@arm.com> Link: https://patch.msgid.link/20251204151935.5fa30355@gandalf.local.home Fixes: e1a453a57bc7 ("tracing: Do not add length to print format in synthetic events") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08sched/rt: Fix race in push_rt_taskHarshit Agarwal
commit 690e47d1403e90b7f2366f03b52ed3304194c793 upstream. Overview ======== When a CPU chooses to call push_rt_task and picks a task to push to another CPU's runqueue then it will call find_lock_lowest_rq method which would take a double lock on both CPUs' runqueues. If one of the locks aren't readily available, it may lead to dropping the current runqueue lock and reacquiring both the locks at once. During this window it is possible that the task is already migrated and is running on some other CPU. These cases are already handled. However, if the task is migrated and has already been executed and another CPU is now trying to wake it up (ttwu) such that it is queued again on the runqeue (on_rq is 1) and also if the task was run by the same CPU, then the current checks will pass even though the task was migrated out and is no longer in the pushable tasks list. Crashes ======= This bug resulted in quite a few flavors of crashes triggering kernel panics with various crash signatures such as assert failures, page faults, null pointer dereferences, and queue corruption errors all coming from scheduler itself. Some of the crashes: -> kernel BUG at kernel/sched/rt.c:1616! BUG_ON(idx >= MAX_RT_PRIO) Call Trace: ? __die_body+0x1a/0x60 ? die+0x2a/0x50 ? do_trap+0x85/0x100 ? pick_next_task_rt+0x6e/0x1d0 ? do_error_trap+0x64/0xa0 ? pick_next_task_rt+0x6e/0x1d0 ? exc_invalid_op+0x4c/0x60 ? pick_next_task_rt+0x6e/0x1d0 ? asm_exc_invalid_op+0x12/0x20 ? pick_next_task_rt+0x6e/0x1d0 __schedule+0x5cb/0x790 ? update_ts_time_stats+0x55/0x70 schedule_idle+0x1e/0x40 do_idle+0x15e/0x200 cpu_startup_entry+0x19/0x20 start_secondary+0x117/0x160 secondary_startup_64_no_verify+0xb0/0xbb -> BUG: kernel NULL pointer dereference, address: 00000000000000c0 Call Trace: ? __die_body+0x1a/0x60 ? no_context+0x183/0x350 ? __warn+0x8a/0xe0 ? exc_page_fault+0x3d6/0x520 ? asm_exc_page_fault+0x1e/0x30 ? pick_next_task_rt+0xb5/0x1d0 ? pick_next_task_rt+0x8c/0x1d0 __schedule+0x583/0x7e0 ? update_ts_time_stats+0x55/0x70 schedule_idle+0x1e/0x40 do_idle+0x15e/0x200 cpu_startup_entry+0x19/0x20 start_secondary+0x117/0x160 secondary_startup_64_no_verify+0xb0/0xbb -> BUG: unable to handle page fault for address: ffff9464daea5900 kernel BUG at kernel/sched/rt.c:1861! BUG_ON(rq->cpu != task_cpu(p)) -> kernel BUG at kernel/sched/rt.c:1055! BUG_ON(!rq->nr_running) Call Trace: ? __die_body+0x1a/0x60 ? die+0x2a/0x50 ? do_trap+0x85/0x100 ? dequeue_top_rt_rq+0xa2/0xb0 ? do_error_trap+0x64/0xa0 ? dequeue_top_rt_rq+0xa2/0xb0 ? exc_invalid_op+0x4c/0x60 ? dequeue_top_rt_rq+0xa2/0xb0 ? asm_exc_invalid_op+0x12/0x20 ? dequeue_top_rt_rq+0xa2/0xb0 dequeue_rt_entity+0x1f/0x70 dequeue_task_rt+0x2d/0x70 __schedule+0x1a8/0x7e0 ? blk_finish_plug+0x25/0x40 schedule+0x3c/0xb0 futex_wait_queue_me+0xb6/0x120 futex_wait+0xd9/0x240 do_futex+0x344/0xa90 ? get_mm_exe_file+0x30/0x60 ? audit_exe_compare+0x58/0x70 ? audit_filter_rules.constprop.26+0x65e/0x1220 __x64_sys_futex+0x148/0x1f0 do_syscall_64+0x30/0x80 entry_SYSCALL_64_after_hwframe+0x62/0xc7 -> BUG: unable to handle page fault for address: ffff8cf3608bc2c0 Call Trace: ? __die_body+0x1a/0x60 ? no_context+0x183/0x350 ? spurious_kernel_fault+0x171/0x1c0 ? exc_page_fault+0x3b6/0x520 ? plist_check_list+0x15/0x40 ? plist_check_list+0x2e/0x40 ? asm_exc_page_fault+0x1e/0x30 ? _cond_resched+0x15/0x30 ? futex_wait_queue_me+0xc8/0x120 ? futex_wait+0xd9/0x240 ? try_to_wake_up+0x1b8/0x490 ? futex_wake+0x78/0x160 ? do_futex+0xcd/0xa90 ? plist_check_list+0x15/0x40 ? plist_check_list+0x2e/0x40 ? plist_del+0x6a/0xd0 ? plist_check_list+0x15/0x40 ? plist_check_list+0x2e/0x40 ? dequeue_pushable_task+0x20/0x70 ? __schedule+0x382/0x7e0 ? asm_sysvec_reschedule_ipi+0xa/0x20 ? schedule+0x3c/0xb0 ? exit_to_user_mode_prepare+0x9e/0x150 ? irqentry_exit_to_user_mode+0x5/0x30 ? asm_sysvec_reschedule_ipi+0x12/0x20 Above are some of the common examples of the crashes that were observed due to this issue. Details ======= Let's look at the following scenario to understand this race. 1) CPU A enters push_rt_task a) CPU A has chosen next_task = task p. b) CPU A calls find_lock_lowest_rq(Task p, CPU Z’s rq). c) CPU A identifies CPU X as a destination CPU (X < Z). d) CPU A enters double_lock_balance(CPU Z’s rq, CPU X’s rq). e) Since X is lower than Z, CPU A unlocks CPU Z’s rq. Someone else has locked CPU X’s rq, and thus, CPU A must wait. 2) At CPU Z a) Previous task has completed execution and thus, CPU Z enters schedule, locks its own rq after CPU A releases it. b) CPU Z dequeues previous task and begins executing task p. c) CPU Z unlocks its rq. d) Task p yields the CPU (ex. by doing IO or waiting to acquire a lock) which triggers the schedule function on CPU Z. e) CPU Z enters schedule again, locks its own rq, and dequeues task p. f) As part of dequeue, it sets p.on_rq = 0 and unlocks its rq. 3) At CPU B a) CPU B enters try_to_wake_up with input task p. b) Since CPU Z dequeued task p, p.on_rq = 0, and CPU B updates B.state = WAKING. c) CPU B via select_task_rq determines CPU Y as the target CPU. 4) The race a) CPU A acquires CPU X’s lock and relocks CPU Z. b) CPU A reads task p.cpu = Z and incorrectly concludes task p is still on CPU Z. c) CPU A failed to notice task p had been dequeued from CPU Z while CPU A was waiting for locks in double_lock_balance. If CPU A knew that task p had been dequeued, it would return NULL forcing push_rt_task to give up the task p's migration. d) CPU B updates task p.cpu = Y and calls ttwu_queue. e) CPU B locks Ys rq. CPU B enqueues task p onto Y and sets task p.on_rq = 1. f) CPU B unlocks CPU Y, triggering memory synchronization. g) CPU A reads task p.on_rq = 1, cementing its assumption that task p has not migrated. h) CPU A decides to migrate p to CPU X. This leads to A dequeuing p from Y's queue and various crashes down the line. Solution ======== The solution here is fairly simple. After obtaining the lock (at 4a), the check is enhanced to make sure that the task is still at the head of the pushable tasks list. If not, then it is anyway not suitable for being pushed out. Testing ======= The fix is tested on a cluster of 3 nodes, where the panics due to this are hit every couple of days. A fix similar to this was deployed on such cluster and was stable for more than 30 days. Co-developed-by: Jon Kohler <jon@nutanix.com> Signed-off-by: Jon Kohler <jon@nutanix.com> Co-developed-by: Gauri Patwardhan <gauri.patwardhan@nutanix.com> Signed-off-by: Gauri Patwardhan <gauri.patwardhan@nutanix.com> Co-developed-by: Rahul Chunduru <rahul.chunduru@nutanix.com> Signed-off-by: Rahul Chunduru <rahul.chunduru@nutanix.com> Signed-off-by: Harshit Agarwal <harshit@nutanix.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: "Steven Rostedt (Google)" <rostedt@goodmis.org> Reviewed-by: Phil Auld <pauld@redhat.com> Tested-by: Will Ton <william.ton@nutanix.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250225180553.167995-1-harshit@nutanix.com Signed-off-by: Rajani Kantha <681739313@139.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08tracing: Do not register unsupported perf eventsSteven Rostedt
commit ef7f38df890f5dcd2ae62f8dbde191d72f3bebae upstream. Synthetic events currently do not have a function to register perf events. This leads to calling the tracepoint register functions with a NULL function pointer which triggers: ------------[ cut here ]------------ WARNING: kernel/tracepoint.c:175 at tracepoint_add_func+0x357/0x370, CPU#2: perf/2272 Modules linked in: kvm_intel kvm irqbypass CPU: 2 UID: 0 PID: 2272 Comm: perf Not tainted 6.18.0-ftest-11964-ge022764176fc-dirty #323 PREEMPTLAZY Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-debian-1.17.0-1 04/01/2014 RIP: 0010:tracepoint_add_func+0x357/0x370 Code: 28 9c e8 4c 0b f5 ff eb 0f 4c 89 f7 48 c7 c6 80 4d 28 9c e8 ab 89 f4 ff 31 c0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc cc <0f> 0b 49 c7 c6 ea ff ff ff e9 ee fe ff ff 0f 0b e9 f9 fe ff ff 0f RSP: 0018:ffffabc0c44d3c40 EFLAGS: 00010246 RAX: 0000000000000001 RBX: ffff9380aa9e4060 RCX: 0000000000000000 RDX: 000000000000000a RSI: ffffffff9e1d4a98 RDI: ffff937fcf5fd6c8 RBP: 0000000000000001 R08: 0000000000000007 R09: ffff937fcf5fc780 R10: 0000000000000003 R11: ffffffff9c193910 R12: 000000000000000a R13: ffffffff9e1e5888 R14: 0000000000000000 R15: ffffabc0c44d3c78 FS: 00007f6202f5f340(0000) GS:ffff93819f00f000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055d3162281a8 CR3: 0000000106a56003 CR4: 0000000000172ef0 Call Trace: <TASK> tracepoint_probe_register+0x5d/0x90 synth_event_reg+0x3c/0x60 perf_trace_event_init+0x204/0x340 perf_trace_init+0x85/0xd0 perf_tp_event_init+0x2e/0x50 perf_try_init_event+0x6f/0x230 ? perf_event_alloc+0x4bb/0xdc0 perf_event_alloc+0x65a/0xdc0 __se_sys_perf_event_open+0x290/0x9f0 do_syscall_64+0x93/0x7b0 ? entry_SYSCALL_64_after_hwframe+0x76/0x7e ? trace_hardirqs_off+0x53/0xc0 entry_SYSCALL_64_after_hwframe+0x76/0x7e Instead, have the code return -ENODEV, which doesn't warn and has perf error out with: # perf record -e synthetic:futex_wait Error: The sys_perf_event_open() syscall returned with 19 (No such device) for event (synthetic:futex_wait). "dmesg | grep -i perf" may provide additional information. Ideally perf should support synthetic events, but for now just fix the warning. The support can come later. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://patch.msgid.link/20251216182440.147e4453@gandalf.local.home Fixes: 4b147936fa509 ("tracing: Add support for 'synthetic' events") Reported-by: Ian Rogers <irogers@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08scs: fix a wrong parameter in __scs_magicZhichi Lin
commit 08bd4c46d5e63b78e77f2605283874bbe868ab19 upstream. __scs_magic() needs a 'void *' variable, but a 'struct task_struct *' is given. 'task_scs(tsk)' is the starting address of the task's shadow call stack, and '__scs_magic(task_scs(tsk))' is the end address of the task's shadow call stack. Here should be '__scs_magic(task_scs(tsk))'. The user-visible effect of this bug is that when CONFIG_DEBUG_STACK_USAGE is enabled, the shadow call stack usage checking function (scs_check_usage) would scan an incorrect memory range. This could lead to: 1. **Inaccurate stack usage reporting**: The function would calculate wrong usage statistics for the shadow call stack, potentially showing incorrect value in kmsg. 2. **Potential kernel crash**: If the value of __scs_magic(tsk)is greater than that of __scs_magic(task_scs(tsk)), the for loop may access unmapped memory, potentially causing a kernel panic. However, this scenario is unlikely because task_struct is allocated via the slab allocator (which typically returns lower addresses), while the shadow call stack returned by task_scs(tsk) is allocated via vmalloc(which typically returns higher addresses). However, since this is purely a debugging feature (CONFIG_DEBUG_STACK_USAGE), normal production systems should be not unaffected. The bug only impacts developers and testers who are actively debugging stack usage with this configuration enabled. Link: https://lkml.kernel.org/r/20251011082222.12965-1-zhichi.lin@vivo.com Fixes: 5bbaf9d1fcb9 ("scs: Add support for stack usage debugging") Signed-off-by: Jiyuan Xie <xiejiyuan@vivo.com> Signed-off-by: Zhichi Lin <zhichi.lin@vivo.com> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Will Deacon <will@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Marco Elver <elver@google.com> Cc: Will Deacon <will@kernel.org> Cc: Yee Lee <yee.lee@mediatek.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08kallsyms: Fix wrong "big" kernel symbol type read from procfsZheng Yejian
commit f3f9f42232dee596d15491ca3f611d02174db49c upstream. Currently when the length of a symbol is longer than 0x7f characters, its type shown in /proc/kallsyms can be incorrect. I found this issue when reading the code, but it can be reproduced by following steps: 1. Define a function which symbol length is 130 characters: #define X13(x) x##x##x##x##x##x##x##x##x##x##x##x##x static noinline void X13(x123456789)(void) { printk("hello world\n"); } 2. The type in vmlinux is 't': $ nm vmlinux | grep x123456 ffffffff816290f0 t x123456789x123456789x123456789x12[...] 3. Then boot the kernel, the type shown in /proc/kallsyms becomes 'g' instead of the expected 't': # cat /proc/kallsyms | grep x123456 ffffffff816290f0 g x123456789x123456789x123456789x12[...] The root cause is that, after commit 73bbb94466fd ("kallsyms: support "big" kernel symbols"), ULEB128 was used to encode symbol name length. That is, for "big" kernel symbols of which name length is longer than 0x7f characters, the length info is encoded into 2 bytes. kallsyms_get_symbol_type() expects to read the first char of the symbol name which indicates the symbol type. However, due to the "big" symbol case not being handled, the symbol type read from /proc/kallsyms may be wrong, so handle it properly. Cc: stable@vger.kernel.org Fixes: 73bbb94466fd ("kallsyms: support "big" kernel symbols") Signed-off-by: Zheng Yejian <zhengyejian@huaweicloud.com> Acked-by: Gary Guo <gary@garyguo.net> Link: https://patch.msgid.link/20241011143853.3022643-1-zhengyejian@huaweicloud.com Signed-off-by: Miguel Ojeda <ojeda@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08livepatch: Match old_sympos 0 and 1 in klp_find_func()Song Liu
[ Upstream commit 139560e8b973402140cafeb68c656c1374bd4c20 ] When there is only one function of the same name, old_sympos of 0 and 1 are logically identical. Match them in klp_find_func(). This is to avoid a corner case with different toolchain behavior. In this specific issue, two versions of kpatch-build were used to build livepatch for the same kernel. One assigns old_sympos == 0 for unique local functions, the other assigns old_sympos == 1 for unique local functions. Both versions work fine by themselves. (PS: This behavior change was introduced in a downstream version of kpatch-build. This change does not exist in upstream kpatch-build.) However, during livepatch upgrade (with the replace flag set) from a patch built with one version of kpatch-build to the same fix built with the other version of kpatch-build, livepatching fails with errors like: [ 14.218706] sysfs: cannot create duplicate filename 'xxx/somefunc,1' ... [ 14.219466] Call Trace: [ 14.219468] <TASK> [ 14.219469] dump_stack_lvl+0x47/0x60 [ 14.219474] sysfs_warn_dup.cold+0x17/0x27 [ 14.219476] sysfs_create_dir_ns+0x95/0xb0 [ 14.219479] kobject_add_internal+0x9e/0x260 [ 14.219483] kobject_add+0x68/0x80 [ 14.219485] ? kstrdup+0x3c/0xa0 [ 14.219486] klp_enable_patch+0x320/0x830 [ 14.219488] patch_init+0x443/0x1000 [ccc_0_6] [ 14.219491] ? 0xffffffffa05eb000 [ 14.219492] do_one_initcall+0x2e/0x190 [ 14.219494] do_init_module+0x67/0x270 [ 14.219496] init_module_from_file+0x75/0xa0 [ 14.219499] idempotent_init_module+0x15a/0x240 [ 14.219501] __x64_sys_finit_module+0x61/0xc0 [ 14.219503] do_syscall_64+0x5b/0x160 [ 14.219505] entry_SYSCALL_64_after_hwframe+0x4b/0x53 [ 14.219507] RIP: 0033:0x7f545a4bd96d ... [ 14.219516] kobject: kobject_add_internal failed for somefunc,1 with -EEXIST, don't try to register things with the same name ... This happens because klp_find_func() thinks somefunc with old_sympos==0 is not the same as somefunc with old_sympos==1, and klp_add_object_nops adds another xxx/func,1 to the list of functions to patch. Signed-off-by: Song Liu <song@kernel.org> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> [pmladek@suse.com: Fixed some typos.] Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-08sched/fair: Revert max_newidle_lb_cost bumpPeter Zijlstra
[ Upstream commit d206fbad9328ddb68ebabd7cf7413392acd38081 ] Many people reported regressions on their database workloads due to: 155213a2aed4 ("sched/fair: Bump sd->max_newidle_lb_cost when newidle balance fails") For instance Adam Li reported a 6% regression on SpecJBB. Conversely this will regress schbench again; on my machine from 2.22 Mrps/s down to 2.04 Mrps/s. Reported-by: Joseph Salisbury <joseph.salisbury@oracle.com> Reported-by: Adam Li <adamli@os.amperecomputing.com> Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reported-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Chris Mason <clm@meta.com> Link: https://lkml.kernel.org/r/20250626144017.1510594-2-clm@fb.com Link: https://lkml.kernel.org/r/006c9df2-b691-47f1-82e6-e233c3f91faf@oracle.com Link: https://patch.msgid.link/20251107161739.406147760@infradead.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-08sched/deadline: only set free_cpus for online runqueuesDoug Berger
[ Upstream commit 382748c05e58a9f1935f5a653c352422375566ea ] Commit 16b269436b72 ("sched/deadline: Modify cpudl::free_cpus to reflect rd->online") introduced the cpudl_set/clear_freecpu functions to allow the cpu_dl::free_cpus mask to be manipulated by the deadline scheduler class rq_on/offline callbacks so the mask would also reflect this state. Commit 9659e1eeee28 ("sched/deadline: Remove cpu_active_mask from cpudl_find()") removed the check of the cpu_active_mask to save some processing on the premise that the cpudl::free_cpus mask already reflected the runqueue online state. Unfortunately, there are cases where it is possible for the cpudl_clear function to set the free_cpus bit for a CPU when the deadline runqueue is offline. When this occurs while a CPU is connected to the default root domain the flag may retain the bad state after the CPU has been unplugged. Later, a different CPU that is transitioning through the default root domain may push a deadline task to the powered down CPU when cpudl_find sees its free_cpus bit is set. If this happens the task will not have the opportunity to run. One example is outlined here: https://lore.kernel.org/lkml/20250110233010.2339521-1-opendmb@gmail.com Another occurs when the last deadline task is migrated from a CPU that has an offlined runqueue. The dequeue_task member of the deadline scheduler class will eventually call cpudl_clear and set the free_cpus bit for the CPU. This commit modifies the cpudl_clear function to be aware of the online state of the deadline runqueue so that the free_cpus mask can be updated appropriately. It is no longer necessary to manage the mask outside of the cpudl_set/clear functions so the cpudl_set/clear_freecpu functions are removed. In addition, since the free_cpus mask is now only updated under the cpudl lock the code was changed to use the non-atomic __cpumask functions. Signed-off-by: Doug Berger <opendmb@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18dma/pool: eliminate alloc_pages warning in atomic_pool_expandDave Kleikamp
[ Upstream commit 463d439becb81383f3a5a5d840800131f265a09c ] atomic_pool_expand iteratively tries the allocation while decrementing the page order. There is no need to issue a warning if an attempted allocation fails. Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Fixes: d7e673ec2c8e ("dma-pool: Only allocate from CMA when in same memory zone") [mszyprow: fixed typo] Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20251202152810.142370-1-dave.kleikamp@oracle.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the ↵xupengbo
last task migrates out [ Upstream commit ca125231dd29fc0678dd3622e9cdea80a51dffe4 ] When a task is migrated out, there is a probability that the tg->load_avg value will become abnormal. The reason is as follows: 1. Due to the 1ms update period limitation in update_tg_load_avg(), there is a possibility that the reduced load_avg is not updated to tg->load_avg when a task migrates out. 2. Even though __update_blocked_fair() traverses the leaf_cfs_rq_list and calls update_tg_load_avg() for cfs_rqs that are not fully decayed, the key function cfs_rq_is_decayed() does not check whether cfs->tg_load_avg_contrib is null. Consequently, in some cases, __update_blocked_fair() removes cfs_rqs whose avg.load_avg has not been updated to tg->load_avg. Add a check of cfs_rq->tg_load_avg_contrib in cfs_rq_is_decayed(), which fixes the case (2.) mentioned above. Fixes: 1528c661c24b ("sched/fair: Ratelimit update to tg->load_avg") Signed-off-by: xupengbo <xupengbo@oppo.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Aaron Lu <ziqianlu@bytedance.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Aaron Lu <ziqianlu@bytedance.com> Link: https://patch.msgid.link/20250827022208.14487-1-xupengbo@oppo.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()"Ilias Stamatis
[ Upstream commit 6fb3acdebf65a72df0a95f9fd2c901ff2bc9a3a2 ] Commit 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only logic") removed an optimization introduced by commit 756398750e11 ("resource: avoid unnecessary lookups in find_next_iomem_res()"). That was not called out in the message of the first commit explicitly so it's not entirely clear whether removing the optimization happened inadvertently or not. As the original commit message of the optimization explains there is no point considering the children of a subtree in find_next_iomem_res() if the top level range does not match. Reinstating the optimization results in performance improvements in systems where /proc/iomem is ~5k lines long. Calling mmap() on /dev/mem in such platforms takes 700-1500μs without the optimisation and 10-50μs with the optimisation. Note that even though commit 97523a4edb7b removed the 'sibling_only' parameter from next_resource(), newer kernels have basically reinstated it under the name 'skip_children'. Link: https://lore.kernel.org/all/20251124165349.3377826-1-ilstam@amazon.com/T/#u Fixes: 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only logic") Signed-off-by: Ilias Stamatis <ilstam@amazon.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Baoquan He <bhe@redhat.com> Cc: "Huang, Ying" <huang.ying.caritas@gmail.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18resource: introduce is_type_match() helper and use itAndy Shevchenko
[ Upstream commit ba1eccc114ffc62c4495a5e15659190fa2c42308 ] There are already a couple of places where we may replace a few lines of code by calling a helper, which increases readability while deduplicating the code. Introduce is_type_match() helper and use it. Link: https://lkml.kernel.org/r/20240925154355.1170859-3-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: 6fb3acdebf65 ("Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()"") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18resource: replace open coded resource_intersection()Andy Shevchenko
[ Upstream commit 5c1edea773c98707fbb23d1df168bcff52f61e4b ] Patch series "resource: A couple of cleanups". A couple of ad-hoc cleanups since there was a recent development of the code in question. No functional changes intended. This patch (of 2): __region_intersects() uses open coded resource_intersection(). Replace it with existing API which also make more clear what we are checking. Link: https://lkml.kernel.org/r/20240925154355.1170859-1-andriy.shevchenko@linux.intel.com Link: https://lkml.kernel.org/r/20240925154355.1170859-2-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: 6fb3acdebf65 ("Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()"") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18cpuset: Treat cpusets in attaching as populatedChen Ridong
[ Upstream commit b1bcaed1e39a9e0dfbe324a15d2ca4253deda316 ] Currently, the check for whether a partition is populated does not account for tasks in the cpuset of attaching. This is a corner case that can leave a task stuck in a partition with no effective CPUs. The race condition occurs as follows: cpu0 cpu1 //cpuset A with cpu N migrate task p to A cpuset_can_attach // with effective cpus // check ok // cpuset_mutex is not held // clear cpuset.cpus.exclusive // making effective cpus empty update_exclusive_cpumask // tasks_nocpu_error check ok // empty effective cpus, partition valid cpuset_attach ... // task p stays in A, with non-effective cpus. To fix this issue, this patch introduces cs_is_populated, which considers tasks in the attaching cpuset. This new helper is used in validate_change and partition_is_populated. Fixes: e2d59900d936 ("cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effective") Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18bpf: Fix invalid prog->stats access when update_effective_progs failsPu Lehui
[ Upstream commit 7dc211c1159d991db609bdf4b0fb9033c04adcbc ] Syzkaller triggers an invalid memory access issue following fault injection in update_effective_progs. The issue can be described as follows: __cgroup_bpf_detach update_effective_progs compute_effective_progs bpf_prog_array_alloc <-- fault inject purge_effective_progs /* change to dummy_bpf_prog */ array->items[index] = &dummy_bpf_prog.prog ---softirq start--- __do_softirq ... __cgroup_bpf_run_filter_skb __bpf_prog_run_save_cb bpf_prog_run stats = this_cpu_ptr(prog->stats) /* invalid memory access */ flags = u64_stats_update_begin_irqsave(&stats->syncp) ---softirq end--- static_branch_dec(&cgroup_bpf_enabled_key[atype]) The reason is that fault injection caused update_effective_progs to fail and then changed the original prog into dummy_bpf_prog.prog in purge_effective_progs. Then a softirq came, and accessing the members of dummy_bpf_prog.prog in the softirq triggers invalid mem access. To fix it, skip updating stats when stats is NULL. Fixes: 492ecee892c2 ("bpf: enable program stats") Signed-off-by: Pu Lehui <pulehui@huawei.com> Link: https://lore.kernel.org/r/20251115102343.2200727-1-pulehui@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18bpf: Handle return value of ftrace_set_filter_ip in register_fentryMenglong Dong
[ Upstream commit fea3f5e83c5cd80a76d97343023a2f2e6bd862bf ] The error that returned by ftrace_set_filter_ip() in register_fentry() is not handled properly. Just fix it. Fixes: 00963a2e75a8 ("bpf: Support bpf_trampoline on functions with IPMODIFY (e.g. livepatch)") Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20251110120705.1553694-1-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18bpf: Free special fields when update [lru_,]percpu_hash mapsLeon Hwang
[ Upstream commit 6af6e49a76c9af7d42eb923703e7648cb2bf401a ] As [lru_,]percpu_hash maps support BPF_KPTR_{REF,PERCPU}, missing calls to 'bpf_obj_free_fields()' in 'pcpu_copy_value()' could cause the memory referenced by BPF_KPTR_{REF,PERCPU} fields to be held until the map gets freed. Fix this by calling 'bpf_obj_free_fields()' after 'copy_map_value[,_long]()' in 'pcpu_copy_value()'. Fixes: 65334e64a493 ("bpf: Support kptrs in percpu hashmap and percpu LRU hashmap") Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20251105151407.12723-2-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18locktorture: Fix memory leak in param_set_cpumask()Wang Liang
[ Upstream commit e52b43883d084a9af263c573f2a1bd1ca5088389 ] With CONFIG_CPUMASK_OFFSTACK=y, the 'bind_writers' buffer is allocated via alloc_cpumask_var() in param_set_cpumask(). But it is not freed, when setting the module parameter multiple times by sysfs interface or removing module. Below kmemleak trace is seen for this issue: unreferenced object 0xffff888100aabff8 (size 8): comm "bash", pid 323, jiffies 4295059233 hex dump (first 8 bytes): 07 00 00 00 00 00 00 00 ........ backtrace (crc ac50919): __kmalloc_node_noprof+0x2e5/0x420 alloc_cpumask_var_node+0x1f/0x30 param_set_cpumask+0x26/0xb0 [locktorture] param_attr_store+0x93/0x100 module_attr_store+0x1b/0x30 kernfs_fop_write_iter+0x114/0x1b0 vfs_write+0x300/0x410 ksys_write+0x60/0xd0 do_syscall_64+0xa4/0x260 entry_SYSCALL_64_after_hwframe+0x77/0x7f This issue can be reproduced by: insmod locktorture.ko bind_writers=1 rmmod locktorture or: insmod locktorture.ko bind_writers=1 echo 2 > /sys/module/locktorture/parameters/bind_writers Considering that setting the module parameter 'bind_writers' or 'bind_readers' by sysfs interface has no real effect, set the parameter permissions to 0444. To fix the memory leak when removing module, free 'bind_writers' and 'bind_readers' memory in lock_torture_cleanup(). Fixes: 73e341242483 ("locktorture: Add readers_bind and writers_bind module parameters") Suggested-by: Zhang Changzhong <zhangchangzhong@huawei.com> Signed-off-by: Wang Liang <wangliang74@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18task_work: Fix NMI race conditionPeter Zijlstra
[ Upstream commit ef1ea98c8fffe227e5319215d84a53fa2a4bcebc ] __schedule() // disable irqs <NMI> task_work_add(current, work, TWA_NMI_CURRENT); </NMI> // current = next; // enable irqs <IRQ> task_work_set_notify_irq() test_and_set_tsk_thread_flag(current, TIF_NOTIFY_RESUME); // wrong task! </IRQ> // original task skips task work on its next return to user (or exit!) Fixes: 466e4d801cd4 ("task_work: Add TWA_NMI_CURRENT as an additional notify mode.") Reported-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://patch.msgid.link/20250924080118.425949403@infradead.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18bpf: Fix stackmap overflow check in __bpf_get_stackid()Arnaud Lecomte
[ Upstream commit 23f852daa4bab4d579110e034e4d513f7d490846 ] Syzkaller reported a KASAN slab-out-of-bounds write in __bpf_get_stackid() when copying stack trace data. The issue occurs when the perf trace contains more stack entries than the stack map bucket can hold, leading to an out-of-bounds write in the bucket's data array. Fixes: ee2a098851bf ("bpf: Adjust BPF stack helper functions to accommodate skip > 0") Reported-by: syzbot+c9b724fbb41cf2538b7b@syzkaller.appspotmail.com Signed-off-by: Arnaud Lecomte <contact@arnaud-lcm.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/bpf/20251025192941.1500-1-contact@arnaud-lcm.com Closes: https://syzkaller.appspot.com/bug?extid=c9b724fbb41cf2538b7b Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18bpf: Refactor stack map trace depth calculation into helper functionArnaud Lecomte
[ Upstream commit e17d62fedd10ae56e2426858bd0757da544dbc73 ] Extract the duplicated maximum allowed depth computation for stack traces stored in BPF stacks from bpf_get_stackid() and __bpf_get_stack() into a dedicated stack_map_calculate_max_depth() helper function. This unifies the logic for: - The max depth computation - Enforcing the sysctl_perf_event_max_stack limit No functional changes for existing code paths. Signed-off-by: Arnaud Lecomte <contact@arnaud-lcm.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/bpf/20251025192858.31424-1-contact@arnaud-lcm.com Stable-dep-of: 23f852daa4ba ("bpf: Fix stackmap overflow check in __bpf_get_stackid()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18perf: Remove get_perf_callchain() init_nr argumentJosh Poimboeuf
[ Upstream commit e649bcda25b5ae1a30a182cc450f928a0b282c93 ] The 'init_nr' argument has double duty: it's used to initialize both the number of contexts and the number of stack entries. That's confusing and the callers always pass zero anyway. Hard code the zero. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Namhyung Kim <Namhyung@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/r/20250820180428.259565081@kernel.org Stable-dep-of: 23f852daa4ba ("bpf: Fix stackmap overflow check in __bpf_get_stackid()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18sched/fair: Forfeit vruntime on yieldFernand Sieber
[ Upstream commit 79104becf42baeeb4a3f2b106f954b9fc7c10a3c ] If a task yields, the scheduler may decide to pick it again. The task in turn may decide to yield immediately or shortly after, leading to a tight loop of yields. If there's another runnable task as this point, the deadline will be increased by the slice at each loop. This can cause the deadline to runaway pretty quickly, and subsequent elevated run delays later on as the task doesn't get picked again. The reason the scheduler can pick the same task again and again despite its deadline increasing is because it may be the only eligible task at that point. Fix this by making the task forfeiting its remaining vruntime and pushing the deadline one slice ahead. This implements yield behavior more authentically. We limit the forfeiting to eligible tasks. This is because core scheduling prefers running ineligible tasks rather than force idling. As such, without the condition, we can end up on a yield loop which makes the vruntime increase rapidly, leading to anomalous run delays later down the line. Fixes: 147f3efaa24182 ("sched/fair: Implement an EEVDF-like scheduling policy") Signed-off-by: Fernand Sieber <sieberf@amazon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250401123622.584018-1-sieberf@amazon.com Link: https://lore.kernel.org/r/20250911095113.203439-1-sieberf@amazon.com Link: https://lore.kernel.org/r/20250916140228.452231-1-sieberf@amazon.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-12ftrace: bpf: Fix IPMODIFY + DIRECT in modify_ftrace_direct()Song Liu
[ Upstream commit 3e9a18e1c3e931abecf501cbb23d28d69f85bb56 ] ftrace_hash_ipmodify_enable() checks IPMODIFY and DIRECT ftrace_ops on the same kernel function. When needed, ftrace_hash_ipmodify_enable() calls ops->ops_func() to prepare the direct ftrace (BPF trampoline) to share the same function as the IPMODIFY ftrace (livepatch). ftrace_hash_ipmodify_enable() is called in register_ftrace_direct() path, but not called in modify_ftrace_direct() path. As a result, the following operations will break livepatch: 1. Load livepatch to a kernel function; 2. Attach fentry program to the kernel function; 3. Attach fexit program to the kernel function. After 3, the kernel function being used will not be the livepatched version, but the original version. Fix this by adding __ftrace_hash_update_ipmodify() to __modify_ftrace_direct() and adjust some logic around the call. Signed-off-by: Song Liu <song@kernel.org> Reviewed-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20251027175023.1521602-3-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>