summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2025-12-03gcov: add support for GCC 15Peter Oberparleiter
commit ec4d11fc4b2dd4a2fa8c9d801ee9753b74623554 upstream. Using gcov on kernels compiled with GCC 15 results in truncated 16-byte long .gcda files with no usable data. To fix this, update GCOV_COUNTERS to match the value defined by GCC 15. Tested with GCC 14.3.0 and GCC 15.2.0. Link: https://lkml.kernel.org/r/20251028115125.1319410-1-oberpar@linux.ibm.com Signed-off-by: Peter Oberparleiter <oberpar@linux.ibm.com> Reported-by: Matthieu Baerts <matttbe@kernel.org> Closes: https://github.com/linux-test-project/lcov/issues/445 Tested-by: Matthieu Baerts <matttbe@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-12-03tracing: Fix memory leaks in create_field_var()Zilin Guan
[ Upstream commit 80f0d631dcc76ee1b7755bfca1d8417d91d71414 ] The function create_field_var() allocates memory for 'val' through create_hist_field() inside parse_atom(), and for 'var' through create_var(), which in turn allocates var->type and var->var.name internally. Simply calling kfree() to release these structures will result in memory leaks. Use destroy_hist_field() to properly free 'val', and explicitly release the memory of var->type and var->var.name before freeing 'var' itself. Link: https://patch.msgid.link/20251106120132.3639920-1-zilin@seu.edu.cn Fixes: 02205a6752f22 ("tracing: Add support for 'field variables'") Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-03uprobe: Do not emulate/sstep original instruction when ip is changedJiri Olsa
[ Upstream commit 4363264111e1297fa37aa39b0598faa19298ecca ] If uprobe handler changes instruction pointer we still execute single step) or emulate the original instruction and increment the (new) ip with its length. This makes the new instruction pointer bogus and application will likely crash on illegal instruction execution. If user decided to take execution elsewhere, it makes little sense to execute the original instruction, so let's skip it. Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20250916215301.664963-3-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-29padata: Reset next CPU when reorder sequence wraps aroundXiao Liang
[ Upstream commit 501302d5cee0d8e8ec2c4a5919c37e0df9abc99b ] When seq_nr wraps around, the next reorder job with seq 0 is hashed to the first CPU in padata_do_serial(). Correspondingly, need reset pd->cpu to the first one when pd->processed wraps around. Otherwise, if the number of used CPUs is not a power of 2, padata_find_next() will be checking a wrong list, hence deadlock. Fixes: 6fc4dbcf0276 ("padata: Replace delayed timer with immediate workqueue in padata_reorder") Cc: <stable@vger.kernel.org> Signed-off-by: Xiao Liang <shaw.leon@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> [ moved from padata_reorder() to padata_find_next() function ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-29sched/fair: Fix pelt lost idle time detectionVincent Guittot
[ Upstream commit 17e3e88ed0b6318fde0d1c14df1a804711cab1b5 ] The check for some lost idle pelt time should be always done when pick_next_task_fair() fails to pick a task and not only when we call it from the fair fast-path. The case happens when the last running task on rq is a RT or DL task. When the latter goes to sleep and the /Sum of util_sum of the rq is at the max value, we don't account the lost of idle time whereas we should. Fixes: 67692435c411 ("sched: Rework pick_next_task() slow-path") Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-29sched/balancing: Rename newidle_balance() => sched_balance_newidle()Ingo Molnar
[ Upstream commit 7d058285cd77cc1411c91efd1b1673530bb1bee8 ] Standardize scheduler load-balancing function names on the sched_balance_() prefix. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://lore.kernel.org/r/20240308111819.1101550-11-mingo@kernel.org Stable-dep-of: 17e3e88ed0b6 ("sched/fair: Fix pelt lost idle time detection") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-29sched/fair: Trivial correction of the newidle_balance() commentBarry Song
[ Upstream commit 5b78f2dc315354c05300795064f587366a02c6ff ] idle_balance() has been renamed to newidle_balance(). To differentiate with nohz_idle_balance, it seems refining the comment will be helpful for the readers of the code. Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20201202220641.22752-1-song.bao.hua@hisilicon.com Stable-dep-of: 17e3e88ed0b6 ("sched/fair: Fix pelt lost idle time detection") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-29sched: Make newidle_balance() static againChen Yu
[ Upstream commit d91cecc156620ec75d94c55369509c807c3d07e6 ] After Commit 6e2df0581f56 ("sched: Fix pick_next_task() vs 'change' pattern race"), there is no need to expose newidle_balance() as it is only used within fair.c file. Change this function back to static again. No functional change. Reported-by: kbuild test robot <lkp@intel.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/83cd3030b031ca5d646cd5e225be10e7a0fdd8f5.1587464698.git.yu.c.chen@intel.com Stable-dep-of: 17e3e88ed0b6 ("sched/fair: Fix pelt lost idle time detection") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-29pid: Add a judgment for ns null in pid_nr_nsgaoxiang17
[ Upstream commit 006568ab4c5ca2309ceb36fa553e390b4aa9c0c7 ] __task_pid_nr_ns ns = task_active_pid_ns(current); pid_nr_ns(rcu_dereference(*task_pid_ptr(task, type)), ns); if (pid && ns->level <= pid->level) { Sometimes null is returned for task_active_pid_ns. Then it will trigger kernel panic in pid_nr_ns. For example: Unable to handle kernel NULL pointer dereference at virtual address 0000000000000058 Mem abort info: ESR = 0x0000000096000007 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x07: level 3 translation fault Data abort info: ISV = 0, ISS = 0x00000007, ISS2 = 0x00000000 CM = 0, WnR = 0, TnD = 0, TagAccess = 0 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 user pgtable: 4k pages, 39-bit VAs, pgdp=00000002175aa000 [0000000000000058] pgd=08000002175ab003, p4d=08000002175ab003, pud=08000002175ab003, pmd=08000002175be003, pte=0000000000000000 pstate: 834000c5 (Nzcv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--) pc : __task_pid_nr_ns+0x74/0xd0 lr : __task_pid_nr_ns+0x24/0xd0 sp : ffffffc08001bd10 x29: ffffffc08001bd10 x28: ffffffd4422b2000 x27: 0000000000000001 x26: ffffffd442821168 x25: ffffffd442821000 x24: 00000f89492eab31 x23: 00000000000000c0 x22: ffffff806f5693c0 x21: ffffff806f5693c0 x20: 0000000000000001 x19: 0000000000000000 x18: 0000000000000000 x17: 00000000529c6ef0 x16: 00000000529c6ef0 x15: 00000000023a1adc x14: 0000000000000003 x13: 00000000007ef6d8 x12: 001167c391c78800 x11: 00ffffffffffffff x10: 0000000000000000 x9 : 0000000000000001 x8 : ffffff80816fa3c0 x7 : 0000000000000000 x6 : 49534d702d535449 x5 : ffffffc080c4c2c0 x4 : ffffffd43ee128c8 x3 : ffffffd43ee124dc x2 : 0000000000000000 x1 : 0000000000000001 x0 : ffffff806f5693c0 Call trace: __task_pid_nr_ns+0x74/0xd0 ... __handle_irq_event_percpu+0xd4/0x284 handle_irq_event+0x48/0xb0 handle_fasteoi_irq+0x160/0x2d8 generic_handle_domain_irq+0x44/0x60 gic_handle_irq+0x4c/0x114 call_on_irq_stack+0x3c/0x74 do_interrupt_handler+0x4c/0x84 el1_interrupt+0x34/0x58 el1h_64_irq_handler+0x18/0x24 el1h_64_irq+0x68/0x6c account_kernel_stack+0x60/0x144 exit_task_stack_account+0x1c/0x80 do_exit+0x7e4/0xaf8 ... get_signal+0x7bc/0x8d8 do_notify_resume+0x128/0x828 el0_svc+0x6c/0x70 el0t_64_sync_handler+0x68/0xbc el0t_64_sync+0x1a8/0x1ac Code: 35fffe54 911a02a8 f9400108 b4000128 (b9405a69) ---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Oops: Fatal exception in interrupt Signed-off-by: gaoxiang17 <gaoxiang17@xiaomi.com> Link: https://lore.kernel.org/20250802022123.3536934-1-gxxa03070307@gmail.com Reviewed-by: Baoquan He <bhe@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-29tracing: Fix race condition in kprobe initialization causing NULL pointer ↵Yuan Chen
dereference [ Upstream commit 9cf9aa7b0acfde7545c1a1d912576e9bab28dc6f ] There is a critical race condition in kprobe initialization that can lead to NULL pointer dereference and kernel crash. [1135630.084782] Unable to handle kernel paging request at virtual address 0000710a04630000 ... [1135630.260314] pstate: 404003c9 (nZcv DAIF +PAN -UAO) [1135630.269239] pc : kprobe_perf_func+0x30/0x260 [1135630.277643] lr : kprobe_dispatcher+0x44/0x60 [1135630.286041] sp : ffffaeff4977fa40 [1135630.293441] x29: ffffaeff4977fa40 x28: ffffaf015340e400 [1135630.302837] x27: 0000000000000000 x26: 0000000000000000 [1135630.312257] x25: ffffaf029ed108a8 x24: ffffaf015340e528 [1135630.321705] x23: ffffaeff4977fc50 x22: ffffaeff4977fc50 [1135630.331154] x21: 0000000000000000 x20: ffffaeff4977fc50 [1135630.340586] x19: ffffaf015340e400 x18: 0000000000000000 [1135630.349985] x17: 0000000000000000 x16: 0000000000000000 [1135630.359285] x15: 0000000000000000 x14: 0000000000000000 [1135630.368445] x13: 0000000000000000 x12: 0000000000000000 [1135630.377473] x11: 0000000000000000 x10: 0000000000000000 [1135630.386411] x9 : 0000000000000000 x8 : 0000000000000000 [1135630.395252] x7 : 0000000000000000 x6 : 0000000000000000 [1135630.403963] x5 : 0000000000000000 x4 : 0000000000000000 [1135630.412545] x3 : 0000710a04630000 x2 : 0000000000000006 [1135630.421021] x1 : ffffaeff4977fc50 x0 : 0000710a04630000 [1135630.429410] Call trace: [1135630.434828] kprobe_perf_func+0x30/0x260 [1135630.441661] kprobe_dispatcher+0x44/0x60 [1135630.448396] aggr_pre_handler+0x70/0xc8 [1135630.454959] kprobe_breakpoint_handler+0x140/0x1e0 [1135630.462435] brk_handler+0xbc/0xd8 [1135630.468437] do_debug_exception+0x84/0x138 [1135630.475074] el1_dbg+0x18/0x8c [1135630.480582] security_file_permission+0x0/0xd0 [1135630.487426] vfs_write+0x70/0x1c0 [1135630.493059] ksys_write+0x5c/0xc8 [1135630.498638] __arm64_sys_write+0x24/0x30 [1135630.504821] el0_svc_common+0x78/0x130 [1135630.510838] el0_svc_handler+0x38/0x78 [1135630.516834] el0_svc+0x8/0x1b0 kernel/trace/trace_kprobe.c: 1308 0xffff3df8995039ec <kprobe_perf_func+0x2c>: ldr x21, [x24,#120] include/linux/compiler.h: 294 0xffff3df8995039f0 <kprobe_perf_func+0x30>: ldr x1, [x21,x0] kernel/trace/trace_kprobe.c 1308: head = this_cpu_ptr(call->perf_events); 1309: if (hlist_empty(head)) 1310: return 0; crash> struct trace_event_call -o struct trace_event_call { ... [120] struct hlist_head *perf_events; //(call->perf_event) ... } crash> struct trace_event_call ffffaf015340e528 struct trace_event_call { ... perf_events = 0xffff0ad5fa89f088, //this value is correct, but x21 = 0 ... } Race Condition Analysis: The race occurs between kprobe activation and perf_events initialization: CPU0 CPU1 ==== ==== perf_kprobe_init perf_trace_event_init tp_event->perf_events = list;(1) tp_event->class->reg (2)← KPROBE ACTIVE Debug exception triggers ... kprobe_dispatcher kprobe_perf_func (tk->tp.flags & TP_FLAG_PROFILE) head = this_cpu_ptr(call->perf_events)(3) (perf_events is still NULL) Problem: 1. CPU0 executes (1) assigning tp_event->perf_events = list 2. CPU0 executes (2) enabling kprobe functionality via class->reg() 3. CPU1 triggers and reaches kprobe_dispatcher 4. CPU1 checks TP_FLAG_PROFILE - condition passes (step 2 completed) 5. CPU1 calls kprobe_perf_func() and crashes at (3) because call->perf_events is still NULL CPU1 sees that kprobe functionality is enabled but does not see that perf_events has been assigned. Add pairing read and write memory barriers to guarantee that if CPU1 sees that kprobe functionality is enabled, it must also see that perf_events has been assigned. Link: https://lore.kernel.org/all/20251001022025.44626-1-chenyuan_fl@163.com/ Fixes: 50d780560785 ("tracing/kprobes: Add probe handler dispatcher to support perf and ftrace concurrent use") Cc: stable@vger.kernel.org Signed-off-by: Yuan Chen <chenyuan@kylinos.cn> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> [ Dropped ftrace changes + context ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-02cgroup: split cgroup_destroy_wq into 3 workqueuesChen Ridong
[ Upstream commit 79f919a89c9d06816dbdbbd168fa41d27411a7f9 ] A hung task can occur during [1] LTP cgroup testing when repeatedly mounting/unmounting perf_event and net_prio controllers with systemd.unified_cgroup_hierarchy=1. The hang manifests in cgroup_lock_and_drain_offline() during root destruction. Related case: cgroup_fj_function_perf_event cgroup_fj_function.sh perf_event cgroup_fj_function_net_prio cgroup_fj_function.sh net_prio Call Trace: cgroup_lock_and_drain_offline+0x14c/0x1e8 cgroup_destroy_root+0x3c/0x2c0 css_free_rwork_fn+0x248/0x338 process_one_work+0x16c/0x3b8 worker_thread+0x22c/0x3b0 kthread+0xec/0x100 ret_from_fork+0x10/0x20 Root Cause: CPU0 CPU1 mount perf_event umount net_prio cgroup1_get_tree cgroup_kill_sb rebind_subsystems // root destruction enqueues // cgroup_destroy_wq // kill all perf_event css // one perf_event css A is dying // css A offline enqueues cgroup_destroy_wq // root destruction will be executed first css_free_rwork_fn cgroup_destroy_root cgroup_lock_and_drain_offline // some perf descendants are dying // cgroup_destroy_wq max_active = 1 // waiting for css A to die Problem scenario: 1. CPU0 mounts perf_event (rebind_subsystems) 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work 3. A dying perf_event CSS gets queued for offline after root destruction 4. Root destruction waits for offline completion, but offline work is blocked behind root destruction in cgroup_destroy_wq (max_active=1) Solution: Split cgroup_destroy_wq into three dedicated workqueues: cgroup_offline_wq – Handles CSS offline operations cgroup_release_wq – Manages resource release cgroup_free_wq – Performs final memory deallocation This separation eliminates blocking in the CSS free path while waiting for offline operations to complete. [1] https://github.com/linux-test-project/ltp/blob/master/runtest/controllers Fixes: 334c3679ec4b ("cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends") Reported-by: Gao Yingjie <gaoyingjie@uniontech.com> Signed-off-by: Chen Ridong <chenridong@huawei.com> Suggested-by: Teju Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-02genirq: Provide new interfaces for affinity hintsThomas Gleixner
[ Upstream commit 65c7cdedeb3026fabcc967a7aae2f755ad4d0783 ] The discussion about removing the side effect of irq_set_affinity_hint() of actually applying the cpumask (if not NULL) as affinity to the interrupt, unearthed a few unpleasantries: 1) The modular perf drivers rely on the current behaviour for the very wrong reasons. 2) While none of the other drivers prevents user space from changing the affinity, a cursorily inspection shows that there are at least expectations in some drivers. #1 needs to be cleaned up anyway, so that's not a problem #2 might result in subtle regressions especially when irqbalanced (which nowadays ignores the affinity hint) is disabled. Provide new interfaces: irq_update_affinity_hint() - Only sets the affinity hint pointer irq_set_affinity_and_hint() - Set the pointer and apply the affinity to the interrupt Make irq_set_affinity_hint() a wrapper around irq_apply_affinity_hint() and document it to be phased out. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20210501021832.743094-1-jesse.brandeburg@intel.com Link: https://lore.kernel.org/r/20210903152430.244937-2-nitesh@redhat.com Stable-dep-of: 915470e1b44e ("i40e: fix IRQ freeing in i40e_vsi_request_irq_msix error path") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-02genirq: Export affinity setter for modulesThomas Gleixner
[ Upstream commit 4d80d6ca5d77fde9880da8466e5b64f250e5bf82 ] Perf modules abuse irq_set_affinity_hint() to set the affinity of system PMU interrupts just because irq_set_affinity() was not exported. The fact that irq_set_affinity_hint() actually sets the affinity is a non-documented side effect and the name is clearly saying it's a hint. To clean this up, export the real affinity setter. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/20210518093117.968251441@linutronix.de Stable-dep-of: 915470e1b44e ("i40e: fix IRQ freeing in i40e_vsi_request_irq_msix error path") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-02genirq/affinity: Add irq_update_affinity_desc()John Garry
[ Upstream commit 1d3aec89286254487df7641c30f1b14ad1d127a5 ] Add a function to allow the affinity of an interrupt be switched to managed, such that interrupts allocated for platform devices may be managed. This new interface has certain limitations, and attempts to use it in the following circumstances will fail: - For when the kernel is configured for generic IRQ reservation mode (in config GENERIC_IRQ_RESERVATION_MODE). The reason being that it could conflict with managed vs. non-managed interrupt accounting. - The interrupt is already started, which should not be the case during init - The interrupt is already configured as managed, which means double init Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/1606905417-183214-2-git-send-email-john.garry@huawei.com Stable-dep-of: 915470e1b44e ("i40e: fix IRQ freeing in i40e_vsi_request_irq_msix error path") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-09cpufreq/sched: Explicitly synchronize limits_changed flag handlingRafael J. Wysocki
[ Upstream commit 79443a7e9da3c9f68290a8653837e23aba0fa89f ] The handling of the limits_changed flag in struct sugov_policy needs to be explicitly synchronized to ensure that cpufreq policy limits updates will not be missed in some cases. Without that synchronization it is theoretically possible that the limits_changed update in sugov_should_update_freq() will be reordered with respect to the reads of the policy limits in cpufreq_driver_resolve_freq() and in that case, if the limits_changed update in sugov_limits() clobbers the one in sugov_should_update_freq(), the new policy limits may not take effect for a long time. Likewise, the limits_changed update in sugov_limits() may theoretically get reordered with respect to the updates of the policy limits in cpufreq_set_policy() and if sugov_should_update_freq() runs between them, the policy limits change may be missed. To ensure that the above situations will not take place, add memory barriers preventing the reordering in question from taking place and add READ_ONCE() and WRITE_ONCE() annotations around all of the limits_changed flag updates to prevent the compiler from messing up with that code. Fixes: 600f5badb78c ("cpufreq: schedutil: Don't skip freq update when limits change") Cc: 5.3+ <stable@vger.kernel.org> # 5.3+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/3376719.44csPzL39Z@rjwysocki.net [ Adjust context ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-04ftrace: Fix potential warning in trace_printk_seq during ftrace_dumpTengda Wu
[ Upstream commit 4013aef2ced9b756a410f50d12df9ebe6a883e4a ] When calling ftrace_dump_one() concurrently with reading trace_pipe, a WARN_ON_ONCE() in trace_printk_seq() can be triggered due to a race condition. The issue occurs because: CPU0 (ftrace_dump) CPU1 (reader) echo z > /proc/sysrq-trigger !trace_empty(&iter) trace_iterator_reset(&iter) <- len = size = 0 cat /sys/kernel/tracing/trace_pipe trace_find_next_entry_inc(&iter) __find_next_entry ring_buffer_empty_cpu <- all empty return NULL trace_printk_seq(&iter.seq) WARN_ON_ONCE(s->seq.len >= s->seq.size) In the context between trace_empty() and trace_find_next_entry_inc() during ftrace_dump, the ring buffer data was consumed by other readers. This caused trace_find_next_entry_inc to return NULL, failing to populate `iter.seq`. At this point, due to the prior trace_iterator_reset, both `iter.seq.len` and `iter.seq.size` were set to 0. Since they are equal, the WARN_ON_ONCE condition is triggered. Move the trace_printk_seq() into the if block that checks to make sure the return value of trace_find_next_entry_inc() is non-NULL in ftrace_dump_one(), ensuring the 'iter.seq' is properly populated before subsequent operations. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Ingo Molnar <mingo@elte.hu> Link: https://lore.kernel.org/20250822033343.3000289-1-wutengda@huaweicloud.com Fixes: d769041f8653 ("ring_buffer: implement new locking") Signed-off-by: Tengda Wu <wutengda@huaweicloud.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-28mm: drop the assumption that VM_SHARED always implies writableLorenzo Stoakes
[ Upstream commit e8e17ee90eaf650c855adb0a3e5e965fd6692ff1 ] Patch series "permit write-sealed memfd read-only shared mappings", v4. The man page for fcntl() describing memfd file seals states the following about F_SEAL_WRITE:- Furthermore, trying to create new shared, writable memory-mappings via mmap(2) will also fail with EPERM. With emphasis on 'writable'. In turns out in fact that currently the kernel simply disallows all new shared memory mappings for a memfd with F_SEAL_WRITE applied, rendering this documentation inaccurate. This matters because users are therefore unable to obtain a shared mapping to a memfd after write sealing altogether, which limits their usefulness. This was reported in the discussion thread [1] originating from a bug report [2]. This is a product of both using the struct address_space->i_mmap_writable atomic counter to determine whether writing may be permitted, and the kernel adjusting this counter when any VM_SHARED mapping is performed and more generally implicitly assuming VM_SHARED implies writable. It seems sensible that we should only update this mapping if VM_MAYWRITE is specified, i.e. whether it is possible that this mapping could at any point be written to. If we do so then all we need to do to permit write seals to function as documented is to clear VM_MAYWRITE when mapping read-only. It turns out this functionality already exists for F_SEAL_FUTURE_WRITE - we can therefore simply adapt this logic to do the same for F_SEAL_WRITE. We then hit a chicken and egg situation in mmap_region() where the check for VM_MAYWRITE occurs before we are able to clear this flag. To work around this, perform this check after we invoke call_mmap(), with careful consideration of error paths. Thanks to Andy Lutomirski for the suggestion! [1]:https://lore.kernel.org/all/20230324133646.16101dfa666f253c4715d965@linux-foundation.org/ [2]:https://bugzilla.kernel.org/show_bug.cgi?id=217238 This patch (of 3): There is a general assumption that VMAs with the VM_SHARED flag set are writable. If the VM_MAYWRITE flag is not set, then this is simply not the case. Update those checks which affect the struct address_space->i_mmap_writable field to explicitly test for this by introducing [vma_]is_shared_maywrite() helper functions. This remains entirely conservative, as the lack of VM_MAYWRITE guarantees that the VMA cannot be written to. Link: https://lkml.kernel.org/r/cover.1697116581.git.lstoakes@gmail.com Link: https://lkml.kernel.org/r/d978aefefa83ec42d18dfa964ad180dbcde34795.1697116581.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Suggested-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: stable@vger.kernel.org Signed-off-by: Isaac J. Manjarres <isaacmanjarres@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28tracing: Add down_write(trace_event_sem) when adding trace eventSteven Rostedt
[ Upstream commit b5e8acc14dcb314a9b61ff19dcd9fdd0d88f70df ] When a module is loaded, it adds trace events defined by the module. It may also need to modify the modules trace printk formats to replace enum names with their values. If two modules are loaded at the same time, the adding of the event to the ftrace_events list can corrupt the walking of the list in the code that is modifying the printk format strings and crash the kernel. The addition of the event should take the trace_event_sem for write while it adds the new event. Also add a lockdep_assert_held() on that semaphore in __trace_add_event_dirs() as it iterates the list. Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/20250718223158.799bfc0c@batman.local.home Reported-by: Fusheng Huang(黄富生) <Fusheng.Huang@luxshare-ict.com> Closes: https://lore.kernel.org/all/20250717105007.46ccd18f@batman.local.home/ Fixes: 110bf2b764eb6 ("tracing: add protection around module events unload") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28ftrace: Also allocate and copy hash for reading of filter filesSteven Rostedt
commit bfb336cf97df7b37b2b2edec0f69773e06d11955 upstream. Currently the reader of set_ftrace_filter and set_ftrace_notrace just adds the pointer to the global tracer hash to its iterator. Unlike the writer that allocates a copy of the hash, the reader keeps the pointer to the filter hashes. This is problematic because this pointer is static across function calls that release the locks that can update the global tracer hashes. This can cause UAF and similar bugs. Allocate and copy the hash for reading the filter files like it is done for the writers. This not only fixes UAF bugs, but also makes the code a bit simpler as it doesn't have to differentiate when to free the iterator's hash between writers and readers. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/20250822183606.12962cc3@batman.local.home Fixes: c20489dad156 ("ftrace: Assign iter->hash to filter or notrace hashes on seq read") Closes: https://lore.kernel.org/all/20250813023044.2121943-1-wutengda@huaweicloud.com/ Closes: https://lore.kernel.org/all/20250822192437.GA458494@ax162/ Reported-by: Tengda Wu <wutengda@huaweicloud.com> Tested-by: Tengda Wu <wutengda@huaweicloud.com> Tested-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28rcu: Protect ->defer_qs_iw_pending from data racePaul E. McKenney
[ Upstream commit 90c09d57caeca94e6f3f87c49e96a91edd40cbfd ] On kernels built with CONFIG_IRQ_WORK=y, when rcu_read_unlock() is invoked within an interrupts-disabled region of code [1], it will invoke rcu_read_unlock_special(), which uses an irq-work handler to force the system to notice when the RCU read-side critical section actually ends. That end won't happen until interrupts are enabled at the soonest. In some kernels, such as those booted with rcutree.use_softirq=y, the irq-work handler is used unconditionally. The per-CPU rcu_data structure's ->defer_qs_iw_pending field is updated by the irq-work handler and is both read and updated by rcu_read_unlock_special(). This resulted in the following KCSAN splat: ------------------------------------------------------------------------ BUG: KCSAN: data-race in rcu_preempt_deferred_qs_handler / rcu_read_unlock_special read to 0xffff96b95f42d8d8 of 1 bytes by task 90 on cpu 8: rcu_read_unlock_special+0x175/0x260 __rcu_read_unlock+0x92/0xa0 rt_spin_unlock+0x9b/0xc0 __local_bh_enable+0x10d/0x170 __local_bh_enable_ip+0xfb/0x150 rcu_do_batch+0x595/0xc40 rcu_cpu_kthread+0x4e9/0x830 smpboot_thread_fn+0x24d/0x3b0 kthread+0x3bd/0x410 ret_from_fork+0x35/0x40 ret_from_fork_asm+0x1a/0x30 write to 0xffff96b95f42d8d8 of 1 bytes by task 88 on cpu 8: rcu_preempt_deferred_qs_handler+0x1e/0x30 irq_work_single+0xaf/0x160 run_irq_workd+0x91/0xc0 smpboot_thread_fn+0x24d/0x3b0 kthread+0x3bd/0x410 ret_from_fork+0x35/0x40 ret_from_fork_asm+0x1a/0x30 no locks held by irq_work/8/88. irq event stamp: 200272 hardirqs last enabled at (200272): [<ffffffffb0f56121>] finish_task_switch+0x131/0x320 hardirqs last disabled at (200271): [<ffffffffb25c7859>] __schedule+0x129/0xd70 softirqs last enabled at (0): [<ffffffffb0ee093f>] copy_process+0x4df/0x1cc0 softirqs last disabled at (0): [<0000000000000000>] 0x0 ------------------------------------------------------------------------ The problem is that irq-work handlers run with interrupts enabled, which means that rcu_preempt_deferred_qs_handler() could be interrupted, and that interrupt handler might contain an RCU read-side critical section, which might invoke rcu_read_unlock_special(). In the strict KCSAN mode of operation used by RCU, this constitutes a data race on the ->defer_qs_iw_pending field. This commit therefore disables interrupts across the portion of the rcu_preempt_deferred_qs_handler() that updates the ->defer_qs_iw_pending field. This suffices because this handler is not a fast path. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-28PM: sleep: console: Fix the black screen issuetuhaowen
[ Upstream commit 4266e8fa56d3d982bf451d382a410b9db432015c ] When the computer enters sleep status without a monitor connected, the system switches the console to the virtual terminal tty63(SUSPEND_CONSOLE). If a monitor is subsequently connected before waking up, the system skips the required VT restoration process during wake-up, leaving the console on tty63 instead of switching back to tty1. To fix this issue, a global flag vt_switch_done is introduced to record whether the system has successfully switched to the suspend console via vt_move_to_console() during suspend. If the switch was completed, vt_switch_done is set to 1. Later during resume, this flag is checked to ensure that the original console is restored properly by calling vt_move_to_console(orig_fgconsole, 0). This prevents scenarios where the resume logic skips console restoration due to incorrect detection of the console state, especially when a monitor is reconnected before waking up. Signed-off-by: tuhaowen <tuhaowen@uniontech.com> Link: https://patch.msgid.link/20250611032345.29962-1-tuhaowen@uniontech.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-28perf/core: Prevent VMA split of buffer mappingsThomas Gleixner
commit b024d7b56c77191cde544f838debb7f8451cd0d6 upstream. The perf mmap code is careful about mmap()'ing the user page with the ringbuffer and additionally the auxiliary buffer, when the event supports it. Once the first mapping is established, subsequent mapping have to use the same offset and the same size in both cases. The reference counting for the ringbuffer and the auxiliary buffer depends on this being correct. Though perf does not prevent that a related mapping is split via mmap(2), munmap(2) or mremap(2). A split of a VMA results in perf_mmap_open() calls, which take reference counts, but then the subsequent perf_mmap_close() calls are not longer fulfilling the offset and size checks. This leads to reference count leaks. As perf already has the requirement for subsequent mappings to match the initial mapping, the obvious consequence is that VMA splits, caused by resizing of a mapping or partial unmapping, have to be prevented. Implement the vm_operations_struct::may_split() callback and return unconditionally -EINVAL. That ensures that the mapping offsets and sizes cannot be changed after the fact. Remapping to a different fixed address with the same size is still possible as it takes the references for the new mapping and drops those of the old mapping. Fixes: 45bfb2e50471 ("perf/core: Add AUX area to ring buffer for raw data streams") Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-27504 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28perf/core: Exit early on perf_mmap() failThomas Gleixner
commit 07091aade394f690e7b655578140ef84d0e8d7b0 upstream. When perf_mmap() fails to allocate a buffer, it still invokes the event_mapped() callback of the related event. On X86 this might increase the perf_rdpmc_allowed reference counter. But nothing undoes this as perf_mmap_close() is never called in this case, which causes another reference count leak. Return early on failure to prevent that. Fixes: 1e0fb9ec679c ("perf/core: Add pmu callbacks to track event mapping and unmapping") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28perf/core: Don't leak AUX buffer refcount on allocation failureThomas Gleixner
commit 5468c0fbccbb9d156522c50832244a8b722374fb upstream. Failure of the AUX buffer allocation leaks the reference count. Set the reference count to 1 only when the allocation succeeds. Fixes: 45bfb2e50471 ("perf/core: Add AUX area to ring buffer for raw data streams") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-27perf: Fix sample vs do_exit()Peter Zijlstra
[ Upstream commit 4f6fc782128355931527cefe3eb45338abd8ab39 ] Baisheng Gao reported an ARM64 crash, which Mark decoded as being a synchronous external abort -- most likely due to trying to access MMIO in bad ways. The crash further shows perf trying to do a user stack sample while in exit_mmap()'s tlb_finish_mmu() -- i.e. while tearing down the address space it is trying to access. It turns out that we stop perf after we tear down the userspace mm; a receipie for disaster, since perf likes to access userspace for various reasons. Flip this order by moving up where we stop perf in do_exit(). Additionally, harden PERF_SAMPLE_CALLCHAIN and PERF_SAMPLE_STACK_USER to abort when the current task does not have an mm (exit_mm() makes sure to set current->mm = NULL; before commencing with the actual teardown). Such that CPU wide events don't trip on this same problem. Fixes: c5ebcedb566e ("perf: Add ability to attach user stack dump to sample") Reported-by: Baisheng Gao <baisheng.gao@unisoc.com> Suggested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250605110815.GQ39944@noisy.programming.kicks-ass.net Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-27posix-cpu-timers: fix race between handle_posix_cpu_timers() and ↵Oleg Nesterov
posix_cpu_timer_del() commit f90fff1e152dedf52b932240ebbd670d83330eca upstream. If an exiting non-autoreaping task has already passed exit_notify() and calls handle_posix_cpu_timers() from IRQ, it can be reaped by its parent or debugger right after unlock_task_sighand(). If a concurrent posix_cpu_timer_del() runs at that moment, it won't be able to detect timer->it.cpu.firing != 0: cpu_timer_task_rcu() and/or lock_task_sighand() will fail. Add the tsk->exit_state check into run_posix_cpu_timers() to fix this. This fix is not needed if CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y, because exit_task_work() is called before exit_notify(). But the check still makes sense, task_work_add(&tsk->posix_cputimers_work.work) will fail anyway in this case. Cc: stable@vger.kernel.org Reported-by: Benoît Sevens <bsevens@google.com> Fixes: 0bdd2ed4138e ("sched: run_posix_cpu_timers: Don't check ->exit_state, use lock_task_sighand()") Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-27ftrace: Fix UAF when lookup kallsym after ftrace disabledYe Bin
commit f914b52c379c12288b7623bb814d0508dbe7481d upstream. The following issue happens with a buggy module: BUG: unable to handle page fault for address: ffffffffc05d0218 PGD 1bd66f067 P4D 1bd66f067 PUD 1bd671067 PMD 101808067 PTE 0 Oops: Oops: 0000 [#1] SMP KASAN PTI Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS RIP: 0010:sized_strscpy+0x81/0x2f0 RSP: 0018:ffff88812d76fa08 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffffffffc0601010 RCX: dffffc0000000000 RDX: 0000000000000038 RSI: dffffc0000000000 RDI: ffff88812608da2d RBP: 8080808080808080 R08: ffff88812608da2d R09: ffff88812608da68 R10: ffff88812608d82d R11: ffff88812608d810 R12: 0000000000000038 R13: ffff88812608da2d R14: ffffffffc05d0218 R15: fefefefefefefeff FS: 00007fef552de740(0000) GS:ffff8884251c7000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffc05d0218 CR3: 00000001146f0000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ftrace_mod_get_kallsym+0x1ac/0x590 update_iter_mod+0x239/0x5b0 s_next+0x5b/0xa0 seq_read_iter+0x8c9/0x1070 seq_read+0x249/0x3b0 proc_reg_read+0x1b0/0x280 vfs_read+0x17f/0x920 ksys_read+0xf3/0x1c0 do_syscall_64+0x5f/0x2e0 entry_SYSCALL_64_after_hwframe+0x76/0x7e The above issue may happen as follows: (1) Add kprobe tracepoint; (2) insmod test.ko; (3) Module triggers ftrace disabled; (4) rmmod test.ko; (5) cat /proc/kallsyms; --> Will trigger UAF as test.ko already removed; ftrace_mod_get_kallsym() ... strscpy(module_name, mod_map->mod->name, MODULE_NAME_LEN); ... The problem is when a module triggers an issue with ftrace and sets ftrace_disable. The ftrace_disable is set when an anomaly is discovered and to prevent any more damage, ftrace stops all text modification. The issue that happened was that the ftrace_disable stops more than just the text modification. When a module is loaded, its init functions can also be traced. Because kallsyms deletes the init functions after a module has loaded, ftrace saves them when the module is loaded and function tracing is enabled. This allows the output of the function trace to show the init function names instead of just their raw memory addresses. When a module is removed, ftrace_release_mod() is called, and if ftrace_disable is set, it just returns without doing anything more. The problem here is that it leaves the mod_list still around and if kallsyms is called, it will call into this code and access the module memory that has already been freed as it will return: strscpy(module_name, mod_map->mod->name, MODULE_NAME_LEN); Where the "mod" no longer exists and triggers a UAF bug. Link: https://lore.kernel.org/all/20250523135452.626d8dcd@gandalf.local.home/ Cc: stable@vger.kernel.org Fixes: aba4b5c22cba ("ftrace: Save module init functions kallsyms symbols for tracing") Link: https://lore.kernel.org/20250529111955.2349189-2-yebin@huaweicloud.com Signed-off-by: Ye Bin <yebin10@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-27bpf: Fix WARN() in get_bpf_raw_tp_regsTao Chen
[ Upstream commit 3880cdbed1c4607e378f58fa924c5d6df900d1d3 ] syzkaller reported an issue: WARNING: CPU: 3 PID: 5971 at kernel/trace/bpf_trace.c:1861 get_bpf_raw_tp_regs+0xa4/0x100 kernel/trace/bpf_trace.c:1861 Modules linked in: CPU: 3 UID: 0 PID: 5971 Comm: syz-executor205 Not tainted 6.15.0-rc5-syzkaller-00038-g707df3375124 #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:get_bpf_raw_tp_regs+0xa4/0x100 kernel/trace/bpf_trace.c:1861 RSP: 0018:ffffc90003636fa8 EFLAGS: 00010293 RAX: 0000000000000000 RBX: 0000000000000003 RCX: ffffffff81c6bc4c RDX: ffff888032efc880 RSI: ffffffff81c6bc83 RDI: 0000000000000005 RBP: ffff88806a730860 R08: 0000000000000005 R09: 0000000000000003 R10: 0000000000000004 R11: 0000000000000000 R12: 0000000000000004 R13: 0000000000000001 R14: ffffc90003637008 R15: 0000000000000900 FS: 0000000000000000(0000) GS:ffff8880d6cdf000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f7baee09130 CR3: 0000000029f5a000 CR4: 0000000000352ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ____bpf_get_stack_raw_tp kernel/trace/bpf_trace.c:1934 [inline] bpf_get_stack_raw_tp+0x24/0x160 kernel/trace/bpf_trace.c:1931 bpf_prog_ec3b2eefa702d8d3+0x43/0x47 bpf_dispatcher_nop_func include/linux/bpf.h:1316 [inline] __bpf_prog_run include/linux/filter.h:718 [inline] bpf_prog_run include/linux/filter.h:725 [inline] __bpf_trace_run kernel/trace/bpf_trace.c:2363 [inline] bpf_trace_run3+0x23f/0x5a0 kernel/trace/bpf_trace.c:2405 __bpf_trace_mmap_lock_acquire_returned+0xfc/0x140 include/trace/events/mmap_lock.h:47 __traceiter_mmap_lock_acquire_returned+0x79/0xc0 include/trace/events/mmap_lock.h:47 __do_trace_mmap_lock_acquire_returned include/trace/events/mmap_lock.h:47 [inline] trace_mmap_lock_acquire_returned include/trace/events/mmap_lock.h:47 [inline] __mmap_lock_do_trace_acquire_returned+0x138/0x1f0 mm/mmap_lock.c:35 __mmap_lock_trace_acquire_returned include/linux/mmap_lock.h:36 [inline] mmap_read_trylock include/linux/mmap_lock.h:204 [inline] stack_map_get_build_id_offset+0x535/0x6f0 kernel/bpf/stackmap.c:157 __bpf_get_stack+0x307/0xa10 kernel/bpf/stackmap.c:483 ____bpf_get_stack kernel/bpf/stackmap.c:499 [inline] bpf_get_stack+0x32/0x40 kernel/bpf/stackmap.c:496 ____bpf_get_stack_raw_tp kernel/trace/bpf_trace.c:1941 [inline] bpf_get_stack_raw_tp+0x124/0x160 kernel/trace/bpf_trace.c:1931 bpf_prog_ec3b2eefa702d8d3+0x43/0x47 Tracepoint like trace_mmap_lock_acquire_returned may cause nested call as the corner case show above, which will be resolved with more general method in the future. As a result, WARN_ON_ONCE will be triggered. As Alexei suggested, remove the WARN_ON_ONCE first. Fixes: 9594dc3c7e71 ("bpf: fix nested bpf tracepoints with per-cpu data") Reported-by: syzbot+45b0c89a0fc7ae8dbadc@syzkaller.appspotmail.com Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250513042747.757042-1-chen.dylane@linux.dev Closes: https://lore.kernel.org/bpf/8bc2554d-1052-4922-8832-e0078a033e1d@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-27PM: wakeup: Delete space in the end of string shown by pm_show_wakelocks()Zijun Hu
[ Upstream commit f0050a3e214aa941b78ad4caf122a735a24d81a6 ] pm_show_wakelocks() is called to generate a string when showing attributes /sys/power/wake_(lock|unlock), but the string ends with an unwanted space that was added back by mistake by commit c9d967b2ce40 ("PM: wakeup: simplify the output logic of pm_show_wakelocks()"). Remove the unwanted space. Fixes: c9d967b2ce40 ("PM: wakeup: simplify the output logic of pm_show_wakelocks()") Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com> Link: https://patch.msgid.link/20250505-fix_power-v1-1-0f7f2c2f338c@quicinc.com [ rjw: Changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-27perf/core: Fix broken throttling when max_samples_per_tick=1Qing Wang
[ Upstream commit f51972e6f8b9a737b2b3eb588069acb538fa72de ] According to the throttling mechanism, the pmu interrupts number can not exceed the max_samples_per_tick in one tick. But this mechanism is ineffective when max_samples_per_tick=1, because the throttling check is skipped during the first interrupt and only performed when the second interrupt arrives. Perhaps this bug may cause little influence in one tick, but if in a larger time scale, the problem can not be underestimated. When max_samples_per_tick = 1: Allowed-interrupts-per-second max-samples-per-second default-HZ ARCH 200 100 100 X86 500 250 250 ARM64 ... Obviously, the pmu interrupt number far exceed the user's expect. Fixes: e050e3f0a71b ("perf: Fix broken interrupt rate throttling") Signed-off-by: Qing Wang <wangqing7171@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250405141635.243786-3-wangqing7171@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-27tracing: Fix compilation warning on arm32Pan Taixi
commit 2fbdb6d8e03b70668c0876e635506540ae92ab05 upstream. On arm32, size_t is defined to be unsigned int, while PAGE_SIZE is unsigned long. This hence triggers a compilation warning as min() asserts the type of two operands to be equal. Casting PAGE_SIZE to size_t solves this issue and works on other target architectures as well. Compilation warning details: kernel/trace/trace.c: In function 'tracing_splice_read_pipe': ./include/linux/minmax.h:20:28: warning: comparison of distinct pointer types lacks a cast (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1))) ^ ./include/linux/minmax.h:26:4: note: in expansion of macro '__typecheck' (__typecheck(x, y) && __no_side_effects(x, y)) ^~~~~~~~~~~ ... kernel/trace/trace.c:6771:8: note: in expansion of macro 'min' min((size_t)trace_seq_used(&iter->seq), ^~~ Cc: stable@vger.kernel.org Link: https://lore.kernel.org/20250526013731.1198030-1-pantaixi@huaweicloud.com Fixes: f5178c41bb43 ("tracing: Fix oob write in trace_seq_to_buffer()") Reviewed-by: Jeongjun Park <aha310510@gmail.com> Signed-off-by: Pan Taixi <pantaixi@huaweicloud.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-04fork: use pidfd_prepare()Christian Brauner
commit ca7707f5430ad6b1c9cb7cee0a7f67d69328bb2d upstream. Stop open-coding get_unused_fd_flags() and anon_inode_getfile(). That's brittle just for keeping the flags between both calls in sync. Use the dedicated helper. Message-Id: <20230327-pidfd-file-api-v1-2-5c0e9a3158e4@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-04pid: add pidfd_prepare()Christian Brauner
commit 6ae930d9dbf2d093157be33428538c91966d8a9f upstream. Add a new helper that allows to reserve a pidfd and allocates a new pidfd file that stashes the provided struct pid. This will allow us to remove places that either open code this function or that call pidfd_create() but then have to call close_fd() because there are still failure points after pidfd_create() has been called. Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230327-pidfd-file-api-v1-1-5c0e9a3158e4@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-04pidfd: check pid has attached task in fdinfoChristian Brauner
commit 3d6d8da48d0b214d65ea0227d47228abc75d7c88 upstream. Currently, when a task is dead we still print the pid it used to use in the fdinfo files of its pidfds. This doesn't make much sense since the pid may have already been reused. So verify that the task is still alive by introducing the pid_has_task() helper which will be used by other callers in follow-up patches. If the task is not alive anymore, we will print -1. This allows us to differentiate between a task not being present in a given pid namespace - in which case we already print 0 - and a task having been reaped. Note that this uses PIDTYPE_PID for the check. Technically, we could've checked PIDTYPE_TGID since pidfds currently only refer to thread-group leaders but if they won't anymore in the future then this check becomes problematic without it being immediately obvious to non-experts imho. If a thread is created via clone(CLONE_THREAD) than struct pid has a single non-empty list pid->tasks[PIDTYPE_PID] and this pid can't be used as a PIDTYPE_TGID meaning pid->tasks[PIDTYPE_TGID] will return NULL even though the thread-group leader might still be very much alive. So checking PIDTYPE_PID is fine and is easier to maintain should we ever allow pidfds to refer to threads. Cc: Jann Horn <jannh@google.com> Cc: Christian Kellner <christian@kellner.me> Cc: linux-api@vger.kernel.org Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20191017101832.5985-1-christian.brauner@ubuntu.com Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-04rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=yAnkur Arora
[ Upstream commit 83b28cfe796464ebbde1cf7916c126da6d572685 ] With PREEMPT_RCU=n, cond_resched() provides urgently needed quiescent states for read-side critical sections via rcu_all_qs(). One reason why this was needed: lacking preempt-count, the tick handler has no way of knowing whether it is executing in a read-side critical section or not. With (PREEMPT_LAZY=y, PREEMPT_DYNAMIC=n), we get (PREEMPT_COUNT=y, PREEMPT_RCU=n). In this configuration cond_resched() is a stub and does not provide quiescent states via rcu_all_qs(). (PREEMPT_RCU=y provides this information via rcu_read_unlock() and its nesting counter.) So, use the availability of preempt_count() to report quiescent states in rcu_flavor_sched_clock_irq(). Suggested-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-04posix-timers: Add cond_resched() to posix_timer_add() search loopEric Dumazet
[ Upstream commit 5f2909c6cd13564a07ae692a95457f52295c4f22 ] With a large number of POSIX timers the search for a valid ID might cause a soft lockup on PREEMPT_NONE/VOLUNTARY kernels. Add cond_resched() to the loop to prevent that. [ tglx: Split out from Eric's series ] Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250214135911.2037402-2-edumazet@google.com Link: https://lore.kernel.org/all/20250308155623.635612865@linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-04cgroup: Fix compilation issue due to cgroup_mutex not being exportedgaoxu
[ Upstream commit 87c259a7a359e73e6c52c68fcbec79988999b4e6 ] When adding folio_memcg function call in the zram module for Android16-6.12, the following error occurs during compilation: ERROR: modpost: "cgroup_mutex" [../soc-repo/zram.ko] undefined! This error is caused by the indirect call to lockdep_is_held(&cgroup_mutex) within folio_memcg. The export setting for cgroup_mutex is controlled by the CONFIG_PROVE_RCU macro. If CONFIG_LOCKDEP is enabled while CONFIG_PROVE_RCU is not, this compilation error will occur. To resolve this issue, add a parallel macro CONFIG_LOCKDEP control to ensure cgroup_mutex is properly exported when needed. Signed-off-by: gao xu <gaoxu2@honor.com> Acked-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-04module: ensure that kobject_put() is safe for module type kobjectsDmitry Antipov
commit a6aeb739974ec73e5217c75a7c008a688d3d5cf1 upstream. In 'lookup_or_create_module_kobject()', an internal kobject is created using 'module_ktype'. So call to 'kobject_put()' on error handling path causes an attempt to use an uninitialized completion pointer in 'module_kobject_release()'. In this scenario, we just want to release kobject without an extra synchronization required for a regular module unloading process, so adding an extra check whether 'complete()' is actually required makes 'kobject_put()' safe. Reported-by: syzbot+7fb8a372e1f6add936dd@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=7fb8a372e1f6add936dd Fixes: 942e443127e9 ("module: Fix mod->mkobj.kobj potentially freed too early") Cc: stable@vger.kernel.org Suggested-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Link: https://lore.kernel.org/r/20250507065044.86529-1-dmantipov@yandex.ru Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-04tracing: Fix oob write in trace_seq_to_buffer()Jeongjun Park
commit f5178c41bb43444a6008150fe6094497135d07cb upstream. syzbot reported this bug: ================================================================== BUG: KASAN: slab-out-of-bounds in trace_seq_to_buffer kernel/trace/trace.c:1830 [inline] BUG: KASAN: slab-out-of-bounds in tracing_splice_read_pipe+0x6be/0xdd0 kernel/trace/trace.c:6822 Write of size 4507 at addr ffff888032b6b000 by task syz.2.320/7260 CPU: 1 UID: 0 PID: 7260 Comm: syz.2.320 Not tainted 6.15.0-rc1-syzkaller-00301-g3bde70a2c827 #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:408 [inline] print_report+0xc3/0x670 mm/kasan/report.c:521 kasan_report+0xe0/0x110 mm/kasan/report.c:634 check_region_inline mm/kasan/generic.c:183 [inline] kasan_check_range+0xef/0x1a0 mm/kasan/generic.c:189 __asan_memcpy+0x3c/0x60 mm/kasan/shadow.c:106 trace_seq_to_buffer kernel/trace/trace.c:1830 [inline] tracing_splice_read_pipe+0x6be/0xdd0 kernel/trace/trace.c:6822 .... ================================================================== It has been reported that trace_seq_to_buffer() tries to copy more data than PAGE_SIZE to buf. Therefore, to prevent this, we should use the smaller of trace_seq_used(&iter->seq) and PAGE_SIZE as an argument. Link: https://lore.kernel.org/20250422113026.13308-1-aha310510@gmail.com Reported-by: syzbot+c8cd2d2c412b868263fb@syzkaller.appspotmail.com Fixes: 3c56819b14b0 ("tracing: splice support for tracing_pipe") Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Jeongjun Park <aha310510@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-02ftrace: Add cond_resched() to ftrace_graph_set_hash()zhoumin
commit 42ea22e754ba4f2b86f8760ca27f6f71da2d982c upstream. When the kernel contains a large number of functions that can be traced, the loop in ftrace_graph_set_hash() may take a lot of time to execute. This may trigger the softlockup watchdog. Add cond_resched() within the loop to allow the kernel to remain responsive even when processing a large number of functions. This matches the cond_resched() that is used in other locations of the code that iterates over all functions that can be traced. Cc: stable@vger.kernel.org Fixes: b9b0c831bed26 ("ftrace: Convert graph filter to use hash tables") Link: https://lore.kernel.org/tencent_3E06CE338692017B5809534B9C5C03DA7705@qq.com Signed-off-by: zhoumin <teczm@foxmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-02locking/lockdep: Decrease nr_unused_locks if lock unused in zap_class()Boqun Feng
commit 495f53d5cca0f939eaed9dca90b67e7e6fb0e30c upstream. Currently, when a lock class is allocated, nr_unused_locks will be increased by 1, until it gets used: nr_unused_locks will be decreased by 1 in mark_lock(). However, one scenario is missed: a lock class may be zapped without even being used once. This could result into a situation that nr_unused_locks != 0 but no unused lock class is active in the system, and when `cat /proc/lockdep_stats`, a WARN_ON() will be triggered in a CONFIG_DEBUG_LOCKDEP=y kernel: [...] DEBUG_LOCKS_WARN_ON(debug_atomic_read(nr_unused_locks) != nr_unused) [...] WARNING: CPU: 41 PID: 1121 at kernel/locking/lockdep_proc.c:283 lockdep_stats_show+0xba9/0xbd0 And as a result, lockdep will be disabled after this. Therefore, nr_unused_locks needs to be accounted correctly at zap_class() time. Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Waiman Long <longman@redhat.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250326180831.510348-1-boqun.feng@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-02tracing: fix return value in __ftrace_event_enable_disable for ↵Gabriele Paoloni
TRACE_REG_UNREGISTER [ Upstream commit 0c588ac0ca6c22b774d9ad4a6594681fdfa57d9d ] When __ftrace_event_enable_disable invokes the class callback to unregister the event, the return value is not reported up to the caller, hence leading to event unregister failures being silently ignored. This patch assigns the ret variable to the invocation of the event unregister callback, so that its return value is stored and reported to the caller, and it raises a warning in case of error. Link: https://lore.kernel.org/20250321170821.101403-1-gpaoloni@redhat.com Signed-off-by: Gabriele Paoloni <gpaoloni@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10tracing: Fix use-after-free in print_graph_function_flags during tracer ↵Tengda Wu
switching commit 7f81f27b1093e4895e87b74143c59c055c3b1906 upstream. Kairui reported a UAF issue in print_graph_function_flags() during ftrace stress testing [1]. This issue can be reproduced if puting a 'mdelay(10)' after 'mutex_unlock(&trace_types_lock)' in s_start(), and executing the following script: $ echo function_graph > current_tracer $ cat trace > /dev/null & $ sleep 5 # Ensure the 'cat' reaches the 'mdelay(10)' point $ echo timerlat > current_tracer The root cause lies in the two calls to print_graph_function_flags within print_trace_line during each s_show(): * One through 'iter->trace->print_line()'; * Another through 'event->funcs->trace()', which is hidden in print_trace_fmt() before print_trace_line returns. Tracer switching only updates the former, while the latter continues to use the print_line function of the old tracer, which in the script above is print_graph_function_flags. Moreover, when switching from the 'function_graph' tracer to the 'timerlat' tracer, s_start only calls graph_trace_close of the 'function_graph' tracer to free 'iter->private', but does not set it to NULL. This provides an opportunity for 'event->funcs->trace()' to use an invalid 'iter->private'. To fix this issue, set 'iter->private' to NULL immediately after freeing it in graph_trace_close(), ensuring that an invalid pointer is not passed to other tracers. Additionally, clean up the unnecessary 'iter->private = NULL' during each 'cat trace' when using wakeup and irqsoff tracers. [1] https://lore.kernel.org/all/20231112150030.84609-1-ryncsn@gmail.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Zheng Yejian <zhengyejian1@huawei.com> Link: https://lore.kernel.org/20250320122137.23635-1-wutengda@huaweicloud.com Fixes: eecb91b9f98d ("tracing: Fix memleak due to race between current_tracer and trace") Closes: https://lore.kernel.org/all/CAMgjq7BW79KDSCyp+tZHjShSzHsScSiJxn5ffskp-QzVM06fxw@mail.gmail.com/ Reported-by: Kairui Song <kasong@tencent.com> Signed-off-by: Tengda Wu <wutengda@huaweicloud.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-10locking/semaphore: Use wake_q to wake up processes outside lock critical sectionWaiman Long
[ Upstream commit 85b2b9c16d053364e2004883140538e73b333cdb ] A circular lock dependency splat has been seen involving down_trylock(): ====================================================== WARNING: possible circular locking dependency detected 6.12.0-41.el10.s390x+debug ------------------------------------------------------ dd/32479 is trying to acquire lock: 0015a20accd0d4f8 ((console_sem).lock){-.-.}-{2:2}, at: down_trylock+0x26/0x90 but task is already holding lock: 000000017e461698 (&zone->lock){-.-.}-{2:2}, at: rmqueue_bulk+0xac/0x8f0 the existing dependency chain (in reverse order) is: -> #4 (&zone->lock){-.-.}-{2:2}: -> #3 (hrtimer_bases.lock){-.-.}-{2:2}: -> #2 (&rq->__lock){-.-.}-{2:2}: -> #1 (&p->pi_lock){-.-.}-{2:2}: -> #0 ((console_sem).lock){-.-.}-{2:2}: The console_sem -> pi_lock dependency is due to calling try_to_wake_up() while holding the console_sem raw_spinlock. This dependency can be broken by using wake_q to do the wakeup instead of calling try_to_wake_up() under the console_sem lock. This will also make the semaphore's raw_spinlock become a terminal lock without taking any further locks underneath it. The hrtimer_bases.lock is a raw_spinlock while zone->lock is a spinlock. The hrtimer_bases.lock -> zone->lock dependency happens via the debug_objects_fill_pool() helper function in the debugobjects code. -> #4 (&zone->lock){-.-.}-{2:2}: __lock_acquire+0xe86/0x1cc0 lock_acquire.part.0+0x258/0x630 lock_acquire+0xb8/0xe0 _raw_spin_lock_irqsave+0xb4/0x120 rmqueue_bulk+0xac/0x8f0 __rmqueue_pcplist+0x580/0x830 rmqueue_pcplist+0xfc/0x470 rmqueue.isra.0+0xdec/0x11b0 get_page_from_freelist+0x2ee/0xeb0 __alloc_pages_noprof+0x2c2/0x520 alloc_pages_mpol_noprof+0x1fc/0x4d0 alloc_pages_noprof+0x8c/0xe0 allocate_slab+0x320/0x460 ___slab_alloc+0xa58/0x12b0 __slab_alloc.isra.0+0x42/0x60 kmem_cache_alloc_noprof+0x304/0x350 fill_pool+0xf6/0x450 debug_object_activate+0xfe/0x360 enqueue_hrtimer+0x34/0x190 __run_hrtimer+0x3c8/0x4c0 __hrtimer_run_queues+0x1b2/0x260 hrtimer_interrupt+0x316/0x760 do_IRQ+0x9a/0xe0 do_irq_async+0xf6/0x160 Normally a raw_spinlock to spinlock dependency is not legitimate and will be warned if CONFIG_PROVE_RAW_LOCK_NESTING is enabled, but debug_objects_fill_pool() is an exception as it explicitly allows this dependency for non-PREEMPT_RT kernel without causing PROVE_RAW_LOCK_NESTING lockdep splat. As a result, this dependency is legitimate and not a bug. Anyway, semaphore is the only locking primitive left that is still using try_to_wake_up() to do wakeup inside critical section, all the other locking primitives had been migrated to use wake_q to do wakeup outside of the critical section. It is also possible that there are other circular locking dependencies involving printk/console_sem or other existing/new semaphores lurking somewhere which may show up in the future. Let just do the migration now to wake_q to avoid headache like this. Reported-by: yzbot+ed801a886dfdbfe7136d@syzkaller.appspotmail.com Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250307232717.1759087-3-boqun.feng@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10sched/deadline: Use online cpus for validating runtimeShrikanth Hegde
[ Upstream commit 14672f059d83f591afb2ee1fff56858efe055e5a ] The ftrace selftest reported a failure because writing -1 to sched_rt_runtime_us returns -EBUSY. This happens when the possible CPUs are different from active CPUs. Active CPUs are part of one root domain, while remaining CPUs are part of def_root_domain. Since active cpumask is being used, this results in cpus=0 when a non active CPUs is used in the loop. Fix it by looping over the online CPUs instead for validating the bandwidth calculations. Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20250306052954.452005-2-sshegde@linux.ibm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10ring-buffer: Fix bytes_dropped calculation issueFeng Yang
[ Upstream commit c73f0b69648501978e8b3e8fa7eef7f4197d0481 ] The calculation of bytes-dropped and bytes_dropped_nested is reversed. Although it does not affect the final calculation of total_dropped, it should still be modified. Link: https://lore.kernel.org/20250223070106.6781-1-yangfeng59949@163.com Fixes: 6c43e554a2a5 ("ring-buffer: Add ring buffer startup selftest") Signed-off-by: Feng Yang <yangfeng@kylinos.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10kexec: initialize ELF lowest address to ULONG_MAXSourabh Jain
[ Upstream commit 9986fb5164c8b21f6439cfd45ba36d8cc80c9710 ] Patch series "powerpc/crash: use generic crashkernel reservation", v3. Commit 0ab97169aa05 ("crash_core: add generic function to do reservation") added a generic function to reserve crashkernel memory. So let's use the same function on powerpc and remove the architecture-specific code that essentially does the same thing. The generic crashkernel reservation also provides a way to split the crashkernel reservation into high and low memory reservations, which can be enabled for powerpc in the future. Additionally move powerpc to use generic APIs to locate memory hole for kexec segments while loading kdump kernel. This patch (of 7): kexec_elf_load() loads an ELF executable and sets the address of the lowest PT_LOAD section to the address held by the lowest_load_addr function argument. To determine the lowest PT_LOAD address, a local variable lowest_addr (type unsigned long) is initialized to UINT_MAX. After loading each PT_LOAD, its address is compared to lowest_addr. If a loaded PT_LOAD address is lower, lowest_addr is updated. However, setting lowest_addr to UINT_MAX won't work when the kernel image is loaded above 4G, as the returned lowest PT_LOAD address would be invalid. This is resolved by initializing lowest_addr to ULONG_MAX instead. This issue was discovered while implementing crashkernel high/low reservation on the PowerPC architecture. Link: https://lkml.kernel.org/r/20250131113830.925179-1-sourabhjain@linux.ibm.com Link: https://lkml.kernel.org/r/20250131113830.925179-2-sourabhjain@linux.ibm.com Fixes: a0458284f062 ("powerpc: Add support code for kexec_file_load()") Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mahesh Salgaonkar <mahesh@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10bpf: Use preempt_count() directly in bpf_send_signal_common()Hou Tao
[ Upstream commit b4a8b5bba712a711d8ca1f7d04646db63f9c88f5 ] bpf_send_signal_common() uses preemptible() to check whether or not the current context is preemptible. If it is preemptible, it will use irq_work to send the signal asynchronously instead of trying to hold a spin-lock, because spin-lock is sleepable under PREEMPT_RT. However, preemptible() depends on CONFIG_PREEMPT_COUNT. When CONFIG_PREEMPT_COUNT is turned off (e.g., CONFIG_PREEMPT_VOLUNTARY=y), !preemptible() will be evaluated as 1 and bpf_send_signal_common() will use irq_work unconditionally. Fix it by unfolding "!preemptible()" and using "preempt_count() != 0 || irqs_disabled()" instead. Fixes: 87c544108b61 ("bpf: Send signals asynchronously if !preemptible") Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250220042259.1583319-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10perf/ring_buffer: Allow the EPOLLRDNORM flag for pollTao Chen
[ Upstream commit c96fff391c095c11dc87dab35be72dee7d217cde ] The poll man page says POLLRDNORM is equivalent to POLLIN. For poll(), it seems that if user sets pollfd with POLLRDNORM in userspace, perf_poll will not return until timeout even if perf_output_wakeup called, whereas POLLIN returns. Fixes: 76369139ceb9 ("perf: Split up buffer handling from core code") Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250314030036.2543180-1-chen.dylane@linux.dev Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10hrtimers: Mark is_migration_base() with __always_inlineAndy Shevchenko
[ Upstream commit 27af31e44949fa85550176520ef7086a0d00fd7b ] When is_migration_base() is unused, it prevents kernel builds with clang, `make W=1` and CONFIG_WERROR=y: kernel/time/hrtimer.c:156:20: error: unused function 'is_migration_base' [-Werror,-Wunused-function] 156 | static inline bool is_migration_base(struct hrtimer_clock_base *base) | ^~~~~~~~~~~~~~~~~ Fix this by marking it with __always_inline. [ tglx: Use __always_inline instead of __maybe_unused and move it into the usage sites conditional ] Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250116160745.243358-1-andriy.shevchenko@linux.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>