summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2025-09-19hrtimers: Unconditionally update target CPU base after offline timer migrationXiongfeng Wang
[ Upstream commit e895f8e29119c8c966ea794af9e9100b10becb88 ] When testing softirq based hrtimers on an ARM32 board, with high resolution mode and NOHZ inactive, softirq based hrtimers fail to expire after being moved away from an offline CPU: CPU0 CPU1 hrtimer_start(..., HRTIMER_MODE_SOFT); cpu_down(CPU1) ... hrtimers_cpu_dying() // Migrate timers to CPU0 smp_call_function_single(CPU0, returgger_next_event); retrigger_next_event() if (!highres && !nohz) return; As retrigger_next_event() is a NOOP when both high resolution timers and NOHZ are inactive CPU0's hrtimer_cpu_base::softirq_expires_next is not updated and the migrated softirq timers never expire unless there is a softirq based hrtimer queued on CPU0 later. Fix this by removing the hrtimer_hres_active() and tick_nohz_active() check in retrigger_next_event(), which enforces a full update of the CPU base. As this is not a fast path the extra cost does not matter. [ tglx: Massaged change log ] Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier") Co-developed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250805081025.54235-1-wangxiongfeng2@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-19hrtimer: Rename __hrtimer_hres_active() to hrtimer_hres_active()Jiapeng Chong
[ Upstream commit b7c8e1f8a7b4352c1d0b4310686385e3cf6c104a ] The function hrtimer_hres_active() are defined in the hrtimer.c file, but not called elsewhere, so rename __hrtimer_hres_active() to hrtimer_hres_active() and remove the old hrtimer_hres_active() function. kernel/time/hrtimer.c:653:19: warning: unused function 'hrtimer_hres_active'. Fixes: 82ccdf062a64 ("hrtimer: Remove unused function") Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Link: https://lore.kernel.org/r/20240418023000.130324-1-jiapeng.chong@linux.alibaba.com Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=8778 Stable-dep-of: e895f8e29119 ("hrtimers: Unconditionally update target CPU base after offline timer migration") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-19hrtimer: Remove unused functionJiapeng Chong
[ Upstream commit 82ccdf062a64f3c4ac575c16179ce68edbbbe8e4 ] The function is defined, but not called anywhere: kernel/time/hrtimer.c:1880:20: warning: unused function '__hrtimer_peek_ahead_timers'. Remove it. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240322070441.29646-1-jiapeng.chong@linux.alibaba.com Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=8611 Stable-dep-of: e895f8e29119 ("hrtimers: Unconditionally update target CPU base after offline timer migration") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-19tracing: Silence warning when chunk allocation fails in trace_pid_writePu Lehui
[ Upstream commit cd4453c5e983cf1fd5757e9acb915adb1e4602b6 ] Syzkaller trigger a fault injection warning: WARNING: CPU: 1 PID: 12326 at tracepoint_add_func+0xbfc/0xeb0 Modules linked in: CPU: 1 UID: 0 PID: 12326 Comm: syz.6.10325 Tainted: G U 6.14.0-rc5-syzkaller #0 Tainted: [U]=USER Hardware name: Google Compute Engine/Google Compute Engine RIP: 0010:tracepoint_add_func+0xbfc/0xeb0 kernel/tracepoint.c:294 Code: 09 fe ff 90 0f 0b 90 0f b6 74 24 43 31 ff 41 bc ea ff ff ff RSP: 0018:ffffc9000414fb48 EFLAGS: 00010283 RAX: 00000000000012a1 RBX: ffffffff8e240ae0 RCX: ffffc90014b78000 RDX: 0000000000080000 RSI: ffffffff81bbd78b RDI: 0000000000000001 RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000001 R12: ffffffffffffffef R13: 0000000000000000 R14: dffffc0000000000 R15: ffffffff81c264f0 FS: 00007f27217f66c0(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2e80dff8 CR3: 00000000268f8000 CR4: 00000000003526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> tracepoint_probe_register_prio+0xc0/0x110 kernel/tracepoint.c:464 register_trace_prio_sched_switch include/trace/events/sched.h:222 [inline] register_pid_events kernel/trace/trace_events.c:2354 [inline] event_pid_write.isra.0+0x439/0x7a0 kernel/trace/trace_events.c:2425 vfs_write+0x24c/0x1150 fs/read_write.c:677 ksys_write+0x12b/0x250 fs/read_write.c:731 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f We can reproduce the warning by following the steps below: 1. echo 8 >> set_event_notrace_pid. Let tr->filtered_pids owns one pid and register sched_switch tracepoint. 2. echo ' ' >> set_event_pid, and perform fault injection during chunk allocation of trace_pid_list_alloc. Let pid_list with no pid and assign to tr->filtered_pids. 3. echo ' ' >> set_event_pid. Let pid_list is NULL and assign to tr->filtered_pids. 4. echo 9 >> set_event_pid, will trigger the double register sched_switch tracepoint warning. The reason is that syzkaller injects a fault into the chunk allocation in trace_pid_list_alloc, causing a failure in trace_pid_list_set, which may trigger double register of the same tracepoint. This only occurs when the system is about to crash, but to suppress this warning, let's add failure handling logic to trace_pid_list_set. Link: https://lore.kernel.org/20250908024658.2390398-1-pulehui@huaweicloud.com Fixes: 8d6e90983ade ("tracing: Create a sparse bitmask for pid filtering") Reported-by: syzbot+161412ccaeff20ce4dde@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67cb890e.050a0220.d8275.022e.GAE@google.com Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-19tracing: Fix tracing_marker may trigger page fault during preempt_disableLuo Gengkun
[ Upstream commit 3d62ab32df065e4a7797204a918f6489ddb8a237 ] Both tracing_mark_write and tracing_mark_raw_write call __copy_from_user_inatomic during preempt_disable. But in some case, __copy_from_user_inatomic may trigger page fault, and will call schedule() subtly. And if a task is migrated to other cpu, the following warning will be trigger: if (RB_WARN_ON(cpu_buffer, !local_read(&cpu_buffer->committing))) An example can illustrate this issue: process flow CPU --------------------------------------------------------------------- tracing_mark_raw_write(): cpu:0 ... ring_buffer_lock_reserve(): cpu:0 ... cpu = raw_smp_processor_id() cpu:0 cpu_buffer = buffer->buffers[cpu] cpu:0 ... ... __copy_from_user_inatomic(): cpu:0 ... # page fault do_mem_abort(): cpu:0 ... # Call schedule schedule() cpu:0 ... # the task schedule to cpu1 __buffer_unlock_commit(): cpu:1 ... ring_buffer_unlock_commit(): cpu:1 ... cpu = raw_smp_processor_id() cpu:1 cpu_buffer = buffer->buffers[cpu] cpu:1 As shown above, the process will acquire cpuid twice and the return values are not the same. To fix this problem using copy_from_user_nofault instead of __copy_from_user_inatomic, as the former performs 'access_ok' before copying. Link: https://lore.kernel.org/20250819105152.2766363-1-luogengkun@huaweicloud.com Fixes: 656c7f0d2d2b ("tracing: Replace kmap with copy_from_user() in trace_marker writing") Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-19tracing: Do not add length to print format in synthetic eventsSteven Rostedt
[ Upstream commit e1a453a57bc76be678bd746f84e3d73f378a9511 ] The following causes a vsnprintf fault: # echo 's:wake_lat char[] wakee; u64 delta;' >> /sys/kernel/tracing/dynamic_events # echo 'hist:keys=pid:ts=common_timestamp.usecs if !(common_flags & 0x18)' > /sys/kernel/tracing/events/sched/sched_waking/trigger # echo 'hist:keys=next_pid:delta=common_timestamp.usecs-$ts:onmatch(sched.sched_waking).trace(wake_lat,next_comm,$delta)' > /sys/kernel/tracing/events/sched/sched_switch/trigger Because the synthetic event's "wakee" field is created as a dynamic string (even though the string copied is not). The print format to print the dynamic string changed from "%*s" to "%s" because another location (__set_synth_event_print_fmt()) exported this to user space, and user space did not need that. But it is still used in print_synth_event(), and the output looks like: <idle>-0 [001] d..5. 193.428167: wake_lat: wakee=(efault)sshd-sessiondelta=155 sshd-session-879 [001] d..5. 193.811080: wake_lat: wakee=(efault)kworker/u34:5delta=58 <idle>-0 [002] d..5. 193.811198: wake_lat: wakee=(efault)bashdelta=91 bash-880 [002] d..5. 193.811371: wake_lat: wakee=(efault)kworker/u35:2delta=21 <idle>-0 [001] d..5. 193.811516: wake_lat: wakee=(efault)sshd-sessiondelta=129 sshd-session-879 [001] d..5. 193.967576: wake_lat: wakee=(efault)kworker/u34:5delta=50 The length isn't needed as the string is always nul terminated. Just print the string and not add the length (which was hard coded to the max string length anyway). Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Douglas Raillard <douglas.raillard@arm.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/20250407154139.69955768@gandalf.local.home Fixes: 4d38328eb442d ("tracing: Fix synth event printk format for str fields"); Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> [ offset calculations instead of union-based data structures ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-09cpufreq/sched: Explicitly synchronize limits_changed flag handlingRafael J. Wysocki
[ Upstream commit 79443a7e9da3c9f68290a8653837e23aba0fa89f ] The handling of the limits_changed flag in struct sugov_policy needs to be explicitly synchronized to ensure that cpufreq policy limits updates will not be missed in some cases. Without that synchronization it is theoretically possible that the limits_changed update in sugov_should_update_freq() will be reordered with respect to the reads of the policy limits in cpufreq_driver_resolve_freq() and in that case, if the limits_changed update in sugov_limits() clobbers the one in sugov_should_update_freq(), the new policy limits may not take effect for a long time. Likewise, the limits_changed update in sugov_limits() may theoretically get reordered with respect to the updates of the policy limits in cpufreq_set_policy() and if sugov_should_update_freq() runs between them, the policy limits change may be missed. To ensure that the above situations will not take place, add memory barriers preventing the reordering in question from taking place and add READ_ONCE() and WRITE_ONCE() annotations around all of the limits_changed flag updates to prevent the compiler from messing up with that code. Fixes: 600f5badb78c ("cpufreq: schedutil: Don't skip freq update when limits change") Cc: 5.3+ <stable@vger.kernel.org> # 5.3+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/3376719.44csPzL39Z@rjwysocki.net [ bw_min => bw_dl ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-09bpf: Fix oob access in cgroup local storageDaniel Borkmann
[ Upstream commit abad3d0bad72a52137e0c350c59542d75ae4f513 ] Lonial reported that an out-of-bounds access in cgroup local storage can be crafted via tail calls. Given two programs each utilizing a cgroup local storage with a different value size, and one program doing a tail call into the other. The verifier will validate each of the indivial programs just fine. However, in the runtime context the bpf_cg_run_ctx holds an bpf_prog_array_item which contains the BPF program as well as any cgroup local storage flavor the program uses. Helpers such as bpf_get_local_storage() pick this up from the runtime context: ctx = container_of(current->bpf_ctx, struct bpf_cg_run_ctx, run_ctx); storage = ctx->prog_item->cgroup_storage[stype]; if (stype == BPF_CGROUP_STORAGE_SHARED) ptr = &READ_ONCE(storage->buf)->data[0]; else ptr = this_cpu_ptr(storage->percpu_buf); For the second program which was called from the originally attached one, this means bpf_get_local_storage() will pick up the former program's map, not its own. With mismatching sizes, this can result in an unintended out-of-bounds access. To fix this issue, we need to extend bpf_map_owner with an array of storage_cookie[] to match on i) the exact maps from the original program if the second program was using bpf_get_local_storage(), or ii) allow the tail call combination if the second program was not using any of the cgroup local storage maps. Fixes: 7d9c3427894f ("bpf: Make cgroup storages shared between programs on the same cgroup") Reported-by: Lonial Con <kongln9170@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20250730234733.530041-4-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-09bpf: Move bpf map owner out of common structDaniel Borkmann
[ Upstream commit fd1c98f0ef5cbcec842209776505d9e70d8fcd53 ] Given this is only relevant for BPF tail call maps, it is adding up space and penalizing other map types. We also need to extend this with further objects to track / compare to. Therefore, lets move this out into a separate structure and dynamically allocate it only for BPF tail call maps. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20250730234733.530041-2-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-09bpf: Add cookie object to bpf mapsDaniel Borkmann
[ Upstream commit 12df58ad294253ac1d8df0c9bb9cf726397a671d ] Add a cookie to BPF maps to uniquely identify BPF maps for the timespan when the node is up. This is different to comparing a pointer or BPF map id which could get rolled over and reused. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20250730234733.530041-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-04dma/pool: Ensure DMA_DIRECT_REMAP allocations are decryptedShanker Donthineni
commit 89a2d212bdb4bc29bed8e7077abe054b801137ea upstream. When CONFIG_DMA_DIRECT_REMAP is enabled, atomic pool pages are remapped via dma_common_contiguous_remap() using the supplied pgprot. Currently, the mapping uses pgprot_dmacoherent(PAGE_KERNEL), which leaves the memory encrypted on systems with memory encryption enabled (e.g., ARM CCA Realms). This can cause the DMA layer to fail or crash when accessing the memory, as the underlying physical pages are not configured as expected. Fix this by requesting a decrypted mapping in the vmap() call: pgprot_decrypted(pgprot_dmacoherent(PAGE_KERNEL)) This ensures that atomic pool memory is consistently mapped unencrypted. Cc: stable@vger.kernel.org Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20250811181759.998805-1-sdonthineni@nvidia.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-04ftrace: Fix potential warning in trace_printk_seq during ftrace_dumpTengda Wu
[ Upstream commit 4013aef2ced9b756a410f50d12df9ebe6a883e4a ] When calling ftrace_dump_one() concurrently with reading trace_pipe, a WARN_ON_ONCE() in trace_printk_seq() can be triggered due to a race condition. The issue occurs because: CPU0 (ftrace_dump) CPU1 (reader) echo z > /proc/sysrq-trigger !trace_empty(&iter) trace_iterator_reset(&iter) <- len = size = 0 cat /sys/kernel/tracing/trace_pipe trace_find_next_entry_inc(&iter) __find_next_entry ring_buffer_empty_cpu <- all empty return NULL trace_printk_seq(&iter.seq) WARN_ON_ONCE(s->seq.len >= s->seq.size) In the context between trace_empty() and trace_find_next_entry_inc() during ftrace_dump, the ring buffer data was consumed by other readers. This caused trace_find_next_entry_inc to return NULL, failing to populate `iter.seq`. At this point, due to the prior trace_iterator_reset, both `iter.seq.len` and `iter.seq.size` were set to 0. Since they are equal, the WARN_ON_ONCE condition is triggered. Move the trace_printk_seq() into the if block that checks to make sure the return value of trace_find_next_entry_inc() is non-NULL in ftrace_dump_one(), ensuring the 'iter.seq' is properly populated before subsequent operations. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Ingo Molnar <mingo@elte.hu> Link: https://lore.kernel.org/20250822033343.3000289-1-wutengda@huaweicloud.com Fixes: d769041f8653 ("ring_buffer: implement new locking") Signed-off-by: Tengda Wu <wutengda@huaweicloud.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-28cgroup/cpuset: Use static_branch_enable_cpuslocked() on ↵Waiman Long
cpusets_insane_config_key [ Upstream commit 65f97cc81b0adc5f49cf6cff5d874be0058e3f41 ] The following lockdep splat was observed. [ 812.359086] ============================================ [ 812.359089] WARNING: possible recursive locking detected [ 812.359097] -------------------------------------------- [ 812.359100] runtest.sh/30042 is trying to acquire lock: [ 812.359105] ffffffffa7f27420 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_enable+0xe/0x20 [ 812.359131] [ 812.359131] but task is already holding lock: [ 812.359134] ffffffffa7f27420 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_write_resmask+0x98/0xa70 : [ 812.359267] Call Trace: [ 812.359272] <TASK> [ 812.359367] cpus_read_lock+0x3c/0xe0 [ 812.359382] static_key_enable+0xe/0x20 [ 812.359389] check_insane_mems_config.part.0+0x11/0x30 [ 812.359398] cpuset_write_resmask+0x9f2/0xa70 [ 812.359411] cgroup_file_write+0x1c7/0x660 [ 812.359467] kernfs_fop_write_iter+0x358/0x530 [ 812.359479] vfs_write+0xabe/0x1250 [ 812.359529] ksys_write+0xf9/0x1d0 [ 812.359558] do_syscall_64+0x5f/0xe0 Since commit d74b27d63a8b ("cgroup/cpuset: Change cpuset_rwsem and hotplug lock order"), the ordering of cpu hotplug lock and cpuset_mutex had been reversed. That patch correctly used the cpuslocked version of the static branch API to enable cpusets_pre_enable_key and cpusets_enabled_key, but it didn't do the same for cpusets_insane_config_key. The cpusets_insane_config_key can be enabled in the check_insane_mems_config() which is called from update_nodemask() or cpuset_hotplug_update_tasks() with both cpu hotplug lock and cpuset_mutex held. Deadlock can happen with a pending hotplug event that tries to acquire the cpu hotplug write lock which will block further cpus_read_lock() attempt from check_insane_mems_config(). Fix that by switching to use static_branch_enable_cpuslocked(). Fixes: d74b27d63a8b ("cgroup/cpuset: Change cpuset_rwsem and hotplug lock order") Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-28tracing: Limit access to parser->buffer when trace_get_user failedPu Lehui
[ Upstream commit 6a909ea83f226803ea0e718f6e88613df9234d58 ] When the length of the string written to set_ftrace_filter exceeds FTRACE_BUFF_MAX, the following KASAN alarm will be triggered: BUG: KASAN: slab-out-of-bounds in strsep+0x18c/0x1b0 Read of size 1 at addr ffff0000d00bd5ba by task ash/165 CPU: 1 UID: 0 PID: 165 Comm: ash Not tainted 6.16.0-g6bcdbd62bd56-dirty Hardware name: linux,dummy-virt (DT) Call trace: show_stack+0x34/0x50 (C) dump_stack_lvl+0xa0/0x158 print_address_description.constprop.0+0x88/0x398 print_report+0xb0/0x280 kasan_report+0xa4/0xf0 __asan_report_load1_noabort+0x20/0x30 strsep+0x18c/0x1b0 ftrace_process_regex.isra.0+0x100/0x2d8 ftrace_regex_release+0x484/0x618 __fput+0x364/0xa58 ____fput+0x28/0x40 task_work_run+0x154/0x278 do_notify_resume+0x1f0/0x220 el0_svc+0xec/0xf0 el0t_64_sync_handler+0xa0/0xe8 el0t_64_sync+0x1ac/0x1b0 The reason is that trace_get_user will fail when processing a string longer than FTRACE_BUFF_MAX, but not set the end of parser->buffer to 0. Then an OOB access will be triggered in ftrace_regex_release-> ftrace_process_regex->strsep->strpbrk. We can solve this problem by limiting access to parser->buffer when trace_get_user failed. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/20250813040232.1344527-1-pulehui@huaweicloud.com Fixes: 8c9af478c06b ("ftrace: Handle commands when closing set_ftrace_filter file") Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28tracing: Remove unneeded goto out logicSteven Rostedt
[ Upstream commit c89504a703fb779052213add0e8ed642f4a4f1c8 ] Several places in the trace.c file there's a goto out where the out is simply a return. There's no reason to jump to the out label if it's not doing any more logic but simply returning from the function. Replace the goto outs with a return and remove the out labels. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/20250801203857.538726745@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28ftrace: Also allocate and copy hash for reading of filter filesSteven Rostedt
commit bfb336cf97df7b37b2b2edec0f69773e06d11955 upstream. Currently the reader of set_ftrace_filter and set_ftrace_notrace just adds the pointer to the global tracer hash to its iterator. Unlike the writer that allocates a copy of the hash, the reader keeps the pointer to the filter hashes. This is problematic because this pointer is static across function calls that release the locks that can update the global tracer hashes. This can cause UAF and similar bugs. Allocate and copy the hash for reading the filter files like it is done for the writers. This not only fixes UAF bugs, but also makes the code a bit simpler as it doesn't have to differentiate when to free the iterator's hash between writers and readers. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/20250822183606.12962cc3@batman.local.home Fixes: c20489dad156 ("ftrace: Assign iter->hash to filter or notrace hashes on seq read") Closes: https://lore.kernel.org/all/20250813023044.2121943-1-wutengda@huaweicloud.com/ Closes: https://lore.kernel.org/all/20250822192437.GA458494@ax162/ Reported-by: Tengda Wu <wutengda@huaweicloud.com> Tested-by: Tengda Wu <wutengda@huaweicloud.com> Tested-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28mm: drop the assumption that VM_SHARED always implies writableLorenzo Stoakes
[ Upstream commit e8e17ee90eaf650c855adb0a3e5e965fd6692ff1 ] Patch series "permit write-sealed memfd read-only shared mappings", v4. The man page for fcntl() describing memfd file seals states the following about F_SEAL_WRITE:- Furthermore, trying to create new shared, writable memory-mappings via mmap(2) will also fail with EPERM. With emphasis on 'writable'. In turns out in fact that currently the kernel simply disallows all new shared memory mappings for a memfd with F_SEAL_WRITE applied, rendering this documentation inaccurate. This matters because users are therefore unable to obtain a shared mapping to a memfd after write sealing altogether, which limits their usefulness. This was reported in the discussion thread [1] originating from a bug report [2]. This is a product of both using the struct address_space->i_mmap_writable atomic counter to determine whether writing may be permitted, and the kernel adjusting this counter when any VM_SHARED mapping is performed and more generally implicitly assuming VM_SHARED implies writable. It seems sensible that we should only update this mapping if VM_MAYWRITE is specified, i.e. whether it is possible that this mapping could at any point be written to. If we do so then all we need to do to permit write seals to function as documented is to clear VM_MAYWRITE when mapping read-only. It turns out this functionality already exists for F_SEAL_FUTURE_WRITE - we can therefore simply adapt this logic to do the same for F_SEAL_WRITE. We then hit a chicken and egg situation in mmap_region() where the check for VM_MAYWRITE occurs before we are able to clear this flag. To work around this, perform this check after we invoke call_mmap(), with careful consideration of error paths. Thanks to Andy Lutomirski for the suggestion! [1]:https://lore.kernel.org/all/20230324133646.16101dfa666f253c4715d965@linux-foundation.org/ [2]:https://bugzilla.kernel.org/show_bug.cgi?id=217238 This patch (of 3): There is a general assumption that VMAs with the VM_SHARED flag set are writable. If the VM_MAYWRITE flag is not set, then this is simply not the case. Update those checks which affect the struct address_space->i_mmap_writable field to explicitly test for this by introducing [vma_]is_shared_maywrite() helper functions. This remains entirely conservative, as the lack of VM_MAYWRITE guarantees that the VMA cannot be written to. Link: https://lkml.kernel.org/r/cover.1697116581.git.lstoakes@gmail.com Link: https://lkml.kernel.org/r/d978aefefa83ec42d18dfa964ad180dbcde34795.1697116581.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Suggested-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: stable@vger.kernel.org [isaacmanjarres: resolved merge conflicts due to due to refactoring that happened in upstream commit 5de195060b2e ("mm: resolve faulty mmap_region() error path behaviour")] Signed-off-by: Isaac J. Manjarres <isaacmanjarres@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28module: Prevent silent truncation of module name in delete_module(2)Petr Pavlu
[ Upstream commit a6323bd4e611567913e23df5b58f2d4e4da06789 ] Passing a module name longer than MODULE_NAME_LEN to the delete_module syscall results in its silent truncation. This really isn't much of a problem in practice, but it could theoretically lead to the removal of an incorrect module. It is more sensible to return ENAMETOOLONG or ENOENT in such a case. Update the syscall to return ENOENT, as documented in the delete_module(2) man page to mean "No module by that name exists." This is appropriate because a module with a name longer than MODULE_NAME_LEN cannot be loaded in the first place. Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Link: https://lore.kernel.org/r/20250630143535.267745-2-petr.pavlu@suse.com Signed-off-by: Daniel Gomez <da.gomez@samsung.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-28sched/fair: Bump sd->max_newidle_lb_cost when newidle balance failsChris Mason
[ Upstream commit 155213a2aed42c85361bf4f5c817f5cb68951c3b ] schbench (https://github.com/masoncl/schbench.git) is showing a regression from previous production kernels that bisected down to: sched/fair: Remove sysctl_sched_migration_cost condition (c5b0a7eefc) The schbench command line was: schbench -L -m 4 -M auto -t 256 -n 0 -r 0 -s 0 This creates 4 message threads pinned to CPUs 0-3, and 256x4 worker threads spread across the rest of the CPUs. Neither the worker threads or the message threads do any work, they just wake each other up and go back to sleep as soon as possible. The end result is the first 4 CPUs are pegged waking up those 1024 workers, and the rest of the CPUs are constantly banging in and out of idle. If I take a v6.9 Linus kernel and revert that one commit, performance goes from 3.4M RPS to 5.4M RPS. schedstat shows there are ~100x more new idle balance operations, and profiling shows the worker threads are spending ~20% of their CPU time on new idle balance. schedstats also shows that almost all of these new idle balance attemps are failing to find busy groups. The fix used here is to crank up the cost of the newidle balance whenever it fails. Since we don't want sd->max_newidle_lb_cost to grow out of control, this also changes update_newidle_cost() to use sysctl_sched_migration_cost as the upper limit on max_newidle_lb_cost. Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20250626144017.1510594-2-clm@fb.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-28rcu: Protect ->defer_qs_iw_pending from data racePaul E. McKenney
[ Upstream commit 90c09d57caeca94e6f3f87c49e96a91edd40cbfd ] On kernels built with CONFIG_IRQ_WORK=y, when rcu_read_unlock() is invoked within an interrupts-disabled region of code [1], it will invoke rcu_read_unlock_special(), which uses an irq-work handler to force the system to notice when the RCU read-side critical section actually ends. That end won't happen until interrupts are enabled at the soonest. In some kernels, such as those booted with rcutree.use_softirq=y, the irq-work handler is used unconditionally. The per-CPU rcu_data structure's ->defer_qs_iw_pending field is updated by the irq-work handler and is both read and updated by rcu_read_unlock_special(). This resulted in the following KCSAN splat: ------------------------------------------------------------------------ BUG: KCSAN: data-race in rcu_preempt_deferred_qs_handler / rcu_read_unlock_special read to 0xffff96b95f42d8d8 of 1 bytes by task 90 on cpu 8: rcu_read_unlock_special+0x175/0x260 __rcu_read_unlock+0x92/0xa0 rt_spin_unlock+0x9b/0xc0 __local_bh_enable+0x10d/0x170 __local_bh_enable_ip+0xfb/0x150 rcu_do_batch+0x595/0xc40 rcu_cpu_kthread+0x4e9/0x830 smpboot_thread_fn+0x24d/0x3b0 kthread+0x3bd/0x410 ret_from_fork+0x35/0x40 ret_from_fork_asm+0x1a/0x30 write to 0xffff96b95f42d8d8 of 1 bytes by task 88 on cpu 8: rcu_preempt_deferred_qs_handler+0x1e/0x30 irq_work_single+0xaf/0x160 run_irq_workd+0x91/0xc0 smpboot_thread_fn+0x24d/0x3b0 kthread+0x3bd/0x410 ret_from_fork+0x35/0x40 ret_from_fork_asm+0x1a/0x30 no locks held by irq_work/8/88. irq event stamp: 200272 hardirqs last enabled at (200272): [<ffffffffb0f56121>] finish_task_switch+0x131/0x320 hardirqs last disabled at (200271): [<ffffffffb25c7859>] __schedule+0x129/0xd70 softirqs last enabled at (0): [<ffffffffb0ee093f>] copy_process+0x4df/0x1cc0 softirqs last disabled at (0): [<0000000000000000>] 0x0 ------------------------------------------------------------------------ The problem is that irq-work handlers run with interrupts enabled, which means that rcu_preempt_deferred_qs_handler() could be interrupted, and that interrupt handler might contain an RCU read-side critical section, which might invoke rcu_read_unlock_special(). In the strict KCSAN mode of operation used by RCU, this constitutes a data race on the ->defer_qs_iw_pending field. This commit therefore disables interrupts across the portion of the rcu_preempt_deferred_qs_handler() that updates the ->defer_qs_iw_pending field. This suffices because this handler is not a fast path. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-28PM: sleep: console: Fix the black screen issuetuhaowen
[ Upstream commit 4266e8fa56d3d982bf451d382a410b9db432015c ] When the computer enters sleep status without a monitor connected, the system switches the console to the virtual terminal tty63(SUSPEND_CONSOLE). If a monitor is subsequently connected before waking up, the system skips the required VT restoration process during wake-up, leaving the console on tty63 instead of switching back to tty1. To fix this issue, a global flag vt_switch_done is introduced to record whether the system has successfully switched to the suspend console via vt_move_to_console() during suspend. If the switch was completed, vt_switch_done is set to 1. Later during resume, this flag is checked to ensure that the original console is restored properly by calling vt_move_to_console(orig_fgconsole, 0). This prevents scenarios where the resume logic skips console restoration due to incorrect detection of the console state, especially when a monitor is reconnected before waking up. Signed-off-by: tuhaowen <tuhaowen@uniontech.com> Link: https://patch.msgid.link/20250611032345.29962-1-tuhaowen@uniontech.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-15perf/core: Prevent VMA split of buffer mappingsThomas Gleixner
commit b024d7b56c77191cde544f838debb7f8451cd0d6 upstream. The perf mmap code is careful about mmap()'ing the user page with the ringbuffer and additionally the auxiliary buffer, when the event supports it. Once the first mapping is established, subsequent mapping have to use the same offset and the same size in both cases. The reference counting for the ringbuffer and the auxiliary buffer depends on this being correct. Though perf does not prevent that a related mapping is split via mmap(2), munmap(2) or mremap(2). A split of a VMA results in perf_mmap_open() calls, which take reference counts, but then the subsequent perf_mmap_close() calls are not longer fulfilling the offset and size checks. This leads to reference count leaks. As perf already has the requirement for subsequent mappings to match the initial mapping, the obvious consequence is that VMA splits, caused by resizing of a mapping or partial unmapping, have to be prevented. Implement the vm_operations_struct::may_split() callback and return unconditionally -EINVAL. That ensures that the mapping offsets and sizes cannot be changed after the fact. Remapping to a different fixed address with the same size is still possible as it takes the references for the new mapping and drops those of the old mapping. Fixes: 45bfb2e50471 ("perf/core: Add AUX area to ring buffer for raw data streams") Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-27504 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-15perf/core: Exit early on perf_mmap() failThomas Gleixner
commit 07091aade394f690e7b655578140ef84d0e8d7b0 upstream. When perf_mmap() fails to allocate a buffer, it still invokes the event_mapped() callback of the related event. On X86 this might increase the perf_rdpmc_allowed reference counter. But nothing undoes this as perf_mmap_close() is never called in this case, which causes another reference count leak. Return early on failure to prevent that. Fixes: 1e0fb9ec679c ("perf/core: Add pmu callbacks to track event mapping and unmapping") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-15perf/core: Don't leak AUX buffer refcount on allocation failureThomas Gleixner
commit 5468c0fbccbb9d156522c50832244a8b722374fb upstream. Failure of the AUX buffer allocation leaks the reference count. Set the reference count to 1 only when the allocation succeeds. Fixes: 45bfb2e50471 ("perf/core: Add AUX area to ring buffer for raw data streams") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-15ucount: fix atomic_long_inc_below() argument typeUros Bizjak
[ Upstream commit f8cd9193b62e92ad25def5370ca8ea2bc7585381 ] The type of u argument of atomic_long_inc_below() should be long to avoid unwanted truncation to int. The patch fixes the wrong argument type of an internal function to prevent unwanted argument truncation. It fixes an internal locking primitive; it should not have any direct effect on userspace. Mark said : AFAICT there's no problem in practice because atomic_long_inc_below() : is only used by inc_ucount(), and it looks like the value is : constrained between 0 and INT_MAX. : : In inc_ucount() the limit value is taken from : user_namespace::ucount_max[], and AFAICT that's only written by : sysctls, to the table setup by setup_userns_sysctls(), where : UCOUNT_ENTRY() limits the value between 0 and INT_MAX. : : This is certainly a cleanup, but there might be no functional issue in : practice as above. Link: https://lkml.kernel.org/r/20250721174610.28361-1-ubizjak@gmail.com Fixes: f9c82a4ea89c ("Increase size of ucounts to atomic_long_t") Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Alexey Gladkov <legion@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: MengEn Sun <mengensun@tencent.com> Cc: "Thomas Weißschuh" <linux@weissschuh.net> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-15kernel: trace: preemptirq_delay_test: use offstack cpu maskArnd Bergmann
[ Upstream commit adc353c0bfb243ebfd29b6222fa3bf149169a6de ] A CPU mask on the stack is broken for large values of CONFIG_NR_CPUS: kernel/trace/preemptirq_delay_test.c: In function ‘preemptirq_delay_run’: kernel/trace/preemptirq_delay_test.c:143:1: error: the frame size of 8512 bytes is larger than 1536 bytes [-Werror=frame-larger-than=] Fall back to dynamic allocation here. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Song Chen <chensong_2000@189.cn> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250620111215.3365305-1-arnd@kernel.org Fixes: 4b9091e1c194 ("kernel: trace: preemptirq_delay_test: add cpu affinity") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-15bpf/preload: Don't select USERMODE_DRIVERThomas Weißschuh
[ Upstream commit 2b03164eee20eac7ce0fe3aa4fbda7efc1e5427a ] The usermode driver framework is not used anymore by the BPF preload code. Fixes: cb80ddc67152 ("bpf: Convert bpf_preload.ko to use light skeleton.") Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/bpf/20250721-remove-usermode-driver-v1-1-0d0083334382@linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-15kcsan: test: Initialize dummy variableMarco Elver
[ Upstream commit 9872916ad1a1a5e7d089e05166c85dbd65e5b0e8 ] Newer compiler versions rightfully point out: kernel/kcsan/kcsan_test.c:591:41: error: variable 'dummy' is uninitialized when passed as a const pointer argument here [-Werror,-Wuninitialized-const-pointer] 591 | KCSAN_EXPECT_READ_BARRIER(atomic_read(&dummy), false); | ^~~~~ 1 error generated. Although this particular test does not care about the value stored in the dummy atomic variable, let's silence the warning. Link: https://lkml.kernel.org/r/CA+G9fYu8JY=k-r0hnBRSkQQrFJ1Bz+ShdXNwC1TNeMt0eXaxeA@mail.gmail.com Fixes: 8bc32b348178 ("kcsan: test: Add test cases for memory barrier instrumentation") Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> Reviewed-by: Alexander Potapenko <glider@google.com> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-24sched: Change nr_uninterruptible type to unsigned longAruna Ramakrishna
commit 36569780b0d64de283f9d6c2195fd1a43e221ee8 upstream. The commit e6fe3f422be1 ("sched: Make multiple runqueue task counters 32-bit") changed nr_uninterruptible to an unsigned int. But the nr_uninterruptible values for each of the CPU runqueues can grow to large numbers, sometimes exceeding INT_MAX. This is valid, if, over time, a large number of tasks are migrated off of one CPU after going into an uninterruptible state. Only the sum of all nr_interruptible values across all CPUs yields the correct result, as explained in a comment in kernel/sched/loadavg.c. Change the type of nr_uninterruptible back to unsigned long to prevent overflows, and thus the miscalculation of load average. Fixes: e6fe3f422be1 ("sched: Make multiple runqueue task counters 32-bit") Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250709173328.606794-1-aruna.ramakrishna@oracle.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-24Revert "cgroup_freezer: cgroup_freezing: Check if not frozen"Chen Ridong
[ Upstream commit 14a67b42cb6f3ab66f41603c062c5056d32ea7dd ] This reverts commit cff5f49d433fcd0063c8be7dd08fa5bf190c6c37. Commit cff5f49d433f ("cgroup_freezer: cgroup_freezing: Check if not frozen") modified the cgroup_freezing() logic to verify that the FROZEN flag is not set, affecting the return value of the freezing() function, in order to address a warning in __thaw_task. A race condition exists that may allow tasks to escape being frozen. The following scenario demonstrates this issue: CPU 0 (get_signal path) CPU 1 (freezer.state reader) try_to_freeze read freezer.state __refrigerator freezer_read update_if_frozen WRITE_ONCE(current->__state, TASK_FROZEN); ... /* Task is now marked frozen */ /* frozen(task) == true */ /* Assuming other tasks are frozen */ freezer->state |= CGROUP_FROZEN; /* freezing(current) returns false */ /* because cgroup is frozen (not freezing) */ break out __set_current_state(TASK_RUNNING); /* Bug: Task resumes running when it should remain frozen */ The existing !frozen(p) check in __thaw_task makes the WARN_ON_ONCE(freezing(p)) warning redundant. Removing this warning enables reverting the commit cff5f49d433f ("cgroup_freezer: cgroup_freezing: Check if not frozen") to resolve the issue. The warning has been removed in the previous patch. This patch revert the commit cff5f49d433f ("cgroup_freezer: cgroup_freezing: Check if not frozen") to complete the fix. Fixes: cff5f49d433f ("cgroup_freezer: cgroup_freezing: Check if not frozen") Reported-by: Zhong Jiawei<zhongjiawei1@huawei.com> Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-24bpf: Reject %p% format string in bprintf-like helpersPaul Chaignon
[ Upstream commit f8242745871f81a3ac37f9f51853d12854fd0b58 ] static const char fmt[] = "%p%"; bpf_trace_printk(fmt, sizeof(fmt)); The above BPF program isn't rejected and causes a kernel warning at runtime: Please remove unsupported %\x00 in format string WARNING: CPU: 1 PID: 7244 at lib/vsprintf.c:2680 format_decode+0x49c/0x5d0 This happens because bpf_bprintf_prepare skips over the second %, detected as punctuation, while processing %p. This patch fixes it by not skipping over punctuation. %\x00 is then processed in the next iteration and rejected. Reported-by: syzbot+e2c932aec5c8a6e1d31c@syzkaller.appspotmail.com Fixes: 48cac3f4a96d ("bpf: Implement formatted output helpers with bstr_printf") Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/a0e06cc479faec9e802ae51ba5d66420523251ee.1751395489.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-24tracing: Add down_write(trace_event_sem) when adding trace eventSteven Rostedt
commit b5e8acc14dcb314a9b61ff19dcd9fdd0d88f70df upstream. When a module is loaded, it adds trace events defined by the module. It may also need to modify the modules trace printk formats to replace enum names with their values. If two modules are loaded at the same time, the adding of the event to the ftrace_events list can corrupt the walking of the list in the code that is modifying the printk format strings and crash the kernel. The addition of the event should take the trace_event_sem for write while it adds the new event. Also add a lockdep_assert_held() on that semaphore in __trace_add_event_dirs() as it iterates the list. Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/20250718223158.799bfc0c@batman.local.home Reported-by: Fusheng Huang(黄富生) <Fusheng.Huang@luxshare-ict.com> Closes: https://lore.kernel.org/all/20250717105007.46ccd18f@batman.local.home/ Fixes: 110bf2b764eb6 ("tracing: add protection around module events unload") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-17rseq: Fix segfault on registration when rseq_cs is non-zeroMichael Jeanson
commit fd881d0a085fc54354414aed990ccf05f282ba53 upstream. The rseq_cs field is documented as being set to 0 by user-space prior to registration, however this is not currently enforced by the kernel. This can result in a segfault on return to user-space if the value stored in the rseq_cs field doesn't point to a valid struct rseq_cs. The correct solution to this would be to fail the rseq registration when the rseq_cs field is non-zero. However, some older versions of glibc will reuse the rseq area of previous threads without clearing the rseq_cs field and will also terminate the process if the rseq registration fails in a secondary thread. This wasn't caught in testing because in this case the leftover rseq_cs does point to a valid struct rseq_cs. What we can do is clear the rseq_cs field on registration when it's non-zero which will prevent segfaults on registration and won't break the glibc versions that reuse rseq areas on thread creation. Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250306211223.109455-1-mjeanson@efficios.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-17perf: Revert to requiring CAP_SYS_ADMIN for uprobesPeter Zijlstra
[ Upstream commit ba677dbe77af5ffe6204e0f3f547f3ba059c6302 ] Jann reports that uprobes can be used destructively when used in the middle of an instruction. The kernel only verifies there is a valid instruction at the requested offset, but due to variable instruction length cannot determine if this is an instruction as seen by the intended execution stream. Additionally, Mark Rutland notes that on architectures that mix data in the text segment (like arm64), a similar things can be done if the data word is 'mistaken' for an instruction. As such, require CAP_SYS_ADMIN for uprobes. Fixes: c9e0924e5c2b ("perf/core: open access to probes for CAP_PERFMON privileged process") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/CAG48ez1n4520sq0XrWYDHKiKxE_+WCfAK+qt9qkY4ZiBGmL-5g@mail.gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-10rcu: Return early if callback is not specifiedUladzislau Rezki (Sony)
[ Upstream commit 33b6a1f155d627f5bd80c7485c598ce45428f74f ] Currently the call_rcu() API does not check whether a callback pointer is NULL. If NULL is passed, rcu_core() will try to invoke it, resulting in NULL pointer dereference and a kernel crash. To prevent this and improve debuggability, this patch adds a check for NULL and emits a kernel stack trace to help identify a faulty caller. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-27perf: Fix sample vs do_exit()Peter Zijlstra
[ Upstream commit 4f6fc782128355931527cefe3eb45338abd8ab39 ] Baisheng Gao reported an ARM64 crash, which Mark decoded as being a synchronous external abort -- most likely due to trying to access MMIO in bad ways. The crash further shows perf trying to do a user stack sample while in exit_mmap()'s tlb_finish_mmu() -- i.e. while tearing down the address space it is trying to access. It turns out that we stop perf after we tear down the userspace mm; a receipie for disaster, since perf likes to access userspace for various reasons. Flip this order by moving up where we stop perf in do_exit(). Additionally, harden PERF_SAMPLE_CALLCHAIN and PERF_SAMPLE_STACK_USER to abort when the current task does not have an mm (exit_mm() makes sure to set current->mm = NULL; before commencing with the actual teardown). Such that CPU wide events don't trip on this same problem. Fixes: c5ebcedb566e ("perf: Add ability to attach user stack dump to sample") Reported-by: Baisheng Gao <baisheng.gao@unisoc.com> Suggested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250605110815.GQ39944@noisy.programming.kicks-ass.net Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-27bpf: Check rcu_read_lock_trace_held() in bpf_map_lookup_percpu_elem()Hou Tao
[ Upstream commit d4965578267e2e81f67c86e2608481e77e9c8569 ] bpf_map_lookup_percpu_elem() helper is also available for sleepable bpf program. When BPF JIT is disabled or under 32-bit host, bpf_map_lookup_percpu_elem() will not be inlined. Using it in a sleepable bpf program will trigger the warning in bpf_map_lookup_percpu_elem(), because the bpf program only holds rcu_read_lock_trace lock. Therefore, add the missed check. Reported-by: syzbot+dce5aae19ae4d6399986@syzkaller.appspotmail.com Closes: https://lore.kernel.org/bpf/000000000000176a130617420310@google.com/ Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250526062534.1105938-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-27clocksource: Fix the CPUs' choice in the watchdog per CPU verificationGuilherme G. Piccoli
[ Upstream commit 08d7becc1a6b8c936e25d827becabfe3bff72a36 ] Right now, if the clocksource watchdog detects a clocksource skew, it might perform a per CPU check, for example in the TSC case on x86. In other words: supposing TSC is detected as unstable by the clocksource watchdog running at CPU1, as part of marking TSC unstable the kernel will also run a check of TSC readings on some CPUs to be sure it is synced between them all. But that check happens only on some CPUs, not all of them; this choice is based on the parameter "verify_n_cpus" and in some random cpumask calculation. So, the watchdog runs such per CPU checks on up to "verify_n_cpus" random CPUs among all online CPUs, with the risk of repeating CPUs (that aren't double checked) in the cpumask random calculation. But if "verify_n_cpus" > num_online_cpus(), it should skip the random calculation and just go ahead and check the clocksource sync between all online CPUs, without the risk of skipping some CPUs due to duplicity in the random cpumask calculation. Tests in a 4 CPU laptop with TSC skew detected led to some cases of the per CPU verification skipping some CPU even with verify_n_cpus=8, due to the duplicity on random cpumask generation. Skipping the randomization when the number of online CPUs is smaller than verify_n_cpus, solves that. Suggested-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/all/20250323173857.372390-1-gpiccoli@igalia.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-27ftrace: Fix UAF when lookup kallsym after ftrace disabledYe Bin
commit f914b52c379c12288b7623bb814d0508dbe7481d upstream. The following issue happens with a buggy module: BUG: unable to handle page fault for address: ffffffffc05d0218 PGD 1bd66f067 P4D 1bd66f067 PUD 1bd671067 PMD 101808067 PTE 0 Oops: Oops: 0000 [#1] SMP KASAN PTI Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS RIP: 0010:sized_strscpy+0x81/0x2f0 RSP: 0018:ffff88812d76fa08 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffffffffc0601010 RCX: dffffc0000000000 RDX: 0000000000000038 RSI: dffffc0000000000 RDI: ffff88812608da2d RBP: 8080808080808080 R08: ffff88812608da2d R09: ffff88812608da68 R10: ffff88812608d82d R11: ffff88812608d810 R12: 0000000000000038 R13: ffff88812608da2d R14: ffffffffc05d0218 R15: fefefefefefefeff FS: 00007fef552de740(0000) GS:ffff8884251c7000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffc05d0218 CR3: 00000001146f0000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ftrace_mod_get_kallsym+0x1ac/0x590 update_iter_mod+0x239/0x5b0 s_next+0x5b/0xa0 seq_read_iter+0x8c9/0x1070 seq_read+0x249/0x3b0 proc_reg_read+0x1b0/0x280 vfs_read+0x17f/0x920 ksys_read+0xf3/0x1c0 do_syscall_64+0x5f/0x2e0 entry_SYSCALL_64_after_hwframe+0x76/0x7e The above issue may happen as follows: (1) Add kprobe tracepoint; (2) insmod test.ko; (3) Module triggers ftrace disabled; (4) rmmod test.ko; (5) cat /proc/kallsyms; --> Will trigger UAF as test.ko already removed; ftrace_mod_get_kallsym() ... strscpy(module_name, mod_map->mod->name, MODULE_NAME_LEN); ... The problem is when a module triggers an issue with ftrace and sets ftrace_disable. The ftrace_disable is set when an anomaly is discovered and to prevent any more damage, ftrace stops all text modification. The issue that happened was that the ftrace_disable stops more than just the text modification. When a module is loaded, its init functions can also be traced. Because kallsyms deletes the init functions after a module has loaded, ftrace saves them when the module is loaded and function tracing is enabled. This allows the output of the function trace to show the init function names instead of just their raw memory addresses. When a module is removed, ftrace_release_mod() is called, and if ftrace_disable is set, it just returns without doing anything more. The problem here is that it leaves the mod_list still around and if kallsyms is called, it will call into this code and access the module memory that has already been freed as it will return: strscpy(module_name, mod_map->mod->name, MODULE_NAME_LEN); Where the "mod" no longer exists and triggers a UAF bug. Link: https://lore.kernel.org/all/20250523135452.626d8dcd@gandalf.local.home/ Cc: stable@vger.kernel.org Fixes: aba4b5c22cba ("ftrace: Save module init functions kallsyms symbols for tracing") Link: https://lore.kernel.org/20250529111955.2349189-2-yebin@huaweicloud.com Signed-off-by: Ye Bin <yebin10@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-27cgroup,freezer: fix incomplete freezing when attaching tasksChen Ridong
commit 37fb58a7273726e59f9429c89ade5116083a213d upstream. An issue was found: # cd /sys/fs/cgroup/freezer/ # mkdir test # echo FROZEN > test/freezer.state # cat test/freezer.state FROZEN # sleep 1000 & [1] 863 # echo 863 > test/cgroup.procs # cat test/freezer.state FREEZING When tasks are migrated to a frozen cgroup, the freezer fails to immediately freeze the tasks, causing the cgroup to remain in the "FREEZING". The freeze_task() function is called before clearing the CGROUP_FROZEN flag. This causes the freezing() check to incorrectly return false, preventing __freeze_task() from being invoked for the migrated task. To fix this issue, clear the CGROUP_FROZEN state before calling freeze_task(). Fixes: f5d39b020809 ("freezer,sched: Rewrite core freezer logic") Cc: stable@vger.kernel.org # v6.1+ Reported-by: Zhong Jiawei <zhongjiawei1@huawei.com> Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-27posix-cpu-timers: fix race between handle_posix_cpu_timers() and ↵Oleg Nesterov
posix_cpu_timer_del() commit f90fff1e152dedf52b932240ebbd670d83330eca upstream. If an exiting non-autoreaping task has already passed exit_notify() and calls handle_posix_cpu_timers() from IRQ, it can be reaped by its parent or debugger right after unlock_task_sighand(). If a concurrent posix_cpu_timer_del() runs at that moment, it won't be able to detect timer->it.cpu.firing != 0: cpu_timer_task_rcu() and/or lock_task_sighand() will fail. Add the tsk->exit_state check into run_posix_cpu_timers() to fix this. This fix is not needed if CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y, because exit_task_work() is called before exit_notify(). But the check still makes sense, task_work_add(&tsk->posix_cputimers_work.work) will fail anyway in this case. Cc: stable@vger.kernel.org Reported-by: Benoît Sevens <bsevens@google.com> Fixes: 0bdd2ed4138e ("sched: run_posix_cpu_timers: Don't check ->exit_state, use lock_task_sighand()") Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-27perf: Ensure bpf_perf_link path is properly serializedPeter Zijlstra
[ Upstream commit 7ed9138a72829d2035ecbd8dbd35b1bc3c137c40 ] Ravi reported that the bpf_perf_link_attach() usage of perf_event_set_bpf_prog() is not serialized by ctx->mutex, unlike the PERF_EVENT_IOC_SET_BPF case. Reported-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lkml.kernel.org/r/20250307193305.486326750@infradead.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-27bpf: Avoid __bpf_prog_ret0_warn when jit failsKaFai Wan
[ Upstream commit 86bc9c742426a16b52a10ef61f5b721aecca2344 ] syzkaller reported an issue: WARNING: CPU: 3 PID: 217 at kernel/bpf/core.c:2357 __bpf_prog_ret0_warn+0xa/0x20 kernel/bpf/core.c:2357 Modules linked in: CPU: 3 UID: 0 PID: 217 Comm: kworker/u32:6 Not tainted 6.15.0-rc4-syzkaller-00040-g8bac8898fe39 RIP: 0010:__bpf_prog_ret0_warn+0xa/0x20 kernel/bpf/core.c:2357 Call Trace: <TASK> bpf_dispatcher_nop_func include/linux/bpf.h:1316 [inline] __bpf_prog_run include/linux/filter.h:718 [inline] bpf_prog_run include/linux/filter.h:725 [inline] cls_bpf_classify+0x74a/0x1110 net/sched/cls_bpf.c:105 ... When creating bpf program, 'fp->jit_requested' depends on bpf_jit_enable. This issue is triggered because of CONFIG_BPF_JIT_ALWAYS_ON is not set and bpf_jit_enable is set to 1, causing the arch to attempt JIT the prog, but jit failed due to FAULT_INJECTION. As a result, incorrectly treats the program as valid, when the program runs it calls `__bpf_prog_ret0_warn` and triggers the WARN_ON_ONCE(1). Reported-by: syzbot+0903f6d7f285e41cdf10@syzkaller.appspotmail.com Closes: https://lore.kernel.org/bpf/6816e34e.a70a0220.254cdc.002c.GAE@google.com Fixes: fa9dd599b4da ("bpf: get rid of pure_initcall dependency to enable jits") Signed-off-by: KaFai Wan <mannkafai@gmail.com> Link: https://lore.kernel.org/r/20250526133358.2594176-1-mannkafai@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-27bpf: Fix WARN() in get_bpf_raw_tp_regsTao Chen
[ Upstream commit 3880cdbed1c4607e378f58fa924c5d6df900d1d3 ] syzkaller reported an issue: WARNING: CPU: 3 PID: 5971 at kernel/trace/bpf_trace.c:1861 get_bpf_raw_tp_regs+0xa4/0x100 kernel/trace/bpf_trace.c:1861 Modules linked in: CPU: 3 UID: 0 PID: 5971 Comm: syz-executor205 Not tainted 6.15.0-rc5-syzkaller-00038-g707df3375124 #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:get_bpf_raw_tp_regs+0xa4/0x100 kernel/trace/bpf_trace.c:1861 RSP: 0018:ffffc90003636fa8 EFLAGS: 00010293 RAX: 0000000000000000 RBX: 0000000000000003 RCX: ffffffff81c6bc4c RDX: ffff888032efc880 RSI: ffffffff81c6bc83 RDI: 0000000000000005 RBP: ffff88806a730860 R08: 0000000000000005 R09: 0000000000000003 R10: 0000000000000004 R11: 0000000000000000 R12: 0000000000000004 R13: 0000000000000001 R14: ffffc90003637008 R15: 0000000000000900 FS: 0000000000000000(0000) GS:ffff8880d6cdf000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f7baee09130 CR3: 0000000029f5a000 CR4: 0000000000352ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ____bpf_get_stack_raw_tp kernel/trace/bpf_trace.c:1934 [inline] bpf_get_stack_raw_tp+0x24/0x160 kernel/trace/bpf_trace.c:1931 bpf_prog_ec3b2eefa702d8d3+0x43/0x47 bpf_dispatcher_nop_func include/linux/bpf.h:1316 [inline] __bpf_prog_run include/linux/filter.h:718 [inline] bpf_prog_run include/linux/filter.h:725 [inline] __bpf_trace_run kernel/trace/bpf_trace.c:2363 [inline] bpf_trace_run3+0x23f/0x5a0 kernel/trace/bpf_trace.c:2405 __bpf_trace_mmap_lock_acquire_returned+0xfc/0x140 include/trace/events/mmap_lock.h:47 __traceiter_mmap_lock_acquire_returned+0x79/0xc0 include/trace/events/mmap_lock.h:47 __do_trace_mmap_lock_acquire_returned include/trace/events/mmap_lock.h:47 [inline] trace_mmap_lock_acquire_returned include/trace/events/mmap_lock.h:47 [inline] __mmap_lock_do_trace_acquire_returned+0x138/0x1f0 mm/mmap_lock.c:35 __mmap_lock_trace_acquire_returned include/linux/mmap_lock.h:36 [inline] mmap_read_trylock include/linux/mmap_lock.h:204 [inline] stack_map_get_build_id_offset+0x535/0x6f0 kernel/bpf/stackmap.c:157 __bpf_get_stack+0x307/0xa10 kernel/bpf/stackmap.c:483 ____bpf_get_stack kernel/bpf/stackmap.c:499 [inline] bpf_get_stack+0x32/0x40 kernel/bpf/stackmap.c:496 ____bpf_get_stack_raw_tp kernel/trace/bpf_trace.c:1941 [inline] bpf_get_stack_raw_tp+0x124/0x160 kernel/trace/bpf_trace.c:1931 bpf_prog_ec3b2eefa702d8d3+0x43/0x47 Tracepoint like trace_mmap_lock_acquire_returned may cause nested call as the corner case show above, which will be resolved with more general method in the future. As a result, WARN_ON_ONCE will be triggered. As Alexei suggested, remove the WARN_ON_ONCE first. Fixes: 9594dc3c7e71 ("bpf: fix nested bpf tracepoints with per-cpu data") Reported-by: syzbot+45b0c89a0fc7ae8dbadc@syzkaller.appspotmail.com Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250513042747.757042-1-chen.dylane@linux.dev Closes: https://lore.kernel.org/bpf/8bc2554d-1052-4922-8832-e0078a033e1d@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-27tracing: Fix error handling in event_trigger_parse()Miaoqian Lin
[ Upstream commit c5dd28e7fb4f63475b50df4f58311df92939d011 ] According to trigger_data_alloc() doc, trigger_data_free() should be used to free an event_trigger_data object. This fixes a mismatch introduced when kzalloc was replaced with trigger_data_alloc without updating the corresponding deallocation calls. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tom Zanussi <zanussi@kernel.org> Link: https://lore.kernel.org/20250507145455.944453325@goodmis.org Link: https://lore.kernel.org/20250318112737.4174-1-linmq006@gmail.com Fixes: e1f187d09e11 ("tracing: Have existing event_command.parse() implementations use helpers") Signed-off-by: Miaoqian Lin <linmq006@gmail.com> [ SDR: Changed event_trigger_alloc/free() to trigger_data_alloc/free() ] Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-27tracing: Rename event_trigger_alloc() to trigger_data_alloc()Steven Rostedt
[ Upstream commit f2947c4b7d0f235621c5daf78aecfbd6e22c05e5 ] The function event_trigger_alloc() creates an event_trigger_data descriptor and states that it needs to be freed via event_trigger_free(). This is incorrect, it needs to be freed by trigger_data_free() as event_trigger_free() adds ref counting. Rename event_trigger_alloc() to trigger_data_alloc() and state that it needs to be freed via trigger_data_free(). This naming convention was introducing bugs. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tom Zanussi <zanussi@kernel.org> Link: https://lore.kernel.org/20250507145455.776436410@goodmis.org Fixes: 86599dbe2c527 ("tracing: Add helper functions to simplify event_command.parse() callback handling") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-27PM: wakeup: Delete space in the end of string shown by pm_show_wakelocks()Zijun Hu
[ Upstream commit f0050a3e214aa941b78ad4caf122a735a24d81a6 ] pm_show_wakelocks() is called to generate a string when showing attributes /sys/power/wake_(lock|unlock), but the string ends with an unwanted space that was added back by mistake by commit c9d967b2ce40 ("PM: wakeup: simplify the output logic of pm_show_wakelocks()"). Remove the unwanted space. Fixes: c9d967b2ce40 ("PM: wakeup: simplify the output logic of pm_show_wakelocks()") Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com> Link: https://patch.msgid.link/20250505-fix_power-v1-1-0f7f2c2f338c@quicinc.com [ rjw: Changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-27perf/core: Fix broken throttling when max_samples_per_tick=1Qing Wang
[ Upstream commit f51972e6f8b9a737b2b3eb588069acb538fa72de ] According to the throttling mechanism, the pmu interrupts number can not exceed the max_samples_per_tick in one tick. But this mechanism is ineffective when max_samples_per_tick=1, because the throttling check is skipped during the first interrupt and only performed when the second interrupt arrives. Perhaps this bug may cause little influence in one tick, but if in a larger time scale, the problem can not be underestimated. When max_samples_per_tick = 1: Allowed-interrupts-per-second max-samples-per-second default-HZ ARCH 200 100 100 X86 500 250 250 ARM64 ... Obviously, the pmu interrupt number far exceed the user's expect. Fixes: e050e3f0a71b ("perf: Fix broken interrupt rate throttling") Signed-off-by: Qing Wang <wangqing7171@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250405141635.243786-3-wangqing7171@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-27tracing: Fix compilation warning on arm32Pan Taixi
commit 2fbdb6d8e03b70668c0876e635506540ae92ab05 upstream. On arm32, size_t is defined to be unsigned int, while PAGE_SIZE is unsigned long. This hence triggers a compilation warning as min() asserts the type of two operands to be equal. Casting PAGE_SIZE to size_t solves this issue and works on other target architectures as well. Compilation warning details: kernel/trace/trace.c: In function 'tracing_splice_read_pipe': ./include/linux/minmax.h:20:28: warning: comparison of distinct pointer types lacks a cast (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1))) ^ ./include/linux/minmax.h:26:4: note: in expansion of macro '__typecheck' (__typecheck(x, y) && __no_side_effects(x, y)) ^~~~~~~~~~~ ... kernel/trace/trace.c:6771:8: note: in expansion of macro 'min' min((size_t)trace_seq_used(&iter->seq), ^~~ Cc: stable@vger.kernel.org Link: https://lore.kernel.org/20250526013731.1198030-1-pantaixi@huaweicloud.com Fixes: f5178c41bb43 ("tracing: Fix oob write in trace_seq_to_buffer()") Reviewed-by: Jeongjun Park <aha310510@gmail.com> Signed-off-by: Pan Taixi <pantaixi@huaweicloud.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-04fork: use pidfd_prepare()Christian Brauner
commit ca7707f5430ad6b1c9cb7cee0a7f67d69328bb2d upstream. Stop open-coding get_unused_fd_flags() and anon_inode_getfile(). That's brittle just for keeping the flags between both calls in sync. Use the dedicated helper. Message-Id: <20230327-pidfd-file-api-v1-2-5c0e9a3158e4@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>