summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2025-04-25tracing: Fix filter string testingSteven Rostedt
commit a8c5b0ed89a3f2c81c6ae0b041394e6eea0e7024 upstream. The filter string testing uses strncpy_from_kernel/user_nofault() to retrieve the string to test the filter against. The if() statement was incorrect as it considered 0 as a fault, when it is only negative that it faulted. Running the following commands: # cd /sys/kernel/tracing # echo "filename.ustring ~ \"/proc*\"" > events/syscalls/sys_enter_openat/filter # echo 1 > events/syscalls/sys_enter_openat/enable # ls /proc/$$/maps # cat trace Would produce nothing, but with the fix it will produce something like: ls-1192 [007] ..... 8169.828333: sys_openat(dfd: ffffffffffffff9c, filename: 7efc18359904, flags: 80000, mode: 0) Link: https://lore.kernel.org/all/CAEf4BzbVPQ=BjWztmEwBPRKHUwNfKBkS3kce-Rzka6zvbQeVpg@mail.gmail.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/20250417183003.505835fb@gandalf.local.home Fixes: 77360f9bbc7e5 ("tracing: Add test for user space strings when filtering on string pointers") Reported-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Reported-by: Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-25cpufreq/sched: Explicitly synchronize limits_changed flag handlingRafael J. Wysocki
commit 79443a7e9da3c9f68290a8653837e23aba0fa89f upstream. The handling of the limits_changed flag in struct sugov_policy needs to be explicitly synchronized to ensure that cpufreq policy limits updates will not be missed in some cases. Without that synchronization it is theoretically possible that the limits_changed update in sugov_should_update_freq() will be reordered with respect to the reads of the policy limits in cpufreq_driver_resolve_freq() and in that case, if the limits_changed update in sugov_limits() clobbers the one in sugov_should_update_freq(), the new policy limits may not take effect for a long time. Likewise, the limits_changed update in sugov_limits() may theoretically get reordered with respect to the updates of the policy limits in cpufreq_set_policy() and if sugov_should_update_freq() runs between them, the policy limits change may be missed. To ensure that the above situations will not take place, add memory barriers preventing the reordering in question from taking place and add READ_ONCE() and WRITE_ONCE() annotations around all of the limits_changed flag updates to prevent the compiler from messing up with that code. Fixes: 600f5badb78c ("cpufreq: schedutil: Don't skip freq update when limits change") Cc: 5.3+ <stable@vger.kernel.org> # 5.3+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/3376719.44csPzL39Z@rjwysocki.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-25ftrace: fix incorrect hash size in register_ftrace_direct()Menglong Dong
[ Upstream commit 92f1d3b40179b15630d72e2c6e4e25a899b67ba9 ] The maximum of the ftrace hash bits is made fls(32) in register_ftrace_direct(), which seems illogical. So, we fix it by making the max hash bits FTRACE_HASH_MAX_BITS instead. Link: https://lore.kernel.org/20250413014444.36724-1-dongml2@chinatelecom.cn Fixes: d05cb470663a ("ftrace: Fix modification of direct_function hash while in use") Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-25cpufreq/sched: Fix the usage of CPUFREQ_NEED_UPDATE_LIMITSRafael J. Wysocki
[ Upstream commit cfde542df7dd51d26cf667f4af497878ddffd85a ] Commit 8e461a1cb43d ("cpufreq: schedutil: Fix superfluous updates caused by need_freq_update") modified sugov_should_update_freq() to set the need_freq_update flag only for drivers with CPUFREQ_NEED_UPDATE_LIMITS set, but that flag generally needs to be set when the policy limits change because the driver callback may need to be invoked for the new limits to take effect. However, if the return value of cpufreq_driver_resolve_freq() after applying the new limits is still equal to the previously selected frequency, the driver callback needs to be invoked only in the case when CPUFREQ_NEED_UPDATE_LIMITS is set (which means that the driver specifically wants its callback to be invoked every time the policy limits change). Update the code accordingly to avoid missing policy limits changes for drivers without CPUFREQ_NEED_UPDATE_LIMITS. Fixes: 8e461a1cb43d ("cpufreq: schedutil: Fix superfluous updates caused by need_freq_update") Closes: https://lore.kernel.org/lkml/Z_Tlc6Qs-tYpxWYb@linaro.org/ Reported-by: Stephan Gerhold <stephan.gerhold@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/3010358.e9J7NaK4W3@rjwysocki.net Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-20sched_ext: create_dsq: Return -EEXIST on duplicate requestJake Hillion
commit a8897ed8523d4c9d782e282b18005a3779c92714 upstream. create_dsq and therefore the scx_bpf_create_dsq kfunc currently silently ignore duplicate entries. As a sched_ext scheduler is creating each DSQ for a different purpose this is surprising behaviour. Replace rhashtable_insert_fast which ignores duplicates with rhashtable_lookup_insert_fast that reports duplicates (though doesn't return their value). The rest of the code is structured correctly and this now returns -EEXIST. Tested by adding an extra scx_bpf_create_dsq to scx_simple. Previously this was ignored, now init fails with a -17 code. Also ran scx_lavd which continued to work well. Signed-off-by: Jake Hillion <jake@hillion.co.uk> Acked-by: Andrea Righi <arighi@nvidia.com> Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class") Cc: stable@vger.kernel.org # v6.12+ Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-20ring-buffer: Use flush_kernel_vmap_range() over flush_dcache_folio()Steven Rostedt
commit e4d4b8670c44cdd22212cab3c576e2d317efa67c upstream. Some architectures do not have data cache coherency between user and kernel space. For these architectures, the cache needs to be flushed on both the kernel and user addresses so that user space can see the updates the kernel has made. Instead of using flush_dcache_folio() and playing with virt_to_folio() within the call to that function, use flush_kernel_vmap_range() which takes the virtual address and does the work for those architectures that need it. This also fixes a bug where the flush of the reader page only flushed one page. If the sub-buffer order is 1 or more, where the sub-buffer size would be greater than a page, it would miss the rest of the sub-buffer content, as the "reader page" is not just a page, but the size of a sub-buffer. Link: https://lore.kernel.org/all/CAG48ez3w0my4Rwttbc5tEbNsme6tc0mrSN95thjXUFaJ3aQ6SA@mail.gmail.com/ Cc: stable@vger.kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Rapoport <rppt@kernel.org> Link: https://lore.kernel.org/20250402144953.920792197@goodmis.org Fixes: 117c39200d9d7 ("ring-buffer: Introducing ring-buffer mapping functions"); Suggested-by: Jann Horn <jannh@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-20ftrace: Properly merge notrace hashesAndy Chiu
commit 04a80a34c22f4db245f553d8696d1318d1c00ece upstream. The global notrace hash should be jointly decided by the intersection of each subops's notrace hash, but not the filter hash. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/20250408160258.48563-1-andybnac@gmail.com Fixes: 5fccc7552ccb ("ftrace: Add subops logic to allow one ops to manage many") Signed-off-by: Andy Chiu <andybnac@gmail.com> [ fixed removing of freeing of filter_hash ] Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-20ftrace: Add cond_resched() to ftrace_graph_set_hash()zhoumin
commit 42ea22e754ba4f2b86f8760ca27f6f71da2d982c upstream. When the kernel contains a large number of functions that can be traced, the loop in ftrace_graph_set_hash() may take a lot of time to execute. This may trigger the softlockup watchdog. Add cond_resched() within the loop to allow the kernel to remain responsive even when processing a large number of functions. This matches the cond_resched() that is used in other locations of the code that iterates over all functions that can be traced. Cc: stable@vger.kernel.org Fixes: b9b0c831bed26 ("ftrace: Convert graph filter to use hash tables") Link: https://lore.kernel.org/tencent_3E06CE338692017B5809534B9C5C03DA7705@qq.com Signed-off-by: zhoumin <teczm@foxmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-20tracing: Do not add length to print format in synthetic eventsSteven Rostedt
commit e1a453a57bc76be678bd746f84e3d73f378a9511 upstream. The following causes a vsnprintf fault: # echo 's:wake_lat char[] wakee; u64 delta;' >> /sys/kernel/tracing/dynamic_events # echo 'hist:keys=pid:ts=common_timestamp.usecs if !(common_flags & 0x18)' > /sys/kernel/tracing/events/sched/sched_waking/trigger # echo 'hist:keys=next_pid:delta=common_timestamp.usecs-$ts:onmatch(sched.sched_waking).trace(wake_lat,next_comm,$delta)' > /sys/kernel/tracing/events/sched/sched_switch/trigger Because the synthetic event's "wakee" field is created as a dynamic string (even though the string copied is not). The print format to print the dynamic string changed from "%*s" to "%s" because another location (__set_synth_event_print_fmt()) exported this to user space, and user space did not need that. But it is still used in print_synth_event(), and the output looks like: <idle>-0 [001] d..5. 193.428167: wake_lat: wakee=(efault)sshd-sessiondelta=155 sshd-session-879 [001] d..5. 193.811080: wake_lat: wakee=(efault)kworker/u34:5delta=58 <idle>-0 [002] d..5. 193.811198: wake_lat: wakee=(efault)bashdelta=91 bash-880 [002] d..5. 193.811371: wake_lat: wakee=(efault)kworker/u35:2delta=21 <idle>-0 [001] d..5. 193.811516: wake_lat: wakee=(efault)sshd-sessiondelta=129 sshd-session-879 [001] d..5. 193.967576: wake_lat: wakee=(efault)kworker/u34:5delta=50 The length isn't needed as the string is always nul terminated. Just print the string and not add the length (which was hard coded to the max string length anyway). Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Douglas Raillard <douglas.raillard@arm.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/20250407154139.69955768@gandalf.local.home Fixes: 4d38328eb442d ("tracing: Fix synth event printk format for str fields"); Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-20tracing: fprobe events: Fix possible UAF on modulesMasami Hiramatsu (Google)
commit dd941507a9486252d6fcf11814387666792020f3 upstream. Commit ac91052f0ae5 ("tracing: tprobe-events: Fix leakage of module refcount") moved try_module_get() from __find_tracepoint_module_cb() to find_tracepoint() caller, but that introduced a possible UAF because the module can be unloaded before try_module_get(). In this case, the module object should be freed too. Thus, try_module_get() does not only fail but may access to the freed object. To avoid that, try_module_get() in __find_tracepoint_module_cb() again. Link: https://lore.kernel.org/all/174342990779.781946.9138388479067729366.stgit@devnote2/ Fixes: ac91052f0ae5 ("tracing: tprobe-events: Fix leakage of module refcount") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-20tracing: fprobe: Fix to lock module while registering fprobeMasami Hiramatsu (Google)
commit d24fa977eec53399a9a49a2e1dc592430ea0a607 upstream. Since register_fprobe() does not get the module reference count while registering fgraph filter, if the target functions (symbols) are in modules, those modules can be unloaded when registering fprobe to fgraph. To avoid this issue, get the reference counter of module for each symbol, and put it after register the fprobe. Link: https://lore.kernel.org/all/174330568792.459674.16874380163991113156.stgit@devnote2/ Reported-by: Steven Rostedt <rostedt@goodmis.org> Closes: https://lore.kernel.org/all/20250325130628.3a9e234c@gandalf.local.home/ Fixes: 4346ba160409 ("fprobe: Rewrite fprobe on function-graph tracer") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-20uprobes: Avoid false-positive lockdep splat on CONFIG_PREEMPT_RT=y in the ↵Andrii Nakryiko
ri_timer() uprobe timer callback, use raw_write_seqcount_*() commit 0cd575cab10e114e95921321f069a08d45bc412e upstream. Avoid a false-positive lockdep warning in the CONFIG_PREEMPT_RT=y configuration when using write_seqcount_begin() in the uprobe timer callback by using raw_write_* APIs. Uprobe's use of timer callback is guaranteed to not race with itself for a given uprobe_task, and as such seqcount's insistence on having preemption disabled on the writer side is irrelevant. So switch to raw_ variants of seqcount API instead of disabling preemption unnecessarily. Also, point out in the comments more explicitly why we use seqcount despite our reader side being rather simple and never retrying. We favor well-maintained kernel primitive in favor of open-coding our own memory barriers. Fixes: 8622e45b5da1 ("uprobes: Reuse return_instances between multiple uretprobes within task") Reported-by: Alexei Starovoitov <ast@kernel.org> Suggested-by: Sebastian Siewior <bigeasy@linutronix.de> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: stable@kernel.org Link: https://lore.kernel.org/r/20250404194848.2109539-1-andrii@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-20locking/lockdep: Decrease nr_unused_locks if lock unused in zap_class()Boqun Feng
commit 495f53d5cca0f939eaed9dca90b67e7e6fb0e30c upstream. Currently, when a lock class is allocated, nr_unused_locks will be increased by 1, until it gets used: nr_unused_locks will be decreased by 1 in mark_lock(). However, one scenario is missed: a lock class may be zapped without even being used once. This could result into a situation that nr_unused_locks != 0 but no unused lock class is active in the system, and when `cat /proc/lockdep_stats`, a WARN_ON() will be triggered in a CONFIG_DEBUG_LOCKDEP=y kernel: [...] DEBUG_LOCKS_WARN_ON(debug_atomic_read(nr_unused_locks) != nr_unused) [...] WARNING: CPU: 41 PID: 1121 at kernel/locking/lockdep_proc.c:283 lockdep_stats_show+0xba9/0xbd0 And as a result, lockdep will be disabled after this. Therefore, nr_unused_locks needs to be accounted correctly at zap_class() time. Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Waiman Long <longman@redhat.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250326180831.510348-1-boqun.feng@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-20tracing: probe-events: Add comments about entry data storing codeMasami Hiramatsu (Google)
[ Upstream commit bb9c6020f4c3a07a90dc36826cb5fbe83f09efd5 ] Add comments about entry data storing code to __store_entry_arg() and traceprobe_get_entry_data_size(). These are a bit complicated because of building the entry data storing code and scanning it. This just add comments, no behavior change. Link: https://lore.kernel.org/all/174061715004.501424.333819546601401102.stgit@devnote2/ Reported-by: Steven Rostedt <rostedt@goodmis.org> Closes: https://lore.kernel.org/all/20250226102223.586d7119@gandalf.local.home/ Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-20tracing: probe-events: Log error for exceeding the number of argumentsMasami Hiramatsu (Google)
[ Upstream commit 57faaa04804ccbf16582f7fc7a6b986fd0c0e78c ] Add error message when the number of arguments exceeds the limitation. Link: https://lore.kernel.org/all/174055075075.4079315.10916648136898316476.stgit@mhiramat.tok.corp.google.com/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-20tracing: fix return value in __ftrace_event_enable_disable for ↵Gabriele Paoloni
TRACE_REG_UNREGISTER [ Upstream commit 0c588ac0ca6c22b774d9ad4a6594681fdfa57d9d ] When __ftrace_event_enable_disable invokes the class callback to unregister the event, the return value is not reported up to the caller, hence leading to event unregister failures being silently ignored. This patch assigns the ret variable to the invocation of the event unregister callback, so that its return value is stored and reported to the caller, and it raises a warning in case of error. Link: https://lore.kernel.org/20250321170821.101403-1-gpaoloni@redhat.com Signed-off-by: Gabriele Paoloni <gpaoloni@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-20tracing: Disable branch profiling in noinstr codeJosh Poimboeuf
[ Upstream commit 2cbb20b008dba39893f0e296dc8ca312f40a9a0e ] CONFIG_TRACE_BRANCH_PROFILING inserts a call to ftrace_likely_update() for each use of likely() or unlikely(). That breaks noinstr rules if the affected function is annotated as noinstr. Disable branch profiling for files with noinstr functions. In addition to some individual files, this also includes the entire arch/x86 subtree, as well as the kernel/entry, drivers/cpuidle, and drivers/idle directories, all of which are noinstr-heavy. Due to the nature of how sched binaries are built by combining multiple .c files into one, branch profiling is disabled more broadly across the sched code than would otherwise be needed. This fixes many warnings like the following: vmlinux.o: warning: objtool: do_syscall_64+0x40: call to ftrace_likely_update() leaves .noinstr.text section vmlinux.o: warning: objtool: __rdgsbase_inactive+0x33: call to ftrace_likely_update() leaves .noinstr.text section vmlinux.o: warning: objtool: handle_bug.isra.0+0x198: call to ftrace_likely_update() leaves .noinstr.text section ... Reported-by: Ingo Molnar <mingo@kernel.org> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/fb94fc9303d48a5ed370498f54500cc4c338eb6d.1742586676.git.jpoimboe@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-20Flush console log from kernel_power_off()Paul E. McKenney
[ Upstream commit 6ea9a1781c70a8be1fcdc49134fc1bf4baba8bca ] Kernels built with CONFIG_PREEMPT_RT=y can lose significant console output and shutdown time, which hides shutdown-time RCU issues from rcutorture. Therefore, make pr_flush() public and invoke it after then last print in kernel_power_off(). [ paulmck: Apply John Ogness feedback. ] [ paulmck: Appy Sebastian Andrzej Siewior feedback. ] [ paulmck: Apply kernel test robot feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Link: https://lore.kernel.org/r/5f743488-dc2a-4f19-bdda-cf50b9314832@paulmck-laptop Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-20PM: hibernate: Avoid deadlock in hibernate_compressor_param_set()Lizhi Xu
[ Upstream commit 52323ed1444ea5c2a5f1754ea0a2d9c8c216ccdf ] syzbot reported a deadlock in lock_system_sleep() (see below). The write operation to "/sys/module/hibernate/parameters/compressor" conflicts with the registration of ieee80211 device, resulting in a deadlock when attempting to acquire system_transition_mutex under param_lock. To avoid this deadlock, change hibernate_compressor_param_set() to use mutex_trylock() for attempting to acquire system_transition_mutex and return -EBUSY when it fails. Task flags need not be saved or adjusted before calling mutex_trylock(&system_transition_mutex) because the caller is not going to end up waiting for this mutex and if it runs concurrently with system suspend in progress, it will be frozen properly when it returns to user space. syzbot report: syz-executor895/5833 is trying to acquire lock: ffffffff8e0828c8 (system_transition_mutex){+.+.}-{4:4}, at: lock_system_sleep+0x87/0xa0 kernel/power/main.c:56 but task is already holding lock: ffffffff8e07dc68 (param_lock){+.+.}-{4:4}, at: kernel_param_lock kernel/params.c:607 [inline] ffffffff8e07dc68 (param_lock){+.+.}-{4:4}, at: param_attr_store+0xe6/0x300 kernel/params.c:586 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (param_lock){+.+.}-{4:4}: __mutex_lock_common kernel/locking/mutex.c:585 [inline] __mutex_lock+0x19b/0xb10 kernel/locking/mutex.c:730 ieee80211_rate_control_ops_get net/mac80211/rate.c:220 [inline] rate_control_alloc net/mac80211/rate.c:266 [inline] ieee80211_init_rate_ctrl_alg+0x18d/0x6b0 net/mac80211/rate.c:1015 ieee80211_register_hw+0x20cd/0x4060 net/mac80211/main.c:1531 mac80211_hwsim_new_radio+0x304e/0x54e0 drivers/net/wireless/virtual/mac80211_hwsim.c:5558 init_mac80211_hwsim+0x432/0x8c0 drivers/net/wireless/virtual/mac80211_hwsim.c:6910 do_one_initcall+0x128/0x700 init/main.c:1257 do_initcall_level init/main.c:1319 [inline] do_initcalls init/main.c:1335 [inline] do_basic_setup init/main.c:1354 [inline] kernel_init_freeable+0x5c7/0x900 init/main.c:1568 kernel_init+0x1c/0x2b0 init/main.c:1457 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 -> #2 (rtnl_mutex){+.+.}-{4:4}: __mutex_lock_common kernel/locking/mutex.c:585 [inline] __mutex_lock+0x19b/0xb10 kernel/locking/mutex.c:730 wg_pm_notification drivers/net/wireguard/device.c:80 [inline] wg_pm_notification+0x49/0x180 drivers/net/wireguard/device.c:64 notifier_call_chain+0xb7/0x410 kernel/notifier.c:85 notifier_call_chain_robust kernel/notifier.c:120 [inline] blocking_notifier_call_chain_robust kernel/notifier.c:345 [inline] blocking_notifier_call_chain_robust+0xc9/0x170 kernel/notifier.c:333 pm_notifier_call_chain_robust+0x27/0x60 kernel/power/main.c:102 snapshot_open+0x189/0x2b0 kernel/power/user.c:77 misc_open+0x35a/0x420 drivers/char/misc.c:179 chrdev_open+0x237/0x6a0 fs/char_dev.c:414 do_dentry_open+0x735/0x1c40 fs/open.c:956 vfs_open+0x82/0x3f0 fs/open.c:1086 do_open fs/namei.c:3830 [inline] path_openat+0x1e88/0x2d80 fs/namei.c:3989 do_filp_open+0x20c/0x470 fs/namei.c:4016 do_sys_openat2+0x17a/0x1e0 fs/open.c:1428 do_sys_open fs/open.c:1443 [inline] __do_sys_openat fs/open.c:1459 [inline] __se_sys_openat fs/open.c:1454 [inline] __x64_sys_openat+0x175/0x210 fs/open.c:1454 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #1 ((pm_chain_head).rwsem){++++}-{4:4}: down_read+0x9a/0x330 kernel/locking/rwsem.c:1524 blocking_notifier_call_chain_robust kernel/notifier.c:344 [inline] blocking_notifier_call_chain_robust+0xa9/0x170 kernel/notifier.c:333 pm_notifier_call_chain_robust+0x27/0x60 kernel/power/main.c:102 snapshot_open+0x189/0x2b0 kernel/power/user.c:77 misc_open+0x35a/0x420 drivers/char/misc.c:179 chrdev_open+0x237/0x6a0 fs/char_dev.c:414 do_dentry_open+0x735/0x1c40 fs/open.c:956 vfs_open+0x82/0x3f0 fs/open.c:1086 do_open fs/namei.c:3830 [inline] path_openat+0x1e88/0x2d80 fs/namei.c:3989 do_filp_open+0x20c/0x470 fs/namei.c:4016 do_sys_openat2+0x17a/0x1e0 fs/open.c:1428 do_sys_open fs/open.c:1443 [inline] __do_sys_openat fs/open.c:1459 [inline] __se_sys_openat fs/open.c:1454 [inline] __x64_sys_openat+0x175/0x210 fs/open.c:1454 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #0 (system_transition_mutex){+.+.}-{4:4}: check_prev_add kernel/locking/lockdep.c:3163 [inline] check_prevs_add kernel/locking/lockdep.c:3282 [inline] validate_chain kernel/locking/lockdep.c:3906 [inline] __lock_acquire+0x249e/0x3c40 kernel/locking/lockdep.c:5228 lock_acquire.part.0+0x11b/0x380 kernel/locking/lockdep.c:5851 __mutex_lock_common kernel/locking/mutex.c:585 [inline] __mutex_lock+0x19b/0xb10 kernel/locking/mutex.c:730 lock_system_sleep+0x87/0xa0 kernel/power/main.c:56 hibernate_compressor_param_set+0x1c/0x210 kernel/power/hibernate.c:1452 param_attr_store+0x18f/0x300 kernel/params.c:588 module_attr_store+0x55/0x80 kernel/params.c:924 sysfs_kf_write+0x117/0x170 fs/sysfs/file.c:139 kernfs_fop_write_iter+0x33d/0x500 fs/kernfs/file.c:334 new_sync_write fs/read_write.c:586 [inline] vfs_write+0x5ae/0x1150 fs/read_write.c:679 ksys_write+0x12b/0x250 fs/read_write.c:731 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f other info that might help us debug this: Chain exists of: system_transition_mutex --> rtnl_mutex --> param_lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(param_lock); lock(rtnl_mutex); lock(param_lock); lock(system_transition_mutex); *** DEADLOCK *** Reported-by: syzbot+ace60642828c074eb913@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=ace60642828c074eb913 Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com> Link: https://patch.msgid.link/20250224013139.3994500-1-lizhi.xu@windriver.com [ rjw: New subject matching the code changes, changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-20srcu: Force synchronization for srcu_get_delay()Paul E. McKenney
[ Upstream commit d31e31365b5b6c0cdfc74d71be87234ced564395 ] Currently, srcu_get_delay() can be called concurrently, for example, by a CPU that is the first to request a new grace period and the CPU processing the current grace period. Although concurrent access is harmless, it unnecessarily expands the state space. Additionally, all calls to srcu_get_delay() are from slow paths. This commit therefore protects all calls to srcu_get_delay() with ssp->srcu_sup->lock, which is already held on the invocation from the srcu_funnel_gp_start() function. While in the area, this commit also adds a lockdep_assert_held() to srcu_get_delay() itself. Reported-by: syzbot+16a19b06125a2963eaee@syzkaller.appspotmail.com Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: <bpf@vger.kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-20perf: Fix hang while freeing sigtrap eventFrederic Weisbecker
[ Upstream commit 56799bc035658738f362acec3e7647bb84e68933 ] Perf can hang while freeing a sigtrap event if a related deferred signal hadn't managed to be sent before the file got closed: perf_event_overflow() task_work_add(perf_pending_task) fput() task_work_add(____fput()) task_work_run() ____fput() perf_release() perf_event_release_kernel() _free_event() perf_pending_task_sync() task_work_cancel() -> FAILED rcuwait_wait_event() Once task_work_run() is running, the list of pending callbacks is removed from the task_struct and from this point on task_work_cancel() can't remove any pending and not yet started work items, hence the task_work_cancel() failure and the hang on rcuwait_wait_event(). Task work could be changed to remove one work at a time, so a work running on the current task can always cancel a pending one, however the wait / wake design is still subject to inverted dependencies when remote targets are involved, as pictured by Oleg: T1 T2 fd = perf_event_open(pid => T2->pid); fd = perf_event_open(pid => T1->pid); close(fd) close(fd) <IRQ> <IRQ> perf_event_overflow() perf_event_overflow() task_work_add(perf_pending_task) task_work_add(perf_pending_task) </IRQ> </IRQ> fput() fput() task_work_add(____fput()) task_work_add(____fput()) task_work_run() task_work_run() ____fput() ____fput() perf_release() perf_release() perf_event_release_kernel() perf_event_release_kernel() _free_event() _free_event() perf_pending_task_sync() perf_pending_task_sync() rcuwait_wait_event() rcuwait_wait_event() Therefore the only option left is to acquire the event reference count upon queueing the perf task work and release it from the task work, just like it was done before 3a5465418f5f ("perf: Fix event leak upon exec and file release") but without the leaks it fixed. Some adjustments are necessary to make it work: * A child event might dereference its parent upon freeing. Care must be taken to release the parent last. * Some places assuming the event doesn't have any reference held and therefore can be freed right away must instead put the reference and let the reference counting to its job. Reported-by: "Yi Lai" <yi1.lai@linux.intel.com> Closes: https://lore.kernel.org/all/Zx9Losv4YcJowaP%2F@ly-workstation/ Reported-by: syzbot+3c4321e10eea460eb606@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/673adf75.050a0220.87769.0024.GAE@google.com/ Fixes: 3a5465418f5f ("perf: Fix event leak upon exec and file release") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250304135446.18905-1-frederic@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-20perf/core: Simplify the perf_event_alloc() error pathPeter Zijlstra
[ Upstream commit c70ca298036c58a88686ff388d3d367e9d21acf0 ] The error cleanup sequence in perf_event_alloc() is a subset of the existing _free_event() function (it must of course be). Split this out into __free_event() and simplify the error path. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20241104135517.967889521@infradead.org Stable-dep-of: 56799bc03565 ("perf: Fix hang while freeing sigtrap event") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-20tracing: fprobe: Cleanup fprobe hash when module unloadingMasami Hiramatsu (Google)
[ Upstream commit a3dc2983ca7b90fd35f978502de6d4664d965cfb ] Cleanup fprobe address hash table on module unloading because the target symbols will be disappeared when unloading module and not sure the same symbol is mapped on the same address. Note that this is at least disables the fprobes if a part of target symbols on the unloaded modules. Unlike kprobes, fprobe does not re-enable the probe point by itself. To do that, the caller should take care register/unregister fprobe when loading/unloading modules. This simplifies the fprobe state managememt related to the module loading/unloading. Link: https://lore.kernel.org/all/174343534473.843280.13988101014957210732.stgit@devnote2/ Fixes: 4346ba160409 ("fprobe: Rewrite fprobe on function-graph tracer") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-20cgroup/cpuset: Fix race between newly created partition and dying oneWaiman Long
[ Upstream commit a22b3d54de94f82ca057cc2ebf9496fa91ebf698 ] There is a possible race between removing a cgroup diectory that is a partition root and the creation of a new partition. The partition to be removed can be dying but still online, it doesn't not currently participate in checking for exclusive CPUs conflict, but the exclusive CPUs are still there in subpartitions_cpus and isolated_cpus. These two cpumasks are global states that affect the operation of cpuset partitions. The exclusive CPUs in dying cpusets will only be removed when cpuset_css_offline() function is called after an RCU delay. As a result, it is possible that a new partition can be created with exclusive CPUs that overlap with those of a dying one. When that dying partition is finally offlined, it removes those overlapping exclusive CPUs from subpartitions_cpus and maybe isolated_cpus resulting in an incorrect CPU configuration. This bug was found when a warning was triggered in remote_partition_disable() during testing because the subpartitions_cpus mask was empty. One possible way to fix this is to iterate the dying cpusets as well and avoid using the exclusive CPUs in those dying cpusets. However, this can still cause random partition creation failures or other anomalies due to racing. A better way to fix this race is to reset the partition state at the moment when a cpuset is being killed. Introduce a new css_killed() CSS function pointer and call it, if defined, before setting CSS_DYING flag in kill_css(). Also update the css_is_dying() helper to use the CSS_DYING flag introduced by commit 33c35aa48178 ("cgroup: Prevent kill_css() from being called more than once") for proper synchronization. Add a new cpuset_css_killed() function to reset the partition state of a valid partition root if it is being killed. Fixes: ee8dde0cd2ce ("cpuset: Add new v2 cpuset.sched.partition flag") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-20cgroup/cpuset: Fix error handling in remote_partition_disable()Waiman Long
[ Upstream commit 8bf450f3aec3d1bbd725d179502c64b8992588e4 ] When remote_partition_disable() is called to disable a remote partition, it always sets the partition to an invalid partition state. It should only do so if an error code (prs_err) has been set. Correct that and add proper error code in places where remote_partition_disable() is called due to error. Fixes: 181c8e091aae ("cgroup/cpuset: Introduce remote partition") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-20cgroup/cpuset: Fix incorrect isolated_cpus update in ↵Waiman Long
update_parent_effective_cpumask() [ Upstream commit 668e041662e92ab3ebcb9eb606d3ec01884546ab ] Before commit f0af1bfc27b5 ("cgroup/cpuset: Relax constraints to partition & cpus changes"), a cpuset partition cannot be enabled if not all the requested CPUs can be granted from the parent cpuset. After that commit, a cpuset partition can be created even if the requested exclusive CPUs contain CPUs not allowed its parent. The delmask containing exclusive CPUs to be removed from its parent wasn't adjusted accordingly. That is not a problem until the introduction of a new isolated_cpus mask in commit 11e5f407b64a ("cgroup/cpuset: Keep track of CPUs in isolated partitions") as the CPUs in the delmask may be added directly into isolated_cpus. As a result, isolated_cpus may incorrectly contain CPUs that are not isolated leading to incorrect data reporting. Fix this by adjusting the delmask to reflect the actual exclusive CPUs for the creation of the partition. Fixes: 11e5f407b64a ("cgroup/cpuset: Keep track of CPUs in isolated partitions") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10tracing: Do not use PERF enums when perf is not definedSteven Rostedt
commit 8eb1518642738c6892bd629b46043513a3bf1a6a upstream. An update was made to up the module ref count when a synthetic event is registered for both trace and perf events. But if perf is not configured in, the perf enums used will cause the kernel to fail to build. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Douglas Raillard <douglas.raillard@arm.com> Link: https://lore.kernel.org/20250323152151.528b5ced@batman.local.home Fixes: 21581dd4e7ff ("tracing: Ensure module defining synth event cannot be unloaded while tracing") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202503232230.TeREVy8R-lkp@intel.com/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-10tracing: Verify event formats that have "%*p.."Steven Rostedt
commit ea8d7647f9ddf1f81e2027ed305299797299aa03 upstream. The trace event verifier checks the formats of trace events to make sure that they do not point at memory that is not in the trace event itself or in data that will never be freed. If an event references data that was allocated when the event triggered and that same data is freed before the event is read, then the kernel can crash by reading freed memory. The verifier runs at boot up (or module load) and scans the print formats of the events and checks their arguments to make sure that dereferenced pointers are safe. If the format uses "%*p.." the verifier will ignore it, and that could be dangerous. Cover this case as well. Also add to the sample code a use case of "%*pbl". Link: https://lore.kernel.org/all/bcba4d76-2c3f-4d11-baf0-02905db953dd@oracle.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: 5013f454a352c ("tracing: Add check of trace event print fmts for dereferencing pointers") Link: https://lore.kernel.org/20250327195311.2d89ec66@gandalf.local.home Reported-by: Libo Chen <libo.chen@oracle.com> Reviewed-by: Libo Chen <libo.chen@oracle.com> Tested-by: Libo Chen <libo.chen@oracle.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-10tracing/osnoise: Fix possible recursive locking for cpus_read_lock()Ran Xiaokai
commit 7e6b3fcc9c5294aeafed0dbe1a09a1bc899bd0f2 upstream. Lockdep reports this deadlock log: osnoise: could not start sampling thread ============================================ WARNING: possible recursive locking detected -------------------------------------------- CPU0 ---- lock(cpu_hotplug_lock); lock(cpu_hotplug_lock); Call Trace: <TASK> print_deadlock_bug+0x282/0x3c0 __lock_acquire+0x1610/0x29a0 lock_acquire+0xcb/0x2d0 cpus_read_lock+0x49/0x120 stop_per_cpu_kthreads+0x7/0x60 start_kthread+0x103/0x120 osnoise_hotplug_workfn+0x5e/0x90 process_one_work+0x44f/0xb30 worker_thread+0x33e/0x5e0 kthread+0x206/0x3b0 ret_from_fork+0x31/0x50 ret_from_fork_asm+0x11/0x20 </TASK> This is the deadlock scenario: osnoise_hotplug_workfn() guard(cpus_read_lock)(); // first lock call start_kthread(cpu) if (IS_ERR(kthread)) { stop_per_cpu_kthreads(); { cpus_read_lock(); // second lock call. Cause the AA deadlock } } It is not necessary to call stop_per_cpu_kthreads() which stops osnoise kthread for every other CPUs in the system if a failure occurs during hotplug of a certain CPU. For start_per_cpu_kthreads(), if the start_kthread() call fails, this function calls stop_per_cpu_kthreads() to handle the error. Therefore, similarly, there is no need to call stop_per_cpu_kthreads() again within start_kthread(). So just remove stop_per_cpu_kthreads() from start_kthread to solve this issue. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/20250321095249.2739397-1-ranxiaokai627@163.com Fixes: c8895e271f79 ("trace/osnoise: Support hotplug operations") Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-10tracing: Fix synth event printk format for str fieldsDouglas Raillard
commit 4d38328eb442dc06aec4350fd9594ffa6488af02 upstream. The printk format for synth event uses "%.*s" to print string fields, but then only passes the pointer part as var arg. Replace %.*s with %s as the C string is guaranteed to be null-terminated. The output in print fmt should never have been updated as __get_str() handles the string limit because it can access the length of the string in the string meta data that is saved in the ring buffer. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: 8db4d6bfbbf92 ("tracing: Change synthetic event string format to limit printed length") Link: https://lore.kernel.org/20250325165202.541088-1-douglas.raillard@arm.com Signed-off-by: Douglas Raillard <douglas.raillard@arm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-10tracing: Ensure module defining synth event cannot be unloaded while tracingDouglas Raillard
commit 21581dd4e7ff6c07d0ab577e3c32b13a74b31522 upstream. Currently, using synth_event_delete() will fail if the event is being used (tracing in progress), but that is normally done in the module exit function. At that stage, failing is problematic as returning a non-zero status means the module will become locked (impossible to unload or reload again). Instead, ensure the module exit function does not get called in the first place by increasing the module refcnt when the event is enabled. Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: 35ca5207c2d11 ("tracing: Add synthetic event command generation functions") Link: https://lore.kernel.org/20250318180906.226841-1-douglas.raillard@arm.com Signed-off-by: Douglas Raillard <douglas.raillard@arm.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-10tracing: Fix use-after-free in print_graph_function_flags during tracer ↵Tengda Wu
switching commit 7f81f27b1093e4895e87b74143c59c055c3b1906 upstream. Kairui reported a UAF issue in print_graph_function_flags() during ftrace stress testing [1]. This issue can be reproduced if puting a 'mdelay(10)' after 'mutex_unlock(&trace_types_lock)' in s_start(), and executing the following script: $ echo function_graph > current_tracer $ cat trace > /dev/null & $ sleep 5 # Ensure the 'cat' reaches the 'mdelay(10)' point $ echo timerlat > current_tracer The root cause lies in the two calls to print_graph_function_flags within print_trace_line during each s_show(): * One through 'iter->trace->print_line()'; * Another through 'event->funcs->trace()', which is hidden in print_trace_fmt() before print_trace_line returns. Tracer switching only updates the former, while the latter continues to use the print_line function of the old tracer, which in the script above is print_graph_function_flags. Moreover, when switching from the 'function_graph' tracer to the 'timerlat' tracer, s_start only calls graph_trace_close of the 'function_graph' tracer to free 'iter->private', but does not set it to NULL. This provides an opportunity for 'event->funcs->trace()' to use an invalid 'iter->private'. To fix this issue, set 'iter->private' to NULL immediately after freeing it in graph_trace_close(), ensuring that an invalid pointer is not passed to other tracers. Additionally, clean up the unnecessary 'iter->private = NULL' during each 'cat trace' when using wakeup and irqsoff tracers. [1] https://lore.kernel.org/all/20231112150030.84609-1-ryncsn@gmail.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Zheng Yejian <zhengyejian1@huawei.com> Link: https://lore.kernel.org/20250320122137.23635-1-wutengda@huaweicloud.com Fixes: eecb91b9f98d ("tracing: Fix memleak due to race between current_tracer and trace") Closes: https://lore.kernel.org/all/CAMgjq7BW79KDSCyp+tZHjShSzHsScSiJxn5ffskp-QzVM06fxw@mail.gmail.com/ Reported-by: Kairui Song <kasong@tencent.com> Signed-off-by: Tengda Wu <wutengda@huaweicloud.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-10uprobes/x86: Harden uretprobe syscall trampoline checkJiri Olsa
commit fa6192adc32f4fdfe5b74edd5b210e12afd6ecc0 upstream. Jann reported a possible issue when trampoline_check_ip returns address near the bottom of the address space that is allowed to call into the syscall if uretprobes are not set up: https://lore.kernel.org/bpf/202502081235.5A6F352985@keescook/T/#m9d416df341b8fbc11737dacbcd29f0054413cbbf Though the mmap minimum address restrictions will typically prevent creating mappings there, let's make sure uretprobe syscall checks for that. Fixes: ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Kees Cook <kees@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250212220433.3624297-1-jolsa@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-10perf/core: Fix child_total_time_enabled accounting bug at task exitYeoreum Yun
[ Upstream commit a3c3c66670cee11eb13aa43905904bf29cb92d32 ] The perf events code fails to account for total_time_enabled of inactive events. Here is a failure case for accounting total_time_enabled for CPU PMU events: sudo ./perf stat -vvv -e armv8_pmuv3_0/event=0x08/ -e armv8_pmuv3_1/event=0x08/ -- stress-ng --pthread=2 -t 2s ... armv8_pmuv3_0/event=0x08/: 1138698008 2289429840 2174835740 armv8_pmuv3_1/event=0x08/: 1826791390 1950025700 847648440 ` ` ` ` ` > total_time_running with child ` > total_time_enabled with child > count with child Performance counter stats for 'stress-ng --pthread=2 -t 2s': 1,138,698,008 armv8_pmuv3_0/event=0x08/ (94.99%) 1,826,791,390 armv8_pmuv3_1/event=0x08/ (43.47%) The two events above are opened on two different CPU PMUs, for example, each event is opened for a cluster in an Arm big.LITTLE system, they will never run on the same CPU. In theory, the total enabled time should be same for both events, as two events are opened and closed together. As the result show, the two events' total enabled time including child event is different (2289429840 vs 1950025700). This is because child events are not accounted properly if a event is INACTIVE state when the task exits: perf_event_exit_event() `> perf_remove_from_context() `> __perf_remove_from_context() `> perf_child_detach() -> Accumulate child_total_time_enabled `> list_del_event() -> Update child event's time The problem is the time accumulation happens prior to child event's time updating. Thus, it misses to account the last period's time when the event exits. The perf core layer follows the rule that timekeeping is tied to state change. To address the issue, make __perf_remove_from_context() handle the task exit case by passing 'DETACH_EXIT' to it and invoke perf_event_state() for state alongside with accounting the time. Then, perf_child_detach() populates the time into the parent's time metrics. After this patch, the bug is fixed: sudo ./perf stat -vvv -e armv8_pmuv3_0/event=0x08/ -e armv8_pmuv3_1/event=0x08/ -- stress-ng --pthread=2 -t 10s ... armv8_pmuv3_0/event=0x08/: 15396770398 32157963940 21898169000 armv8_pmuv3_1/event=0x08/: 22428964974 32157963940 10259794940 Performance counter stats for 'stress-ng --pthread=2 -t 10s': 15,396,770,398 armv8_pmuv3_0/event=0x08/ (68.10%) 22,428,964,974 armv8_pmuv3_1/event=0x08/ (31.90%) [ mingo: Clarified the changelog. ] Fixes: ef54c1a476aef ("perf: Rework perf_event_exit_event()") Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Leo Yan <leo.yan@arm.com> Link: https://lore.kernel.org/r/20250326082003.1630986-1-yeoreum.yun@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10ring-buffer: Fix bytes_dropped calculation issueFeng Yang
[ Upstream commit c73f0b69648501978e8b3e8fa7eef7f4197d0481 ] The calculation of bytes-dropped and bytes_dropped_nested is reversed. Although it does not affect the final calculation of total_dropped, it should still be modified. Link: https://lore.kernel.org/20250223070106.6781-1-yangfeng59949@163.com Fixes: 6c43e554a2a5 ("ring-buffer: Add ring buffer startup selftest") Signed-off-by: Feng Yang <yangfeng@kylinos.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10reboot: reboot, not shutdown, on hw_protection_reboot timeoutAhmad Fatoum
[ Upstream commit bbf0ec4f57e0a34983ca25f00ec90f4765572d78 ] hw_protection_shutdown() will kick off an orderly shutdown and if that takes longer than a configurable amount of time, an emergency shutdown will occur. Recently, hw_protection_reboot() was added for those systems that don't implement a proper shutdown and are better served by rebooting and having the boot firmware worry about doing something about the critical condition. On timeout of the orderly reboot of hw_protection_reboot(), the system would go into shutdown, instead of reboot. This is not a good idea, as going into shutdown was explicitly not asked for. Fix this by always doing an emergency reboot if hw_protection_reboot() is called and the orderly reboot takes too long. Link: https://lkml.kernel.org/r/20250217-hw_protection-reboot-v3-2-e1c09b090c0c@pengutronix.de Fixes: 79fa723ba84c ("reboot: Introduce thermal_zone_device_critical_reboot()") Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de> Reviewed-by: Tzung-Bi Shih <tzungbi@kernel.org> Reviewed-by: Matti Vaittinen <mazziesaccount@gmail.com> Cc: Benson Leung <bleung@chromium.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Fabio Estevam <festevam@denx.de> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Girdwood <lgirdwood@gmail.com> Cc: Lukasz Luba <lukasz.luba@arm.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matteo Croce <teknoraver@meta.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Rui Zhang <rui.zhang@intel.com> Cc: Sascha Hauer <kernel@pengutronix.de> Cc: "Serge E. Hallyn" <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10reboot: replace __hw_protection_shutdown bool action parameter with an enumAhmad Fatoum
[ Upstream commit 318f05a057157c400a92f17c5f2fa3dd96f57c7e ] Patch series "reboot: support runtime configuration of emergency hw_protection action", v3. We currently leave the decision of whether to shutdown or reboot to protect hardware in an emergency situation to the individual drivers. This works out in some cases, where the driver detecting the critical failure has inside knowledge: It binds to the system management controller for example or is guided by hardware description that defines what to do. This is inadequate in the general case though as a driver reporting e.g. an imminent power failure can't know whether a shutdown or a reboot would be more appropriate for a given hardware platform. To address this, this series adds a hw_protection kernel parameter and sysfs toggle that can be used to change the action from the shutdown default to reboot. A new hw_protection_trigger API then makes use of this default action. My particular use case is unattended embedded systems that don't have support for shutdown and that power on automatically when power is supplied: - A brief power cycle gets detected by the driver - The kernel powers down the system and SoC goes into shutdown mode - Power is restored - The system remains oblivious to the restored power - System needs to be manually power cycled for a duration long enough to drain the capacitors With this series, such systems can configure the kernel with hw_protection=reboot to have the boot firmware worry about critical conditions. This patch (of 12): Currently __hw_protection_shutdown() either reboots or shuts down the system according to its shutdown argument. To make the logic easier to follow, both inside __hw_protection_shutdown and at caller sites, lets replace the bool parameter with an enum. This will be extra useful, when in a later commit, a third action is added to the enumeration. No functional change. Link: https://lkml.kernel.org/r/20250217-hw_protection-reboot-v3-0-e1c09b090c0c@pengutronix.de Link: https://lkml.kernel.org/r/20250217-hw_protection-reboot-v3-1-e1c09b090c0c@pengutronix.de Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de> Reviewed-by: Tzung-Bi Shih <tzungbi@kernel.org> Cc: Benson Leung <bleung@chromium.org> Cc: Mark Brown <broonie@kernel.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Fabio Estevam <festevam@denx.de> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Girdwood <lgirdwood@gmail.com> Cc: Lukasz Luba <lukasz.luba@arm.com> Cc: Matteo Croce <teknoraver@meta.com> Cc: Matti Vaittinen <mazziesaccount@gmail.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Rui Zhang <rui.zhang@intel.com> Cc: Sascha Hauer <kernel@pengutronix.de> Cc: "Serge E. Hallyn" <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: bbf0ec4f57e0 ("reboot: reboot, not shutdown, on hw_protection_reboot timeout") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10kexec: initialize ELF lowest address to ULONG_MAXSourabh Jain
[ Upstream commit 9986fb5164c8b21f6439cfd45ba36d8cc80c9710 ] Patch series "powerpc/crash: use generic crashkernel reservation", v3. Commit 0ab97169aa05 ("crash_core: add generic function to do reservation") added a generic function to reserve crashkernel memory. So let's use the same function on powerpc and remove the architecture-specific code that essentially does the same thing. The generic crashkernel reservation also provides a way to split the crashkernel reservation into high and low memory reservations, which can be enabled for powerpc in the future. Additionally move powerpc to use generic APIs to locate memory hole for kexec segments while loading kdump kernel. This patch (of 7): kexec_elf_load() loads an ELF executable and sets the address of the lowest PT_LOAD section to the address held by the lowest_load_addr function argument. To determine the lowest PT_LOAD address, a local variable lowest_addr (type unsigned long) is initialized to UINT_MAX. After loading each PT_LOAD, its address is compared to lowest_addr. If a loaded PT_LOAD address is lower, lowest_addr is updated. However, setting lowest_addr to UINT_MAX won't work when the kernel image is loaded above 4G, as the returned lowest PT_LOAD address would be invalid. This is resolved by initializing lowest_addr to ULONG_MAX instead. This issue was discovered while implementing crashkernel high/low reservation on the PowerPC architecture. Link: https://lkml.kernel.org/r/20250131113830.925179-1-sourabhjain@linux.ibm.com Link: https://lkml.kernel.org/r/20250131113830.925179-2-sourabhjain@linux.ibm.com Fixes: a0458284f062 ("powerpc: Add support code for kexec_file_load()") Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mahesh Salgaonkar <mahesh@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10kernel/events/uprobes: handle device-exclusive entries correctly in ↵David Hildenbrand
__replace_page() [ Upstream commit 096cbb80ab3fd85a9035ec17a1312c2a7db8bc8c ] Ever since commit b756a3b5e7ea ("mm: device exclusive memory access") we can return with a device-exclusive entry from page_vma_mapped_walk(). __replace_page() is not prepared for that, so teach it about these PFN swap PTEs. Note that device-private entries are so far not applicable on that path, because GUP would never have returned such folios (conversion to device-private happens by page migration, not in-place conversion of the PTE). There is a race between GUP and us locking the folio to look it up using page_vma_mapped_walk(), so this is likely a fix (unless something else could prevent that race, but it doesn't look like). pte_pfn() on something that is not a present pte could give use garbage, and we'd wrongly mess up the mapcount because it was already adjusted by calling folio_remove_rmap_pte() when making the entry device-exclusive. Link: https://lkml.kernel.org/r/20250210193801.781278-9-david@redhat.com Fixes: b756a3b5e7ea ("mm: device exclusive memory access") Signed-off-by: David Hildenbrand <david@redhat.com> Tested-by: Alistair Popple <apopple@nvidia.com> Cc: Alex Shi <alexs@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dave Airlie <airlied@gmail.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Karol Herbst <kherbst@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Lyude <lyude@redhat.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Simona Vetter <simona.vetter@ffwll.ch> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yanteng Si <si.yanteng@linux.dev> Cc: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10bpf: Fix array bounds error with may_gotoJiayuan Chen
[ Upstream commit 6ebc5030e0c5a698f1dd9a6684cddf6ccaed64a0 ] may_goto uses an additional 8 bytes on the stack, which causes the interpreters[] array to go out of bounds when calculating index by stack_size. 1. If a BPF program is rewritten, re-evaluate the stack size. For non-JIT cases, reject loading directly. 2. For non-JIT cases, calculating interpreters[idx] may still cause out-of-bounds array access, and just warn about it. 3. For jit_requested cases, the execution of bpf_func also needs to be warned. So move the definition of function __bpf_prog_ret0_warn out of the macro definition CONFIG_BPF_JIT_ALWAYS_ON. Reported-by: syzbot+d2a2c639d03ac200a4f1@syzkaller.appspotmail.com Closes: https://lore.kernel.org/bpf/0000000000000f823606139faa5d@google.com/ Fixes: 011832b97b311 ("bpf: Introduce may_goto instruction") Signed-off-by: Jiayuan Chen <mrpre@163.com> Link: https://lore.kernel.org/r/20250214091823.46042-2-mrpre@163.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10bpf: Use preempt_count() directly in bpf_send_signal_common()Hou Tao
[ Upstream commit b4a8b5bba712a711d8ca1f7d04646db63f9c88f5 ] bpf_send_signal_common() uses preemptible() to check whether or not the current context is preemptible. If it is preemptible, it will use irq_work to send the signal asynchronously instead of trying to hold a spin-lock, because spin-lock is sleepable under PREEMPT_RT. However, preemptible() depends on CONFIG_PREEMPT_COUNT. When CONFIG_PREEMPT_COUNT is turned off (e.g., CONFIG_PREEMPT_VOLUNTARY=y), !preemptible() will be evaluated as 1 and bpf_send_signal_common() will use irq_work unconditionally. Fix it by unfolding "!preemptible()" and using "preempt_count() != 0 || irqs_disabled()" instead. Fixes: 87c544108b61 ("bpf: Send signals asynchronously if !preemptible") Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250220042259.1583319-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()David Hildenbrand
[ Upstream commit dc84bc2aba85a1508f04a936f9f9a15f64ebfb31 ] If track_pfn_copy() fails, we already added the dst VMA to the maple tree. As fork() fails, we'll cleanup the maple tree, and stumble over the dst VMA for which we neither performed any reservation nor copied any page tables. Consequently untrack_pfn() will see VM_PAT and try obtaining the PAT information from the page table -- which fails because the page table was not copied. The easiest fix would be to simply clear the VM_PAT flag of the dst VMA if track_pfn_copy() fails. However, the whole thing is about "simply" clearing the VM_PAT flag is shaky as well: if we passed track_pfn_copy() and performed a reservation, but copying the page tables fails, we'll simply clear the VM_PAT flag, not properly undoing the reservation ... which is also wrong. So let's fix it properly: set the VM_PAT flag only if the reservation succeeded (leaving it clear initially), and undo the reservation if anything goes wrong while copying the page tables: clearing the VM_PAT flag after undoing the reservation. Note that any copied page table entries will get zapped when the VMA will get removed later, after copy_page_range() succeeded; as VM_PAT is not set then, we won't try cleaning VM_PAT up once more and untrack_pfn() will be happy. Note that leaving these page tables in place without a reservation is not a problem, as we are aborting fork(); this process will never run. A reproducer can trigger this usually at the first try: https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/reproducers/pat_fork.c WARNING: CPU: 26 PID: 11650 at arch/x86/mm/pat/memtype.c:983 get_pat_info+0xf6/0x110 Modules linked in: ... CPU: 26 UID: 0 PID: 11650 Comm: repro3 Not tainted 6.12.0-rc5+ #92 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014 RIP: 0010:get_pat_info+0xf6/0x110 ... Call Trace: <TASK> ... untrack_pfn+0x52/0x110 unmap_single_vma+0xa6/0xe0 unmap_vmas+0x105/0x1f0 exit_mmap+0xf6/0x460 __mmput+0x4b/0x120 copy_process+0x1bf6/0x2aa0 kernel_clone+0xab/0x440 __do_sys_clone+0x66/0x90 do_syscall_64+0x95/0x180 Likely this case was missed in: d155df53f310 ("x86/mm/pat: clear VM_PAT if copy_p4d_range failed") ... and instead of undoing the reservation we simply cleared the VM_PAT flag. Keep the documentation of these functions in include/linux/pgtable.h, one place is more than sufficient -- we should clean that up for the other functions like track_pfn_remap/untrack_pfn separately. Fixes: d155df53f310 ("x86/mm/pat: clear VM_PAT if copy_p4d_range failed") Fixes: 2ab640379a0a ("x86: PAT: hooks in generic vm code to help archs to track pfnmap regions - v3") Reported-by: xingwei lee <xrivendell7@gmail.com> Reported-by: yuxin wang <wang1315768607@163.com> Reported-by: Marius Fleischer <fleischermarius@gmail.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org Link: https://lore.kernel.org/r/20250321112323.153741-1-david@redhat.com Closes: https://lore.kernel.org/lkml/CABOYnLx_dnqzpCW99G81DmOr+2UzdmZMk=T3uxwNxwz+R1RAwg@mail.gmail.com/ Closes: https://lore.kernel.org/lkml/CAJg=8jwijTP5fre8woS4JVJQ8iUA6v+iNcsOgtj9Zfpc3obDOQ@mail.gmail.com/ Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10sched/deadline: Rebuild root domain accounting after every updateJuri Lelli
[ Upstream commit 2ff899e3516437354204423ef0a94994717b8e6a ] Rebuilding of root domains accounting information (total_bw) is currently broken on some cases, e.g. suspend/resume on aarch64. Problem is that the way we keep track of domain changes and try to add bandwidth back is convoluted and fragile. Fix it by simplify things by making sure bandwidth accounting is cleared and completely restored after root domains changes (after root domains are again stable). To be sure we always call dl_rebuild_rd_accounting while holding cpuset_mutex we also add cpuset_reset_sched_domains() wrapper. Fixes: 53916d5fd3c0 ("sched/deadline: Check bandwidth overflow earlier for hotplug") Reported-by: Jon Hunter <jonathanh@nvidia.com> Co-developed-by: Waiman Long <llong@redhat.com> Signed-off-by: Waiman Long <llong@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/Z9MRfeJKJUOyUSto@jlelli-thinkpadt14gen4.remote.csb Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10sched/deadline: Generalize unique visiting of root domainsJuri Lelli
[ Upstream commit 45007c6fb5860cf63556a9cadc87c8984927e23d ] Bandwidth checks and updates that work on root domains currently employ a cookie mechanism for efficiency. This mechanism is very much tied to when root domains are first created and initialized. Generalize the cookie mechanism so that it can be used also later at runtime while updating root domains. Also, additionally guard it with sched_domains_mutex, since domains need to be stable while updating them (and it will be required for further dynamic changes). Fixes: 53916d5fd3c0 ("sched/deadline: Check bandwidth overflow earlier for hotplug") Reported-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Waiman Long <longman@redhat.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/Z9MQaiXPvEeW_v7x@jlelli-thinkpadt14gen4.remote.csb Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10sched/topology: Wrappers for sched_domains_mutexJuri Lelli
[ Upstream commit 56209334dda1832c0a919e1d74768c6d0f3b2ca9 ] Create wrappers for sched_domains_mutex so that it can transparently be used on both CONFIG_SMP and !CONFIG_SMP, as some function will need to do. Fixes: 53916d5fd3c0 ("sched/deadline: Check bandwidth overflow earlier for hotplug") Reported-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Waiman Long <longman@redhat.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/Z9MP5Oq9RB8jBs3y@jlelli-thinkpadt14gen4.remote.csb Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10sched/deadline: Ignore special tasks when rebuilding domainsJuri Lelli
[ Upstream commit f6147af176eaa4027b692fdbb1a0a60dfaa1e9b6 ] SCHED_DEADLINE special tasks get a fake bandwidth that is only used to make sure sleeping and priority inheritance 'work', but it is ignored for runtime enforcement and admission control. Be consistent with it also when rebuilding root domains. Fixes: 53916d5fd3c0 ("sched/deadline: Check bandwidth overflow earlier for hotplug") Reported-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Waiman Long <longman@redhat.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20250313170011.357208-2-juri.lelli@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10perf: Supply task information to sched_task()Kan Liang
[ Upstream commit d57e94f5b891925e4f2796266eba31edd5a01903 ] To save/restore LBR call stack data in system-wide mode, the task_struct information is required. Extend the parameters of sched_task() to supply task_struct information. When schedule in, the LBR call stack data for new task will be restored. When schedule out, the LBR call stack data for old task will be saved. Only need to pass the required task_struct information. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250314172700.438923-4-kan.liang@linux.intel.com Stable-dep-of: 3cec9fd03543 ("perf/x86/lbr: Fix shorter LBRs call stacks for the system-wide mode") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10perf: Save PMU specific data in task_structKan Liang
[ Upstream commit cb4369129339060218baca718a578bb0b826e734 ] Some PMU specific data has to be saved/restored during context switch, e.g. LBR call stack data. Currently, the data is saved in event context structure, but only for per-process event. For system-wide event, because of missing the LBR call stack data after context switch, LBR callstacks are always shorter in comparison to per-process mode. For example, Per-process mode: $perf record --call-graph lbr -- taskset -c 0 ./tchain_edit - 99.90% 99.86% tchain_edit tchain_edit [.] f3 99.86% _start __libc_start_main generic_start_main main f1 - f2 f3 System-wide mode: $perf record --call-graph lbr -a -- taskset -c 0 ./tchain_edit - 99.88% 99.82% tchain_edit tchain_edit [.] f3 - 62.02% main f1 f2 f3 - 28.83% f1 - f2 f3 - 28.83% f1 - f2 f3 - 8.88% generic_start_main main f1 f2 f3 It isn't practical to simply allocate the data for system-wide event in CPU context structure for all tasks. We have no idea which CPU a task will be scheduled to. The duplicated LBR data has to be maintained on every CPU context structure. That's a huge waste. Otherwise, the LBR data still lost if the task is scheduled to another CPU. Save the pmu specific data in task_struct. The size of pmu specific data is 788 bytes for LBR call stack. Usually, the overall amount of threads doesn't exceed a few thousands. For 10K threads, keeping LBR data would consume additional ~8MB. The additional space will only be allocated during LBR call stack monitoring. It will be released when the monitoring is finished. Furthermore, moving task_ctx_data from perf_event_context to task_struct can reduce complexity and make things clearer. E.g. perf doesn't need to swap task_ctx_data on optimized context switch path. This patch set is just the first step. There could be other optimization/extension on top of this patch set. E.g. for cgroup profiling, perf just needs to save/store the LBR call stack information for tasks in specific cgroup. That could reduce the additional space. Also, the LBR call stack can be available for software events, or allow even debugging use cases, like LBRs on crash later. Because of the alignment requirement of Intel Arch LBR, the Kmem cache is used to allocate the PMU specific data. It's required when child task allocates the space. Save it in struct perf_ctx_data. The refcount in struct perf_ctx_data is used to track the users of pmu specific data. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Alexey Budankov <alexey.budankov@linux.intel.com> Link: https://lore.kernel.org/r/20250314172700.438923-1-kan.liang@linux.intel.com Stable-dep-of: 3cec9fd03543 ("perf/x86/lbr: Fix shorter LBRs call stacks for the system-wide mode") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10perf/ring_buffer: Allow the EPOLLRDNORM flag for pollTao Chen
[ Upstream commit c96fff391c095c11dc87dab35be72dee7d217cde ] The poll man page says POLLRDNORM is equivalent to POLLIN. For poll(), it seems that if user sets pollfd with POLLRDNORM in userspace, perf_poll will not return until timeout even if perf_output_wakeup called, whereas POLLIN returns. Fixes: 76369139ceb9 ("perf: Split up buffer handling from core code") Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250314030036.2543180-1-chen.dylane@linux.dev Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10watchdog/hardlockup/perf: Fix perf_event memory leakLi Huafei
[ Upstream commit d6834d9c990333bfa433bc1816e2417f268eebbe ] During stress-testing, we found a kmemleak report for perf_event: unreferenced object 0xff110001410a33e0 (size 1328): comm "kworker/4:11", pid 288, jiffies 4294916004 hex dump (first 32 bytes): b8 be c2 3b 02 00 11 ff 22 01 00 00 00 00 ad de ...;...."....... f0 33 0a 41 01 00 11 ff f0 33 0a 41 01 00 11 ff .3.A.....3.A.... backtrace (crc 24eb7b3a): [<00000000e211b653>] kmem_cache_alloc_node_noprof+0x269/0x2e0 [<000000009d0985fa>] perf_event_alloc+0x5f/0xcf0 [<00000000084ad4a2>] perf_event_create_kernel_counter+0x38/0x1b0 [<00000000fde96401>] hardlockup_detector_event_create+0x50/0xe0 [<0000000051183158>] watchdog_hardlockup_enable+0x17/0x70 [<00000000ac89727f>] softlockup_start_fn+0x15/0x40 ... Our stress test includes CPU online and offline cycles, and updating the watchdog configuration. After reading the code, I found that there may be a race between cleaning up perf_event after updating watchdog and disabling event when the CPU goes offline: CPU0 CPU1 CPU2 (update watchdog) (hotplug offline CPU1) ... _cpu_down(CPU1) cpus_read_lock() // waiting for cpu lock softlockup_start_all smp_call_on_cpu(CPU1) softlockup_start_fn ... watchdog_hardlockup_enable(CPU1) perf create E1 watchdog_ev[CPU1] = E1 cpus_read_unlock() cpus_write_lock() cpuhp_kick_ap_work(CPU1) cpuhp_thread_fun ... watchdog_hardlockup_disable(CPU1) watchdog_ev[CPU1] = NULL dead_event[CPU1] = E1 __lockup_detector_cleanup for each dead_events_mask release each dead_event /* * CPU1 has not been added to * dead_events_mask, then E1 * will not be released */ CPU1 -> dead_events_mask cpumask_clear(&dead_events_mask) // dead_events_mask is cleared, E1 is leaked In this case, the leaked perf_event E1 matches the perf_event leak reported by kmemleak. Due to the low probability of problem recurrence (only reported once), I added some hack delays in the code: static void __lockup_detector_reconfigure(void) { ... watchdog_hardlockup_start(); cpus_read_unlock(); + mdelay(100); /* * Must be called outside the cpus locked section to prevent * recursive locking in the perf code. ... } void watchdog_hardlockup_disable(unsigned int cpu) { ... perf_event_disable(event); this_cpu_write(watchdog_ev, NULL); this_cpu_write(dead_event, event); + mdelay(100); cpumask_set_cpu(smp_processor_id(), &dead_events_mask); atomic_dec(&watchdog_cpus); ... } void hardlockup_detector_perf_cleanup(void) { ... perf_event_release_kernel(event); per_cpu(dead_event, cpu) = NULL; } + mdelay(100); cpumask_clear(&dead_events_mask); } Then, simultaneously performing CPU on/off and switching watchdog, it is almost certain to reproduce this leak. The problem here is that releasing perf_event is not within the CPU hotplug read-write lock. Commit: 941154bd6937 ("watchdog/hardlockup/perf: Prevent CPU hotplug deadlock") introduced deferred release to solve the deadlock caused by calling get_online_cpus() when releasing perf_event. Later, commit: efe951d3de91 ("perf/x86: Fix perf,x86,cpuhp deadlock") removed the get_online_cpus() call on the perf_event release path to solve another deadlock problem. Therefore, it is now possible to move the release of perf_event back into the CPU hotplug read-write lock, and release the event immediately after disabling it. Fixes: 941154bd6937 ("watchdog/hardlockup/perf: Prevent CPU hotplug deadlock") Signed-off-by: Li Huafei <lihuafei1@huawei.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20241021193004.308303-1-lihuafei1@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>