summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
21 hoursMerge tag 'wq-for-7.0-rc3-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: - Improve workqueue stall diagnostics: dump all busy workers (not just running ones), show wall-clock duration of in-flight work items, and add a sample module for reproducing stalls - Fix POOL_BH vs WQ_BH flag namespace mismatch in pr_cont_worker_id() - Rename pool->watchdog_ts to pool->last_progress_ts and related functions for clarity * tag 'wq-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Rename show_cpu_pool{s,}_hog{s,}() to reflect broadened scope workqueue: Add stall detector sample module workqueue: Show all busy workers in stall diagnostics workqueue: Show in-flight work item duration in stall diagnostics workqueue: Rename pool->watchdog_ts to pool->last_progress_ts workqueue: Use POOL_BH instead of WQ_BH when checking pool flags
21 hoursMerge tag 'cgroup-for-7.0-rc3-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Hide PF_EXITING tasks from cgroup.procs to avoid exposing dead tasks that haven't been removed yet, fixing a systemd timeout issue on PREEMPT_RT - Call rebuild_sched_domains() directly in CPU hotplug instead of deferring to a workqueue, fixing a race where online/offline CPUs could briefly appear in stale sched domains * tag 'cgroup-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Don't expose dead tasks in cgroup cgroup/cpuset: Call rebuild_sched_domains() directly in hotplug
22 hoursMerge tag 'sched_ext-for-7.0-rc3-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix data races flagged by KCSAN: add missing READ_ONCE()/WRITE_ONCE() annotations for lock-free accesses to module parameters and dsq->seq - Fix silent truncation of upper 32 enqueue flags (SCX_ENQ_PREEMPT and above) when passed through the int sched_class interface - Documentation updates: scheduling class precedence, task ownership state machine, example scheduler descriptions, config list cleanup - Selftest fix for format specifier and buffer length in file_write_long() * tag 'sched_ext-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Use WRITE_ONCE() for the write side of scx_enable helper pointer sched_ext: Fix enqueue_task_scx() truncation of upper enqueue flags sched_ext: Documentation: Update sched-ext.rst sched_ext: Use READ_ONCE() for scx_slice_bypass_us in scx_bypass() sched_ext: Documentation: Mention scheduling class precedence sched_ext: Document task ownership state machine sched_ext: Use READ_ONCE() for lock-free reads of module param variables sched_ext/selftests: Fix format specifier and buffer length in file_write_long() sched_ext: Use WRITE_ONCE() for the write side of dsq->seq update
4 dayssched: idle: Make skipping governor callbacks more consistentRafael J. Wysocki
If the cpuidle governor .select() callback is skipped because there is only one idle state in the cpuidle driver, the .reflect() callback should be skipped as well, at least for consistency (if not for correctness), so do it. Fixes: e5c9ffc6ae1b ("cpuidle: Skip governor when only one idle state is available") Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://patch.msgid.link/12857700.O9o76ZdvQC@rafael.j.wysocki
5 dayssched_ext: Use WRITE_ONCE() for the write side of scx_enable helper pointerzhidao su
scx_enable() uses double-checked locking to lazily initialize a static kthread_worker pointer. The fast path reads helper locklessly: if (!READ_ONCE(helper)) { // lockless read -- no helper_mutex The write side initializes helper under helper_mutex, but previously used a plain assignment: helper = kthread_run_worker(0, "scx_enable_helper"); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ plain write -- KCSAN data race with READ_ONCE() above Since READ_ONCE() on the fast path and the plain write on the initialization path access the same variable without a common lock, they constitute a data race. KCSAN requires that all sides of a lock-free access use READ_ONCE()/WRITE_ONCE() consistently. Use a temporary variable to stage the result of kthread_run_worker(), and only WRITE_ONCE() into helper after confirming the pointer is valid. This avoids a window where a concurrent caller on the fast path could observe an ERR pointer via READ_ONCE(helper) before the error check completes. Fixes: b06ccbabe250 ("sched_ext: Fix starvation of scx_enable() under fair-class saturation") Signed-off-by: zhidao su <suzhidao@xiaomi.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
7 daysMerge tag 'timers-urgent-2026-03-08' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "Make clock_adjtime() syscall timex validation slightly more permissive for auxiliary clocks, to not reject syscalls based on the status field that do not try to modify the status field. This makes the ABI behavior in clock_adjtime() consistent with CLOCK_REALTIME" * tag 'timers-urgent-2026-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timekeeping: Fix timex status validation for auxiliary clocks
7 daysMerge tag 'sched-urgent-2026-03-08' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix a DL scheduler bug that may corrupt internal metrics during PI and setscheduler() syscalls, resulting in kernel warnings and misbehavior. Found during stress-testing" * tag 'sched-urgent-2026-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/deadline: Fix missing ENQUEUE_REPLENISH during PI de-boosting
7 daysMerge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds
Pull bpf fixes from Alexei Starovoitov: - Fix u32/s32 bounds when ranges cross min/max boundary (Eduard Zingerman) - Fix precision backtracking with linked registers (Eduard Zingerman) - Fix linker flags detection for resolve_btfids (Ihor Solodrai) - Fix race in update_ftrace_direct_add/del (Jiri Olsa) - Fix UAF in bpf_trampoline_link_cgroup_shim (Lang Xu) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: resolve_btfids: Fix linker flags detection selftests/bpf: add reproducer for spurious precision propagation through calls bpf: collect only live registers in linked regs Revert "selftests/bpf: Update reg_bound range refinement logic" selftests/bpf: test refining u32/s32 bounds when ranges cross min/max boundary bpf: Fix u32/s32 bounds when ranges cross min/max boundary bpf: Fix a UAF issue in bpf_trampoline_link_cgroup_shim ftrace: Add missing ftrace_lock to update_ftrace_direct_add/del
7 daysMerge tag 'trace-v7.0-rc2-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix possible NULL pointer dereference in trace_data_alloc() On the trace_data_alloc() error path, it can call trigger_data_free() with a NULL pointer. This used to be a kfree() but was changed to trigger_data_free() to clean up any partial initialization. The issue is that trigger_data_free() does not expect a NULL pointer. Have trigger_data_free() return safely on NULL pointer. - Fix multiple events on the command line and bootconfig If multiple events are enabled on the command line separately and not grouped, only the last event gets enabled. That is: trace_event=sched_switch trace_event=sched_waking will only enable sched_waking whereas: trace_event=sched_switch,sched_waking will enable both. The bootconfig makes it even worse as the second way is the more common method. The issue is that a temporary buffer is used to store the events to enable later in boot. Each time the cmdline callback is called, it overwrites what was previously there. Have the callback append the next value (delimited by a comma) if the temporary buffer already has content. - Fix command line trace_buffer_size if >= 2G The logic to allocate the trace buffer uses "int" for the size parameter in the command line code causing overflow issues if more that 2G is specified. * tag 'trace-v7.0-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Fix trace_buf_size= cmdline parameter with sizes >= 2G tracing: Fix enabling multiple events on the kernel command line and bootconfig tracing: Add NULL pointer check to trigger_data_free()
7 dayssched_ext: Fix enqueue_task_scx() truncation of upper enqueue flagsTejun Heo
enqueue_task_scx() takes int enq_flags from the sched_class interface. SCX enqueue flags starting at bit 32 (SCX_ENQ_PREEMPT and above) are silently truncated when passed through activate_task(). extra_enq_flags was added as a workaround - storing high bits in rq->scx.extra_enq_flags and OR-ing them back in enqueue_task_scx(). However, the OR target is still the int parameter, so the high bits are lost anyway. The current impact is limited as the only affected flag is SCX_ENQ_PREEMPT which is informational to the BPF scheduler - its loss means the scheduler doesn't know about preemption but doesn't cause incorrect behavior. Fix by renaming the int parameter to core_enq_flags and introducing a u64 enq_flags local that merges both sources. All downstream functions already take u64 enq_flags. Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class") Cc: stable@vger.kernel.org # v6.12+ Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
8 daysbpf: collect only live registers in linked regsEduard Zingerman
Fix an inconsistency between func_states_equal() and collect_linked_regs(): - regsafe() uses check_ids() to verify that cached and current states have identical register id mapping. - func_states_equal() calls regsafe() only for registers computed as live by compute_live_registers(). - clean_live_states() is supposed to remove dead registers from cached states, but it can skip states belonging to an iterator-based loop. - collect_linked_regs() collects all registers sharing the same id, ignoring the marks computed by compute_live_registers(). Linked registers are stored in the state's jump history. - backtrack_insn() marks all linked registers for an instruction as precise whenever one of the linked registers is precise. The above might lead to a scenario: - There is an instruction I with register rY known to be dead at I. - Instruction I is reached via two paths: first A, then B. - On path A: - There is an id link between registers rX and rY. - Checkpoint C is created at I. - Linked register set {rX, rY} is saved to the jump history. - rX is marked as precise at I, causing both rX and rY to be marked precise at C. - On path B: - There is no id link between registers rX and rY, otherwise register states are sub-states of those in C. - Because rY is dead at I, check_ids() returns true. - Current state is considered equal to checkpoint C, propagate_precision() propagates spurious precision mark for register rY along the path B. - Depending on a program, this might hit verifier_bug() in the backtrack_insn(), e.g. if rY ∈ [r1..r5] and backtrack_insn() spots a function call. The reproducer program is in the next patch. This was hit by sched_ext scx_lavd scheduler code. Changes in tests: - verifier_scalar_ids.c selftests need modification to preserve some registers as live for __msg() checks. - exceptions_assert.c adjusted to match changes in the verifier log, R0 is dead after conditional instruction and thus does not get range. - precise.c adjusted to match changes in the verifier log, register r9 is dead after comparison and it's range is not important for test. Reported-by: Emil Tsalapatis <emil@etsalapatis.com> Fixes: 0fb3cf6110a5 ("bpf: use register liveness information for func_states_equal") Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260306-linked-regs-and-propagate-precision-v1-1-18e859be570d@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
8 daystracing: Fix trace_buf_size= cmdline parameter with sizes >= 2GCalvin Owens
Some of the sizing logic through tracer_alloc_buffers() uses int internally, causing unexpected behavior if the user passes a value that does not fit in an int (on my x86 machine, the result is uselessly tiny buffers). Fix by plumbing the parameter's real type (unsigned long) through to the ring buffer allocation functions, which already use unsigned long. It has always been possible to create larger ring buffers via the sysfs interface: this only affects the cmdline parameter. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/bff42a4288aada08bdf74da3f5b67a2c28b761f8.1772852067.git.calvin@wbinvd.org Fixes: 73c5162aa362 ("tracing: keep ring buffer to minimum size till used") Signed-off-by: Calvin Owens <calvin@wbinvd.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
8 daysbpf: Fix u32/s32 bounds when ranges cross min/max boundaryEduard Zingerman
Same as in __reg64_deduce_bounds(), refine s32/u32 ranges in __reg32_deduce_bounds() in the following situations: - s32 range crosses U32_MAX/0 boundary, positive part of the s32 range overlaps with u32 range: 0 U32_MAX | [xxxxxxxxxxxxxx u32 range xxxxxxxxxxxxxx] | |----------------------------|----------------------------| |xxxxx s32 range xxxxxxxxx] [xxxxxxx| 0 S32_MAX S32_MIN -1 - s32 range crosses U32_MAX/0 boundary, negative part of the s32 range overlaps with u32 range: 0 U32_MAX | [xxxxxxxxxxxxxx u32 range xxxxxxxxxxxxxx] | |----------------------------|----------------------------| |xxxxxxxxx] [xxxxxxxxxxxx s32 range | 0 S32_MAX S32_MIN -1 - No refinement if ranges overlap in two intervals. This helps for e.g. consider the following program: call %[bpf_get_prandom_u32]; w0 &= 0xffffffff; if w0 < 0x3 goto 1f; // on fall-through u32 range [3..U32_MAX] if w0 s> 0x1 goto 1f; // on fall-through s32 range [S32_MIN..1] if w0 s< 0x0 goto 1f; // range can be narrowed to [S32_MIN..-1] r10 = 0; 1: ...; The reg_bounds.c selftest is updated to incorporate identical logic, refinement based on non-overflowing range halves: ((x ∩ [0, smax]) ∩ (y ∩ [0, smax])) ∪ ((x ∩ [smin,-1]) ∩ (y ∩ [smin,-1])) Reported-by: Andrea Righi <arighi@nvidia.com> Reported-by: Emil Tsalapatis <emil@etsalapatis.com> Closes: https://lore.kernel.org/bpf/aakqucg4vcujVwif@gpd4/T/ Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260306-bpf-32-bit-range-overflow-v3-1-f7f67e060a6b@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
8 dayscgroup: Don't expose dead tasks in cgroupSebastian Andrzej Siewior
Once a task exits it has its state set to TASK_DEAD and then it is removed from the cgroup it belonged to. The last step happens on the task gets out of its last schedule() invocation and is delayed on PREEMPT_RT due to locking constraints. As a result it is possible to receive a pid via waitpid() of a task which is still listed in cgroup.procs for the cgroup it belonged to. This is something that systemd does not expect and as a result it waits for its exit until a time out occurs. This can also be reproduced on !PREEMPT_RT kernel with a significant delay in do_exit() after exit_notify(). Hide the task from the output which have PF_EXITING set which is done before the parent is notified. Keeping zombies with live threads shouldn't break anything (suggested by Tejun). Reported-by: Bert Karwatzki <spasswolf@web.de> Closes: https://lore.kernel.org/all/20260219164648.3014-1-spasswolf@web.de/ Tested-by: Bert Karwatzki <spasswolf@web.de> Fixes: 9311e6c29b34 ("cgroup: Fix sleeping from invalid context warning on PREEMPT_RT") Cc: stable@vger.kernel.org # v6.19+ Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Tejun Heo <tj@kernel.org>
8 daystracing: Fix enabling multiple events on the kernel command line and bootconfigAndrei-Alexandru Tachici
Multiple events can be enabled on the kernel command line via a comma separator. But if the are specified one at a time, then only the last event is enabled. This is because the event names are saved in a temporary buffer, and each call by the init cmdline code will reset that buffer. This also affects names in the boot config file, as it may call the callback multiple times with an example of: kernel.trace_event = ":mod:rproc_qcom_common", ":mod:qrtr", ":mod:qcom_aoss" Change the cmdline callback function to append a comma and the next value if the temporary buffer already has content. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260302-trace-events-allow-multiple-modules-v1-1-ce4436e37fb8@oss.qualcomm.com Signed-off-by: Andrei-Alexandru Tachici <andrei-alexandru.tachici@oss.qualcomm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
8 daystracing: Add NULL pointer check to trigger_data_free()Guenter Roeck
If trigger_data_alloc() fails and returns NULL, event_hist_trigger_parse() jumps to the out_free error path. While kfree() safely handles a NULL pointer, trigger_data_free() does not. This causes a NULL pointer dereference in trigger_data_free() when evaluating data->cmd_ops->set_filter. Fix the problem by adding a NULL pointer check to trigger_data_free(). The problem was found by an experimental code review agent based on gemini-3.1-pro while reviewing backports into v6.18.y. Cc: Miaoqian Lin <linmq006@gmail.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://patch.msgid.link/20260305193339.2810953-1-linux@roeck-us.net Fixes: 0550069cc25f ("tracing: Properly process error handling in event_hist_trigger_parse()") Assisted-by: Gemini:gemini-3.1-pro Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
8 dayscgroup/cpuset: Call rebuild_sched_domains() directly in hotplugWaiman Long
Besides deferring the call to housekeeping_update(), commit 6df415aa46ec ("cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue") also defers the rebuild_sched_domains() call to the workqueue. So a new offline CPU may still be in a sched domain or new online CPU not showing up in the sched domains for a short transition period. That could be a problem in some corner cases and can be the cause of a reported test failure[1]. Fix it by calling rebuild_sched_domains_cpuslocked() directly in hotplug as before. If isolated partition invalidation or recreation is being done, the housekeeping_update() call to update the housekeeping cpumasks will still be deferred to a workqueue. In commit 3bfe47967191 ("cgroup/cpuset: Move housekeeping_update()/rebuild_sched_domains() together"), housekeeping_update() is called before rebuild_sched_domains() because it needs to access the HK_TYPE_DOMAIN housekeeping cpumask. That is now changed to use the static HK_TYPE_DOMAIN_BOOT cpumask as HK_TYPE_DOMAIN cpumask is now changeable at run time. As a result, we can move the rebuild_sched_domains() call before housekeeping_update() with the slight advantage that it will be done in the same cpus_read_lock critical section without the possibility of interference by a concurrent cpu hot add/remove operation. As it doesn't make sense to acquire cpuset_mutex/cpuset_top_mutex after calling housekeeping_update() and immediately release them again, move the cpuset_full_unlock() operation inside update_hk_sched_domains() and rename it to cpuset_update_sd_hk_unlock() to signify that it will release the full set of locks. [1] https://lore.kernel.org/lkml/1a89aceb-48db-4edd-a730-b445e41221fe@nvidia.com Fixes: 6df415aa46ec ("cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue") Tested-by: Jon Hunter <jonathanh@nvidia.com> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
8 dayssched_ext: Use READ_ONCE() for scx_slice_bypass_us in scx_bypass()David Carlier
Commit 0927780c90ce ("sched_ext: Use READ_ONCE() for lock-free reads of module param variables") annotated the plain reads of scx_slice_bypass_us and scx_bypass_lb_intv_us in bypass_lb_cpu(), but missed a third site in scx_bypass(): WRITE_ONCE(scx_slice_dfl, scx_slice_bypass_us * NSEC_PER_USEC); scx_slice_bypass_us is a module parameter writable via sysfs in process context through set_slice_us() -> param_set_uint_minmax(), which performs a plain store without holding bypass_lock. scx_bypass() reads the variable under bypass_lock, but since the writer does not take that lock, the two accesses are concurrent. WRITE_ONCE() only applies volatile semantics to the store of scx_slice_dfl -- the val expression containing scx_slice_bypass_us is evaluated as a plain read, providing no protection against concurrent writes. Wrap the read with READ_ONCE() to complete the annotation started by commit 0927780c90ce and make the access KCSAN-clean, consistent with the existing READ_ONCE(scx_slice_bypass_us) in bypass_lb_cpu(). Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
8 daysworkqueue: Rename show_cpu_pool{s,}_hog{s,}() to reflect broadened scopeBreno Leitao
show_cpu_pool_hog() and show_cpu_pools_hogs() no longer only dump CPU hogs — since commit 8823eaef45da ("workqueue: Show all busy workers in stall diagnostics"), they dump every in-flight worker in the pool's busy_hash. Rename them to show_cpu_pool_busy_workers() and show_cpu_pools_busy_workers() to accurately describe what they do. Also fix the pr_info() message to say "stalled worker pools" instead of "stalled CPU-bound worker pools", since sleeping/blocked workers are now included. No functional change. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org>
8 daysMerge tag 'block-7.0-20260305' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - Improve quirk visibility and configurability (Maurizio) - Fix runtime user modification to queue setup (Keith) - Fix multipath leak on try_module_get failure (Keith) - Ignore ambiguous spec definitions for better atomics support (John) - Fix admin queue leak on controller reset (Ming) - Fix large allocation in persistent reservation read keys (Sungwoo Kim) - Fix fcloop callback handling (Justin) - Securely free DHCHAP secrets (Daniel) - Various cleanups and typo fixes (John, Wilfred) - Avoid a circular lock dependency issue in the sysfs nr_requests or scheduler store handling - Fix a circular lock dependency with the pcpu mutex and the queue freeze lock - Cleanup for bio_copy_kern(), using __bio_add_page() rather than the bio_add_page(), as adding a page here cannot fail. The exiting code had broken cleanup for the error condition, so make it clear that the error condition cannot happen - Fix for a __this_cpu_read() in preemptible context splat * tag 'block-7.0-20260305' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: block: use trylock to avoid lockdep circular dependency in sysfs nvme: fix memory allocation in nvme_pr_read_keys() block: use __bio_add_page in bio_copy_kern block: break pcpu_alloc_mutex dependency on freeze_lock blktrace: fix __this_cpu_read/write in preemptible context nvme-multipath: fix leak on try_module_get failure nvmet-fcloop: Check remoteport port_state before calling done callback nvme-pci: do not try to add queue maps at runtime nvme-pci: cap queue creation to used queues nvme-pci: ensure we're polling a polled queue nvme: fix memory leak in quirks_param_set() nvme: correct comment about nvme_ns_remove() nvme: stop setting namespace gendisk device driver data nvme: add support for dynamic quirk configuration via module parameter nvme: fix admin queue leak on controller reset nvme-fabrics: use kfree_sensitive() for DHCHAP secrets nvme: stop using AWUPF nvme: expose active quirks in sysfs nvme/host: fixup some typos
8 daysbpf: drop kthread_exit from noreturn_denyChristian Loehle
kthread_exit became a macro to do_exit in commit 28aaa9c39945 ("kthread: consolidate kthread exit paths to prevent use-after-free"), so there is no kthread_exit function BTF ID to resolve. Remove it from noreturn_deny to avoid resolve_btfids unresolved symbol warnings. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 daysworkqueue: Show all busy workers in stall diagnosticsBreno Leitao
show_cpu_pool_hog() only prints workers whose task is currently running on the CPU (task_is_running()). This misses workers that are busy processing a work item but are sleeping or blocked — for example, a worker that clears PF_WQ_WORKER and enters wait_event_idle(). Such a worker still occupies a pool slot and prevents progress, yet produces an empty backtrace section in the watchdog output. This is happening on real arm64 systems, where toggle_allocation_gate() IPIs every single CPU in the machine (which lacks NMI), causing workqueue stalls that show empty backtraces because toggle_allocation_gate() is sleeping in wait_event_idle(). Remove the task_is_running() filter so every in-flight worker in the pool's busy_hash is dumped. The busy_hash is protected by pool->lock, which is already held. Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
9 daysworkqueue: Show in-flight work item duration in stall diagnosticsBreno Leitao
When diagnosing workqueue stalls, knowing how long each in-flight work item has been executing is valuable. Add a current_start timestamp (jiffies) to struct worker, set it when a work item begins execution in process_one_work(), and print the elapsed wall-clock time in show_pwq(). Unlike current_at (which tracks CPU runtime and resets on wakeup for CPU-intensive detection), current_start is never reset because the diagnostic cares about total wall-clock time including sleeps. Before: in-flight: 165:stall_work_fn [wq_stall] After: in-flight: 165:stall_work_fn [wq_stall] for 100s Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
9 daysworkqueue: Rename pool->watchdog_ts to pool->last_progress_tsBreno Leitao
The watchdog_ts name doesn't convey what the timestamp actually tracks. This field tracks the last time a workqueue got progress. Rename it to last_progress_ts to make it clear that it records when the pool last made forward progress (started processing new work items). No functional change. Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
9 daysworkqueue: Use POOL_BH instead of WQ_BH when checking pool flagsBreno Leitao
pr_cont_worker_id() checks pool->flags against WQ_BH, which is a workqueue-level flag (defined in workqueue.h). Pool flags use a separate namespace with POOL_* constants (defined in workqueue.c). The correct constant is POOL_BH. Both WQ_BH and POOL_BH are defined as (1 << 0) so this has no behavioral impact, but it is semantically wrong and inconsistent with every other pool-level BH check in the file. Fixes: 4cb1ef64609f ("workqueue: Implement BH workqueues to eventually replace tasklets") Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
9 dayssched_ext: Document task ownership state machineAndrea Righi
The task ownership state machine in sched_ext is quite hard to follow from the code alone. The interaction of ownership states, memory ordering rules and cross-CPU "lock dancing" makes the overall model subtle. Extend the documentation next to scx_ops_state to provide a more structured and self-contained description of the state transitions and their synchronization rules. The new reference should make the code easier to reason about and maintain and can help future contributors understand the overall task-ownership workflow. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
9 dayssched_ext: Use READ_ONCE() for lock-free reads of module param variableszhidao su
bypass_lb_cpu() reads scx_bypass_lb_intv_us and scx_slice_bypass_us without holding any lock, in timer callback context where module parameter writes via sysfs can happen concurrently: min_delta_us = scx_bypass_lb_intv_us / SCX_BYPASS_LB_MIN_DELTA_DIV; ^^^^^^^^^^^^^^^^^^^^ plain read -- KCSAN data race if (delta < DIV_ROUND_UP(min_delta_us, scx_slice_bypass_us)) ^^^^^^^^^^^^^^^^^ plain read -- KCSAN data race scx_bypass_lb_intv_us already uses READ_ONCE() in scx_bypass_lb_timerfn() and scx_bypass() for its other lock-free read sites, leaving bypass_lb_cpu() inconsistent. scx_slice_bypass_us has the same lock-free access pattern in the same function. Fix both plain reads by using READ_ONCE() to complete the concurrent access annotation and make the code KCSAN-clean. Signed-off-by: zhidao su <suzhidao@xiaomi.com> Signed-off-by: Tejun Heo <tj@kernel.org>
9 daysMerge tag 'trace-v7.0-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix thresh_return of function graph tracer The update to store data on the shadow stack removed the abuse of using the task recursion word as a way to keep track of what functions to ignore. The trace_graph_return() was updated to handle this, but when function_graph tracer is using a threshold (only trace functions that took longer than a specified time), it uses trace_graph_thresh_return() instead. This function was still incorrectly using the task struct recursion word causing the function graph tracer to permanently set all functions to "notrace" - Fix thresh_return nosleep accounting When the calltime was moved to the shadow stack storage instead of being on the fgraph descriptor, the calculations for the amount of sleep time was updated. The calculation was done in the trace_graph_thresh_return() function, which also called the trace_graph_return(), which did the calculation again, causing the time to be doubled. Remove the call to trace_graph_return() as what it needed to do wasn't that much, and just do the work in trace_graph_thresh_return(). - Fix syscall trace event activation on boot up The syscall trace events are pseudo events attached to the raw_syscall tracepoints. When the first syscall event is enabled, it enables the raw_syscall tracepoint and doesn't need to do anything when a second syscall event is also enabled. When events are enabled via the kernel command line, syscall events are partially enabled as the enabling is called before rcu_init. This is due to allow early events to be enabled immediately. Because kernel command line events do not distinguish between different types of events, the syscall events are enabled here but are not fully functioning. After rcu_init, they are disabled and re-enabled so that they can be fully enabled. The problem happened is that this "disable-enable" is done one at a time. If more than one syscall event is specified on the command line, by disabling them one at a time, the counter never gets to zero, and the raw_syscall is not disabled and enabled, keeping the syscall events in their non-fully functional state. Instead, disable all events and re-enabled them all, as that will ensure the raw_syscall event is also disabled and re-enabled. - Disable preemption in ftrace pid filtering The ftrace pid filtering attaches to the fork and exit tracepoints to add or remove pids that should be traced. They access variables protected by RCU (preemption disabled). Now that tracepoint callbacks are called with preemption enabled, this protection needs to be added explicitly, and not depend on the functions being called with preemption disabled. - Disable preemption in event pid filtering The event pid filtering needs the same preemption disabling guards as ftrace pid filtering. - Fix accounting of the memory mapped ring buffer on fork Memory mapping the ftrace ring buffer sets the vm_flags to DONTCOPY. But this does not prevent the application from calling madvise(MADVISE_DOFORK). This causes the mapping to be copied on fork. After the first tasks exits, the mapping is considered unmapped by everyone. But when he second task exits, the counter goes below zero and triggers a WARN_ON. Since nothing prevents two separate tasks from mmapping the ftrace ring buffer (although two mappings may mess each other up), there's no reason to stop the memory from being copied on fork. Update the vm_operations to have an ".open" handler to update the accounting and let the ring buffer know someone else has it mapped. - Add all ftrace headers in MAINTAINERS file The MAINTAINERS file only specifies include/linux/ftrace.h But misses ftrace_irq.h and ftrace_regs.h. Make the file use wildcards to get all *ftrace* files. * tag 'trace-v7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ftrace: Add MAINTAINERS entries for all ftrace headers tracing: Fix WARN_ON in tracing_buffers_mmap_close tracing: Disable preemption in the tracepoint callbacks handling filtered pids ftrace: Disable preemption in the tracepoint callbacks handling filtered pids tracing: Fix syscall events activation by ensuring refcount hits zero fgraph: Fix thresh_return nosleeptime double-adjust fgraph: Fix thresh_return clear per-task notrace
10 daysMerge tag 'modules-7.0-rc3.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux Pull module fixes from Sami Tolvanen: - Fix a potential kernel panic in the module loader by adding a bounds check for the ELF section index. This prevents crashes if attempting to load a module that uses SHN_XINDEX or is corrupted. - Fix the Kconfig menu layout for module versioning, signing, and compression options so they correctly appear as submenus in menuconfig. - Remove a redundant lockdep_free_key_range() call in the load_module() error path. This is already handled by module_deallocate() calling free_mod_mem() since the module_memory rework. * tag 'modules-7.0-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux: module: Fix kernel panic when a symbol st_shndx is out of bounds module: Fix the modversions and signing submenus module: Remove duplicate freeing of lockdep classes
10 daysMerge tag 'vfs-7.0-rc3.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - kthread: consolidate kthread exit paths to prevent use-after-free - iomap: - don't mark folio uptodate if read IO has bytes pending - don't report direct-io retries to fserror - reject delalloc mappings during writeback - ns: tighten visibility checks - netfs: Fix unbuffered/DIO writes to dispatch subrequests in strict sequence * tag 'vfs-7.0-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: iomap: reject delalloc mappings during writeback iomap: don't mark folio uptodate if read IO has bytes pending selftests: fix mntns iteration selftests nstree: tighten permission checks for listing nsfs: tighten permission checks for handle opening nsfs: tighten permission checks for ns iteration ioctls netfs: Fix unbuffered/DIO writes to dispatch subrequests in strict sequence kthread: consolidate kthread exit paths to prevent use-after-free iomap: don't report direct-io retries to fserror
10 daystimekeeping: Fix timex status validation for auxiliary clocksMiroslav Lichvar
The timekeeping_validate_timex() function validates the timex status of an auxiliary system clock even when the status is not to be changed, which causes unexpected errors for applications that make read-only clock_adjtime() calls, or set some other timex fields, but without clearing the status field. Do the AUX-specific status validation only when the modes field contains ADJ_STATUS, i.e. the application is actually trying to change the status. This makes the AUX-specific clock_adjtime() behavior consistent with CLOCK_REALTIME. Fixes: 4eca49d0b621 ("timekeeping: Prepare do_adtimex() for auxiliary clocks") Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260225085231.276751-1-mlichvar@redhat.com
10 dayssched_ext: Use WRITE_ONCE() for the write side of dsq->seq updatezhidao su
bpf_iter_scx_dsq_new() reads dsq->seq via READ_ONCE() without holding any lock, making dsq->seq a lock-free concurrently accessed variable. However, dispatch_enqueue(), the sole writer of dsq->seq, uses a plain increment without the matching WRITE_ONCE() on the write side: dsq->seq++; ^^^^^^^^^^^ plain write -- KCSAN data race The KCSAN documentation requires that if one accessor uses READ_ONCE() or WRITE_ONCE() on a variable to annotate lock-free access, all other accesses must also use the appropriate accessor. A plain write leaves the pair incomplete and will trigger KCSAN warnings. Fix by using WRITE_ONCE() for the write side of the update: WRITE_ONCE(dsq->seq, dsq->seq + 1); This is consistent with bpf_iter_scx_dsq_new() and makes the concurrent access annotation complete and KCSAN-clean. Signed-off-by: zhidao su <suzhidao@xiaomi.com> Signed-off-by: Tejun Heo <tj@kernel.org>
10 daysMerge tag 'sysctl-7.00-fixes-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl Pull sysctl fix from Joel Granados: - Fix error when reporting jiffies converted values back to user space Return the converted value instead of "Invalid argument" error * tag 'sysctl-7.00-fixes-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl: time/jiffies: Fix sysctl file error on configurations where USER_HZ < HZ
10 dayssched/deadline: Fix missing ENQUEUE_REPLENISH during PI de-boostingJuri Lelli
Running stress-ng --schedpolicy 0 on an RT kernel on a big machine might lead to the following WARNINGs (edited). sched: DL de-boosted task PID 22725: REPLENISH flag missing WARNING: CPU: 93 PID: 0 at kernel/sched/deadline.c:239 dequeue_task_dl+0x15c/0x1f8 ... (running_bw underflow) Call trace: dequeue_task_dl+0x15c/0x1f8 (P) dequeue_task+0x80/0x168 deactivate_task+0x24/0x50 push_dl_task+0x264/0x2e0 dl_task_timer+0x1b0/0x228 __hrtimer_run_queues+0x188/0x378 hrtimer_interrupt+0xfc/0x260 ... The problem is that when a SCHED_DEADLINE task (lock holder) is changed to a lower priority class via sched_setscheduler(), it may fail to properly inherit the parameters of potential DEADLINE donors if it didn't already inherit them in the past (shorter deadline than donor's at that time). This might lead to bandwidth accounting corruption, as enqueue_task_dl() won't recognize the lock holder as boosted. The scenario occurs when: 1. A DEADLINE task (donor) blocks on a PI mutex held by another DEADLINE task (holder), but the holder doesn't inherit parameters (e.g., it already has a shorter deadline) 2. sched_setscheduler() changes the holder from DEADLINE to a lower class while still holding the mutex 3. The holder should now inherit DEADLINE parameters from the donor and be enqueued with ENQUEUE_REPLENISH, but this doesn't happen Fix the issue by introducing __setscheduler_dl_pi(), which detects when a DEADLINE (proper or boosted) task gets setscheduled to a lower priority class. In case, the function makes the task inherit DEADLINE parameters of the donoer (pi_se) and sets ENQUEUE_REPLENISH flag to ensure proper bandwidth accounting during the next enqueue operation. Fixes: 2279f540ea7d ("sched/deadline: Fix priority inheritance with multiple scheduling classes") Reported-by: Bruno Goncalves <bgoncalv@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260302-upstream-fix-deadline-piboost-b4-v3-1-6ba32184a9e0@redhat.com
10 daystime/jiffies: Fix sysctl file error on configurations where USER_HZ < HZGerd Rausch
Commit 2dc164a48e6fd ("sysctl: Create converter functions with two new macros") incorrectly returns error to user space when jiffies sysctl converter is used. The old overflow check got replaced with an unconditional one: + if (USER_HZ < HZ) + return -EINVAL; which will always be true on configurations with "USER_HZ < HZ". Remove the check; it is no longer needed as clock_t_to_jiffies() returns ULONG_MAX for the overflow case and proc_int_u2k_conv_uop() checks for "> INT_MAX" after conversion Fixes: 2dc164a48e6fd ("sysctl: Create converter functions with two new macros") Reported-by: Colm Harrington <colm.harrington@oracle.com> Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Signed-off-by: Joel Granados <joel.granados@kernel.org>
11 daystracing: Fix WARN_ON in tracing_buffers_mmap_closeQing Wang
When a process forks, the child process copies the parent's VMAs but the user_mapped reference count is not incremented. As a result, when both the parent and child processes exit, tracing_buffers_mmap_close() is called twice. On the second call, user_mapped is already 0, causing the function to return -ENODEV and triggering a WARN_ON. Normally, this isn't an issue as the memory is mapped with VM_DONTCOPY set. But this is only a hint, and the application can call madvise(MADVISE_DOFORK) which resets the VM_DONTCOPY flag. When the application does that, it can trigger this issue on fork. Fix it by incrementing the user_mapped reference count without re-mapping the pages in the VMA's open callback. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://patch.msgid.link/20260227025842.1085206-1-wangqing7171@gmail.com Fixes: cf9f0f7c4c5bb ("tracing: Allow user-space mapping of the ring-buffer") Reported-by: syzbot+3b5dd2030fe08afdf65d@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=3b5dd2030fe08afdf65d Tested-by: syzbot+3b5dd2030fe08afdf65d@syzkaller.appspotmail.com Signed-off-by: Qing Wang <wangqing7171@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
11 daystracing: Disable preemption in the tracepoint callbacks handling filtered pidsMasami Hiramatsu (Google)
Filtering PIDs for events triggered the following during selftests: [37] event tracing - restricts events based on pid notrace filtering [ 155.874095] [ 155.874869] ============================= [ 155.876037] WARNING: suspicious RCU usage [ 155.877287] 7.0.0-rc1-00004-g8cd473a19bc7 #7 Not tainted [ 155.879263] ----------------------------- [ 155.882839] kernel/trace/trace_events.c:1057 suspicious rcu_dereference_check() usage! [ 155.889281] [ 155.889281] other info that might help us debug this: [ 155.889281] [ 155.894519] [ 155.894519] rcu_scheduler_active = 2, debug_locks = 1 [ 155.898068] no locks held by ftracetest/4364. [ 155.900524] [ 155.900524] stack backtrace: [ 155.902645] CPU: 1 UID: 0 PID: 4364 Comm: ftracetest Not tainted 7.0.0-rc1-00004-g8cd473a19bc7 #7 PREEMPT(lazy) [ 155.902648] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-debian-1.17.0-1 04/01/2014 [ 155.902651] Call Trace: [ 155.902655] <TASK> [ 155.902659] dump_stack_lvl+0x67/0x90 [ 155.902665] lockdep_rcu_suspicious+0x154/0x1a0 [ 155.902672] event_filter_pid_sched_process_fork+0x9a/0xd0 [ 155.902678] kernel_clone+0x367/0x3a0 [ 155.902689] __x64_sys_clone+0x116/0x140 [ 155.902696] do_syscall_64+0x158/0x460 [ 155.902700] ? entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 155.902702] ? trace_irq_disable+0x1d/0xc0 [ 155.902709] entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 155.902711] RIP: 0033:0x4697c3 [ 155.902716] Code: 1f 84 00 00 00 00 00 64 48 8b 04 25 10 00 00 00 45 31 c0 31 d2 31 f6 bf 11 00 20 01 4c 8d 90 d0 02 00 00 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 89 c2 85 c0 75 2c 64 48 8b 04 25 10 00 00 [ 155.902718] RSP: 002b:00007ffc41150428 EFLAGS: 00000246 ORIG_RAX: 0000000000000038 [ 155.902721] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00000000004697c3 [ 155.902722] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011 [ 155.902724] RBP: 0000000000000000 R08: 0000000000000000 R09: 000000003fccf990 [ 155.902725] R10: 000000003fccd690 R11: 0000000000000246 R12: 0000000000000001 [ 155.902726] R13: 000000003fce8103 R14: 0000000000000001 R15: 0000000000000000 [ 155.902733] </TASK> [ 155.902747] The tracepoint callbacks recently were changed to allow preemption. The event PID filtering callbacks that were attached to the fork and exit tracepoints expected preemption disabled in order to access the RCU protected PID lists. Add a guard(preempt)() to protect the references to the PID list. Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260303215738.6ab275af@fedora Fixes: a46023d5616e ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast") Link: https://patch.msgid.link/20260303131706.96057f61a48a34c43ce1e396@kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
11 daysftrace: Disable preemption in the tracepoint callbacks handling filtered pidsSteven Rostedt
When function trace PID filtering is enabled, the function tracer will attach a callback to the fork tracepoint as well as the exit tracepoint that will add the forked child PID to the PID filtering list as well as remove the PID that is exiting. Commit a46023d5616e ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast") removed the disabling of preemption when calling tracepoint callbacks. The callbacks used for the PID filtering accounting depended on preemption being disabled, and now the trigger a "suspicious RCU usage" warning message. Make them explicitly disable preemption. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260302213546.156e3e4f@gandalf.local.home Fixes: a46023d5616e ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
11 daystracing: Fix syscall events activation by ensuring refcount hits zeroHuiwen He
When multiple syscall events are specified in the kernel command line (e.g., trace_event=syscalls:sys_enter_openat,syscalls:sys_enter_close), they are often not captured after boot, even though they appear enabled in the tracing/set_event file. The issue stems from how syscall events are initialized. Syscall tracepoints require the global reference count (sys_tracepoint_refcount) to transition from 0 to 1 to trigger the registration of the syscall work (TIF_SYSCALL_TRACEPOINT) for tasks, including the init process (pid 1). The current implementation of early_enable_events() with disable_first=true used an interleaved sequence of "Disable A -> Enable A -> Disable B -> Enable B". If multiple syscalls are enabled, the refcount never drops to zero, preventing the 0->1 transition that triggers actual registration. Fix this by splitting early_enable_events() into two distinct phases: 1. Disable all events specified in the buffer. 2. Enable all events specified in the buffer. This ensures the refcount hits zero before re-enabling, allowing syscall events to be properly activated during early boot. The code is also refactored to use a helper function to avoid logic duplication between the disable and enable phases. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260224023544.1250787-1-hehuiwen@kylinos.cn Fixes: ce1039bd3a89 ("tracing: Fix enabling of syscall events on the command line") Signed-off-by: Huiwen He <hehuiwen@kylinos.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
11 daysfgraph: Fix thresh_return nosleeptime double-adjustShengming Hu
trace_graph_thresh_return() called handle_nosleeptime() and then delegated to trace_graph_return(), which calls handle_nosleeptime() again. When sleep-time accounting is disabled this double-adjusts calltime and can produce bogus durations (including underflow). Fix this by computing rettime once, applying handle_nosleeptime() only once, using the adjusted calltime for threshold comparison, and writing the return event directly via __trace_graph_return() when the threshold is met. Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260221113314048jE4VRwIyZEALiYByGK0My@zte.com.cn Fixes: 3c9880f3ab52b ("ftrace: Use a running sleeptime instead of saving on shadow stack") Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
11 daysfgraph: Fix thresh_return clear per-task notraceShengming Hu
When tracing_thresh is enabled, function graph tracing uses trace_graph_thresh_return() as the return handler. Unlike trace_graph_return(), it did not clear the per-task TRACE_GRAPH_NOTRACE flag set by the entry handler for set_graph_notrace addresses. This could leave the task permanently in "notrace" state and effectively disable function graph tracing for that task. Mirror trace_graph_return()'s per-task notrace handling by clearing TRACE_GRAPH_NOTRACE and returning early when set. Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260221113007819YgrZsMGABff4Rc-O_fZxL@zte.com.cn Fixes: b84214890a9bc ("function_graph: Move graph notrace bit to shadow stack global var") Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
11 daysbpf: Fix a UAF issue in bpf_trampoline_link_cgroup_shimLang Xu
The root cause of this bug is that when 'bpf_link_put' reduces the refcount of 'shim_link->link.link' to zero, the resource is considered released but may still be referenced via 'tr->progs_hlist' in 'cgroup_shim_find'. The actual cleanup of 'tr->progs_hlist' in 'bpf_shim_tramp_link_release' is deferred. During this window, another process can cause a use-after-free via 'bpf_trampoline_link_cgroup_shim'. Based on Martin KaFai Lau's suggestions, I have created a simple patch. To fix this: Add an atomic non-zero check in 'bpf_trampoline_link_cgroup_shim'. Only increment the refcount if it is not already zero. Testing: I verified the fix by adding a delay in 'bpf_shim_tramp_link_release' to make the bug easier to trigger: static void bpf_shim_tramp_link_release(struct bpf_link *link) { /* ... */ if (!shim_link->trampoline) return; + msleep(100); WARN_ON_ONCE(bpf_trampoline_unlink_prog(&shim_link->link, shim_link->trampoline, NULL)); bpf_trampoline_put(shim_link->trampoline); } Before the patch, running a PoC easily reproduced the crash(almost 100%) with a call trace similar to KaiyanM's report. After the patch, the bug no longer occurs even after millions of iterations. Fixes: 69fd337a975c ("bpf: per-cgroup lsm flavor") Reported-by: Kaiyan Mei <M202472210@hust.edu.cn> Closes: https://lore.kernel.org/bpf/3c4ebb0b.46ff8.19abab8abe2.Coremail.kaiyanm@hust.edu.cn/ Signed-off-by: Lang Xu <xulang@uniontech.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/279EEE1BA1DDB49D+20260303095217.34436-1-xulang@uniontech.com
11 daysMerge tag 'cgroup-for-7.0-rc2-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Fix circular locking dependency in cpuset partition code by deferring housekeeping_update() calls to a workqueue instead of calling them directly under cpus_read_lock - Fix null-ptr-deref in rebuild_sched_domains_cpuslocked() when generate_sched_domains() returns NULL due to kmalloc failure - Fix incorrect cpuset behavior for effective_xcpus in partition_xcpus_del() and cpuset_update_tasks_cpumask() in update_cpumasks_hier() - Fix race between task migration and cgroup iteration * tag 'cgroup-for-7.0-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/cpuset: fix null-ptr-deref in rebuild_sched_domains_cpuslocked cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue cgroup/cpuset: Move housekeeping_update()/rebuild_sched_domains() together kselftest/cgroup: Simplify test_cpuset_prs.sh by removing "S+" command cgroup/cpuset: Set isolated_cpus_updating only if isolated_cpus is changed cgroup/cpuset: Clarify exclusion rules for cpuset internal variables cgroup/cpuset: Fix incorrect use of cpuset_update_tasks_cpumask() in update_cpumasks_hier() cgroup/cpuset: Fix incorrect change to effective_xcpus in partition_xcpus_del() cgroup: fix race between task migration and iteration
11 daysMerge tag 'sched_ext-for-7.0-rc2-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix starvation of scx_enable() under fair-class saturation by offloading the enable path to an RT kthread - Fix out-of-bounds access in idle mask initialization on systems with non-contiguous NUMA node IDs - Fix a preemption window during scheduler exit and a refcount underflow in cgroup init error path - Fix SCX_EFLAG_INITIALIZED being a no-op flag - Add READ_ONCE() annotations for KCSAN-clean lockless accesses and replace naked scx_root dereferences with container_of() in kobject callbacks - Tooling and selftest fixes: compilation issues with clang 17, strtoul() misuse, unused options cleanup, and Kconfig sync * tag 'sched_ext-for-7.0-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Fix starvation of scx_enable() under fair-class saturation sched_ext: Remove redundant css_put() in scx_cgroup_init() selftests/sched_ext: Fix peek_dsq.bpf.c compile error for clang 17 selftests/sched_ext: Add -fms-extensions to bpf build flags tools/sched_ext: Add -fms-extensions to bpf build flags sched_ext: Use READ_ONCE() for plain reads of scx_watchdog_timeout sched_ext: Replace naked scx_root dereferences in kobject callbacks sched_ext: Use READ_ONCE() for the read side of dsq->nr update tools/sched_ext: fix strtoul() misuse in scx_hotplug_seq() sched_ext: Fix SCX_EFLAG_INITIALIZED being a no-op flag sched_ext: Fix out-of-bounds access in scx_idle_init_masks() sched_ext: Disable preemption between scx_claim_exit() and kicking helper work tools/sched_ext: Add Kconfig to sync with upstream tools/sched_ext: Sync README.md Kconfig with upstream scx selftests/sched_ext: Remove duplicated unistd.h include in rt_stall.c tools/sched_ext: scx_sdt: Remove unused '-f' option tools/sched_ext: scx_central: Remove unused '-p' option selftests/sched_ext: Fix unused-result warning for read() selftests/sched_ext: Abort test loop on signal
11 dayssched_ext: Fix starvation of scx_enable() under fair-class saturationTejun Heo
During scx_enable(), the READY -> ENABLED task switching loop changes the calling thread's sched_class from fair to ext. Since fair has higher priority than ext, saturating fair-class workloads can indefinitely starve the enable thread, hanging the system. This was introduced when the enable path switched from preempt_disable() to scx_bypass() which doesn't protect against fair-class starvation. Note that the original preempt_disable() protection wasn't complete either - in partial switch modes, the calling thread could still be starved after preempt_enable() as it may have been switched to ext class. Fix it by offloading the enable body to a dedicated system-wide RT (SCHED_FIFO) kthread which cannot be starved by either fair or ext class tasks. scx_enable() lazily creates the kthread on first use and passes the ops pointer through a struct scx_enable_cmd containing the kthread_work, then synchronously waits for completion. The workfn runs on a different kthread from sch->helper (which runs disable_work), so it can safely flush disable_work on the error path without deadlock. Fixes: 8c2090c504e9 ("sched_ext: Initialize in bypass mode") Cc: stable@vger.kernel.org # v6.12+ Signed-off-by: Tejun Heo <tj@kernel.org>
11 dayssched_ext: Remove redundant css_put() in scx_cgroup_init()Cheng-Yang Chou
The iterator css_for_each_descendant_pre() walks the cgroup hierarchy under cgroup_lock(). It does not increment the reference counts on yielded css structs. According to the cgroup documentation, css_put() should only be used to release a reference obtained via css_get() or css_tryget_online(). Since the iterator does not use either of these to acquire a reference, calling css_put() in the error path of scx_cgroup_init() causes a refcount underflow. Remove the unbalanced css_put() to prevent a potential Use-After-Free (UAF) vulnerability. Fixes: 819513666966 ("sched_ext: Add cgroup support") Cc: stable@vger.kernel.org # v6.12+ Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
11 dayssched_ext: Use READ_ONCE() for plain reads of scx_watchdog_timeoutzhidao su
scx_watchdog_timeout is written with WRITE_ONCE() in scx_enable(): WRITE_ONCE(scx_watchdog_timeout, timeout); However, three read-side accesses use plain reads without the matching READ_ONCE(): /* check_rq_for_timeouts() - L2824 */ last_runnable + scx_watchdog_timeout /* scx_watchdog_workfn() - L2852 */ scx_watchdog_timeout / 2 /* scx_enable() - L5179 */ scx_watchdog_timeout / 2 The KCSAN documentation requires that if one accessor uses WRITE_ONCE() to annotate lock-free access, all other accesses must also use the appropriate accessor. Plain reads alongside WRITE_ONCE() leave the pair incomplete and can trigger KCSAN warnings. Note that scx_tick() already uses the correct READ_ONCE() annotation: last_check + READ_ONCE(scx_watchdog_timeout) Fix the three remaining plain reads to match, making all accesses to scx_watchdog_timeout consistently annotated and KCSAN-clean. Signed-off-by: zhidao su <suzhidao@xiaomi.com> Signed-off-by: Tejun Heo <tj@kernel.org>
12 daysftrace: Add missing ftrace_lock to update_ftrace_direct_add/delJiri Olsa
Ihor and Kumar reported splat from ftrace_get_addr_curr [1], which happened because of the missing ftrace_lock in update_ftrace_direct_add/del functions allowing concurrent access to ftrace internals. The ftrace_update_ops function must be guarded by ftrace_lock, adding that. Fixes: 05dc5e9c1fe1 ("ftrace: Add update_ftrace_direct_add function") Fixes: 8d2c1233f371 ("ftrace: Add update_ftrace_direct_del function") Reported-by: Ihor Solodrai <ihor.solodrai@linux.dev> Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Closes: https://lore.kernel.org/bpf/1b58ffb2-92ae-433a-ba46-95294d6edea2@linux.dev/ Tested-by: Ihor Solodrai <ihor.solodrai@linux.dev> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20260302081622.165713-1-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
12 dayssched_ext: Replace naked scx_root dereferences in kobject callbackszhidao su
scx_attr_ops_show() and scx_uevent() access scx_root->ops.name directly. This is problematic for two reasons: 1. The file-level comment explicitly identifies naked scx_root dereferences as a temporary measure that needs to be replaced with proper per-instance access. 2. scx_attr_events_show(), the neighboring sysfs show function in the same group, already uses the correct pattern: struct scx_sched *sch = container_of(kobj, struct scx_sched, kobj); Having inconsistent access patterns in the same sysfs/uevent group is error-prone. The kobject embedded in struct scx_sched is initialized as: kobject_init_and_add(&sch->kobj, &scx_ktype, NULL, "root"); so container_of(kobj, struct scx_sched, kobj) correctly retrieves the owning scx_sched instance in both callbacks. Replace the naked scx_root dereferences with container_of()-based access, consistent with scx_attr_events_show() and in preparation for proper multi-instance scx_sched support. Signed-off-by: zhidao su <suzhidao@xiaomi.com> Signed-off-by: Tejun Heo <tj@kernel.org>
12 dayssched_ext: Use READ_ONCE() for the read side of dsq->nr updatezhidao su
scx_bpf_dsq_nr_queued() reads dsq->nr via READ_ONCE() without holding any lock, making dsq->nr a lock-free concurrently accessed variable. However, dsq_mod_nr(), the sole writer of dsq->nr, only uses WRITE_ONCE() on the write side without the matching READ_ONCE() on the read side: WRITE_ONCE(dsq->nr, dsq->nr + delta); ^^^^^^^ plain read -- KCSAN data race The KCSAN documentation requires that if one accessor uses READ_ONCE() or WRITE_ONCE() on a variable to annotate lock-free access, all other accesses must also use the appropriate accessor. A plain read on the right-hand side of WRITE_ONCE() leaves the pair incomplete and will trigger KCSAN warnings. Fix by using READ_ONCE() for the read side of the update: WRITE_ONCE(dsq->nr, READ_ONCE(dsq->nr) + delta); This is consistent with scx_bpf_dsq_nr_queued() and makes the concurrent access annotation complete and KCSAN-clean. Signed-off-by: zhidao su <suzhidao@xiaomi.com> Signed-off-by: Tejun Heo <tj@kernel.org>