summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2023-11-02perf/core: Fix potential NULL derefPeter Zijlstra
commit a71ef31485bb51b846e8db8b3a35e432cc15afb5 upstream. Smatch is awesome. Fixes: 32671e3799ca ("perf: Disallow mis-matched inherited group reads") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-02tracing/kprobes: Fix the description of variable length argumentsYujie Liu
commit e0f831836cead677fb07d54bd6bf499df35640c2 upstream. Fix the following kernel-doc warnings: kernel/trace/trace_kprobe.c:1029: warning: Excess function parameter 'args' description in '__kprobe_event_gen_cmd_start' kernel/trace/trace_kprobe.c:1097: warning: Excess function parameter 'args' description in '__kprobe_event_add_fields' Refer to the usage of variable length arguments elsewhere in the kernel code, "@..." is the proper way to express it in the description. Link: https://lore.kernel.org/all/20231027041315.2613166-1-yujie.liu@intel.com/ Fixes: 2a588dd1d5d6 ("tracing: Add kprobe event command generation functions") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202310190437.paI6LYJF-lkp@intel.com/ Signed-off-by: Yujie Liu <yujie.liu@intel.com> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-02tracing/kprobes: Fix symbol counting logic by looking at modules as wellAndrii Nakryiko
commit 926fe783c8a64b33997fec405cf1af3e61aed441 upstream. Recent changes to count number of matching symbols when creating a kprobe event failed to take into account kernel modules. As such, it breaks kprobes on kernel module symbols, by assuming there is no match. Fix this my calling module_kallsyms_on_each_symbol() in addition to kallsyms_on_each_match_symbol() to perform a proper counting. Link: https://lore.kernel.org/all/20231027233126.2073148-1-andrii@kernel.org/ Cc: Francis Laniel <flaniel@linux.microsoft.com> Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Fixes: b022f0c7e404 ("tracing/kprobes: Return EADDRNOTAVAIL when func matches several symbols") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-25tracing/kprobes: Return EADDRNOTAVAIL when func matches several symbolsFrancis Laniel
commit b022f0c7e404887a7c5229788fc99eff9f9a80d5 upstream. When a kprobe is attached to a function that's name is not unique (is static and shares the name with other functions in the kernel), the kprobe is attached to the first function it finds. This is a bug as the function that it is attaching to is not necessarily the one that the user wants to attach to. Instead of blindly picking a function to attach to what is ambiguous, error with EADDRNOTAVAIL to let the user know that this function is not unique, and that the user must use another unique function with an address offset to get to the function they want to attach to. Link: https://lore.kernel.org/all/20231020104250.9537-2-flaniel@linux.microsoft.com/ Cc: stable@vger.kernel.org Fixes: 413d37d1eb69 ("tracing: Add kprobe-based event tracer") Suggested-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com> Link: https://lore.kernel.org/lkml/20230819101105.b0c104ae4494a7d1f2eea742@kernel.org/ Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-25perf: Disallow mis-matched inherited group readsPeter Zijlstra
commit 32671e3799ca2e4590773fd0e63aaa4229e50c06 upstream. Because group consistency is non-atomic between parent (filedesc) and children (inherited) events, it is possible for PERF_FORMAT_GROUP read() to try and sum non-matching counter groups -- with non-sensical results. Add group_generation to distinguish the case where a parent group removes and adds an event and thus has the same number, but a different configuration of events as inherited groups. This became a problem when commit fa8c269353d5 ("perf/core: Invert perf_read_group() loops") flipped the order of child_list and sibling_list. Previously it would iterate the group (sibling_list) first, and for each sibling traverse the child_list. In this order, only the group composition of the parent is relevant. By flipping the order the group composition of the child (inherited) events becomes an issue and the mis-match in group composition becomes evident. That said; even prior to this commit, while reading of a group that is not equally inherited was not broken, it still made no sense. (Ab)use ECHILD as error return to indicate issues with child process group composition. Fixes: fa8c269353d5 ("perf/core: Invert perf_read_group() loops") Reported-by: Budimir Markovic <markovicbudimir@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20231018115654.GK33217@noisy.programming.kicks-ass.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-25cpufreq: schedutil: Update next_freq when cpufreq_limits changeXuewen Yan
[ Upstream commit 9e0bc36ab07c550d791bf17feeb479f1dfc42d89 ] When cpufreq's policy is 'single', there is a scenario that will cause sg_policy's next_freq to be unable to update. When the CPU's util is always max, the cpufreq will be max, and then if we change the policy's scaling_max_freq to be a lower freq, indeed, the sg_policy's next_freq need change to be the lower freq, however, because the cpu_is_busy, the next_freq would keep the max_freq. For example: The cpu7 is a single CPU: unisoc:/sys/devices/system/cpu/cpufreq/policy7 # while true;do done& [1] 4737 unisoc:/sys/devices/system/cpu/cpufreq/policy7 # taskset -p 80 4737 pid 4737's current affinity mask: ff pid 4737's new affinity mask: 80 unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_max_freq 2301000 unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_cur_freq 2301000 unisoc:/sys/devices/system/cpu/cpufreq/policy7 # echo 2171000 > scaling_max_freq unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_max_freq 2171000 At this time, the sg_policy's next_freq would stay at 2301000, which is wrong. To fix this, add a check for the ->need_freq_update flag. [ mingo: Clarified the changelog. ] Co-developed-by: Guohua Yan <guohua.yan@unisoc.com> Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Signed-off-by: Guohua Yan <guohua.yan@unisoc.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: "Rafael J. Wysocki" <rafael@kernel.org> Link: https://lore.kernel.org/r/20230719130527.8074-1-xuewen.yan@unisoc.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-25tracing: relax trace_event_eval_update() execution with cond_resched()Clément Léger
[ Upstream commit 23cce5f25491968b23fb9c399bbfb25f13870cd9 ] When kernel is compiled without preemption, the eval_map_work_func() (which calls trace_event_eval_update()) will not be preempted up to its complete execution. This can actually cause a problem since if another CPU call stop_machine(), the call will have to wait for the eval_map_work_func() function to finish executing in the workqueue before being able to be scheduled. This problem was observe on a SMP system at boot time, when the CPU calling the initcalls executed clocksource_done_booting() which in the end calls stop_machine(). We observed a 1 second delay because one CPU was executing eval_map_work_func() and was not preempted by the stop_machine() task. Adding a call to cond_resched() in trace_event_eval_update() allows other tasks to be executed and thus continue working asynchronously like before without blocking any pending task at boot time. Link: https://lore.kernel.org/linux-trace-kernel/20230929191637.416931-1-cleger@rivosinc.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Clément Léger <cleger@rivosinc.com> Tested-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-25fprobe: Fix to ensure the number of active retprobes is not zeroMasami Hiramatsu (Google)
commit 700b2b439766e8aab8a7174991198497345bd411 upstream. The number of active retprobes can be zero but it is not acceptable, so return EINVAL error if detected. Link: https://lore.kernel.org/all/169750018550.186853.11198884812017796410.stgit@devnote2/ Reported-by: wuqiang.matt <wuqiang.matt@bytedance.com> Closes: https://lore.kernel.org/all/20231016222103.cb9f426edc60220eabd8aa6a@kernel.org/ Fixes: 5b0ab78998e3 ("fprobe: Add exit_handler support") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-25audit,io_uring: io_uring openat triggers audit reference count underflowDan Clash
commit 03adc61edad49e1bbecfb53f7ea5d78f398fe368 upstream. An io_uring openat operation can update an audit reference count from multiple threads resulting in the call trace below. A call to io_uring_submit() with a single openat op with a flag of IOSQE_ASYNC results in the following reference count updates. These first part of the system call performs two increments that do not race. do_syscall_64() __do_sys_io_uring_enter() io_submit_sqes() io_openat_prep() __io_openat_prep() getname() getname_flags() /* update 1 (increment) */ __audit_getname() /* update 2 (increment) */ The openat op is queued to an io_uring worker thread which starts the opportunity for a race. The system call exit performs one decrement. do_syscall_64() syscall_exit_to_user_mode() syscall_exit_to_user_mode_prepare() __audit_syscall_exit() audit_reset_context() putname() /* update 3 (decrement) */ The io_uring worker thread performs one increment and two decrements. These updates can race with the system call decrement. io_wqe_worker() io_worker_handle_work() io_wq_submit_work() io_issue_sqe() io_openat() io_openat2() do_filp_open() path_openat() __audit_inode() /* update 4 (increment) */ putname() /* update 5 (decrement) */ __audit_uring_exit() audit_reset_context() putname() /* update 6 (decrement) */ The fix is to change the refcnt member of struct audit_names from int to atomic_t. kernel BUG at fs/namei.c:262! Call Trace: ... ? putname+0x68/0x70 audit_reset_context.part.0.constprop.0+0xe1/0x300 __audit_uring_exit+0xda/0x1c0 io_issue_sqe+0x1f3/0x450 ? lock_timer_base+0x3b/0xd0 io_wq_submit_work+0x8d/0x2b0 ? __try_to_del_timer_sync+0x67/0xa0 io_worker_handle_work+0x17c/0x2b0 io_wqe_worker+0x10a/0x350 Cc: stable@vger.kernel.org Link: https://lore.kernel.org/lkml/MW2PR2101MB1033FFF044A258F84AEAA584F1C9A@MW2PR2101MB1033.namprd21.prod.outlook.com/ Fixes: 5bd2182d58e9 ("audit,io_uring,io-wq: add some basic audit support to io_uring") Signed-off-by: Dan Clash <daclash@linux.microsoft.com> Link: https://lore.kernel.org/r/20231012215518.GA4048@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-19cgroup: Remove duplicates in cgroup v1 tasks fileMichal Koutný
commit 1ca0b605150501b7dc59f3016271da4eb3e96fce upstream. One PID may appear multiple times in a preloaded pidlist. (Possibly due to PID recycling but we have reports of the same task_struct appearing with different PIDs, thus possibly involving transfer of PID via de_thread().) Because v1 seq_file iterator uses PIDs as position, it leads to a message: > seq_file: buggy .next function kernfs_seq_next did not update position index Conservative and quick fix consists of removing duplicates from `tasks` file (as opposed to removing pidlists altogether). It doesn't affect correctness (it's sufficient to show a PID once), performance impact would be hidden by unconditional sorting of the pidlist already in place (asymptotically). Link: https://lore.kernel.org/r/20230823174804.23632-1-mkoutny@suse.com/ Suggested-by: Firo Yang <firo.yang@suse.com> Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-19workqueue: Override implicit ordered attribute in ↵Waiman Long
workqueue_apply_unbound_cpumask() [ Upstream commit ca10d851b9ad0338c19e8e3089e24d565ebfffd7 ] Commit 5c0338c68706 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered") enabled implicit ordered attribute to be added to WQ_UNBOUND workqueues with max_active of 1. This prevented the changing of attributes to these workqueues leading to fix commit 0a94efb5acbb ("workqueue: implicit ordered attribute should be overridable"). However, workqueue_apply_unbound_cpumask() was not updated at that time. So sysfs changes to wq_unbound_cpumask has no effect on WQ_UNBOUND workqueues with implicit ordered attribute. Since not all WQ_UNBOUND workqueues are visible on sysfs, we are not able to make all the necessary cpumask changes even if we iterates all the workqueue cpumasks in sysfs and changing them one by one. Fix this problem by applying the corresponding change made to apply_workqueue_attrs_locked() in the fix commit to workqueue_apply_unbound_cpumask(). Fixes: 5c0338c68706 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-19bpf: Fix verifier log for async callback return valuesDavid Vernet
[ Upstream commit 829955981c557c7fc7416581c4cd68a8a0c28620 ] The verifier, as part of check_return_code(), verifies that async callbacks such as from e.g. timers, will return 0. It does this by correctly checking that R0->var_off is in tnum_const(0), which effectively checks that it's in a range of 0. If this condition fails, however, it prints an error message which says that the value should have been in (0x0; 0x1). This results in possibly confusing output such as the following in which an async callback returns 1: At async callback the register R0 has value (0x1; 0x0) should have been in (0x0; 0x1) The fix is easy -- we should just pass the tnum_const(0) as the correct range to verbose_invalid_scalar(), which will then print the following: At async callback the register R0 has value (0x1; 0x0) should have been in (0x0; 0x0) Fixes: bfc6bb74e4f1 ("bpf: Implement verifier support for validation of async callbacks.") Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20231009161414.235829-1-void@manifault.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-10bpf: unconditionally reset backtrack_state masks on global func exitAndrii Nakryiko
[ Upstream commit 81335f90e8a88b81932df011105c46e708744f44 ] In mark_chain_precision() logic, when we reach the entry to a global func, it is expected that R1-R5 might be still requested to be marked precise. This would correspond to some integer input arguments being tracked as precise. This is all expected and handled as a special case. What's not expected is that we'll leave backtrack_state structure with some register bits set. This is because for subsequent precision propagations backtrack_state is reused without clearing masks, as all code paths are carefully written in a way to leave empty backtrack_state with zeroed out masks, for speed. The fix is trivial, we always clear register bit in the register mask, and then, optionally, set reg->precise if register is SCALAR_VALUE type. Reported-by: Chris Mason <clm@meta.com> Fixes: be2ef8161572 ("bpf: allow precision tracking for programs with subprogs") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230918210110.2241458-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-06tracing/user_events: Align set_bit() address for all archsBeau Belgrave
commit 2de9ee94054263940122aee8720e902b30c27930 upstream. All architectures should use a long aligned address passed to set_bit(). User processes can pass either a 32-bit or 64-bit sized value to be updated when tracing is enabled when on a 64-bit kernel. Both cases are ensured to be naturally aligned, however, that is not enough. The address must be long aligned without affecting checks on the value within the user process which require different adjustments for the bit for little and big endian CPUs. Add a compat flag to user_event_enabler that indicates when a 32-bit value is being used on a 64-bit kernel. Long align addresses and correct the bit to be used by set_bit() to account for this alignment. Ensure compat flags are copied during forks and used during deletion clears. Link: https://lore.kernel.org/linux-trace-kernel/20230925230829.341-2-beaub@linux.microsoft.com Link: https://lore.kernel.org/linux-trace-kernel/20230914131102.179100-1-cleger@rivosinc.com/ Cc: stable@vger.kernel.org Fixes: 7235759084a4 ("tracing/user_events: Use remote writes for event enablement") Reported-by: Clément Léger <cleger@rivosinc.com> Suggested-by: Clément Léger <cleger@rivosinc.com> Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-06bpf: Add override check to kprobe multi link attachJiri Olsa
commit 41bc46c12a8053a1b3279a379bd6b5e87b045b85 upstream. Currently the multi_kprobe link attach does not check error injection list for programs with bpf_override_return helper and allows them to attach anywhere. Adding the missing check. Fixes: 0dcac2725406 ("bpf: Add multi kprobe link") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Alan Maguire <alan.maguire@oracle.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/bpf/20230907200652.926951-1-jolsa@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-06ring-buffer: Update "shortest_full" in pollingSteven Rostedt (Google)
commit 1e0cb399c7653462d9dadf8ab9425337c355d358 upstream. It was discovered that the ring buffer polling was incorrectly stating that read would not block, but that's because polling did not take into account that reads will block if the "buffer-percent" was set. Instead, the ring buffer polling would say reads would not block if there was any data in the ring buffer. This was incorrect behavior from a user space point of view. This was fixed by commit 42fb0a1e84ff by having the polling code check if the ring buffer had more data than what the user specified "buffer percent" had. The problem now is that the polling code did not register itself to the writer that it wanted to wait for a specific "full" value of the ring buffer. The result was that the writer would wake the polling waiter whenever there was a new event. The polling waiter would then wake up, see that there's not enough data in the ring buffer to notify user space and then go back to sleep. The next event would wake it up again. Before the polling fix was added, the code would wake up around 100 times for a hackbench 30 benchmark. After the "fix", due to the constant waking of the writer, it would wake up over 11,0000 times! It would never leave the kernel, so the user space behavior was still "correct", but this definitely is not the desired effect. To fix this, have the polling code add what it's waiting for to the "shortest_full" variable, to tell the writer not to wake it up if the buffer is not as full as it expects to be. Note, after this fix, it appears that the waiter is now woken up around 2x the times it was before (~200). This is a tremendous improvement from the 11,000 times, but I will need to spend some time to see why polling is more aggressive in its wakeups than the read blocking code. Link: https://lore.kernel.org/linux-trace-kernel/20230929180113.01c2cae3@rorschach.local.home Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Fixes: 42fb0a1e84ff ("tracing/ring-buffer: Have polling block on watermark") Reported-by: Julia Lawall <julia.lawall@inria.fr> Tested-by: Julia Lawall <julia.lawall@inria.fr> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-06ring-buffer: Fix bytes info in per_cpu buffer statsZheng Yejian
commit 45d99ea451d0c30bfd4864f0fe485d7dac014902 upstream. The 'bytes' info in file 'per_cpu/cpu<X>/stats' means the number of bytes in cpu buffer that have not been consumed. However, currently after consuming data by reading file 'trace_pipe', the 'bytes' info was not changed as expected. # cat per_cpu/cpu0/stats entries: 0 overrun: 0 commit overrun: 0 bytes: 568 <--- 'bytes' is problematical !!! oldest event ts: 8651.371479 now ts: 8653.912224 dropped events: 0 read events: 8 The root cause is incorrect stat on cpu_buffer->read_bytes. To fix it: 1. When stat 'read_bytes', account consumed event in rb_advance_reader(); 2. When stat 'entries_bytes', exclude the discarded padding event which is smaller than minimum size because it is invisible to reader. Then use rb_page_commit() instead of BUF_PAGE_SIZE at where accounting for page-based read/remove/overrun. Also correct the comments of ring_buffer_bytes_cpu() in this patch. Link: https://lore.kernel.org/linux-trace-kernel/20230921125425.1708423-1-zhengyejian1@huawei.com Cc: stable@vger.kernel.org Fixes: c64e148a3be3 ("trace: Add ring buffer stats to measure rate of events") Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-06sched/rt: Fix live lock between select_fallback_rq() and RT pushJoel Fernandes (Google)
commit fc09027786c900368de98d03d40af058bcb01ad9 upstream. During RCU-boost testing with the TREE03 rcutorture config, I found that after a few hours, the machine locks up. On tracing, I found that there is a live lock happening between 2 CPUs. One CPU has an RT task running, while another CPU is being offlined which also has an RT task running. During this offlining, all threads are migrated. The migration thread is repeatedly scheduled to migrate actively running tasks on the CPU being offlined. This results in a live lock because select_fallback_rq() keeps picking the CPU that an RT task is already running on only to get pushed back to the CPU being offlined. It is anyway pointless to pick CPUs for pushing tasks to if they are being offlined only to get migrated away to somewhere else. This could also add unwanted latency to this task. Fix these issues by not selecting CPUs in RT if they are not 'active' for scheduling, using the cpu_active_mask. Other parts in core.c already use cpu_active_mask to prevent tasks from being put on CPUs going offline. With this fix I ran the tests for days and could not reproduce the hang. Without the patch, I hit it in a few hours. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Paul E. McKenney <paulmck@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230923011409.3522762-1-joel@joelfernandes.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-06kernel/sched: Modify initial boot task idle setupLiam R. Howlett
commit cff9b2332ab762b7e0586c793c431a8f2ea4db04 upstream. Initial booting is setting the task flag to idle (PF_IDLE) by the call path sched_init() -> init_idle(). Having the task idle and calling call_rcu() in kernel/rcu/tiny.c means that TIF_NEED_RESCHED will be set. Subsequent calls to any cond_resched() will enable IRQs, potentially earlier than the IRQ setup has completed. Recent changes have caused just this scenario and IRQs have been enabled early. This causes a warning later in start_kernel() as interrupts are enabled before they are fully set up. Fix this issue by setting the PF_IDLE flag later in the boot sequence. Although the boot task was marked as idle since (at least) d80e4fda576d, I am not sure that it is wrong to do so. The forced context-switch on idle task was introduced in the tiny_rcu update, so I'm going to claim this fixes 5f6130fa52ee. Fixes: 5f6130fa52ee ("tiny_rcu: Directly force QS when call_rcu_[bh|sched]() on idle_task") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/linux-mm/CAMuHMdWpvpWoDa=Ox-do92czYRvkok6_x6pYUH+ZouMcJbXy+Q@mail.gmail.com/ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-06ring-buffer: Do not attempt to read past "commit"Steven Rostedt (Google)
[ Upstream commit 95a404bd60af6c4d9d8db01ad14fe8957ece31ca ] When iterating over the ring buffer while the ring buffer is active, the writer can corrupt the reader. There's barriers to help detect this and handle it, but that code missed the case where the last event was at the very end of the page and has only 4 bytes left. The checks to detect the corruption by the writer to reads needs to see the length of the event. If the length in the first 4 bytes is zero then the length is stored in the second 4 bytes. But if the writer is in the process of updating that code, there's a small window where the length in the first 4 bytes could be zero even though the length is only 4 bytes. That will cause rb_event_length() to read the next 4 bytes which could happen to be off the allocated page. To protect against this, fail immediately if the next event pointer is less than 8 bytes from the end of the commit (last byte of data), as all events must be a minimum of 8 bytes anyway. Link: https://lore.kernel.org/all/20230905141245.26470-1-Tze-nan.Wu@mediatek.com/ Link: https://lore.kernel.org/linux-trace-kernel/20230907122820.0899019c@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Reported-by: Tze-nan Wu <Tze-nan.Wu@mediatek.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-06ring-buffer: Avoid softlockup in ring_buffer_resize()Zheng Yejian
[ Upstream commit f6bd2c92488c30ef53b5bd80c52f0a7eee9d545a ] When user resize all trace ring buffer through file 'buffer_size_kb', then in ring_buffer_resize(), kernel allocates buffer pages for each cpu in a loop. If the kernel preemption model is PREEMPT_NONE and there are many cpus and there are many buffer pages to be allocated, it may not give up cpu for a long time and finally cause a softlockup. To avoid it, call cond_resched() after each cpu buffer allocation. Link: https://lore.kernel.org/linux-trace-kernel/20230906081930.3939106-1-zhengyejian1@huawei.com Cc: <mhiramat@kernel.org> Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-06dma-debug: don't call __dma_entry_alloc_check_leak() under free_entries_lockSergey Senozhatsky
[ Upstream commit fb5a4315591dae307a65fc246ca80b5159d296e1 ] __dma_entry_alloc_check_leak() calls into printk -> serial console output (qcom geni) and grabs port->lock under free_entries_lock spin lock, which is a reverse locking dependency chain as qcom_geni IRQ handler can call into dma-debug code and grab free_entries_lock under port->lock. Move __dma_entry_alloc_check_leak() call out of free_entries_lock scope so that we don't acquire serial console's port->lock under it. Trimmed-down lockdep splat: The existing dependency chain (in reverse order) is: -> #2 (free_entries_lock){-.-.}-{2:2}: _raw_spin_lock_irqsave+0x60/0x80 dma_entry_alloc+0x38/0x110 debug_dma_map_page+0x60/0xf8 dma_map_page_attrs+0x1e0/0x230 dma_map_single_attrs.constprop.0+0x6c/0xc8 geni_se_rx_dma_prep+0x40/0xcc qcom_geni_serial_isr+0x310/0x510 __handle_irq_event_percpu+0x110/0x244 handle_irq_event_percpu+0x20/0x54 handle_irq_event+0x50/0x88 handle_fasteoi_irq+0xa4/0xcc handle_irq_desc+0x28/0x40 generic_handle_domain_irq+0x24/0x30 gic_handle_irq+0xc4/0x148 do_interrupt_handler+0xa4/0xb0 el1_interrupt+0x34/0x64 el1h_64_irq_handler+0x18/0x24 el1h_64_irq+0x64/0x68 arch_local_irq_enable+0x4/0x8 ____do_softirq+0x18/0x24 ... -> #1 (&port_lock_key){-.-.}-{2:2}: _raw_spin_lock_irqsave+0x60/0x80 qcom_geni_serial_console_write+0x184/0x1dc console_flush_all+0x344/0x454 console_unlock+0x94/0xf0 vprintk_emit+0x238/0x24c vprintk_default+0x3c/0x48 vprintk+0xb4/0xbc _printk+0x68/0x90 register_console+0x230/0x38c uart_add_one_port+0x338/0x494 qcom_geni_serial_probe+0x390/0x424 platform_probe+0x70/0xc0 really_probe+0x148/0x280 __driver_probe_device+0xfc/0x114 driver_probe_device+0x44/0x100 __device_attach_driver+0x64/0xdc bus_for_each_drv+0xb0/0xd8 __device_attach+0xe4/0x140 device_initial_probe+0x1c/0x28 bus_probe_device+0x44/0xb0 device_add+0x538/0x668 of_device_add+0x44/0x50 of_platform_device_create_pdata+0x94/0xc8 of_platform_bus_create+0x270/0x304 of_platform_populate+0xac/0xc4 devm_of_platform_populate+0x60/0xac geni_se_probe+0x154/0x160 platform_probe+0x70/0xc0 ... -> #0 (console_owner){-...}-{0:0}: __lock_acquire+0xdf8/0x109c lock_acquire+0x234/0x284 console_flush_all+0x330/0x454 console_unlock+0x94/0xf0 vprintk_emit+0x238/0x24c vprintk_default+0x3c/0x48 vprintk+0xb4/0xbc _printk+0x68/0x90 dma_entry_alloc+0xb4/0x110 debug_dma_map_sg+0xdc/0x2f8 __dma_map_sg_attrs+0xac/0xe4 dma_map_sgtable+0x30/0x4c get_pages+0x1d4/0x1e4 [msm] msm_gem_pin_pages_locked+0x38/0xac [msm] msm_gem_pin_vma_locked+0x58/0x88 [msm] msm_ioctl_gem_submit+0xde4/0x13ac [msm] drm_ioctl_kernel+0xe0/0x15c drm_ioctl+0x2e8/0x3f4 vfs_ioctl+0x30/0x50 ... Chain exists of: console_owner --> &port_lock_key --> free_entries_lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(free_entries_lock); lock(&port_lock_key); lock(free_entries_lock); lock(console_owner); *** DEADLOCK *** Call trace: dump_backtrace+0xb4/0xf0 show_stack+0x20/0x30 dump_stack_lvl+0x60/0x84 dump_stack+0x18/0x24 print_circular_bug+0x1cc/0x234 check_noncircular+0x78/0xac __lock_acquire+0xdf8/0x109c lock_acquire+0x234/0x284 console_flush_all+0x330/0x454 console_unlock+0x94/0xf0 vprintk_emit+0x238/0x24c vprintk_default+0x3c/0x48 vprintk+0xb4/0xbc _printk+0x68/0x90 dma_entry_alloc+0xb4/0x110 debug_dma_map_sg+0xdc/0x2f8 __dma_map_sg_attrs+0xac/0xe4 dma_map_sgtable+0x30/0x4c get_pages+0x1d4/0x1e4 [msm] msm_gem_pin_pages_locked+0x38/0xac [msm] msm_gem_pin_vma_locked+0x58/0x88 [msm] msm_ioctl_gem_submit+0xde4/0x13ac [msm] drm_ioctl_kernel+0xe0/0x15c drm_ioctl+0x2e8/0x3f4 vfs_ioctl+0x30/0x50 ... Reported-by: Rob Clark <robdclark@chromium.org> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-06swiotlb: use the calculated number of areasRoss Lagerwall
[ Upstream commit a6a241764f69c62d23fc6960920cc662ae4069e9 ] Commit 8ac04063354a ("swiotlb: reduce the number of areas to match actual memory pool size") calculated the reduced number of areas in swiotlb_init_remap() but didn't actually use the value. Replace usage of default_nareas accordingly. Fixes: 8ac04063354a ("swiotlb: reduce the number of areas to match actual memory pool size") Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-06bpf: Fix a erroneous check after snprintf()Christophe JAILLET
[ Upstream commit a8f12572860ad8ba659d96eee9cf09e181f6ebcc ] snprintf() does not return negative error code on error, it returns the number of characters which *would* be generated for the given input. Fix the error handling check. Fixes: 57539b1c0ac2 ("bpf: Enable annotating trusted nested pointers") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://lore.kernel.org/r/393bdebc87b22563c08ace094defa7160eb7a6c0.1694190795.git.christophe.jaillet@wanadoo.fr Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-06bpf: Avoid dummy bpf_offload_netdev in __bpf_prog_dev_bound_initEduard Zingerman
[ Upstream commit 1a49f4195d3498fe458a7f5ff7ec5385da70d92e ] Fix for a bug observable under the following sequence of events: 1. Create a network device that does not support XDP offload. 2. Load a device bound XDP program with BPF_F_XDP_DEV_BOUND_ONLY flag (such programs are not offloaded). 3. Load a device bound XDP program with zero flags (such programs are offloaded). At step (2) __bpf_prog_dev_bound_init() associates with device (1) a dummy bpf_offload_netdev struct with .offdev field set to NULL. At step (3) __bpf_prog_dev_bound_init() would reuse dummy struct allocated at step (2). However, downstream usage of the bpf_offload_netdev assumes that .offdev field can't be NULL, e.g. in bpf_prog_offload_verifier_prep(). Adjust __bpf_prog_dev_bound_init() to require bpf_offload_netdev with non-NULL .offdev for offloaded BPF programs. Fixes: 2b3486bc2d23 ("bpf: Introduce device-bound XDP programs") Reported-by: syzbot+291100dcb32190ec02a8@syzkaller.appspotmail.com Closes: https://lore.kernel.org/bpf/000000000000d97f3c060479c4f8@google.com/ Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20230912005539.2248244-2-eddyz87@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-06bpf: Avoid deadlock when using queue and stack maps from NMIToke Høiland-Jørgensen
[ Upstream commit a34a9f1a19afe9c60ca0ea61dfeee63a1c2baac8 ] Sysbot discovered that the queue and stack maps can deadlock if they are being used from a BPF program that can be called from NMI context (such as one that is attached to a perf HW counter event). To fix this, add an in_nmi() check and use raw_spin_trylock() in NMI context, erroring out if grabbing the lock fails. Fixes: f1a2e44a3aec ("bpf: add queue and stack maps") Reported-by: Hsin-Wei Hung <hsinweih@uci.edu> Tested-by: Hsin-Wei Hung <hsinweih@uci.edu> Co-developed-by: Hsin-Wei Hung <hsinweih@uci.edu> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/20230911132815.717240-1-toke@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-23tracing: Have option files inc the trace array ref countSteven Rostedt (Google)
commit 7e2cfbd2d3c86afcd5c26b5c4b1dd251f63c5838 upstream. The option files update the options for a given trace array. For an instance, if the file is opened and the instance is deleted, reading or writing to the file will cause a use after free. Up the ref count of the trace_array when an option file is opened. Link: https://lkml.kernel.org/r/20230907024804.086679464@goodmis.org Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Zheng Yejian <zhengyejian1@huawei.com> Fixes: 8530dec63e7b4 ("tracing: Add tracing_check_open_get_tr()") Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-23tracing: Have current_trace inc the trace array ref countSteven Rostedt (Google)
commit 9b37febc578b2e1ad76a105aab11d00af5ec3d27 upstream. The current_trace updates the trace array tracer. For an instance, if the file is opened and the instance is deleted, reading or writing to the file will cause a use after free. Up the ref count of the trace array when current_trace is opened. Link: https://lkml.kernel.org/r/20230907024803.877687227@goodmis.org Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Zheng Yejian <zhengyejian1@huawei.com> Fixes: 8530dec63e7b4 ("tracing: Add tracing_check_open_get_tr()") Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-23tracing: Increase trace array ref count on enable and filter filesSteven Rostedt (Google)
commit f5ca233e2e66dc1c249bf07eefa37e34a6c9346a upstream. When the trace event enable and filter files are opened, increment the trace array ref counter, otherwise they can be accessed when the trace array is being deleted. The ref counter keeps the trace array from being deleted while those files are opened. Link: https://lkml.kernel.org/r/20230907024803.456187066@goodmis.org Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 8530dec63e7b4 ("tracing: Add tracing_check_open_get_tr()") Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Reported-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-23tracing/synthetic: Print out u64 values properlyTero Kristo
commit 62663b849662c1a5126b6274d91671b90566ef13 upstream. The synth traces incorrectly print pointer to the synthetic event values instead of the actual value when using u64 type. Fix by addressing the contents of the union properly. Link: https://lore.kernel.org/linux-trace-kernel/20230911141704.3585965-1-tero.kristo@linux.intel.com Fixes: ddeea494a16f ("tracing/synthetic: Use union instead of casts") Cc: stable@vger.kernel.org Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-23tracing: Have event inject files inc the trace array ref countSteven Rostedt (Google)
commit e5c624f027ac74f97e97c8f36c69228ac9f1102d upstream. The event inject files add events for a specific trace array. For an instance, if the file is opened and the instance is deleted, reading or writing to the file will cause a use after free. Up the ref count of the trace_array when a event inject file is opened. Link: https://lkml.kernel.org/r/20230907024804.292337868@goodmis.org Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Zheng Yejian <zhengyejian1@huawei.com> Fixes: 6c3edaf9fd6a ("tracing: Introduce trace event injection") Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-23tracing: Have tracing_max_latency inc the trace array ref countSteven Rostedt (Google)
commit 7d660c9b2bc95107f90a9f4c4759be85309a6550 upstream. The tracing_max_latency file points to the trace_array max_latency field. For an instance, if the file is opened and the instance is deleted, reading or writing to the file will cause a use after free. Up the ref count of the trace_array when tracing_max_latency is opened. Link: https://lkml.kernel.org/r/20230907024803.666889383@goodmis.org Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Zheng Yejian <zhengyejian1@huawei.com> Fixes: 8530dec63e7b4 ("tracing: Add tracing_check_open_get_tr()") Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-23panic: Reenable preemption in WARN slowpathLukas Wunner
[ Upstream commit cccd32816506cbac3a4c65d9dff51b3125ef1a03 ] Commit: 5a5d7e9badd2 ("cpuidle: lib/bug: Disable rcu_is_watching() during WARN/BUG") amended warn_slowpath_fmt() to disable preemption until the WARN splat has been emitted. However the commit neglected to reenable preemption in the !fmt codepath, i.e. when a WARN splat is emitted without additional format string. One consequence is that users may see more splats than intended. E.g. a WARN splat emitted in a work item results in at least two extra splats: BUG: workqueue leaked lock or atomic (emitted by process_one_work()) BUG: scheduling while atomic (emitted by worker_thread() -> schedule()) Ironically the point of the commit was to *avoid* extra splats. ;) Fix it. Fixes: 5a5d7e9badd2 ("cpuidle: lib/bug: Disable rcu_is_watching() during WARN/BUG") Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/3ec48fde01e4ee6505f77908ba351bad200ae3d1.1694763684.git.lukas@wunner.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-23PM: hibernate: Fix the exclusive get block device in test_resume modeChen Yu
[ Upstream commit 148b6f4cc3920e563094540fe1a12d00d3bbccae ] Commit 5904de0d735b ("PM: hibernate: Do not get block device exclusively in test_resume mode") fixes a hibernation issue under test_resume mode. That commit is supposed to open the block device in non-exclusive mode when in test_resume. However the code does the opposite, which is against its description. In summary, the swap device is only opened exclusively by swsusp_check() with its corresponding *close(), and must be in non test_resume mode. This is to avoid the race condition that different processes scribble the device at the same time. All the other cases should use non-exclusive mode. Fix it by really disabling exclusive mode under test_resume. Fixes: 5904de0d735b ("PM: hibernate: Do not get block device exclusively in test_resume mode") Closes: https://lore.kernel.org/lkml/000000000000761f5f0603324129@google.com/ Reported-by: Pengfei Xu <pengfei.xu@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Tested-by: Chenzhou Feng <chenzhoux.feng@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-23printk: Rename abandon_console_lock_in_panic() to other_cpu_in_panic()John Ogness
[ Upstream commit 132a90d1527fedba2d95085c951ccf00dbbebe41 ] Currently abandon_console_lock_in_panic() is only used to determine if the current CPU should immediately release the console lock because another CPU is in panic. However, later this function will be used by the CPU to immediately release other resources in this situation. Rename the function to other_cpu_in_panic(), which is a better description and does not assume it is related to the console lock. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20230717194607.145135-8-john.ogness@linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-23printk: Consolidate console deferred printingJohn Ogness
[ Upstream commit 696ffaf50e1f8dbc66223ff614473f945f5fb8d8 ] Printing to consoles can be deferred for several reasons: - explicitly with printk_deferred() - printk() in NMI context - recursive printk() calls The current implementation is not consistent. For printk_deferred(), irq work is scheduled twice. For NMI und recursive, panic CPU suppression and caller delays are not properly enforced. Correct these inconsistencies by consolidating the deferred printing code so that vprintk_deferred() is the top-level function for deferred printing and vprintk_emit() will perform whichever irq_work queueing is appropriate. Also add kerneldoc for wake_up_klogd() and defer_console_output() to clarify their differences and appropriate usage. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20230717194607.145135-6-john.ogness@linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-23printk: Do not take console lock for console_flush_on_panic()John Ogness
[ Upstream commit eacb04ff3c5b8662a65f380ae450250698448cff ] Currently console_flush_on_panic() will attempt to acquire the console lock when flushing the buffer on panic. If it fails to acquire the lock, it continues anyway because this is the last chance to get any pending records printed. The reason why the console lock was attempted at all was to prevent any other CPUs from acquiring the console lock for printing while the panic CPU was printing. But as of the previous commit, non-panic CPUs will no longer attempt to acquire the console lock in a panic situation. Therefore it is no longer strictly necessary for a panic CPU to acquire the console lock. Avoiding taking the console lock when flushing in panic has the additional benefit of avoiding possible deadlocks due to semaphore usage in NMI context (semaphores are not NMI-safe) and avoiding possible deadlocks if another CPU accesses the semaphore and is stopped while holding one of the semaphore's internal spinlocks. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20230717194607.145135-5-john.ogness@linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-23printk: Keep non-panic-CPUs out of console lockJohn Ogness
[ Upstream commit 51a1d258e50e03a0216bf42b6af9ff34ec402ac1 ] When in a panic situation, non-panic CPUs should avoid holding the console lock so as not to contend with the panic CPU. This is already implemented with abandon_console_lock_in_panic(), which is checked after each printed line. However, non-panic CPUs should also avoid trying to acquire the console lock during a panic. Modify console_trylock() to fail and console_lock() to block() when called from a non-panic CPU during a panic. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20230717194607.145135-4-john.ogness@linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-23printk: Reduce console_unblank() usage in unsafe scenariosJohn Ogness
[ Upstream commit 7b23a66db55ed0a55b020e913f0d6f6d52a1ad2c ] A semaphore is not NMI-safe, even when using down_trylock(). Both down_trylock() and up() are using internal spinlocks and up() might even call wake_up_process(). In the panic() code path it gets even worse because the internal spinlocks of the semaphore may have been taken by a CPU that has been stopped. To reduce the risk of deadlocks caused by the console semaphore in the panic path, make the following changes: - First check if any consoles have implemented the unblank() callback. If not, then there is no reason to take the console semaphore anyway. (This check is also useful for the non-panic path since the locking/unlocking of the console lock can be quite expensive due to console printing.) - If the panic path is in NMI context, bail out without attempting to take the console semaphore or calling any unblank() callbacks. Bailing out is acceptable because console_unblank() would already bail out if the console semaphore is contended. The alternative of ignoring the console semaphore and calling the unblank() callbacks anyway is a bad idea because these callbacks are also not NMI-safe. If consoles with unblank() callbacks exist and console_unblank() is called from a non-NMI panic context, it will still attempt a down_trylock(). This could still result in a deadlock if one of the stopped CPUs is holding the semaphore internal spinlock. But this is a risk that the kernel has been (and continues to be) willing to take. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20230717194607.145135-3-john.ogness@linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-23bpf: Consider non-owning refs to refcounted nodes RCU protectedDave Marchevsky
[ Upstream commit 0816b8c6bf7fc87cec4273dc199e8f0764b9e7b1 ] An earlier patch in the series ensures that the underlying memory of nodes with bpf_refcount - which can have multiple owners - is not reused until RCU grace period has elapsed. This prevents use-after-free with non-owning references that may point to recently-freed memory. While RCU read lock is held, it's safe to dereference such a non-owning ref, as by definition RCU GP couldn't have elapsed and therefore underlying memory couldn't have been reused. From the perspective of verifier "trustedness" non-owning refs to refcounted nodes are now trusted only in RCU CS and therefore should no longer pass is_trusted_reg, but rather is_rcu_reg. Let's mark them MEM_RCU in order to reflect this new state. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230821193311.3290257-6-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-23scftorture: Forgive memory-allocation failure if KASANPaul E. McKenney
[ Upstream commit 013608cd0812bdb21fc26d39ed8fdd2fc76e8b9b ] Kernels built with CONFIG_KASAN=y quarantine newly freed memory in order to better detect use-after-free errors. However, this can exhaust memory more quickly in allocator-heavy tests, which can result in spurious scftorture failure. This commit therefore forgives memory-allocation failure in kernels built with CONFIG_KASAN=y, but continues counting the errors for use in detailed test-result analyses. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-23rcuscale: Move rcu_scale_writer() schedule_timeout_uninterruptible() to _idle()Zqiang
[ Upstream commit e60c122a1614b4f65b29a7bef9d83b9fd30e937a ] The rcuscale.holdoff module parameter can be used to delay the start of rcu_scale_writer() kthread. However, the hung-task timeout will trigger when the timeout specified by rcuscale.holdoff is greater than hung_task_timeout_secs: runqemu kvm nographic slirp qemuparams="-smp 4 -m 2048M" bootparams="rcuscale.shutdown=0 rcuscale.holdoff=300" [ 247.071753] INFO: task rcu_scale_write:59 blocked for more than 122 seconds. [ 247.072529] Not tainted 6.4.0-rc1-00134-gb9ed6de8d4ff #7 [ 247.073400] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 247.074331] task:rcu_scale_write state:D stack:30144 pid:59 ppid:2 flags:0x00004000 [ 247.075346] Call Trace: [ 247.075660] <TASK> [ 247.075965] __schedule+0x635/0x1280 [ 247.076448] ? __pfx___schedule+0x10/0x10 [ 247.076967] ? schedule_timeout+0x2dc/0x4d0 [ 247.077471] ? __pfx_lock_release+0x10/0x10 [ 247.078018] ? enqueue_timer+0xe2/0x220 [ 247.078522] schedule+0x84/0x120 [ 247.078957] schedule_timeout+0x2e1/0x4d0 [ 247.079447] ? __pfx_schedule_timeout+0x10/0x10 [ 247.080032] ? __pfx_rcu_scale_writer+0x10/0x10 [ 247.080591] ? __pfx_process_timeout+0x10/0x10 [ 247.081163] ? __pfx_sched_set_fifo_low+0x10/0x10 [ 247.081760] ? __pfx_rcu_scale_writer+0x10/0x10 [ 247.082287] rcu_scale_writer+0x6b1/0x7f0 [ 247.082773] ? mark_held_locks+0x29/0xa0 [ 247.083252] ? __pfx_rcu_scale_writer+0x10/0x10 [ 247.083865] ? __pfx_rcu_scale_writer+0x10/0x10 [ 247.084412] kthread+0x179/0x1c0 [ 247.084759] ? __pfx_kthread+0x10/0x10 [ 247.085098] ret_from_fork+0x2c/0x50 [ 247.085433] </TASK> This commit therefore replaces schedule_timeout_uninterruptible() with schedule_timeout_idle(). Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-23kernel/fork: beware of __put_task_struct() calling contextWander Lairson Costa
[ Upstream commit d243b34459cea30cfe5f3a9b2feb44e7daff9938 ] Under PREEMPT_RT, __put_task_struct() indirectly acquires sleeping locks. Therefore, it can't be called from an non-preemptible context. One practical example is splat inside inactive_task_timer(), which is called in a interrupt context: CPU: 1 PID: 2848 Comm: life Kdump: loaded Tainted: G W --------- Hardware name: HP ProLiant DL388p Gen8, BIOS P70 07/15/2012 Call Trace: dump_stack_lvl+0x57/0x7d mark_lock_irq.cold+0x33/0xba mark_lock+0x1e7/0x400 mark_usage+0x11d/0x140 __lock_acquire+0x30d/0x930 lock_acquire.part.0+0x9c/0x210 rt_spin_lock+0x27/0xe0 refill_obj_stock+0x3d/0x3a0 kmem_cache_free+0x357/0x560 inactive_task_timer+0x1ad/0x340 __run_hrtimer+0x8a/0x1a0 __hrtimer_run_queues+0x91/0x130 hrtimer_interrupt+0x10f/0x220 __sysvec_apic_timer_interrupt+0x7b/0xd0 sysvec_apic_timer_interrupt+0x4f/0xd0 asm_sysvec_apic_timer_interrupt+0x12/0x20 RIP: 0033:0x7fff196bf6f5 Instead of calling __put_task_struct() directly, we defer it using call_rcu(). A more natural approach would use a workqueue, but since in PREEMPT_RT, we can't allocate dynamic memory from atomic context, the code would become more complex because we would need to put the work_struct instance in the task_struct and initialize it when we allocate a new task_struct. The issue is reproducible with stress-ng: while true; do stress-ng --sched deadline --sched-period 1000000000 \ --sched-runtime 800000000 --sched-deadline \ 1000000000 --mmapfork 23 -t 20 done Reported-by: Hu Chunyu <chuhu@redhat.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Valentin Schneider <vschneid@redhat.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Wander Lairson Costa <wander@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230614122323.37957-2-wander@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-19bpf: fix bpf_probe_read_kernel prototype mismatchArnd Bergmann
[ Upstream commit 6a5a148aaf14747570cc634f9cdfcb0393f5617f ] bpf_probe_read_kernel() has a __weak definition in core.c and another definition with an incompatible prototype in kernel/trace/bpf_trace.c, when CONFIG_BPF_EVENTS is enabled. Since the two are incompatible, there cannot be a shared declaration in a header file, but the lack of a prototype causes a W=1 warning: kernel/bpf/core.c:1638:12: error: no previous prototype for 'bpf_probe_read_kernel' [-Werror=missing-prototypes] On 32-bit architectures, the local prototype u64 __weak bpf_probe_read_kernel(void *dst, u32 size, const void *unsafe_ptr) passes arguments in other registers as the one in bpf_trace.c BPF_CALL_3(bpf_probe_read_kernel, void *, dst, u32, size, const void *, unsafe_ptr) which uses 64-bit arguments in pairs of registers. As both versions of the function are fairly simple and only really differ in one line, just move them into a header file as an inline function that does not add any overhead for the bpf_trace.c callers and actually avoids a function call for the other one. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/ac25cb0f-b804-1649-3afb-1dc6138c2716@iogearbox.net/ Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230801111449.185301-1-arnd@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-19bpf: bpf_sk_storage: Fix the missing uncharge in sk_omem_allocMartin KaFai Lau
[ Upstream commit 55d49f750b1cb1f177fb1b00ae02cba4613bcfb7 ] The commit c83597fa5dc6 ("bpf: Refactor some inode/task/sk storage functions for reuse"), refactored the bpf_{sk,task,inode}_storage_free() into bpf_local_storage_unlink_nolock() which then later renamed to bpf_local_storage_destroy(). The commit accidentally passed the "bool uncharge_mem = false" argument to bpf_selem_unlink_storage_nolock() which then stopped the uncharge from happening to the sk->sk_omem_alloc. This missing uncharge only happens when the sk is going away (during __sk_destruct). This patch fixes it by always passing "uncharge_mem = true". It is a noop to the task/inode/cgroup storage because they do not have the map_local_storage_(un)charge enabled in the map_ops. A followup patch will be done in bpf-next to remove the uncharge_mem argument. A selftest is added in the next patch. Fixes: c83597fa5dc6 ("bpf: Refactor some inode/task/sk storage functions for reuse") Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230901231129.578493-3-martin.lau@linux.dev Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-19bpf: bpf_sk_storage: Fix invalid wait context lockdep reportMartin KaFai Lau
[ Upstream commit a96a44aba556c42b432929d37d60158aca21ad4c ] './test_progs -t test_local_storage' reported a splat: [ 27.137569] ============================= [ 27.138122] [ BUG: Invalid wait context ] [ 27.138650] 6.5.0-03980-gd11ae1b16b0a #247 Tainted: G O [ 27.139542] ----------------------------- [ 27.140106] test_progs/1729 is trying to lock: [ 27.140713] ffff8883ef047b88 (stock_lock){-.-.}-{3:3}, at: local_lock_acquire+0x9/0x130 [ 27.141834] other info that might help us debug this: [ 27.142437] context-{5:5} [ 27.142856] 2 locks held by test_progs/1729: [ 27.143352] #0: ffffffff84bcd9c0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x4/0x40 [ 27.144492] #1: ffff888107deb2c0 (&storage->lock){..-.}-{2:2}, at: bpf_local_storage_update+0x39e/0x8e0 [ 27.145855] stack backtrace: [ 27.146274] CPU: 0 PID: 1729 Comm: test_progs Tainted: G O 6.5.0-03980-gd11ae1b16b0a #247 [ 27.147550] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 27.149127] Call Trace: [ 27.149490] <TASK> [ 27.149867] dump_stack_lvl+0x130/0x1d0 [ 27.152609] dump_stack+0x14/0x20 [ 27.153131] __lock_acquire+0x1657/0x2220 [ 27.153677] lock_acquire+0x1b8/0x510 [ 27.157908] local_lock_acquire+0x29/0x130 [ 27.159048] obj_cgroup_charge+0xf4/0x3c0 [ 27.160794] slab_pre_alloc_hook+0x28e/0x2b0 [ 27.161931] __kmem_cache_alloc_node+0x51/0x210 [ 27.163557] __kmalloc+0xaa/0x210 [ 27.164593] bpf_map_kzalloc+0xbc/0x170 [ 27.165147] bpf_selem_alloc+0x130/0x510 [ 27.166295] bpf_local_storage_update+0x5aa/0x8e0 [ 27.167042] bpf_fd_sk_storage_update_elem+0xdb/0x1a0 [ 27.169199] bpf_map_update_value+0x415/0x4f0 [ 27.169871] map_update_elem+0x413/0x550 [ 27.170330] __sys_bpf+0x5e9/0x640 [ 27.174065] __x64_sys_bpf+0x80/0x90 [ 27.174568] do_syscall_64+0x48/0xa0 [ 27.175201] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 27.175932] RIP: 0033:0x7effb40e41ad [ 27.176357] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d8 [ 27.179028] RSP: 002b:00007ffe64c21fc8 EFLAGS: 00000202 ORIG_RAX: 0000000000000141 [ 27.180088] RAX: ffffffffffffffda RBX: 00007ffe64c22768 RCX: 00007effb40e41ad [ 27.181082] RDX: 0000000000000020 RSI: 00007ffe64c22008 RDI: 0000000000000002 [ 27.182030] RBP: 00007ffe64c21ff0 R08: 0000000000000000 R09: 00007ffe64c22788 [ 27.183038] R10: 0000000000000064 R11: 0000000000000202 R12: 0000000000000000 [ 27.184006] R13: 00007ffe64c22788 R14: 00007effb42a1000 R15: 0000000000000000 [ 27.184958] </TASK> It complains about acquiring a local_lock while holding a raw_spin_lock. It means it should not allocate memory while holding a raw_spin_lock since it is not safe for RT. raw_spin_lock is needed because bpf_local_storage supports tracing context. In particular for task local storage, it is easy to get a "current" task PTR_TO_BTF_ID in tracing bpf prog. However, task (and cgroup) local storage has already been moved to bpf mem allocator which can be used after raw_spin_lock. The splat is for the sk storage. For sk (and inode) storage, it has not been moved to bpf mem allocator. Using raw_spin_lock or not, kzalloc(GFP_ATOMIC) could theoretically be unsafe in tracing context. However, the local storage helper requires a verifier accepted sk pointer (PTR_TO_BTF_ID), it is hypothetical if that (mean running a bpf prog in a kzalloc unsafe context and also able to hold a verifier accepted sk pointer) could happen. This patch avoids kzalloc after raw_spin_lock to silent the splat. There is an existing kzalloc before the raw_spin_lock. At that point, a kzalloc is very likely required because a lookup has just been done before. Thus, this patch always does the kzalloc before acquiring the raw_spin_lock and remove the later kzalloc usage after the raw_spin_lock. After this change, it will have a charge and then uncharge during the syscall bpf_map_update_elem() code path. This patch opts for simplicity and not continue the old optimization to save one charge and uncharge. This issue is dated back to the very first commit of bpf_sk_storage which had been refactored multiple times to create task, inode, and cgroup storage. This patch uses a Fixes tag with a more recent commit that should be easier to do backport. Fixes: b00fa38a9c1c ("bpf: Enable non-atomic allocations in local storage") Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230901231129.578493-2-martin.lau@linux.dev Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-19bpf: Assign bpf_tramp_run_ctx::saved_run_ctx before recursion check.Sebastian Andrzej Siewior
[ Upstream commit 6764e767f4af1e35f87f3497e1182d945de37f93 ] __bpf_prog_enter_recur() assigns bpf_tramp_run_ctx::saved_run_ctx before performing the recursion check which means in case of a recursion __bpf_prog_exit_recur() uses the previously set bpf_tramp_run_ctx::saved_run_ctx value. __bpf_prog_enter_sleepable_recur() assigns bpf_tramp_run_ctx::saved_run_ctx after the recursion check which means in case of a recursion __bpf_prog_exit_sleepable_recur() uses an uninitialized value. This does not look right. If I read the entry trampoline code right, then bpf_tramp_run_ctx isn't initialized upfront. Align __bpf_prog_enter_sleepable_recur() with __bpf_prog_enter_recur() and set bpf_tramp_run_ctx::saved_run_ctx before the recursion check is made. Remove the assignment of saved_run_ctx in kern_sys_bpf() since it happens a few cycles later. Fixes: e384c7b7b46d0 ("bpf, x86: Create bpf_tramp_run_ctx on the caller thread's stack") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20230830080405.251926-3-bigeasy@linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-19bpf: Invoke __bpf_prog_exit_sleepable_recur() on recursion in kern_sys_bpf().Sebastian Andrzej Siewior
[ Upstream commit 7645629f7dc88cd777f98970134bf1a54c8d77e3 ] If __bpf_prog_enter_sleepable_recur() detects recursion then it returns 0 without undoing rcu_read_lock_trace(), migrate_disable() or decrementing the recursion counter. This is fine in the JIT case because the JIT code will jump in the 0 case to the end and invoke the matching exit trampoline (__bpf_prog_exit_sleepable_recur()). This is not the case in kern_sys_bpf() which returns directly to the caller with an error code. Add __bpf_prog_exit_sleepable_recur() as clean up in the recursion case. Fixes: b1d18a7574d0d ("bpf: Extend sys_bpf commands for bpf_syscall programs.") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20230830080405.251926-2-bigeasy@linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-19netfilter: nf_tables: Audit log rule resetPhil Sutter
[ Upstream commit ea078ae9108e25fc881c84369f7c03931d22e555 ] Resetting rules' stateful data happens outside of the transaction logic, so 'get' and 'dump' handlers have to emit audit log entries themselves. Fixes: 8daa8fde3fc3f ("netfilter: nf_tables: Introduce NFT_MSG_GETRULE_RESET") Signed-off-by: Phil Sutter <phil@nwl.cc> Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-19netfilter: nf_tables: Audit log setelem resetPhil Sutter
[ Upstream commit 7e9be1124dbe7888907e82cab20164578e3f9ab7 ] Since set element reset is not integrated into nf_tables' transaction logic, an explicit log call is needed, similar to NFT_MSG_GETOBJ_RESET handling. For the sake of simplicity, catchall element reset will always generate a dedicated log entry. This relieves nf_tables_dump_set() from having to adjust the logged element count depending on whether a catchall element was found or not. Fixes: 079cd633219d7 ("netfilter: nf_tables: Introduce NFT_MSG_GETSETELEM_RESET") Signed-off-by: Phil Sutter <phil@nwl.cc> Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>