summaryrefslogtreecommitdiff
path: root/kernel/trace
AgeCommit message (Collapse)Author
2023-10-25tracing/kprobes: Return EADDRNOTAVAIL when func matches several symbolsFrancis Laniel
commit b022f0c7e404887a7c5229788fc99eff9f9a80d5 upstream. When a kprobe is attached to a function that's name is not unique (is static and shares the name with other functions in the kernel), the kprobe is attached to the first function it finds. This is a bug as the function that it is attaching to is not necessarily the one that the user wants to attach to. Instead of blindly picking a function to attach to what is ambiguous, error with EADDRNOTAVAIL to let the user know that this function is not unique, and that the user must use another unique function with an address offset to get to the function they want to attach to. Link: https://lore.kernel.org/all/20231020104250.9537-2-flaniel@linux.microsoft.com/ Cc: stable@vger.kernel.org Fixes: 413d37d1eb69 ("tracing: Add kprobe-based event tracer") Suggested-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com> Link: https://lore.kernel.org/lkml/20230819101105.b0c104ae4494a7d1f2eea742@kernel.org/ Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-25tracing: relax trace_event_eval_update() execution with cond_resched()Clément Léger
[ Upstream commit 23cce5f25491968b23fb9c399bbfb25f13870cd9 ] When kernel is compiled without preemption, the eval_map_work_func() (which calls trace_event_eval_update()) will not be preempted up to its complete execution. This can actually cause a problem since if another CPU call stop_machine(), the call will have to wait for the eval_map_work_func() function to finish executing in the workqueue before being able to be scheduled. This problem was observe on a SMP system at boot time, when the CPU calling the initcalls executed clocksource_done_booting() which in the end calls stop_machine(). We observed a 1 second delay because one CPU was executing eval_map_work_func() and was not preempted by the stop_machine() task. Adding a call to cond_resched() in trace_event_eval_update() allows other tasks to be executed and thus continue working asynchronously like before without blocking any pending task at boot time. Link: https://lore.kernel.org/linux-trace-kernel/20230929191637.416931-1-cleger@rivosinc.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Clément Léger <cleger@rivosinc.com> Tested-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-25fprobe: Fix to ensure the number of active retprobes is not zeroMasami Hiramatsu (Google)
commit 700b2b439766e8aab8a7174991198497345bd411 upstream. The number of active retprobes can be zero but it is not acceptable, so return EINVAL error if detected. Link: https://lore.kernel.org/all/169750018550.186853.11198884812017796410.stgit@devnote2/ Reported-by: wuqiang.matt <wuqiang.matt@bytedance.com> Closes: https://lore.kernel.org/all/20231016222103.cb9f426edc60220eabd8aa6a@kernel.org/ Fixes: 5b0ab78998e3 ("fprobe: Add exit_handler support") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-06tracing/user_events: Align set_bit() address for all archsBeau Belgrave
commit 2de9ee94054263940122aee8720e902b30c27930 upstream. All architectures should use a long aligned address passed to set_bit(). User processes can pass either a 32-bit or 64-bit sized value to be updated when tracing is enabled when on a 64-bit kernel. Both cases are ensured to be naturally aligned, however, that is not enough. The address must be long aligned without affecting checks on the value within the user process which require different adjustments for the bit for little and big endian CPUs. Add a compat flag to user_event_enabler that indicates when a 32-bit value is being used on a 64-bit kernel. Long align addresses and correct the bit to be used by set_bit() to account for this alignment. Ensure compat flags are copied during forks and used during deletion clears. Link: https://lore.kernel.org/linux-trace-kernel/20230925230829.341-2-beaub@linux.microsoft.com Link: https://lore.kernel.org/linux-trace-kernel/20230914131102.179100-1-cleger@rivosinc.com/ Cc: stable@vger.kernel.org Fixes: 7235759084a4 ("tracing/user_events: Use remote writes for event enablement") Reported-by: Clément Léger <cleger@rivosinc.com> Suggested-by: Clément Léger <cleger@rivosinc.com> Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-06bpf: Add override check to kprobe multi link attachJiri Olsa
commit 41bc46c12a8053a1b3279a379bd6b5e87b045b85 upstream. Currently the multi_kprobe link attach does not check error injection list for programs with bpf_override_return helper and allows them to attach anywhere. Adding the missing check. Fixes: 0dcac2725406 ("bpf: Add multi kprobe link") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Alan Maguire <alan.maguire@oracle.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/bpf/20230907200652.926951-1-jolsa@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-06ring-buffer: Update "shortest_full" in pollingSteven Rostedt (Google)
commit 1e0cb399c7653462d9dadf8ab9425337c355d358 upstream. It was discovered that the ring buffer polling was incorrectly stating that read would not block, but that's because polling did not take into account that reads will block if the "buffer-percent" was set. Instead, the ring buffer polling would say reads would not block if there was any data in the ring buffer. This was incorrect behavior from a user space point of view. This was fixed by commit 42fb0a1e84ff by having the polling code check if the ring buffer had more data than what the user specified "buffer percent" had. The problem now is that the polling code did not register itself to the writer that it wanted to wait for a specific "full" value of the ring buffer. The result was that the writer would wake the polling waiter whenever there was a new event. The polling waiter would then wake up, see that there's not enough data in the ring buffer to notify user space and then go back to sleep. The next event would wake it up again. Before the polling fix was added, the code would wake up around 100 times for a hackbench 30 benchmark. After the "fix", due to the constant waking of the writer, it would wake up over 11,0000 times! It would never leave the kernel, so the user space behavior was still "correct", but this definitely is not the desired effect. To fix this, have the polling code add what it's waiting for to the "shortest_full" variable, to tell the writer not to wake it up if the buffer is not as full as it expects to be. Note, after this fix, it appears that the waiter is now woken up around 2x the times it was before (~200). This is a tremendous improvement from the 11,000 times, but I will need to spend some time to see why polling is more aggressive in its wakeups than the read blocking code. Link: https://lore.kernel.org/linux-trace-kernel/20230929180113.01c2cae3@rorschach.local.home Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Fixes: 42fb0a1e84ff ("tracing/ring-buffer: Have polling block on watermark") Reported-by: Julia Lawall <julia.lawall@inria.fr> Tested-by: Julia Lawall <julia.lawall@inria.fr> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-06ring-buffer: Fix bytes info in per_cpu buffer statsZheng Yejian
commit 45d99ea451d0c30bfd4864f0fe485d7dac014902 upstream. The 'bytes' info in file 'per_cpu/cpu<X>/stats' means the number of bytes in cpu buffer that have not been consumed. However, currently after consuming data by reading file 'trace_pipe', the 'bytes' info was not changed as expected. # cat per_cpu/cpu0/stats entries: 0 overrun: 0 commit overrun: 0 bytes: 568 <--- 'bytes' is problematical !!! oldest event ts: 8651.371479 now ts: 8653.912224 dropped events: 0 read events: 8 The root cause is incorrect stat on cpu_buffer->read_bytes. To fix it: 1. When stat 'read_bytes', account consumed event in rb_advance_reader(); 2. When stat 'entries_bytes', exclude the discarded padding event which is smaller than minimum size because it is invisible to reader. Then use rb_page_commit() instead of BUF_PAGE_SIZE at where accounting for page-based read/remove/overrun. Also correct the comments of ring_buffer_bytes_cpu() in this patch. Link: https://lore.kernel.org/linux-trace-kernel/20230921125425.1708423-1-zhengyejian1@huawei.com Cc: stable@vger.kernel.org Fixes: c64e148a3be3 ("trace: Add ring buffer stats to measure rate of events") Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-06ring-buffer: Do not attempt to read past "commit"Steven Rostedt (Google)
[ Upstream commit 95a404bd60af6c4d9d8db01ad14fe8957ece31ca ] When iterating over the ring buffer while the ring buffer is active, the writer can corrupt the reader. There's barriers to help detect this and handle it, but that code missed the case where the last event was at the very end of the page and has only 4 bytes left. The checks to detect the corruption by the writer to reads needs to see the length of the event. If the length in the first 4 bytes is zero then the length is stored in the second 4 bytes. But if the writer is in the process of updating that code, there's a small window where the length in the first 4 bytes could be zero even though the length is only 4 bytes. That will cause rb_event_length() to read the next 4 bytes which could happen to be off the allocated page. To protect against this, fail immediately if the next event pointer is less than 8 bytes from the end of the commit (last byte of data), as all events must be a minimum of 8 bytes anyway. Link: https://lore.kernel.org/all/20230905141245.26470-1-Tze-nan.Wu@mediatek.com/ Link: https://lore.kernel.org/linux-trace-kernel/20230907122820.0899019c@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Reported-by: Tze-nan Wu <Tze-nan.Wu@mediatek.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-06ring-buffer: Avoid softlockup in ring_buffer_resize()Zheng Yejian
[ Upstream commit f6bd2c92488c30ef53b5bd80c52f0a7eee9d545a ] When user resize all trace ring buffer through file 'buffer_size_kb', then in ring_buffer_resize(), kernel allocates buffer pages for each cpu in a loop. If the kernel preemption model is PREEMPT_NONE and there are many cpus and there are many buffer pages to be allocated, it may not give up cpu for a long time and finally cause a softlockup. To avoid it, call cond_resched() after each cpu buffer allocation. Link: https://lore.kernel.org/linux-trace-kernel/20230906081930.3939106-1-zhengyejian1@huawei.com Cc: <mhiramat@kernel.org> Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-23tracing: Have option files inc the trace array ref countSteven Rostedt (Google)
commit 7e2cfbd2d3c86afcd5c26b5c4b1dd251f63c5838 upstream. The option files update the options for a given trace array. For an instance, if the file is opened and the instance is deleted, reading or writing to the file will cause a use after free. Up the ref count of the trace_array when an option file is opened. Link: https://lkml.kernel.org/r/20230907024804.086679464@goodmis.org Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Zheng Yejian <zhengyejian1@huawei.com> Fixes: 8530dec63e7b4 ("tracing: Add tracing_check_open_get_tr()") Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-23tracing: Have current_trace inc the trace array ref countSteven Rostedt (Google)
commit 9b37febc578b2e1ad76a105aab11d00af5ec3d27 upstream. The current_trace updates the trace array tracer. For an instance, if the file is opened and the instance is deleted, reading or writing to the file will cause a use after free. Up the ref count of the trace array when current_trace is opened. Link: https://lkml.kernel.org/r/20230907024803.877687227@goodmis.org Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Zheng Yejian <zhengyejian1@huawei.com> Fixes: 8530dec63e7b4 ("tracing: Add tracing_check_open_get_tr()") Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-23tracing: Increase trace array ref count on enable and filter filesSteven Rostedt (Google)
commit f5ca233e2e66dc1c249bf07eefa37e34a6c9346a upstream. When the trace event enable and filter files are opened, increment the trace array ref counter, otherwise they can be accessed when the trace array is being deleted. The ref counter keeps the trace array from being deleted while those files are opened. Link: https://lkml.kernel.org/r/20230907024803.456187066@goodmis.org Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 8530dec63e7b4 ("tracing: Add tracing_check_open_get_tr()") Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Reported-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-23tracing/synthetic: Print out u64 values properlyTero Kristo
commit 62663b849662c1a5126b6274d91671b90566ef13 upstream. The synth traces incorrectly print pointer to the synthetic event values instead of the actual value when using u64 type. Fix by addressing the contents of the union properly. Link: https://lore.kernel.org/linux-trace-kernel/20230911141704.3585965-1-tero.kristo@linux.intel.com Fixes: ddeea494a16f ("tracing/synthetic: Use union instead of casts") Cc: stable@vger.kernel.org Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-23tracing: Have event inject files inc the trace array ref countSteven Rostedt (Google)
commit e5c624f027ac74f97e97c8f36c69228ac9f1102d upstream. The event inject files add events for a specific trace array. For an instance, if the file is opened and the instance is deleted, reading or writing to the file will cause a use after free. Up the ref count of the trace_array when a event inject file is opened. Link: https://lkml.kernel.org/r/20230907024804.292337868@goodmis.org Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Zheng Yejian <zhengyejian1@huawei.com> Fixes: 6c3edaf9fd6a ("tracing: Introduce trace event injection") Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-23tracing: Have tracing_max_latency inc the trace array ref countSteven Rostedt (Google)
commit 7d660c9b2bc95107f90a9f4c4759be85309a6550 upstream. The tracing_max_latency file points to the trace_array max_latency field. For an instance, if the file is opened and the instance is deleted, reading or writing to the file will cause a use after free. Up the ref count of the trace_array when tracing_max_latency is opened. Link: https://lkml.kernel.org/r/20230907024803.666889383@goodmis.org Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Zheng Yejian <zhengyejian1@huawei.com> Fixes: 8530dec63e7b4 ("tracing: Add tracing_check_open_get_tr()") Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-19bpf: fix bpf_probe_read_kernel prototype mismatchArnd Bergmann
[ Upstream commit 6a5a148aaf14747570cc634f9cdfcb0393f5617f ] bpf_probe_read_kernel() has a __weak definition in core.c and another definition with an incompatible prototype in kernel/trace/bpf_trace.c, when CONFIG_BPF_EVENTS is enabled. Since the two are incompatible, there cannot be a shared declaration in a header file, but the lack of a prototype causes a W=1 warning: kernel/bpf/core.c:1638:12: error: no previous prototype for 'bpf_probe_read_kernel' [-Werror=missing-prototypes] On 32-bit architectures, the local prototype u64 __weak bpf_probe_read_kernel(void *dst, u32 size, const void *unsafe_ptr) passes arguments in other registers as the one in bpf_trace.c BPF_CALL_3(bpf_probe_read_kernel, void *, dst, u32, size, const void *, unsafe_ptr) which uses 64-bit arguments in pairs of registers. As both versions of the function are fairly simple and only really differ in one line, just move them into a header file as an inline function that does not add any overhead for the bpf_trace.c callers and actually avoids a function call for the other one. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/ac25cb0f-b804-1649-3afb-1dc6138c2716@iogearbox.net/ Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230801111449.185301-1-arnd@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13tracing: Fix race issue between cpu buffer write and swapZheng Yejian
[ Upstream commit 3163f635b20e9e1fb4659e74f47918c9dddfe64e ] Warning happened in rb_end_commit() at code: if (RB_WARN_ON(cpu_buffer, !local_read(&cpu_buffer->committing))) WARNING: CPU: 0 PID: 139 at kernel/trace/ring_buffer.c:3142 rb_commit+0x402/0x4a0 Call Trace: ring_buffer_unlock_commit+0x42/0x250 trace_buffer_unlock_commit_regs+0x3b/0x250 trace_event_buffer_commit+0xe5/0x440 trace_event_buffer_reserve+0x11c/0x150 trace_event_raw_event_sched_switch+0x23c/0x2c0 __traceiter_sched_switch+0x59/0x80 __schedule+0x72b/0x1580 schedule+0x92/0x120 worker_thread+0xa0/0x6f0 It is because the race between writing event into cpu buffer and swapping cpu buffer through file per_cpu/cpu0/snapshot: Write on CPU 0 Swap buffer by per_cpu/cpu0/snapshot on CPU 1 -------- -------- tracing_snapshot_write() [...] ring_buffer_lock_reserve() cpu_buffer = buffer->buffers[cpu]; // 1. Suppose find 'cpu_buffer_a'; [...] rb_reserve_next_event() [...] ring_buffer_swap_cpu() if (local_read(&cpu_buffer_a->committing)) goto out_dec; if (local_read(&cpu_buffer_b->committing)) goto out_dec; buffer_a->buffers[cpu] = cpu_buffer_b; buffer_b->buffers[cpu] = cpu_buffer_a; // 2. cpu_buffer has swapped here. rb_start_commit(cpu_buffer); if (unlikely(READ_ONCE(cpu_buffer->buffer) != buffer)) { // 3. This check passed due to 'cpu_buffer->buffer' [...] // has not changed here. return NULL; } cpu_buffer_b->buffer = buffer_a; cpu_buffer_a->buffer = buffer_b; [...] // 4. Reserve event from 'cpu_buffer_a'. ring_buffer_unlock_commit() [...] cpu_buffer = buffer->buffers[cpu]; // 5. Now find 'cpu_buffer_b' !!! rb_commit(cpu_buffer) rb_end_commit() // 6. WARN for the wrong 'committing' state !!! Based on above analysis, we can easily reproduce by following testcase: ``` bash #!/bin/bash dmesg -n 7 sysctl -w kernel.panic_on_warn=1 TR=/sys/kernel/tracing echo 7 > ${TR}/buffer_size_kb echo "sched:sched_switch" > ${TR}/set_event while [ true ]; do echo 1 > ${TR}/per_cpu/cpu0/snapshot done & while [ true ]; do echo 1 > ${TR}/per_cpu/cpu0/snapshot done & while [ true ]; do echo 1 > ${TR}/per_cpu/cpu0/snapshot done & ``` To fix it, IIUC, we can use smp_call_function_single() to do the swap on the target cpu where the buffer is located, so that above race would be avoided. Link: https://lore.kernel.org/linux-trace-kernel/20230831132739.4070878-1-zhengyejian1@huawei.com Cc: <mhiramat@kernel.org> Fixes: f1affcaaa861 ("tracing: Add snapshot in the per_cpu trace directories") Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13tracing: Remove extra space at the end of hwlat_detector/modeMikhail Kobuk
[ Upstream commit 2cf0dee989a8b2501929eaab29473b6b1fa11057 ] Space is printed after each mode value including the last one: $ echo \"$(sudo cat /sys/kernel/tracing/hwlat_detector/mode)\" "none [round-robin] per-cpu " Found by Linux Verification Center (linuxtesting.org) with SVACE. Link: https://lore.kernel.org/linux-trace-kernel/20230825103432.7750-1-m.kobuk@ispras.ru Cc: Masami Hiramatsu <mhiramat@kernel.org> Fixes: 8fa826b7344d ("trace/hwlat: Implement the mode config option") Signed-off-by: Mikhail Kobuk <m.kobuk@ispras.ru> Reviewed-by: Alexey Khoroshilov <khoroshilov@ispras.ru> Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13bpf: Clear the probe_addr for uprobeYafang Shao
[ Upstream commit 5125e757e62f6c1d5478db4c2b61a744060ddf3f ] To avoid returning uninitialized or random values when querying the file descriptor (fd) and accessing probe_addr, it is necessary to clear the variable prior to its use. Fixes: 41bdc4b40ed6 ("bpf: introduce bpf subcommand BPF_TASK_FD_QUERY") Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20230709025630.3735-6-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-06tracing: Zero the pipe cpumask on alloc to avoid spurious -EBUSYBrian Foster
commit 3d07fa1dd19035eb0b13ae6697efd5caa9033e74 upstream. The pipe cpumask used to serialize opens between the main and percpu trace pipes is not zeroed or initialized. This can result in spurious -EBUSY returns if underlying memory is not fully zeroed. This has been observed by immediate failure to read the main trace_pipe file on an otherwise newly booted and idle system: # cat /sys/kernel/debug/tracing/trace_pipe cat: /sys/kernel/debug/tracing/trace_pipe: Device or resource busy Zero the allocation of pipe_cpumask to avoid the problem. Link: https://lore.kernel.org/linux-trace-kernel/20230831125500.986862-1-bfoster@redhat.com Cc: stable@vger.kernel.org Fixes: c2489bb7e6be ("tracing: Introduce pipe_cpumask to avoid race on trace_pipes") Reviewed-by: Zheng Yejian <zhengyejian1@huawei.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-21tracing: Introduce pipe_cpumask to avoid race on trace_pipesZheng Yejian
There is race issue when concurrently splice_read main trace_pipe and per_cpu trace_pipes which will result in data read out being different from what actually writen. As suggested by Steven: > I believe we should add a ref count to trace_pipe and the per_cpu > trace_pipes, where if they are opened, nothing else can read it. > > Opening trace_pipe locks all per_cpu ref counts, if any of them are > open, then the trace_pipe open will fail (and releases any ref counts > it had taken). > > Opening a per_cpu trace_pipe will up the ref count for just that > CPU buffer. This will allow multiple tasks to read different per_cpu > trace_pipe files, but will prevent the main trace_pipe file from > being opened. But because we only need to know whether per_cpu trace_pipe is open or not, using a cpumask instead of using ref count may be easier. After this patch, users will find that: - Main trace_pipe can be opened by only one user, and if it is opened, all per_cpu trace_pipes cannot be opened; - Per_cpu trace_pipes can be opened by multiple users, but each per_cpu trace_pipe can only be opened by one user. And if one of them is opened, main trace_pipe cannot be opened. Link: https://lore.kernel.org/linux-trace-kernel/20230818022645.1948314-1-zhengyejian1@huawei.com Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-08-17tracing: Fix memleak due to race between current_tracer and traceZheng Yejian
Kmemleak report a leak in graph_trace_open(): unreferenced object 0xffff0040b95f4a00 (size 128): comm "cat", pid 204981, jiffies 4301155872 (age 99771.964s) hex dump (first 32 bytes): e0 05 e7 b4 ab 7d 00 00 0b 00 01 00 00 00 00 00 .....}.......... f4 00 01 10 00 a0 ff ff 00 00 00 00 65 00 10 00 ............e... backtrace: [<000000005db27c8b>] kmem_cache_alloc_trace+0x348/0x5f0 [<000000007df90faa>] graph_trace_open+0xb0/0x344 [<00000000737524cd>] __tracing_open+0x450/0xb10 [<0000000098043327>] tracing_open+0x1a0/0x2a0 [<00000000291c3876>] do_dentry_open+0x3c0/0xdc0 [<000000004015bcd6>] vfs_open+0x98/0xd0 [<000000002b5f60c9>] do_open+0x520/0x8d0 [<00000000376c7820>] path_openat+0x1c0/0x3e0 [<00000000336a54b5>] do_filp_open+0x14c/0x324 [<000000002802df13>] do_sys_openat2+0x2c4/0x530 [<0000000094eea458>] __arm64_sys_openat+0x130/0x1c4 [<00000000a71d7881>] el0_svc_common.constprop.0+0xfc/0x394 [<00000000313647bf>] do_el0_svc+0xac/0xec [<000000002ef1c651>] el0_svc+0x20/0x30 [<000000002fd4692a>] el0_sync_handler+0xb0/0xb4 [<000000000c309c35>] el0_sync+0x160/0x180 The root cause is descripted as follows: __tracing_open() { // 1. File 'trace' is being opened; ... *iter->trace = *tr->current_trace; // 2. Tracer 'function_graph' is // currently set; ... iter->trace->open(iter); // 3. Call graph_trace_open() here, // and memory are allocated in it; ... } s_start() { // 4. The opened file is being read; ... *iter->trace = *tr->current_trace; // 5. If tracer is switched to // 'nop' or others, then memory // in step 3 are leaked!!! ... } To fix it, in s_start(), close tracer before switching then reopen the new tracer after switching. And some tracers like 'wakeup' may not update 'iter->private' in some cases when reopen, then it should be cleared to avoid being mistakenly closed again. Link: https://lore.kernel.org/linux-trace-kernel/20230817125539.1646321-1-zhengyejian1@huawei.com Fixes: d7350c3f4569 ("tracing/core: make the read callbacks reentrants") Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-08-16tracing/synthetic: Allocate one additional element for sizeSven Schnelle
While debugging another issue I noticed that the stack trace contains one invalid entry at the end: <idle>-0 [008] d..4. 26.484201: wake_lat: pid=0 delta=2629976084 000000009cc24024 stack=STACK: => __schedule+0xac6/0x1a98 => schedule+0x126/0x2c0 => schedule_timeout+0x150/0x2c0 => kcompactd+0x9ca/0xc20 => kthread+0x2f6/0x3d8 => __ret_from_fork+0x8a/0xe8 => 0x6b6b6b6b6b6b6b6b This is because the code failed to add the one element containing the number of entries to field_size. Link: https://lkml.kernel.org/r/20230816154928.4171614-4-svens@linux.ibm.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Fixes: 00cf3d672a9d ("tracing: Allow synthetic events to pass around stacktraces") Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-08-16tracing/synthetic: Skip first entry for stack tracesSven Schnelle
While debugging another issue I noticed that the stack trace output contains the number of entries on top: <idle>-0 [000] d..4. 203.322502: wake_lat: pid=0 delta=2268270616 stack=STACK: => 0x10 => __schedule+0xac6/0x1a98 => schedule+0x126/0x2c0 => schedule_timeout+0x242/0x2c0 => __wait_for_common+0x434/0x680 => __wait_rcu_gp+0x198/0x3e0 => synchronize_rcu+0x112/0x138 => ring_buffer_reset_online_cpus+0x140/0x2e0 => tracing_reset_online_cpus+0x15c/0x1d0 => tracing_set_clock+0x180/0x1d8 => hist_register_trigger+0x486/0x670 => event_hist_trigger_parse+0x494/0x1318 => trigger_process_regex+0x1d4/0x258 => event_trigger_write+0xb4/0x170 => vfs_write+0x210/0xad0 => ksys_write+0x122/0x208 Fix this by skipping the first element. Also replace the pointer logic with an index variable which is easier to read. Link: https://lkml.kernel.org/r/20230816154928.4171614-3-svens@linux.ibm.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Fixes: 00cf3d672a9d ("tracing: Allow synthetic events to pass around stacktraces") Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-08-16tracing/synthetic: Use union instead of castsSven Schnelle
The current code uses a lot of casts to access the fields member in struct synth_trace_events with different sizes. This makes the code hard to read, and had already introduced an endianness bug. Use a union and struct instead. Link: https://lkml.kernel.org/r/20230816154928.4171614-2-svens@linux.ibm.com Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Fixes: 00cf3d672a9dd ("tracing: Allow synthetic events to pass around stacktraces") Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-08-16tracing: Fix cpu buffers unavailable due to 'record_disabled' missedZheng Yejian
Trace ring buffer can no longer record anything after executing following commands at the shell prompt: # cd /sys/kernel/tracing # cat tracing_cpumask fff # echo 0 > tracing_cpumask # echo 1 > snapshot # echo fff > tracing_cpumask # echo 1 > tracing_on # echo "hello world" > trace_marker -bash: echo: write error: Bad file descriptor The root cause is that: 1. After `echo 0 > tracing_cpumask`, 'record_disabled' of cpu buffers in 'tr->array_buffer.buffer' became 1 (see tracing_set_cpumask()); 2. After `echo 1 > snapshot`, 'tr->array_buffer.buffer' is swapped with 'tr->max_buffer.buffer', then the 'record_disabled' became 0 (see update_max_tr()); 3. After `echo fff > tracing_cpumask`, the 'record_disabled' become -1; Then array_buffer and max_buffer are both unavailable due to value of 'record_disabled' is not 0. To fix it, enable or disable both array_buffer and max_buffer at the same time in tracing_set_cpumask(). Link: https://lkml.kernel.org/r/20230805033816.3284594-2-zhengyejian1@huawei.com Cc: <mhiramat@kernel.org> Cc: <vnagarnaik@google.com> Cc: <shuah@kernel.org> Fixes: 71babb2705e2 ("tracing: change CPU ring buffer state from tracing_cpumask") Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-08-03Merge tag 'net-6.5-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bpf and wireless. Nothing scary here. Feels like the first wave of regressions from v6.5 is addressed - one outstanding fix still to come in TLS for the sendpage rework. Current release - regressions: - udp: fix __ip_append_data()'s handling of MSG_SPLICE_PAGES - dsa: fix older DSA drivers using phylink Previous releases - regressions: - gro: fix misuse of CB in udp socket lookup - mlx5: unregister devlink params in case interface is down - Revert "wifi: ath11k: Enable threaded NAPI" Previous releases - always broken: - sched: cls_u32: fix match key mis-addressing - sched: bind logic fixes for cls_fw, cls_u32 and cls_route - add bound checks to a number of places which hand-parse netlink - bpf: disable preemption in perf_event_output helpers code - qed: fix scheduling in a tasklet while getting stats - avoid using APIs which are not hardirq-safe in couple of drivers, when we may be in a hard IRQ (netconsole) - wifi: cfg80211: fix return value in scan logic, avoid page allocator warning - wifi: mt76: mt7615: do not advertise 5 GHz on first PHY of MT7615D (DBDC) Misc: - drop handful of inactive maintainers, put some new in place" * tag 'net-6.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (98 commits) MAINTAINERS: update TUN/TAP maintainers test/vsock: remove vsock_perf executable on `make clean` tcp_metrics: fix data-race in tcpm_suck_dst() vs fastopen tcp_metrics: annotate data-races around tm->tcpm_net tcp_metrics: annotate data-races around tm->tcpm_vals[] tcp_metrics: annotate data-races around tm->tcpm_lock tcp_metrics: annotate data-races around tm->tcpm_stamp tcp_metrics: fix addr_same() helper prestera: fix fallback to previous version on same major version udp: Fix __ip_append_data()'s handling of MSG_SPLICE_PAGES net/mlx5e: Set proper IPsec source port in L4 selector net/mlx5: fs_core: Skip the FTs in the same FS_TYPE_PRIO_CHAINS fs_prio net/mlx5: fs_core: Make find_closest_ft more generic wifi: brcmfmac: Fix field-spanning write in brcmf_scan_params_v2_to_v1() vxlan: Fix nexthop hash size ip6mr: Fix skb_under_panic in ip6mr_cache_report() s390/qeth: Don't call dev_close/dev_open (DOWN/UP) net: tap_open(): set sk_uid from current_fsuid() net: tun_chr_open(): set sk_uid from current_fsuid() net: dcb: choose correct policy to parse DCB_ATTR_BCN ...
2023-08-03Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Martin KaFai Lau says: ==================== pull-request: bpf 2023-08-03 We've added 5 non-merge commits during the last 7 day(s) which contain a total of 3 files changed, 37 insertions(+), 20 deletions(-). The main changes are: 1) Disable preemption in perf_event_output helpers code, from Jiri Olsa 2) Add length check for SK_DIAG_BPF_STORAGE_REQ_MAP_FD parsing, from Lin Ma 3) Multiple warning splat fixes in cpumap from Hou Tao * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf, cpumap: Handle skb as well when clean up ptr_ring bpf, cpumap: Make sure kthread is running before map update returns bpf: Add length check for SK_DIAG_BPF_STORAGE_REQ_MAP_FD parsing bpf: Disable preemption in bpf_event_output bpf: Disable preemption in bpf_perf_event_output ==================== Link: https://lore.kernel.org/r/20230803181429.994607-1-martin.lau@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-30Merge tag 'probes-fixes-v6.5-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probe fixes from Masami Hiramatsu: - probe-events: add NULL check for some BTF API calls which can return error code and NULL. - ftrace selftests: check fprobe and kprobe event correctly. This fixes a miss condition of the test command. - kprobes: do not allow probing functions that start with "__cfi_" or "__pfx_" since those are auto generated for kernel CFI and not executed. * tag 'probes-fixes-v6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: kprobes: Prohibit probing on CFI preamble symbol selftests/ftrace: Fix to check fprobe event eneblement tracing/probes: Fix to add NULL check for BTF APIs
2023-07-28tracing: Fix warning in trace_buffered_event_disable()Zheng Yejian
Warning happened in trace_buffered_event_disable() at WARN_ON_ONCE(!trace_buffered_event_ref) Call Trace: ? __warn+0xa5/0x1b0 ? trace_buffered_event_disable+0x189/0x1b0 __ftrace_event_enable_disable+0x19e/0x3e0 free_probe_data+0x3b/0xa0 unregister_ftrace_function_probe_func+0x6b8/0x800 event_enable_func+0x2f0/0x3d0 ftrace_process_regex.isra.0+0x12d/0x1b0 ftrace_filter_write+0xe6/0x140 vfs_write+0x1c9/0x6f0 [...] The cause of the warning is in __ftrace_event_enable_disable(), trace_buffered_event_enable() was called once while trace_buffered_event_disable() was called twice. Reproduction script show as below, for analysis, see the comments: ``` #!/bin/bash cd /sys/kernel/tracing/ # 1. Register a 'disable_event' command, then: # 1) SOFT_DISABLED_BIT was set; # 2) trace_buffered_event_enable() was called first time; echo 'cmdline_proc_show:disable_event:initcall:initcall_finish' > \ set_ftrace_filter # 2. Enable the event registered, then: # 1) SOFT_DISABLED_BIT was cleared; # 2) trace_buffered_event_disable() was called first time; echo 1 > events/initcall/initcall_finish/enable # 3. Try to call into cmdline_proc_show(), then SOFT_DISABLED_BIT was # set again!!! cat /proc/cmdline # 4. Unregister the 'disable_event' command, then: # 1) SOFT_DISABLED_BIT was cleared again; # 2) trace_buffered_event_disable() was called second time!!! echo '!cmdline_proc_show:disable_event:initcall:initcall_finish' > \ set_ftrace_filter ``` To fix it, IIUC, we can change to call trace_buffered_event_enable() at fist time soft-mode enabled, and call trace_buffered_event_disable() at last time soft-mode disabled. Link: https://lore.kernel.org/linux-trace-kernel/20230726095804.920457-1-zhengyejian1@huawei.com Cc: <mhiramat@kernel.org> Fixes: 0fc1b09ff1ff ("tracing: Use temp buffer when filtering events") Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-28tracing: Fix kernel-doc warnings in trace_seq.cGaosheng Cui
Fix kernel-doc warning: kernel/trace/trace_seq.c:142: warning: Function parameter or member 'args' not described in 'trace_seq_vprintf' Link: https://lkml.kernel.org/r/20230724140827.1023266-5-cuigaosheng1@huawei.com Cc: <mhiramat@kernel.org> Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-28tracing: Fix kernel-doc warnings in trace_events_trigger.cGaosheng Cui
Fix kernel-doc warnings: kernel/trace/trace_events_trigger.c:59: warning: Function parameter or member 'buffer' not described in 'event_triggers_call' kernel/trace/trace_events_trigger.c:59: warning: Function parameter or member 'event' not described in 'event_triggers_call' Link: https://lkml.kernel.org/r/20230724140827.1023266-4-cuigaosheng1@huawei.com Cc: <mhiramat@kernel.org> Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-28tracing/synthetic: Fix kernel-doc warnings in trace_events_synth.cGaosheng Cui
Fix kernel-doc warning: kernel/trace/trace_events_synth.c:1257: warning: Function parameter or member 'mod' not described in 'synth_event_gen_cmd_array_start' Link: https://lkml.kernel.org/r/20230724140827.1023266-3-cuigaosheng1@huawei.com Cc: <mhiramat@kernel.org> Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-28ring-buffer: Fix kernel-doc warnings in ring_buffer.cGaosheng Cui
Fix kernel-doc warnings: kernel/trace/ring_buffer.c:954: warning: Function parameter or member 'cpu' not described in 'ring_buffer_wake_waiters' kernel/trace/ring_buffer.c:3383: warning: Excess function parameter 'event' description in 'ring_buffer_unlock_commit' kernel/trace/ring_buffer.c:5359: warning: Excess function parameter 'cpu' description in 'ring_buffer_reset_online_cpus' Link: https://lkml.kernel.org/r/20230724140827.1023266-2-cuigaosheng1@huawei.com Cc: <mhiramat@kernel.org> Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-28ring-buffer: Fix wrong stat of cpu_buffer->readZheng Yejian
When pages are removed in rb_remove_pages(), 'cpu_buffer->read' is set to 0 in order to make sure any read iterators reset themselves. However, this will mess 'entries' stating, see following steps: # cd /sys/kernel/tracing/ # 1. Enlarge ring buffer prepare for later reducing: # echo 20 > per_cpu/cpu0/buffer_size_kb # 2. Write a log into ring buffer of cpu0: # taskset -c 0 echo "hello1" > trace_marker # 3. Read the log: # cat per_cpu/cpu0/trace_pipe <...>-332 [000] ..... 62.406844: tracing_mark_write: hello1 # 4. Stop reading and see the stats, now 0 entries, and 1 event readed: # cat per_cpu/cpu0/stats entries: 0 [...] read events: 1 # 5. Reduce the ring buffer # echo 7 > per_cpu/cpu0/buffer_size_kb # 6. Now entries became unexpected 1 because actually no entries!!! # cat per_cpu/cpu0/stats entries: 1 [...] read events: 0 To fix it, introduce 'page_removed' field to count total removed pages since last reset, then use it to let read iterators reset themselves instead of changing the 'read' pointer. Link: https://lore.kernel.org/linux-trace-kernel/20230724054040.3489499-1-zhengyejian1@huawei.com Cc: <mhiramat@kernel.org> Cc: <vnagarnaik@google.com> Fixes: 83f40318dab0 ("ring-buffer: Make removal of ring buffer pages atomic") Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-26tracing/probes: Fix to add NULL check for BTF APIsMasami Hiramatsu (Google)
Since find_btf_func_param() abd btf_type_by_id() can return NULL, the caller must check the return value correctly. Link: https://lore.kernel.org/all/169024903951.395371.11361556840733470934.stgit@devnote2/ Fixes: b576e09701c7 ("tracing/probes: Support function parameters if BTF is available") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-25bpf: Disable preemption in bpf_event_outputJiri Olsa
We received report [1] of kernel crash, which is caused by using nesting protection without disabled preemption. The bpf_event_output can be called by programs executed by bpf_prog_run_array_cg function that disabled migration but keeps preemption enabled. This can cause task to be preempted by another one inside the nesting protection and lead eventually to two tasks using same perf_sample_data buffer and cause crashes like: BUG: kernel NULL pointer dereference, address: 0000000000000001 #PF: supervisor instruction fetch in kernel mode #PF: error_code(0x0010) - not-present page ... ? perf_output_sample+0x12a/0x9a0 ? finish_task_switch.isra.0+0x81/0x280 ? perf_event_output+0x66/0xa0 ? bpf_event_output+0x13a/0x190 ? bpf_event_output_data+0x22/0x40 ? bpf_prog_dfc84bbde731b257_cil_sock4_connect+0x40a/0xacb ? xa_load+0x87/0xe0 ? __cgroup_bpf_run_filter_sock_addr+0xc1/0x1a0 ? release_sock+0x3e/0x90 ? sk_setsockopt+0x1a1/0x12f0 ? udp_pre_connect+0x36/0x50 ? inet_dgram_connect+0x93/0xa0 ? __sys_connect+0xb4/0xe0 ? udp_setsockopt+0x27/0x40 ? __pfx_udp_push_pending_frames+0x10/0x10 ? __sys_setsockopt+0xdf/0x1a0 ? __x64_sys_connect+0xf/0x20 ? do_syscall_64+0x3a/0x90 ? entry_SYSCALL_64_after_hwframe+0x72/0xdc Fixing this by disabling preemption in bpf_event_output. [1] https://github.com/cilium/cilium/issues/26756 Cc: stable@vger.kernel.org Reported-by: Oleg "livelace" Popov <o.popov@livelace.ru> Closes: https://github.com/cilium/cilium/issues/26756 Fixes: 2a916f2f546c ("bpf: Use migrate_disable/enable in array macros and cgroup/lirc code.") Acked-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20230725084206.580930-3-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-25bpf: Disable preemption in bpf_perf_event_outputJiri Olsa
The nesting protection in bpf_perf_event_output relies on disabled preemption, which is guaranteed for kprobes and tracepoints. However bpf_perf_event_output can be also called from uprobes context through bpf_prog_run_array_sleepable function which disables migration, but keeps preemption enabled. This can cause task to be preempted by another one inside the nesting protection and lead eventually to two tasks using same perf_sample_data buffer and cause crashes like: kernel tried to execute NX-protected page - exploit attempt? (uid: 0) BUG: unable to handle page fault for address: ffffffff82be3eea ... Call Trace: ? __die+0x1f/0x70 ? page_fault_oops+0x176/0x4d0 ? exc_page_fault+0x132/0x230 ? asm_exc_page_fault+0x22/0x30 ? perf_output_sample+0x12b/0x910 ? perf_event_output+0xd0/0x1d0 ? bpf_perf_event_output+0x162/0x1d0 ? bpf_prog_c6271286d9a4c938_krava1+0x76/0x87 ? __uprobe_perf_func+0x12b/0x540 ? uprobe_dispatcher+0x2c4/0x430 ? uprobe_notify_resume+0x2da/0xce0 ? atomic_notifier_call_chain+0x7b/0x110 ? exit_to_user_mode_prepare+0x13e/0x290 ? irqentry_exit_to_user_mode+0x5/0x30 ? asm_exc_int3+0x35/0x40 Fixing this by disabling preemption in bpf_perf_event_output. Cc: stable@vger.kernel.org Fixes: 8c7dcb84e3b7 ("bpf: implement sleepable uprobes by chaining gps") Acked-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20230725084206.580930-2-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-23tracing/histograms: Return an error if we fail to add histogram to hist_vars ↵Mohamed Khalfella
list Commit 6018b585e8c6 ("tracing/histograms: Add histograms to hist_vars if they have referenced variables") added a check to fail histogram creation if save_hist_vars() failed to add histogram to hist_vars list. But the commit failed to set ret to failed return code before jumping to unregister histogram, fix it. Link: https://lore.kernel.org/linux-trace-kernel/20230714203341.51396-1-mkhalfella@purestorage.com Cc: stable@vger.kernel.org Fixes: 6018b585e8c6 ("tracing/histograms: Add histograms to hist_vars if they have referenced variables") Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-23ring-buffer: Do not swap cpu_buffer during resize processChen Lin
When ring_buffer_swap_cpu was called during resize process, the cpu buffer was swapped in the middle, resulting in incorrect state. Continuing to run in the wrong state will result in oops. This issue can be easily reproduced using the following two scripts: /tmp # cat test1.sh //#! /bin/sh for i in `seq 0 100000` do echo 2000 > /sys/kernel/debug/tracing/buffer_size_kb sleep 0.5 echo 5000 > /sys/kernel/debug/tracing/buffer_size_kb sleep 0.5 done /tmp # cat test2.sh //#! /bin/sh for i in `seq 0 100000` do echo irqsoff > /sys/kernel/debug/tracing/current_tracer sleep 1 echo nop > /sys/kernel/debug/tracing/current_tracer sleep 1 done /tmp # ./test1.sh & /tmp # ./test2.sh & A typical oops log is as follows, sometimes with other different oops logs. [ 231.711293] WARNING: CPU: 0 PID: 9 at kernel/trace/ring_buffer.c:2026 rb_update_pages+0x378/0x3f8 [ 231.713375] Modules linked in: [ 231.714735] CPU: 0 PID: 9 Comm: kworker/0:1 Tainted: G W 6.5.0-rc1-00276-g20edcec23f92 #15 [ 231.716750] Hardware name: linux,dummy-virt (DT) [ 231.718152] Workqueue: events update_pages_handler [ 231.719714] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 231.721171] pc : rb_update_pages+0x378/0x3f8 [ 231.722212] lr : rb_update_pages+0x25c/0x3f8 [ 231.723248] sp : ffff800082b9bd50 [ 231.724169] x29: ffff800082b9bd50 x28: ffff8000825f7000 x27: 0000000000000000 [ 231.726102] x26: 0000000000000001 x25: fffffffffffff010 x24: 0000000000000ff0 [ 231.728122] x23: ffff0000c3a0b600 x22: ffff0000c3a0b5c0 x21: fffffffffffffe0a [ 231.730203] x20: ffff0000c3a0b600 x19: ffff0000c0102400 x18: 0000000000000000 [ 231.732329] x17: 0000000000000000 x16: 0000000000000000 x15: 0000ffffe7aa8510 [ 231.734212] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002 [ 231.736291] x11: ffff8000826998a8 x10: ffff800082b9baf0 x9 : ffff800081137558 [ 231.738195] x8 : fffffc00030e82c8 x7 : 0000000000000000 x6 : 0000000000000001 [ 231.740192] x5 : ffff0000ffbafe00 x4 : 0000000000000000 x3 : 0000000000000000 [ 231.742118] x2 : 00000000000006aa x1 : 0000000000000001 x0 : ffff0000c0007208 [ 231.744196] Call trace: [ 231.744892] rb_update_pages+0x378/0x3f8 [ 231.745893] update_pages_handler+0x1c/0x38 [ 231.746893] process_one_work+0x1f0/0x468 [ 231.747852] worker_thread+0x54/0x410 [ 231.748737] kthread+0x124/0x138 [ 231.749549] ret_from_fork+0x10/0x20 [ 231.750434] ---[ end trace 0000000000000000 ]--- [ 233.720486] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 [ 233.721696] Mem abort info: [ 233.721935] ESR = 0x0000000096000004 [ 233.722283] EC = 0x25: DABT (current EL), IL = 32 bits [ 233.722596] SET = 0, FnV = 0 [ 233.722805] EA = 0, S1PTW = 0 [ 233.723026] FSC = 0x04: level 0 translation fault [ 233.723458] Data abort info: [ 233.723734] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 [ 233.724176] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [ 233.724589] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 233.725075] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000104943000 [ 233.725592] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000 [ 233.726231] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP [ 233.726720] Modules linked in: [ 233.727007] CPU: 0 PID: 9 Comm: kworker/0:1 Tainted: G W 6.5.0-rc1-00276-g20edcec23f92 #15 [ 233.727777] Hardware name: linux,dummy-virt (DT) [ 233.728225] Workqueue: events update_pages_handler [ 233.728655] pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 233.729054] pc : rb_update_pages+0x1a8/0x3f8 [ 233.729334] lr : rb_update_pages+0x154/0x3f8 [ 233.729592] sp : ffff800082b9bd50 [ 233.729792] x29: ffff800082b9bd50 x28: ffff8000825f7000 x27: 0000000000000000 [ 233.730220] x26: 0000000000000000 x25: ffff800082a8b840 x24: ffff0000c0102418 [ 233.730653] x23: 0000000000000000 x22: fffffc000304c880 x21: 0000000000000003 [ 233.731105] x20: 00000000000001f4 x19: ffff0000c0102400 x18: ffff800082fcbc58 [ 233.731727] x17: 0000000000000000 x16: 0000000000000001 x15: 0000000000000001 [ 233.732282] x14: ffff8000825fe0c8 x13: 0000000000000001 x12: 0000000000000000 [ 233.732709] x11: ffff8000826998a8 x10: 0000000000000ae0 x9 : ffff8000801b760c [ 233.733148] x8 : fefefefefefefeff x7 : 0000000000000018 x6 : ffff0000c03298c0 [ 233.733553] x5 : 0000000000000002 x4 : 0000000000000000 x3 : 0000000000000000 [ 233.733972] x2 : ffff0000c3a0b600 x1 : 0000000000000000 x0 : 0000000000000000 [ 233.734418] Call trace: [ 233.734593] rb_update_pages+0x1a8/0x3f8 [ 233.734853] update_pages_handler+0x1c/0x38 [ 233.735148] process_one_work+0x1f0/0x468 [ 233.735525] worker_thread+0x54/0x410 [ 233.735852] kthread+0x124/0x138 [ 233.736064] ret_from_fork+0x10/0x20 [ 233.736387] Code: 92400000 910006b5 aa000021 aa0303f7 (f9400060) [ 233.736959] ---[ end trace 0000000000000000 ]--- After analysis, the seq of the error is as follows [1-5]: int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size, int cpu_id) { for_each_buffer_cpu(buffer, cpu) { cpu_buffer = buffer->buffers[cpu]; //1. get cpu_buffer, aka cpu_buffer(A) ... ... schedule_work_on(cpu, &cpu_buffer->update_pages_work); //2. 'update_pages_work' is queue on 'cpu', cpu_buffer(A) is passed to // update_pages_handler, do the update process, set 'update_done' in // complete(&cpu_buffer->update_done) and to wakeup resize process. //----> //3. Just at this moment, ring_buffer_swap_cpu is triggered, //cpu_buffer(A) be swaped to cpu_buffer(B), the max_buffer. //ring_buffer_swap_cpu is called as the 'Call trace' below. Call trace: dump_backtrace+0x0/0x2f8 show_stack+0x18/0x28 dump_stack+0x12c/0x188 ring_buffer_swap_cpu+0x2f8/0x328 update_max_tr_single+0x180/0x210 check_critical_timing+0x2b4/0x2c8 tracer_hardirqs_on+0x1c0/0x200 trace_hardirqs_on+0xec/0x378 el0_svc_common+0x64/0x260 do_el0_svc+0x90/0xf8 el0_svc+0x20/0x30 el0_sync_handler+0xb0/0xb8 el0_sync+0x180/0x1c0 //<---- /* wait for all the updates to complete */ for_each_buffer_cpu(buffer, cpu) { cpu_buffer = buffer->buffers[cpu]; //4. get cpu_buffer, cpu_buffer(B) is used in the following process, //the state of cpu_buffer(A) and cpu_buffer(B) is totally wrong. //for example, cpu_buffer(A)->update_done will leave be set 1, and will //not 'wait_for_completion' at the next resize round. if (!cpu_buffer->nr_pages_to_update) continue; if (cpu_online(cpu)) wait_for_completion(&cpu_buffer->update_done); cpu_buffer->nr_pages_to_update = 0; } ... } //5. the state of cpu_buffer(A) and cpu_buffer(B) is totally wrong, //Continuing to run in the wrong state, then oops occurs. Link: https://lore.kernel.org/linux-trace-kernel/202307191558478409990@zte.com.cn Signed-off-by: Chen Lin <chen.lin5@zte.com.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-23tracing: Remove unused extern declaration tracing_map_set_field_descr()YueHaibing
Since commit 08d43a5fa063 ("tracing: Add lock-free tracing_map"), this is never used, so can be removed. Link: https://lore.kernel.org/linux-trace-kernel/20230722032123.24664-1-yuehaibing@huawei.com Cc: <mhiramat@kernel.org> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-16Merge tag 'probes-fixes-v6.5-rc1-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probe fixes from Masami Hiramatsu: - fprobe: Add a comment why fprobe will be skipped if another kprobe is running in fprobe_kprobe_handler(). - probe-events: Fix some issues related to fetch-arguments: - Fix double counting of the string length for user-string and symstr. This will require longer buffer in the array case. - Fix not to count error code (minus value) for the total used length in array argument. This makes the total used length shorter. - Fix to update dynamic used data size counter only if fetcharg uses the dynamic size data. This may mis-count the used dynamic data size and corrupt data. - Revert "tracing: Add "(fault)" name injection to kernel probes" because that did not work correctly with a bug, and we agreed the current '(fault)' output (instead of '"(fault)"' like a string) explains what happened more clearly. - Fix to record 0-length (means fault access) data_loc data in fetch function itself, instead of store_trace_args(). If we record an array of string, this will fix to save fault access data on each entry of the array correctly. * tag 'probes-fixes-v6.5-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/probes: Fix to record 0-length data_loc in fetch_store_string*() if fails Revert "tracing: Add "(fault)" name injection to kernel probes" tracing/probes: Fix to update dynamic data counter if fetcharg uses it tracing/probes: Fix not to count error code to total length tracing/probes: Fix to avoid double count of the string length on the array fprobes: Add a comment why fprobe_kprobe_handler exits if kprobe is running
2023-07-14tracing/probes: Fix to record 0-length data_loc in fetch_store_string*() if ↵Masami Hiramatsu (Google)
fails Fix to record 0-length data to data_loc in fetch_store_string*() if it fails to get the string data. Currently those expect that the data_loc is updated by store_trace_args() if it returns the error code. However, that does not work correctly if the argument is an array of strings. In that case, store_trace_args() only clears the first entry of the array (which may have no error) and leaves other entries. So it should be cleared by fetch_store_string*() itself. Also, 'dyndata' and 'maxlen' in store_trace_args() should be updated only if it is used (ret > 0 and argument is a dynamic data.) Link: https://lore.kernel.org/all/168908496683.123124.4761206188794205601.stgit@devnote2/ Fixes: 40b53b771806 ("tracing: probeevent: Add array type support") Cc: stable@vger.kernel.org Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-07-13Merge tag 'trace-v6.5-rc1-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix some missing-prototype warnings - Fix user events struct args (did not include size of struct) When creating a user event, the "struct" keyword is to denote that the size of the field will be passed in. But the parsing failed to handle this case. - Add selftest to struct sizes for user events - Fix sample code for direct trampolines. The sample code for direct trampolines attached to handle_mm_fault(). But the prototype changed and the direct trampoline sample code was not updated. Direct trampolines needs to have the arguments correct otherwise it can fail or crash the system. - Remove unused ftrace_regs_caller_ret() prototype. - Quiet false positive of FORTIFY_SOURCE Due to backward compatibility, the structure used to save stack traces in the kernel had a fixed size of 8. This structure is exported to user space via the tracing format file. A change was made to allow more than 8 functions to be recorded, and user space now uses the size field to know how many functions are actually in the stack. But the structure still has size of 8 (even though it points into the ring buffer that has the required amount allocated to hold a full stack. This was fine until the fortifier noticed that the memcpy(&entry->caller, stack, size) was greater than the 8 functions and would complain at runtime about it. Hide this by using a pointer to the stack location on the ring buffer instead of using the address of the entry structure caller field. - Fix a deadloop in reading trace_pipe that was caused by a mismatch between ring_buffer_empty() returning false which then asked to read the data, but the read code uses rb_num_of_entries() that returned zero, and causing a infinite "retry". - Fix a warning caused by not using all pages allocated to store ftrace functions, where this can happen if the linker inserts a bunch of "NULL" entries, causing the accounting of how many pages needed to be off. - Fix histogram synthetic event crashing when the start event is removed and the end event is still using a variable from it - Fix memory leak in freeing iter->temp in tracing_release_pipe() * tag 'trace-v6.5-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Fix memory leak of iter->temp when reading trace_pipe tracing/histograms: Add histograms to hist_vars if they have referenced variables tracing: Stop FORTIFY_SOURCE complaining about stack trace caller ftrace: Fix possible warning on checking all pages used in ftrace_process_locs() ring-buffer: Fix deadloop issue on reading trace_pipe tracing: arm64: Avoid missing-prototype warnings selftests/user_events: Test struct size match cases tracing/user_events: Fix struct arg size match check x86/ftrace: Remove unsued extern declaration ftrace_regs_caller_ret() arm64: ftrace: Add direct call trampoline samples support samples: ftrace: Save required argument registers in sample trampolines
2023-07-14Revert "tracing: Add "(fault)" name injection to kernel probes"Masami Hiramatsu (Google)
This reverts commit 2e9906f84fc7c99388bb7123ade167250d50f1c0. It was turned out that commit 2e9906f84fc7 ("tracing: Add "(fault)" name injection to kernel probes") did not work correctly and probe events still show just '(fault)' (instead of '"(fault)"'). Also, current '(fault)' is more explicit that it faulted. This also moves FAULT_STRING macro to trace.h so that synthetic event can keep using it, and uses it in trace_probe.c too. Link: https://lore.kernel.org/all/168908495772.123124.1250788051922100079.stgit@devnote2/ Link: https://lore.kernel.org/all/20230706230642.3793a593@rorschach.local.home/ Cc: stable@vger.kernel.org Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-14tracing/probes: Fix to update dynamic data counter if fetcharg uses itMasami Hiramatsu (Google)
Fix to update dynamic data counter ('dyndata') and max length ('maxlen') only if the fetcharg uses the dynamic data. Also get out arg->dynamic from unlikely(). This makes dynamic data address wrong if process_fetch_insn() returns error on !arg->dynamic case. Link: https://lore.kernel.org/all/168908494781.123124.8160245359962103684.stgit@devnote2/ Suggested-by: Steven Rostedt <rostedt@goodmis.org> Link: https://lore.kernel.org/all/20230710233400.5aaf024e@gandalf.local.home/ Fixes: 9178412ddf5a ("tracing: probeevent: Return consumed bytes of dynamic area") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-14tracing/probes: Fix not to count error code to total lengthMasami Hiramatsu (Google)
Fix not to count the error code (which is minus value) to the total used length of array, because it can mess up the return code of process_fetch_insn_bottom(). Also clear the 'ret' value because it will be used for calculating next data_loc entry. Link: https://lore.kernel.org/all/168908493827.123124.2175257289106364229.stgit@devnote2/ Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/all/8819b154-2ba1-43c3-98a2-cbde20892023@moroto.mountain/ Fixes: 9b960a38835f ("tracing: probeevent: Unify fetch_insn processing common part") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-14tracing/probes: Fix to avoid double count of the string length on the arrayMasami Hiramatsu (Google)
If an array is specified with the ustring or symstr, the length of the strings are accumlated on both of 'ret' and 'total', which means the length is double counted. Just set the length to the 'ret' value for avoiding double counting. Link: https://lore.kernel.org/all/168908492917.123124.15076463491122036025.stgit@devnote2/ Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/all/8819b154-2ba1-43c3-98a2-cbde20892023@moroto.mountain/ Fixes: 88903c464321 ("tracing/probe: Add ustring type for user-space string") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-14fprobes: Add a comment why fprobe_kprobe_handler exits if kprobe is runningMasami Hiramatsu (Google)
Add a comment the reason why fprobe_kprobe_handler() exits if any other kprobe is running. Link: https://lore.kernel.org/all/168874788299.159442.2485957441413653858.stgit@devnote2/ Suggested-by: Steven Rostedt <rostedt@goodmis.org> Link: https://lore.kernel.org/all/20230706120916.3c6abf15@gandalf.local.home/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-13tracing: Fix memory leak of iter->temp when reading trace_pipeZheng Yejian
kmemleak reports: unreferenced object 0xffff88814d14e200 (size 256): comm "cat", pid 336, jiffies 4294871818 (age 779.490s) hex dump (first 32 bytes): 04 00 01 03 00 00 00 00 08 00 00 00 00 00 00 00 ................ 0c d8 c8 9b ff ff ff ff 04 5a ca 9b ff ff ff ff .........Z...... backtrace: [<ffffffff9bdff18f>] __kmalloc+0x4f/0x140 [<ffffffff9bc9238b>] trace_find_next_entry+0xbb/0x1d0 [<ffffffff9bc9caef>] trace_print_lat_context+0xaf/0x4e0 [<ffffffff9bc94490>] print_trace_line+0x3e0/0x950 [<ffffffff9bc95499>] tracing_read_pipe+0x2d9/0x5a0 [<ffffffff9bf03a43>] vfs_read+0x143/0x520 [<ffffffff9bf04c2d>] ksys_read+0xbd/0x160 [<ffffffff9d0f0edf>] do_syscall_64+0x3f/0x90 [<ffffffff9d2000aa>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 when reading file 'trace_pipe', 'iter->temp' is allocated or relocated in trace_find_next_entry() but not freed before 'trace_pipe' is closed. To fix it, free 'iter->temp' in tracing_release_pipe(). Link: https://lore.kernel.org/linux-trace-kernel/20230713141435.1133021-1-zhengyejian1@huawei.com Cc: stable@vger.kernel.org Fixes: ff895103a84ab ("tracing: Save off entry when peeking at next entry") Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>