summaryrefslogtreecommitdiff
path: root/kernel/trace
AgeCommit message (Collapse)Author
2025-02-17tracing/osnoise: Fix resetting of tracepointsSteven Rostedt
commit e3ff4245928f948f3eb2e852aa350b870421c358 upstream. If a timerlat tracer is started with the osnoise option OSNOISE_WORKLOAD disabled, but then that option is enabled and timerlat is removed, the tracepoints that were enabled on timerlat registration do not get disabled. If the option is disabled again and timelat is started, then it triggers a warning in the tracepoint code due to registering the tracepoint again without ever disabling it. Do not use the same user space defined options to know to disable the tracepoints when timerlat is removed. Instead, set a global flag when it is enabled and use that flag to know to disable the events. ~# echo NO_OSNOISE_WORKLOAD > /sys/kernel/tracing/osnoise/options ~# echo timerlat > /sys/kernel/tracing/current_tracer ~# echo OSNOISE_WORKLOAD > /sys/kernel/tracing/osnoise/options ~# echo nop > /sys/kernel/tracing/current_tracer ~# echo NO_OSNOISE_WORKLOAD > /sys/kernel/tracing/osnoise/options ~# echo timerlat > /sys/kernel/tracing/current_tracer Triggers: ------------[ cut here ]------------ WARNING: CPU: 6 PID: 1337 at kernel/tracepoint.c:294 tracepoint_add_func+0x3b6/0x3f0 Modules linked in: CPU: 6 UID: 0 PID: 1337 Comm: rtla Not tainted 6.13.0-rc4-test-00018-ga867c441128e-dirty #73 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:tracepoint_add_func+0x3b6/0x3f0 Code: 48 8b 53 28 48 8b 73 20 4c 89 04 24 e8 23 59 11 00 4c 8b 04 24 e9 36 fe ff ff 0f 0b b8 ea ff ff ff 45 84 e4 0f 84 68 fe ff ff <0f> 0b e9 61 fe ff ff 48 8b 7b 18 48 85 ff 0f 84 4f ff ff ff 49 8b RSP: 0018:ffffb9b003a87ca0 EFLAGS: 00010202 RAX: 00000000ffffffef RBX: ffffffff92f30860 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff9bf59e91ccd0 RDI: ffffffff913b6410 RBP: 000000000000000a R08: 00000000000005c7 R09: 0000000000000002 R10: ffffb9b003a87ce0 R11: 0000000000000002 R12: 0000000000000001 R13: ffffb9b003a87ce0 R14: ffffffffffffffef R15: 0000000000000008 FS: 00007fce81209240(0000) GS:ffff9bf6fdd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055e99b728000 CR3: 00000001277c0002 CR4: 0000000000172ef0 Call Trace: <TASK> ? __warn.cold+0xb7/0x14d ? tracepoint_add_func+0x3b6/0x3f0 ? report_bug+0xea/0x170 ? handle_bug+0x58/0x90 ? exc_invalid_op+0x17/0x70 ? asm_exc_invalid_op+0x1a/0x20 ? __pfx_trace_sched_migrate_callback+0x10/0x10 ? tracepoint_add_func+0x3b6/0x3f0 ? __pfx_trace_sched_migrate_callback+0x10/0x10 ? __pfx_trace_sched_migrate_callback+0x10/0x10 tracepoint_probe_register+0x78/0xb0 ? __pfx_trace_sched_migrate_callback+0x10/0x10 osnoise_workload_start+0x2b5/0x370 timerlat_tracer_init+0x76/0x1b0 tracing_set_tracer+0x244/0x400 tracing_set_trace_write+0xa0/0xe0 vfs_write+0xfc/0x570 ? do_sys_openat2+0x9c/0xe0 ksys_write+0x72/0xf0 do_syscall_64+0x79/0x1c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tomas Glozar <tglozar@redhat.com> Cc: Gabriele Monaco <gmonaco@redhat.com> Cc: Luis Goncalves <lgoncalv@redhat.com> Cc: John Kacur <jkacur@redhat.com> Link: https://lore.kernel.org/20250123204159.4450c88e@gandalf.local.home Fixes: e88ed227f639e ("tracing/timerlat: Add user-space interface") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-08bpf: Send signals asynchronously if !preemptiblePuranjay Mohan
[ Upstream commit 87c544108b612512b254c8f79aa5c0a8546e2cc4 ] BPF programs can execute in all kinds of contexts and when a program running in a non-preemptible context uses the bpf_send_signal() kfunc, it will cause issues because this kfunc can sleep. Change `irqs_disabled()` to `!preemptible()`. Reported-by: syzbot+97da3d7e0112d59971de@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67486b09.050a0220.253251.0084.GAE@google.com/ Fixes: 1bc7896e9ef4 ("bpf: Fix deadlock with rq_lock in bpf_send_signal()") Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20250115103647.38487-1-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-08perf/core: Save raw sample data conditionally based on sample typeYabin Cui
[ Upstream commit b9c44b91476b67327a521568a854babecc4070ab ] Currently, space for raw sample data is always allocated within sample records for both BPF output and tracepoint events. This leads to unused space in sample records when raw sample data is not requested. This patch enforces checking sample type of an event in perf_sample_save_raw_data(). So raw sample data will only be saved if explicitly requested, reducing overhead when it is not needed. Fixes: 0a9081cf0a11 ("perf/core: Add perf_sample_save_raw_data() helper") Signed-off-by: Yabin Cui <yabinc@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240515193610.2350456-2-yabinc@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-01-09tracing: Check "%s" dereference via the field and not the TP_printk formatSteven Rostedt
[ Upstream commit afd2627f727b89496d79a6b934a025fc916d4ded ] The TP_printk() portion of a trace event is executed at the time a event is read from the trace. This can happen seconds, minutes, hours, days, months, years possibly later since the event was recorded. If the print format contains a dereference to a string via "%s", and that string was allocated, there's a chance that string could be freed before it is read by the trace file. To protect against such bugs, there are two functions that verify the event. The first one is test_event_printk(), which is called when the event is created. It reads the TP_printk() format as well as its arguments to make sure nothing may be dereferencing a pointer that was not copied into the ring buffer along with the event. If it is, it will trigger a WARN_ON(). For strings that use "%s", it is not so easy. The string may not reside in the ring buffer but may still be valid. Strings that are static and part of the kernel proper which will not be freed for the life of the running system, are safe to dereference. But to know if it is a pointer to a static string or to something on the heap can not be determined until the event is triggered. This brings us to the second function that tests for the bad dereferencing of strings, trace_check_vprintf(). It would walk through the printf format looking for "%s", and when it finds it, it would validate that the pointer is safe to read. If not, it would produces a WARN_ON() as well and write into the ring buffer "[UNSAFE-MEMORY]". The problem with this is how it used va_list to have vsnprintf() handle all the cases that it didn't need to check. Instead of re-implementing vsnprintf(), it would make a copy of the format up to the %s part, and call vsnprintf() with the current va_list ap variable, where the ap would then be ready to point at the string in question. For architectures that passed va_list by reference this was possible. For architectures that passed it by copy it was not. A test_can_verify() function was used to differentiate between the two, and if it wasn't possible, it would disable it. Even for architectures where this was feasible, it was a stretch to rely on such a method that is undocumented, and could cause issues later on with new optimizations of the compiler. Instead, the first function test_event_printk() was updated to look at "%s" as well. If the "%s" argument is a pointer outside the event in the ring buffer, it would find the field type of the event that is the problem and mark the structure with a new flag called "needs_test". The event itself will be marked by TRACE_EVENT_FL_TEST_STR to let it be known that this event has a field that needs to be verified before the event can be printed using the printf format. When the event fields are created from the field type structure, the fields would copy the field type's "needs_test" value. Finally, before being printed, a new function ignore_event() is called which will check if the event has the TEST_STR flag set (if not, it returns false). If the flag is set, it then iterates through the events fields looking for the ones that have the "needs_test" flag set. Then it uses the offset field from the field structure to find the pointer in the ring buffer event. It runs the tests to make sure that pointer is safe to print and if not, it triggers the WARN_ON() and also adds to the trace output that the event in question has an unsafe memory access. The ignore_event() makes the trace_check_vprintf() obsolete so it is removed. Link: https://lore.kernel.org/all/CAHk-=wh3uOnqnZPpR0PeLZZtyWbZLboZ7cHLCKRWsocvs9Y7hQ@mail.gmail.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/20241217024720.848621576@goodmis.org Fixes: 5013f454a352c ("tracing: Add check of trace event print fmts for dereferencing pointers") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-01-09tracing: Fix trace_check_vprintf() when tp_printk is usedSteven Rostedt
[ Upstream commit 50a3242d84ee1625b0bfef29b95f935958dccfbe ] When the tp_printk kernel command line is used, the trace events go directly to printk(). It is still checked via the trace_check_vprintf() function to make sure the pointers of the trace event are legit. The addition of reading buffers from previous boots required adding a delta between the addresses of the previous boot and the current boot so that the pointers in the old buffer can still be used. But this required adding a trace_array pointer to acquire the delta offsets. The tp_printk code does not provide a trace_array (tr) pointer, so when the offsets were examined, a NULL pointer dereference happened and the kernel crashed. If the trace_array does not exist, just default the delta offsets to zero, as that also means the trace event is not being read from a previous boot. Link: https://lore.kernel.org/all/Zv3z5UsG_jsO9_Tb@aschofie-mobl2.lan/ Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20241003104925.4e1b1fd9@gandalf.local.home Fixes: 07714b4bb3f98 ("tracing: Handle old buffer mappings for event strings and functions") Reported-by: Alison Schofield <alison.schofield@intel.com> Tested-by: Alison Schofield <alison.schofield@intel.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Stable-dep-of: afd2627f727b ("tracing: Check "%s" dereference via the field and not the TP_printk format") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-01-09tracing: Handle old buffer mappings for event strings and functionsSteven Rostedt (Google)
[ Upstream commit 07714b4bb3f9800261c8b4b2f47e9010ed60979d ] Use the saved text_delta and data_delta of a persistent memory mapped ring buffer that was saved from a previous boot, and use the delta in the trace event print output so that strings and functions show up normally. That is, for an event like trace_kmalloc() that prints the callsite via "%pS", if it used the address saved in the ring buffer it will not match the function that was saved in the previous boot if the kernel remaps itself between boots. For RCU events that point to saved static strings where only the address of the string is saved in the ring buffer, it too will be adjusted to point to where the string is on the current boot. Link: https://lkml.kernel.org/r/20240612232026.821020753@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vineeth Pillai <vineeth@bitbyteword.org> Cc: Youssef Esmat <youssefesmat@google.com> Cc: Beau Belgrave <beaub@linux.microsoft.com> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: David Howells <dhowells@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Ross Zwisler <zwisler@google.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Stable-dep-of: afd2627f727b ("tracing: Check "%s" dereference via the field and not the TP_printk format") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-01-09seq_buf: Introduce DECLARE_SEQ_BUF and seq_buf_str()Kees Cook
[ Upstream commit dcc4e5728eeaeda84878ca0018758cff1abfca21 ] Solve two ergonomic issues with struct seq_buf; 1) Too much boilerplate is required to initialize: struct seq_buf s; char buf[32]; seq_buf_init(s, buf, sizeof(buf)); Instead, we can build this directly on the stack. Provide DECLARE_SEQ_BUF() macro to do this: DECLARE_SEQ_BUF(s, 32); 2) %NUL termination is fragile and requires 2 steps to get a valid C String (and is a layering violation exposing the "internals" of seq_buf): seq_buf_terminate(s); do_something(s->buffer); Instead, we can just return s->buffer directly after terminating it in the refactored seq_buf_terminate(), now known as seq_buf_str(): do_something(seq_buf_str(s)); Link: https://lore.kernel.org/linux-trace-kernel/20231027155634.make.260-kees@kernel.org Link: https://lore.kernel.org/linux-trace-kernel/20231026194033.it.702-kees@kernel.org/ Cc: Yosry Ahmed <yosryahmed@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Justin Stitt <justinstitt@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Petr Mladek <pmladek@suse.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Yun Zhou <yun.zhou@windriver.com> Cc: Jacob Keller <jacob.e.keller@intel.com> Cc: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Stable-dep-of: afd2627f727b ("tracing: Check "%s" dereference via the field and not the TP_printk format") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-01-09tracing: Move readpos from seq_buf to trace_seqMatthew Wilcox (Oracle)
[ Upstream commit d0ed46b60396cfa7e0056f55e1ce0b43c7db57b6 ] To make seq_buf more lightweight as a string buf, move the readpos member from seq_buf to its container, trace_seq. That puts the responsibility of maintaining the readpos entirely in the tracing code. If some future users want to package up the readpos with a seq_buf, we can define a new struct then. Link: https://lore.kernel.org/linux-trace-kernel/20231020033545.2587554-2-willy@infradead.org Cc: Kees Cook <keescook@chromium.org> Cc: Justin Stitt <justinstitt@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Petr Mladek <pmladek@suse.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Stable-dep-of: afd2627f727b ("tracing: Check "%s" dereference via the field and not the TP_printk format") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-01-09tracing: Have process_string() also allow arraysSteven Rostedt
commit afc6717628f959941d7b33728570568b4af1c4b8 upstream. In order to catch a common bug where a TRACE_EVENT() TP_fast_assign() assigns an address of an allocated string to the ring buffer and then references it in TP_printk(), which can be executed hours later when the string is free, the function test_event_printk() runs on all events as they are registered to make sure there's no unwanted dereferencing. It calls process_string() to handle cases in TP_printk() format that has "%s". It returns whether or not the string is safe. But it can have some false positives. For instance, xe_bo_move() has: TP_printk("move_lacks_source:%s, migrate object %p [size %zu] from %s to %s device_id:%s", __entry->move_lacks_source ? "yes" : "no", __entry->bo, __entry->size, xe_mem_type_to_name[__entry->old_placement], xe_mem_type_to_name[__entry->new_placement], __get_str(device_id)) Where the "%s" references into xe_mem_type_to_name[]. This is an array of pointers that should be safe for the event to access. Instead of flagging this as a bad reference, if a reference points to an array, where the record field is the index, consider it safe. Link: https://lore.kernel.org/all/9dee19b6185d325d0e6fa5f7cbba81d007d99166.camel@sapience.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20241231000646.324fb5f7@gandalf.local.home Fixes: 65a25d9f7ac02 ("tracing: Add "%s" check in test_event_printk()") Reported-by: Genes Lists <lists@sapience.com> Tested-by: Gene C <arch@sapience.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-01-02tracing: Prevent bad count for tracing_cpumask_writeLizhi Xu
commit 98feccbf32cfdde8c722bc4587aaa60ee5ac33f0 upstream. If a large count is provided, it will trigger a warning in bitmap_parse_user. Also check zero for it. Cc: stable@vger.kernel.org Fixes: 9e01c1b74c953 ("cpumask: convert kernel trace functions") Link: https://lore.kernel.org/20241216073238.2573704-1-lizhi.xu@windriver.com Reported-by: syzbot+0aecfd34fb878546f3fd@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=0aecfd34fb878546f3fd Tested-by: syzbot+0aecfd34fb878546f3fd@syzkaller.appspotmail.com Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-01-02tracing/kprobe: Make trace_kprobe's module callback called after jump_label ↵Masami Hiramatsu (Google)
update [ Upstream commit d685d55dfc86b1a4bdcec77c3c1f8a83f181264e ] Make sure the trace_kprobe's module notifer callback function is called after jump_label's callback is called. Since the trace_kprobe's callback eventually checks jump_label address during registering new kprobe on the loading module, jump_label must be updated before this registration happens. Link: https://lore.kernel.org/all/173387585556.995044.3157941002975446119.stgit@devnote2/ Fixes: 614243181050 ("tracing/kprobes: Support module init function probing") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-27tracing: Add "%s" check in test_event_printk()Steven Rostedt
commit 65a25d9f7ac02e0cf361356e834d1c71d36acca9 upstream. The test_event_printk() code makes sure that when a trace event is registered, any dereferenced pointers in from the event's TP_printk() are pointing to content in the ring buffer. But currently it does not handle "%s", as there's cases where the string pointer saved in the ring buffer points to a static string in the kernel that will never be freed. As that is a valid case, the pointer needs to be checked at runtime. Currently the runtime check is done via trace_check_vprintf(), but to not have to replicate everything in vsnprintf() it does some logic with the va_list that may not be reliable across architectures. In order to get rid of that logic, more work in the test_event_printk() needs to be done. Some of the strings can be validated at this time when it is obvious the string is valid because the string will be saved in the ring buffer content. Do all the validation of strings in the ring buffer at boot in test_event_printk(), and make sure that the field of the strings that point into the kernel are accessible. This will allow adding checks at runtime that will validate the fields themselves and not rely on paring the TP_printk() format at runtime. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/20241217024720.685917008@goodmis.org Fixes: 5013f454a352c ("tracing: Add check of trace event print fmts for dereferencing pointers") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-27tracing: Add missing helper functions in event pointer dereference checkSteven Rostedt
commit 917110481f6bc1c96b1e54b62bb114137fbc6d17 upstream. The process_pointer() helper function looks to see if various trace event macros are used. These macros are for storing data in the event. This makes it safe to dereference as the dereference will then point into the event on the ring buffer where the content of the data stays with the event itself. A few helper functions were missing. Those were: __get_rel_dynamic_array() __get_dynamic_array_len() __get_rel_dynamic_array_len() __get_rel_sockaddr() Also add a helper function find_print_string() to not need to use a middle man variable to test if the string exists. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/20241217024720.521836792@goodmis.org Fixes: 5013f454a352c ("tracing: Add check of trace event print fmts for dereferencing pointers") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-27tracing: Fix test_event_printk() to process entire print argumentSteven Rostedt
commit a6629626c584200daf495cc9a740048b455addcd upstream. The test_event_printk() analyzes print formats of trace events looking for cases where it may dereference a pointer that is not in the ring buffer which can possibly be a bug when the trace event is read from the ring buffer and the content of that pointer no longer exists. The function needs to accurately go from one print format argument to the next. It handles quotes and parenthesis that may be included in an argument. When it finds the start of the next argument, it uses a simple "c = strstr(fmt + i, ',')" to find the end of that argument! In order to include "%s" dereferencing, it needs to process the entire content of the print format argument and not just the content of the first ',' it finds. As there may be content like: ({ const char *saved_ptr = trace_seq_buffer_ptr(p); static const char *access_str[] = { "---", "--x", "w--", "w-x", "-u-", "-ux", "wu-", "wux" }; union kvm_mmu_page_role role; role.word = REC->role; trace_seq_printf(p, "sp gen %u gfn %llx l%u %u-byte q%u%s %s%s" " %snxe %sad root %u %s%c", REC->mmu_valid_gen, REC->gfn, role.level, role.has_4_byte_gpte ? 4 : 8, role.quadrant, role.direct ? " direct" : "", access_str[role.access], role.invalid ? " invalid" : "", role.efer_nx ? "" : "!", role.ad_disabled ? "!" : "", REC->root_count, REC->unsync ? "unsync" : "sync", 0); saved_ptr; }) Which is an example of a full argument of an existing event. As the code already handles finding the next print format argument, process the argument at the end of it and not the start of it. This way it has both the start of the argument as well as the end of it. Add a helper function "process_pointer()" that will do the processing during the loop as well as at the end. It also makes the code cleaner and easier to read. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/20241217024720.362271189@goodmis.org Fixes: 5013f454a352c ("tracing: Add check of trace event print fmts for dereferencing pointers") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-19tracing/kprobes: Skip symbol counting logic for module symbols in ↵Nikolay Kuratov
create_local_trace_kprobe() commit b022f0c7e404 ("tracing/kprobes: Return EADDRNOTAVAIL when func matches several symbols") avoids checking number_of_same_symbols() for module symbol in __trace_kprobe_create(), but create_local_trace_kprobe() should avoid this check too. Doing this check leads to ENOENT for module_name:symbol_name constructions passed over perf_event_open. No bug in newer kernels as it was fixed more generally by commit 9d8616034f16 ("tracing/kprobes: Add symbol counting check when module loads") Link: https://lore.kernel.org/linux-trace-kernel/20240705161030.b3ddb33a8167013b9b1da202@kernel.org Fixes: b022f0c7e404 ("tracing/kprobes: Return EADDRNOTAVAIL when func matches several symbols") Signed-off-by: Nikolay Kuratov <kniv@yandex-team.ru> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-19bpf,perf: Fix invalid prog_array access in perf_event_detach_bpf_progJiri Olsa
commit 978c4486cca5c7b9253d3ab98a88c8e769cb9bbd upstream. Syzbot reported [1] crash that happens for following tracing scenario: - create tracepoint perf event with attr.inherit=1, attach it to the process and set bpf program to it - attached process forks -> chid creates inherited event the new child event shares the parent's bpf program and tp_event (hence prog_array) which is global for tracepoint - exit both process and its child -> release both events - first perf_event_detach_bpf_prog call will release tp_event->prog_array and second perf_event_detach_bpf_prog will crash, because tp_event->prog_array is NULL The fix makes sure the perf_event_detach_bpf_prog checks prog_array is valid before it tries to remove the bpf program from it. [1] https://lore.kernel.org/bpf/Z1MR6dCIKajNS6nU@krava/T/#m91dbf0688221ec7a7fc95e896a7ef9ff93b0b8ad Fixes: 0ee288e69d03 ("bpf,perf: Fix perf_event_detach_bpf_prog error handling") Reported-by: syzbot+2e0d2840414ce817aaac@syzkaller.appspotmail.com Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241208142507.1207698-1-jolsa@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-19bpf: Fix theoretical prog_array UAF in __uprobe_perf_func()Jann Horn
commit 7d0d673627e20cfa3b21a829a896ce03b58a4f1c upstream. Currently, the pointer stored in call->prog_array is loaded in __uprobe_perf_func(), with no RCU annotation and no immediately visible RCU protection, so it looks as if the loaded pointer can immediately be dangling. Later, bpf_prog_run_array_uprobe() starts a RCU-trace read-side critical section, but this is too late. It then uses rcu_dereference_check(), but this use of rcu_dereference_check() does not actually dereference anything. Fix it by aligning the semantics to bpf_prog_run_array(): Let the caller provide rcu_read_lock_trace() protection and then load call->prog_array with rcu_dereference_check(). This issue seems to be theoretical: I don't know of any way to reach this code without having handle_swbp() further up the stack, which is already holding a rcu_read_lock_trace() lock, so where we take rcu_read_lock_trace() in __uprobe_perf_func()/bpf_prog_run_array_uprobe() doesn't actually have any effect. Fixes: 8c7dcb84e3b7 ("bpf: implement sleepable uprobes by chaining gps") Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241210-bpf-fix-uprobe-uaf-v4-1-5fc8959b2b74@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-19bpf: Fix UAF via mismatching bpf_prog/attachment RCU flavorsJann Horn
commit ef1b808e3b7c98612feceedf985c2fbbeb28f956 upstream. Uprobes always use bpf_prog_run_array_uprobe() under tasks-trace-RCU protection. But it is possible to attach a non-sleepable BPF program to a uprobe, and non-sleepable BPF programs are freed via normal RCU (see __bpf_prog_put_noref()). This leads to UAF of the bpf_prog because a normal RCU grace period does not imply a tasks-trace-RCU grace period. Fix it by explicitly waiting for a tasks-trace-RCU grace period after removing the attachment of a bpf_prog to a perf_event. Fixes: 8c7dcb84e3b7 ("bpf: implement sleepable uprobes by chaining gps") Suggested-by: Andrii Nakryiko <andrii@kernel.org> Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/bpf/20241210-bpf-fix-actual-uprobe-uaf-v1-1-19439849dd44@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-14tracing/eprobe: Fix to release eprobe when failed to add dyn_eventMasami Hiramatsu (Google)
[ Upstream commit 494b332064c0ce2f7392fa92632bc50191c1b517 ] Fix eprobe event to unregister event call and release eprobe when it fails to add dynamic event correctly. Link: https://lore.kernel.org/all/173289886698.73724.1959899350183686006.stgit@devnote2/ Fixes: 7491e2c44278 ("tracing: Add a probe that attaches to trace events") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-14tracing: Use atomic64_inc_return() in trace_clock_counter()Uros Bizjak
[ Upstream commit eb887c4567d1b0e7684c026fe7df44afa96589e6 ] Use atomic64_inc_return(&ref) instead of atomic64_add_return(1, &ref) to use optimized implementation and ease register pressure around the primitive for targets that implement optimized variant. Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20241007085651.48544-1-ubizjak@gmail.com Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-14tracing/ftrace: disable preemption in syscall probeMathieu Desnoyers
[ Upstream commit 13d750c2c03e9861e15268574ed2c239cca9c9d5 ] In preparation for allowing system call enter/exit instrumentation to handle page faults, make sure that ftrace can handle this change by explicitly disabling preemption within the ftrace system call tracepoint probes to respect the current expectations within ftrace ring buffer code. This change does not yet allow ftrace to take page faults per se within its probe, but allows its existing probes to adapt to the upcoming change. Cc: Michael Jeanson <mjeanson@efficios.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Yonghong Song <yhs@fb.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com> Cc: bpf@vger.kernel.org Cc: Joel Fernandes <joel@joelfernandes.org> Link: https://lore.kernel.org/20241009010718.2050182-3-mathieu.desnoyers@efficios.com Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-14tracing: Fix cmp_entries_dup() to respect sort() comparison rulesKuan-Wei Chiu
commit e63fbd5f6810ed756bbb8a1549c7d4132968baa9 upstream. The cmp_entries_dup() function used as the comparator for sort() violated the symmetry and transitivity properties required by the sorting algorithm. Specifically, it returned 1 whenever memcmp() was non-zero, which broke the following expectations: * Symmetry: If x < y, then y > x. * Transitivity: If x < y and y < z, then x < z. These violations could lead to incorrect sorting and failure to correctly identify duplicate elements. Fix the issue by directly returning the result of memcmp(), which adheres to the required comparison properties. Cc: stable@vger.kernel.org Fixes: 08d43a5fa063 ("tracing: Add lock-free tracing_map") Link: https://lore.kernel.org/20241203202228.1274403-1-visitorckw@gmail.com Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-09ftrace: Fix regression with module command in stack_trace_filterguoweikang
commit 45af52e7d3b8560f21d139b3759735eead8b1653 upstream. When executing the following command: # echo "write*:mod:ext3" > /sys/kernel/tracing/stack_trace_filter The current mod command causes a null pointer dereference. While commit 0f17976568b3f ("ftrace: Fix regression with module command in stack_trace_filter") has addressed part of the issue, it left a corner case unhandled, which still results in a kernel crash. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20241120052750.275463-1-guoweikang.kernel@gmail.com Fixes: 04ec7bb642b77 ("tracing: Have the trace_array hold the list of registered func probes"); Signed-off-by: guoweikang <guoweikang.kernel@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-09trace/trace_event_perf: remove duplicate samples on the first tracepoint eventLevi Yun
[ Upstream commit afe5960dc208fe069ddaaeb0994d857b24ac19d1 ] When a tracepoint event is created with attr.freq = 1, 'hwc->period_left' is not initialized correctly. As a result, in the perf_swevent_overflow() function, when the first time the event occurs, it calculates the event overflow and the perf_swevent_set_period() returns 3, this leads to the event are recorded for three duplicate times. Step to reproduce: 1. Enable the tracepoint event & starting tracing $ echo 1 > /sys/kernel/tracing/events/module/module_free $ echo 1 > /sys/kernel/tracing/tracing_on 2. Record with perf $ perf record -a --strict-freq -F 1 -e "module:module_free" 3. Trigger module_free event. $ modprobe -i sunrpc $ modprobe -r sunrpc Result: - Trace pipe result: $ cat trace_pipe modprobe-174509 [003] ..... 6504.868896: module_free: sunrpc - perf sample: modprobe 174509 [003] 6504.868980: module:module_free: sunrpc modprobe 174509 [003] 6504.868980: module:module_free: sunrpc modprobe 174509 [003] 6504.868980: module:module_free: sunrpc By setting period_left via perf_swevent_set_period() as other sw_event did, This problem could be solved. After patch: - Trace pipe result: $ cat trace_pipe modprobe 1153096 [068] 613468.867774: module:module_free: xfs - perf sample modprobe 1153096 [068] 613468.867794: module:module_free: xfs Link: https://lore.kernel.org/20240913021347.595330-1-yeoreum.yun@arm.com Fixes: bd2b5b12849a ("perf_counter: More aggressive frequency adjustment") Signed-off-by: Levi Yun <yeoreum.yun@arm.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-09bpf: Force uprobe bpf program to always return 0Jiri Olsa
[ Upstream commit f505005bc7426f4309880da94cfbfc37efa225bd ] As suggested by Andrii make uprobe multi bpf programs to always return 0, so they can't force uprobe removal. Keeping the int return type for uprobe_prog_run, because it will be used in following session changes. Fixes: 89ae89f53d20 ("bpf: Add multi uprobe link") Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241108134544.480660-3-jolsa@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-01tracing: probes: Fix to zero initialize a local variableMasami Hiramatsu (Google)
commit 0add699ad068d26e5b1da9ff28b15461fc4005df upstream. Fix to initialize 'val' local variable with zero. Dan reported that Smatch static code checker reports an error that a local 'val' variable needs to be initialized. Actually, the 'val' is expected to be initialized by FETCH_OP_ARG in the same loop, but it is not obvious. So initialize it with zero. Link: https://lore.kernel.org/all/171092223833.237219.17304490075697026697.stgit@devnote2/ Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/all/b010488e-68aa-407c-add0-3e059254aaa0@moroto.mountain/ Fixes: 25f00e40ce79 ("tracing/probes: Support $argN in return probe (kprobe and fprobe)") Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-01bpf,perf: Fix perf_event_detach_bpf_prog error handlingJiri Olsa
[ Upstream commit 0ee288e69d033850bc87abe0f9cc3ada24763d7f ] Peter reported that perf_event_detach_bpf_prog might skip to release the bpf program for -ENOENT error from bpf_prog_array_copy. This can't happen because bpf program is stored in perf event and is detached and released only when perf event is freed. Let's drop the -ENOENT check and make sure the bpf program is released in any case. Fixes: 170a7e3ea070 ("bpf: bpf_prog_array_copy() should return -ENOENT if exclude_prog not found") Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241023200352.3488610-1-jolsa@kernel.org Closes: https://lore.kernel.org/lkml/20241022111638.GC16066@noisy.programming.kicks-ass.net/ Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-01bpf: Add MEM_WRITE attributeDaniel Borkmann
[ Upstream commit 6fad274f06f038c29660aa53fbad14241c9fd976 ] Add a MEM_WRITE attribute for BPF helper functions which can be used in bpf_func_proto to annotate an argument type in order to let the verifier know that the helper writes into the memory passed as an argument. In the past MEM_UNINIT has been (ab)used for this function, but the latter merely tells the verifier that the passed memory can be uninitialized. There have been bugs with overloading the latter but aside from that there are also cases where the passed memory is read + written which currently cannot be expressed, see also 4b3786a6c539 ("bpf: Zero former ARG_PTR_TO_{LONG,INT} args in case of error"). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241021152809.33343-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org> Stable-dep-of: 8ea607330a39 ("bpf: Fix overloading of MEM_UNINIT's meaning") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-01tracing: Consider the NULL character when validating the event lengthLeo Yan
[ Upstream commit 0b6e2e22cb23105fcb171ab92f0f7516c69c8471 ] strlen() returns a string length excluding the null byte. If the string length equals to the maximum buffer length, the buffer will have no space for the NULL terminating character. This commit checks this condition and returns failure for it. Link: https://lore.kernel.org/all/20241007144724.920954-1-leo.yan@arm.com/ Fixes: dec65d79fd26 ("tracing/probe: Check event name length correctly") Signed-off-by: Leo Yan <leo.yan@arm.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-01tracing/probes: Fix MAX_TRACE_ARGS limit handlingMikel Rychliski
[ Upstream commit 73f35080477e893aa6f4c8d388352b871b288fbc ] When creating a trace_probe we would set nr_args prior to truncating the arguments to MAX_TRACE_ARGS. However, we would only initialize arguments up to the limit. This caused invalid memory access when attempting to set up probes with more than 128 fetchargs. BUG: kernel NULL pointer dereference, address: 0000000000000020 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] PREEMPT SMP PTI CPU: 0 UID: 0 PID: 1769 Comm: cat Not tainted 6.11.0-rc7+ #8 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014 RIP: 0010:__set_print_fmt+0x134/0x330 Resolve the issue by applying the MAX_TRACE_ARGS limit earlier. Return an error when there are too many arguments instead of silently truncating. Link: https://lore.kernel.org/all/20240930202656.292869-1-mikel@mikelr.com/ Fixes: 035ba76014c0 ("tracing/probes: cleanup: Set trace_probe::nr_args at trace_probe_init") Signed-off-by: Mikel Rychliski <mikel@mikelr.com> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-01uprobe: avoid out-of-bounds memory access of fetching argsQiao Ma
[ Upstream commit 373b9338c9722a368925d83bc622c596896b328e ] Uprobe needs to fetch args into a percpu buffer, and then copy to ring buffer to avoid non-atomic context problem. Sometimes user-space strings, arrays can be very large, but the size of percpu buffer is only page size. And store_trace_args() won't check whether these data exceeds a single page or not, caused out-of-bounds memory access. It could be reproduced by following steps: 1. build kernel with CONFIG_KASAN enabled 2. save follow program as test.c ``` \#include <stdio.h> \#include <stdlib.h> \#include <string.h> // If string length large than MAX_STRING_SIZE, the fetch_store_strlen() // will return 0, cause __get_data_size() return shorter size, and // store_trace_args() will not trigger out-of-bounds access. // So make string length less than 4096. \#define STRLEN 4093 void generate_string(char *str, int n) { int i; for (i = 0; i < n; ++i) { char c = i % 26 + 'a'; str[i] = c; } str[n-1] = '\0'; } void print_string(char *str) { printf("%s\n", str); } int main() { char tmp[STRLEN]; generate_string(tmp, STRLEN); print_string(tmp); return 0; } ``` 3. compile program `gcc -o test test.c` 4. get the offset of `print_string()` ``` objdump -t test | grep -w print_string 0000000000401199 g F .text 000000000000001b print_string ``` 5. configure uprobe with offset 0x1199 ``` off=0x1199 cd /sys/kernel/debug/tracing/ echo "p /root/test:${off} arg1=+0(%di):ustring arg2=\$comm arg3=+0(%di):ustring" > uprobe_events echo 1 > events/uprobes/enable echo 1 > tracing_on ``` 6. run `test`, and kasan will report error. ================================================================== BUG: KASAN: use-after-free in strncpy_from_user+0x1d6/0x1f0 Write of size 8 at addr ffff88812311c004 by task test/499CPU: 0 UID: 0 PID: 499 Comm: test Not tainted 6.12.0-rc3+ #18 Hardware name: Red Hat KVM, BIOS 1.16.0-4.al8 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x55/0x70 print_address_description.constprop.0+0x27/0x310 kasan_report+0x10f/0x120 ? strncpy_from_user+0x1d6/0x1f0 strncpy_from_user+0x1d6/0x1f0 ? rmqueue.constprop.0+0x70d/0x2ad0 process_fetch_insn+0xb26/0x1470 ? __pfx_process_fetch_insn+0x10/0x10 ? _raw_spin_lock+0x85/0xe0 ? __pfx__raw_spin_lock+0x10/0x10 ? __pte_offset_map+0x1f/0x2d0 ? unwind_next_frame+0xc5f/0x1f80 ? arch_stack_walk+0x68/0xf0 ? is_bpf_text_address+0x23/0x30 ? kernel_text_address.part.0+0xbb/0xd0 ? __kernel_text_address+0x66/0xb0 ? unwind_get_return_address+0x5e/0xa0 ? __pfx_stack_trace_consume_entry+0x10/0x10 ? arch_stack_walk+0xa2/0xf0 ? _raw_spin_lock_irqsave+0x8b/0xf0 ? __pfx__raw_spin_lock_irqsave+0x10/0x10 ? depot_alloc_stack+0x4c/0x1f0 ? _raw_spin_unlock_irqrestore+0xe/0x30 ? stack_depot_save_flags+0x35d/0x4f0 ? kasan_save_stack+0x34/0x50 ? kasan_save_stack+0x24/0x50 ? mutex_lock+0x91/0xe0 ? __pfx_mutex_lock+0x10/0x10 prepare_uprobe_buffer.part.0+0x2cd/0x500 uprobe_dispatcher+0x2c3/0x6a0 ? __pfx_uprobe_dispatcher+0x10/0x10 ? __kasan_slab_alloc+0x4d/0x90 handler_chain+0xdd/0x3e0 handle_swbp+0x26e/0x3d0 ? __pfx_handle_swbp+0x10/0x10 ? uprobe_pre_sstep_notifier+0x151/0x1b0 irqentry_exit_to_user_mode+0xe2/0x1b0 asm_exc_int3+0x39/0x40 RIP: 0033:0x401199 Code: 01 c2 0f b6 45 fb 88 02 83 45 fc 01 8b 45 fc 3b 45 e4 7c b7 8b 45 e4 48 98 48 8d 50 ff 48 8b 45 e8 48 01 d0 ce RSP: 002b:00007ffdf00576a8 EFLAGS: 00000206 RAX: 00007ffdf00576b0 RBX: 0000000000000000 RCX: 0000000000000ff2 RDX: 0000000000000ffc RSI: 0000000000000ffd RDI: 00007ffdf00576b0 RBP: 00007ffdf00586b0 R08: 00007feb2f9c0d20 R09: 00007feb2f9c0d20 R10: 0000000000000001 R11: 0000000000000202 R12: 0000000000401040 R13: 00007ffdf0058780 R14: 0000000000000000 R15: 0000000000000000 </TASK> This commit enforces the buffer's maxlen less than a page-size to avoid store_trace_args() out-of-memory access. Link: https://lore.kernel.org/all/20241015060148.1108331-1-mqaio@linux.alibaba.com/ Fixes: dcad1a204f72 ("tracing/uprobes: Fetch args before reserving a ring buffer") Signed-off-by: Qiao Ma <mqaio@linux.alibaba.com> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-01uprobes: prevent mutex_lock() under rcu_read_lock()Andrii Nakryiko
[ Upstream commit 699646734ab51bf5b1cd4a7a30c20074f6e74f6e ] Recent changes made uprobe_cpu_buffer preparation lazy, and moved it deeper into __uprobe_trace_func(). This is problematic because __uprobe_trace_func() is called inside rcu_read_lock()/rcu_read_unlock() block, which then calls prepare_uprobe_buffer() -> uprobe_buffer_get() -> mutex_lock(&ucb->mutex), leading to a splat about using mutex under non-sleepable RCU: BUG: sleeping function called from invalid context at kernel/locking/mutex.c:585 in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 98231, name: stress-ng-sigq preempt_count: 0, expected: 0 RCU nest depth: 1, expected: 0 ... Call Trace: <TASK> dump_stack_lvl+0x3d/0xe0 __might_resched+0x24c/0x270 ? prepare_uprobe_buffer+0xd5/0x1d0 __mutex_lock+0x41/0x820 ? ___perf_sw_event+0x206/0x290 ? __perf_event_task_sched_in+0x54/0x660 ? __perf_event_task_sched_in+0x54/0x660 prepare_uprobe_buffer+0xd5/0x1d0 __uprobe_trace_func+0x4a/0x140 uprobe_dispatcher+0x135/0x280 ? uprobe_dispatcher+0x94/0x280 uprobe_notify_resume+0x650/0xec0 ? atomic_notifier_call_chain+0x21/0x110 ? atomic_notifier_call_chain+0xf8/0x110 irqentry_exit_to_user_mode+0xe2/0x1e0 asm_exc_int3+0x35/0x40 RIP: 0033:0x7f7e1d4da390 Code: 33 04 00 0f 1f 80 00 00 00 00 f3 0f 1e fa b9 01 00 00 00 e9 b2 fc ff ff 66 90 f3 0f 1e fa 31 c9 e9 a5 fc ff ff 0f 1f 44 00 00 <cc> 0f 1e fa b8 27 00 00 00 0f 05 c3 0f 1f 40 00 f3 0f 1e fa b8 6e RSP: 002b:00007ffd2abc3608 EFLAGS: 00000246 RAX: 0000000000000000 RBX: 0000000076d325f1 RCX: 0000000000000000 RDX: 0000000076d325f1 RSI: 000000000000000a RDI: 00007ffd2abc3690 RBP: 000000000000000a R08: 00017fb700000000 R09: 00017fb700000000 R10: 00017fb700000000 R11: 0000000000000246 R12: 0000000000017ff2 R13: 00007ffd2abc3610 R14: 0000000000000000 R15: 00007ffd2abc3780 </TASK> Luckily, it's easy to fix by moving prepare_uprobe_buffer() to be called slightly earlier: into uprobe_trace_func() and uretprobe_trace_func(), outside of RCU locked section. This still keeps this buffer preparation lazy and helps avoid the overhead when it's not needed. E.g., if there is only BPF uprobe handler installed on a given uprobe, buffer won't be initialized. Note, the other user of prepare_uprobe_buffer(), __uprobe_perf_func(), is not affected, as it doesn't prepare buffer under RCU read lock. Link: https://lore.kernel.org/all/20240521053017.3708530-1-andrii@kernel.org/ Fixes: 1b8f85defbc8 ("uprobes: prepare uprobe args buffer lazily") Reported-by: Breno Leitao <leitao@debian.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Stable-dep-of: 373b9338c972 ("uprobe: avoid out-of-bounds memory access of fetching args") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-01uprobes: prepare uprobe args buffer lazilyAndrii Nakryiko
[ Upstream commit 1b8f85defbc82e2eb8f27c5f6060ea507ad4d5a3 ] uprobe_cpu_buffer and corresponding logic to store uprobe args into it are used for uprobes/uretprobes that are created through tracefs or perf events. BPF is yet another user of uprobe/uretprobe infrastructure, but doesn't need uprobe_cpu_buffer and associated data. For BPF-only use cases this buffer handling and preparation is a pure overhead. At the same time, BPF-only uprobe/uretprobe usage is very common in practice. Also, for a lot of cases applications are very senstivie to performance overheads, as they might be tracing a very high frequency functions like malloc()/free(), so every bit of performance improvement matters. All that is to say that this uprobe_cpu_buffer preparation is an unnecessary overhead that each BPF user of uprobes/uretprobe has to pay. This patch is changing this by making uprobe_cpu_buffer preparation optional. It will happen only if either tracefs-based or perf event-based uprobe/uretprobe consumer is registered for given uprobe/uretprobe. For BPF-only use cases this step will be skipped. We used uprobe/uretprobe benchmark which is part of BPF selftests (see [0]) to estimate the improvements. We have 3 uprobe and 3 uretprobe scenarios, which vary an instruction that is replaced by uprobe: nop (fastest uprobe case), `push rbp` (typical case), and non-simulated `ret` instruction (slowest case). Benchmark thread is constantly calling user space function in a tight loop. User space function has attached BPF uprobe or uretprobe program doing nothing but atomic counter increments to count number of triggering calls. Benchmark emits throughput in millions of executions per second. BEFORE these changes ==================== uprobe-nop : 2.657 ± 0.024M/s uprobe-push : 2.499 ± 0.018M/s uprobe-ret : 1.100 ± 0.006M/s uretprobe-nop : 1.356 ± 0.004M/s uretprobe-push : 1.317 ± 0.019M/s uretprobe-ret : 0.785 ± 0.007M/s AFTER these changes =================== uprobe-nop : 2.732 ± 0.022M/s (+2.8%) uprobe-push : 2.621 ± 0.016M/s (+4.9%) uprobe-ret : 1.105 ± 0.007M/s (+0.5%) uretprobe-nop : 1.396 ± 0.007M/s (+2.9%) uretprobe-push : 1.347 ± 0.008M/s (+2.3%) uretprobe-ret : 0.800 ± 0.006M/s (+1.9) So the improvements on this particular machine seems to be between 2% and 5%. [0] https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/benchs/bench_trigger.c Reviewed-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/all/20240318181728.2795838-3-andrii@kernel.org/ Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Stable-dep-of: 373b9338c972 ("uprobe: avoid out-of-bounds memory access of fetching args") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-01uprobes: encapsulate preparation of uprobe args bufferAndrii Nakryiko
[ Upstream commit 3eaea21b4d27cff0017c20549aeb53034c58fc23 ] Move the logic of fetching temporary per-CPU uprobe buffer and storing uprobes args into it to a new helper function. Store data size as part of this buffer, simplifying interfaces a bit, as now we only pass single uprobe_cpu_buffer reference around, instead of pointer + dsize. This logic was duplicated across uprobe_dispatcher and uretprobe_dispatcher, and now will be centralized. All this is also in preparation to make this uprobe_cpu_buffer handling logic optional in the next patch. Link: https://lore.kernel.org/all/20240318181728.2795838-2-andrii@kernel.org/ [Masami: update for v6.9-rc3 kernel] Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Stable-dep-of: 373b9338c972 ("uprobe: avoid out-of-bounds memory access of fetching args") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-01tracing/probes: Support $argN in return probe (kprobe and fprobe)Masami Hiramatsu (Google)
[ Upstream commit 25f00e40ce7953db197af3a59233711d154c9d80 ] Support accessing $argN in the return probe events. This will help users to record entry data in function return (exit) event for simplfing the function entry/exit information in one event, and record the result values (e.g. allocated object/initialized object) at function exit. For example, if we have a function `int init_foo(struct foo *obj, int param)` sometimes we want to check how `obj` is initialized. In such case, we can define a new return event like below; # echo 'r init_foo retval=$retval param=$arg2 field1=+0($arg1)' >> kprobe_events Thus it records the function parameter `param` and its result `obj->field1` (the dereference will be done in the function exit timing) value at once. This also support fprobe, BTF args and'$arg*'. So if CONFIG_DEBUG_INFO_BTF is enabled, we can trace both function parameters and the return value by following command. # echo 'f target_function%return $arg* $retval' >> dynamic_events Link: https://lore.kernel.org/all/170952365552.229804.224112990211602895.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Stable-dep-of: 373b9338c972 ("uprobe: avoid out-of-bounds memory access of fetching args") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-01tracing/probes: cleanup: Set trace_probe::nr_args at trace_probe_initMasami Hiramatsu (Google)
[ Upstream commit 035ba76014c096316fa809a46ce0a1b9af1cde0d ] Instead of incrementing the trace_probe::nr_args, init it at trace_probe_init(). Without this change, there is no way to get the number of trace_probe arguments while parsing it. This is a cleanup, so the behavior is not changed. Link: https://lore.kernel.org/all/170952363585.229804.13060759900346411951.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Stable-dep-of: 373b9338c972 ("uprobe: avoid out-of-bounds memory access of fetching args") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-01tracing/fprobe-event: cleanup: Fix a wrong comment in fprobe eventMasami Hiramatsu (Google)
[ Upstream commit 7e37b6bc3cc096e24709908076807bb9c3cf0d38 ] Despite the fprobe event, "Kretprobe" was commented. So fix it. Link: https://lore.kernel.org/all/170952361630.229804.10832200172327797860.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Stable-dep-of: 373b9338c972 ("uprobe: avoid out-of-bounds memory access of fetching args") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-01bpf: Add missed value to kprobe perf link infoJiri Olsa
[ Upstream commit 3acf8ace68230e9558cf916847f1cc9f208abdf1 ] Add missed value to kprobe attached through perf link info to hold the stats of missed kprobe handler execution. The kprobe's missed counter gets incremented when kprobe handler is not executed due to another kprobe running on the same cpu. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20230920213145.1941596-4-jolsa@kernel.org Stable-dep-of: 4deecdd29cf2 ("bpf: fix unpopulated name_len field in perf_event link info") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-10-17tracing: Have saved_cmdlines arrays all in one allocationSteven Rostedt (Google)
[ Upstream commit 0b18c852cc6fb8284ac0ab97e3e840974a6a8a64 ] The saved_cmdlines have three arrays for mapping PIDs to COMMs: - map_pid_to_cmdline[] - map_cmdline_to_pid[] - saved_cmdlines The map_pid_to_cmdline[] is PID_MAX_DEFAULT in size and holds the index into the other arrays. The map_cmdline_to_pid[] is a mapping back to the full pid as it can be larger than PID_MAX_DEFAULT. And the saved_cmdlines[] just holds the COMMs associated to the pids. Currently the map_pid_to_cmdline[] and saved_cmdlines[] are allocated together (in reality the saved_cmdlines is just in the memory of the rounding of the allocation of the structure as it is always allocated in powers of two). The map_cmdline_to_pid[] array is allocated separately. Since the rounding to a power of two is rather large (it allows for 8000 elements in saved_cmdlines), also include the map_cmdline_to_pid[] array. (This drops it to 6000 by default, which is still plenty for most use cases). This saves even more memory as the map_cmdline_to_pid[] array doesn't need to be allocated. Link: https://lore.kernel.org/linux-trace-kernel/20240212174011.068211d9@gandalf.local.home/ Link: https://lore.kernel.org/linux-trace-kernel/20240220140703.182330529@goodmis.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Mete Durlu <meted@linux.ibm.com> Fixes: 44dc5c41b5b1 ("tracing: Fix wasted memory in saved_cmdlines logic") Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-10-17tracing: Remove precision vsnprintf() check from print eventSteven Rostedt (Google)
[ Upstream commit 5efd3e2aef91d2d812290dcb25b2058e6f3f532c ] This reverts 60be76eeabb3d ("tracing: Add size check when printing trace_marker output"). The only reason the precision check was added was because of a bug that miscalculated the write size of the string into the ring buffer and it truncated it removing the terminating nul byte. On reading the trace it crashed the kernel. But this was due to the bug in the code that happened during development and should never happen in practice. If anything, the precision can hide bugs where the string in the ring buffer isn't nul terminated and it will not be checked. Link: https://lore.kernel.org/all/C7E7AF1A-D30F-4D18-B8E5-AF1EF58004F5@linux.ibm.com/ Link: https://lore.kernel.org/linux-trace-kernel/20240227125706.04279ac2@gandalf.local.home Link: https://lore.kernel.org/all/20240302111244.3a1674be@gandalf.local.home/ Link: https://lore.kernel.org/linux-trace-kernel/20240304174341.2a561d9f@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Fixes: 60be76eeabb3d ("tracing: Add size check when printing trace_marker output") Reported-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-10-10tracing/timerlat: Fix duplicated kthread creation due to CPU online/offlineWei Li
commit 0bb0a5c12ecf36ad561542bbb95f96355e036a02 upstream. osnoise_hotplug_workfn() is the asynchronous online callback for "trace/osnoise:online". It may be congested when a CPU goes online and offline repeatedly and is invoked for multiple times after a certain online. This will lead to kthread leak and timer corruption. Add a check in start_kthread() to prevent this situation. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20240924094515.3561410-2-liwei391@huawei.com Fixes: c8895e271f79 ("trace/osnoise: Support hotplug operations") Signed-off-by: Wei Li <liwei391@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-10-10tracing/timerlat: Fix a race during cpuhp processingWei Li
commit 829e0c9f0855f26b3ae830d17b24aec103f7e915 upstream. There is another found exception that the "timerlat/1" thread was scheduled on CPU0, and lead to timer corruption finally: ``` ODEBUG: init active (active state 0) object: ffff888237c2e108 object type: hrtimer hint: timerlat_irq+0x0/0x220 WARNING: CPU: 0 PID: 426 at lib/debugobjects.c:518 debug_print_object+0x7d/0xb0 Modules linked in: CPU: 0 UID: 0 PID: 426 Comm: timerlat/1 Not tainted 6.11.0-rc7+ #45 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 RIP: 0010:debug_print_object+0x7d/0xb0 ... Call Trace: <TASK> ? __warn+0x7c/0x110 ? debug_print_object+0x7d/0xb0 ? report_bug+0xf1/0x1d0 ? prb_read_valid+0x17/0x20 ? handle_bug+0x3f/0x70 ? exc_invalid_op+0x13/0x60 ? asm_exc_invalid_op+0x16/0x20 ? debug_print_object+0x7d/0xb0 ? debug_print_object+0x7d/0xb0 ? __pfx_timerlat_irq+0x10/0x10 __debug_object_init+0x110/0x150 hrtimer_init+0x1d/0x60 timerlat_main+0xab/0x2d0 ? __pfx_timerlat_main+0x10/0x10 kthread+0xb7/0xe0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2d/0x40 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> ``` After tracing the scheduling event, it was discovered that the migration of the "timerlat/1" thread was performed during thread creation. Further analysis confirmed that it is because the CPU online processing for osnoise is implemented through workers, which is asynchronous with the offline processing. When the worker was scheduled to create a thread, the CPU may has already been removed from the cpu_online_mask during the offline process, resulting in the inability to select the right CPU: T1 | T2 [CPUHP_ONLINE] | cpu_device_down() osnoise_hotplug_workfn() | | cpus_write_lock() | takedown_cpu(1) | cpus_write_unlock() [CPUHP_OFFLINE] | cpus_read_lock() | start_kthread(1) | cpus_read_unlock() | To fix this, skip online processing if the CPU is already offline. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20240924094515.3561410-4-liwei391@huawei.com Fixes: c8895e271f79 ("trace/osnoise: Support hotplug operations") Signed-off-by: Wei Li <liwei391@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-10-10tracing/timerlat: Drop interface_lock in stop_kthread()Wei Li
commit b484a02c9cedf8703eff8f0756f94618004bd165 upstream. stop_kthread() is the offline callback for "trace/osnoise:online", since commit 5bfbcd1ee57b ("tracing/timerlat: Add interface_lock around clearing of kthread in stop_kthread()"), the following ABBA deadlock scenario is introduced: T1 | T2 [BP] | T3 [AP] osnoise_hotplug_workfn() | work_for_cpu_fn() | cpuhp_thread_fun() | _cpu_down() | osnoise_cpu_die() mutex_lock(&interface_lock) | | stop_kthread() | cpus_write_lock() | mutex_lock(&interface_lock) cpus_read_lock() | cpuhp_kick_ap() | As the interface_lock here in just for protecting the "kthread" field of the osn_var, use xchg() instead to fix this issue. Also use for_each_online_cpu() back in stop_per_cpu_kthreads() as it can take cpu_read_lock() again. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20240924094515.3561410-3-liwei391@huawei.com Fixes: 5bfbcd1ee57b ("tracing/timerlat: Add interface_lock around clearing of kthread in stop_kthread()") Signed-off-by: Wei Li <liwei391@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-10-10tracing/hwlat: Fix a race during cpuhp processingWei Li
commit 2a13ca2e8abb12ee43ada8a107dadca83f140937 upstream. The cpuhp online/offline processing race also exists in percpu-mode hwlat tracer in theory, apply the fix too. That is: T1 | T2 [CPUHP_ONLINE] | cpu_device_down() hwlat_hotplug_workfn() | | cpus_write_lock() | takedown_cpu(1) | cpus_write_unlock() [CPUHP_OFFLINE] | cpus_read_lock() | start_kthread(1) | cpus_read_unlock() | Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20240924094515.3561410-5-liwei391@huawei.com Fixes: ba998f7d9531 ("trace/hwlat: Support hotplug operations") Signed-off-by: Wei Li <liwei391@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-10-04bpf: Fix use-after-free in bpf_uprobe_multi_link_attach()Oleg Nesterov
commit 5fe6e308abaea082c20fbf2aa5df8e14495622cf upstream. If bpf_link_prime() fails, bpf_uprobe_multi_link_attach() goes to the error_free label and frees the array of bpf_uprobe's without calling bpf_uprobe_unregister(). This leaks bpf_uprobe->uprobe and worse, this frees bpf_uprobe->consumer without removing it from the uprobe->consumers list. Fixes: 89ae89f53d20 ("bpf: Add multi uprobe link") Closes: https://lore.kernel.org/all/000000000000382d39061f59f2dd@google.com/ Reported-by: syzbot+f7a1c2c2711e4a780f19@syzkaller.appspotmail.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Tested-by: syzbot+f7a1c2c2711e4a780f19@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240813152524.GA7292@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-10-04bpf: Fix helper writes to read-only mapsDaniel Borkmann
[ Upstream commit 32556ce93bc45c730829083cb60f95a2728ea48b ] Lonial found an issue that despite user- and BPF-side frozen BPF map (like in case of .rodata), it was still possible to write into it from a BPF program side through specific helpers having ARG_PTR_TO_{LONG,INT} as arguments. In check_func_arg() when the argument is as mentioned, the meta->raw_mode is never set. Later, check_helper_mem_access(), under the case of PTR_TO_MAP_VALUE as register base type, it assumes BPF_READ for the subsequent call to check_map_access_type() and given the BPF map is read-only it succeeds. The helpers really need to be annotated as ARG_PTR_TO_{LONG,INT} | MEM_UNINIT when results are written into them as opposed to read out of them. The latter indicates that it's okay to pass a pointer to uninitialized memory as the memory is written to anyway. However, ARG_PTR_TO_{LONG,INT} is a special case of ARG_PTR_TO_FIXED_SIZE_MEM just with additional alignment requirement. So it is better to just get rid of the ARG_PTR_TO_{LONG,INT} special cases altogether and reuse the fixed size memory types. For this, add MEM_ALIGNED to additionally ensure alignment given these helpers write directly into the args via *<ptr> = val. The .arg*_size has been initialized reflecting the actual sizeof(*<ptr>). MEM_ALIGNED can only be used in combination with MEM_FIXED_SIZE annotated argument types, since in !MEM_FIXED_SIZE cases the verifier does not know the buffer size a priori and therefore cannot blindly write *<ptr> = val. Fixes: 57c3bb725a3d ("bpf: Introduce ARG_PTR_TO_{INT,LONG} arg types") Reported-by: Lonial Con <kongln9170@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20240913191754.13290-3-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-09-18tracing/osnoise: Fix build when timerlat is not enabledSteven Rostedt
commit af178143343028fdec9d5960a22d17f5587fd3f5 upstream. To fix some critical section races, the interface_lock was added to a few locations. One of those locations was above where the interface_lock was declared, so the declaration was moved up before that usage. Unfortunately, where it was placed was inside a CONFIG_TIMERLAT_TRACER ifdef block. As the interface_lock is used outside that config, this broke the build when CONFIG_OSNOISE_TRACER was enabled but CONFIG_TIMERLAT_TRACER was not. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: "Helena Anna" <helena.anna.dubel@intel.com> Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com> Cc: Tomas Glozar <tglozar@redhat.com> Link: https://lore.kernel.org/20240909103231.23a289e2@gandalf.local.home Fixes: e6a53481da29 ("tracing/timerlat: Only clear timer if a kthread exists") Reported-by: "Bityutskiy, Artem" <artem.bityutskiy@intel.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-12tracing/timerlat: Add interface_lock around clearing of kthread in ↵Steven Rostedt
stop_kthread() commit 5bfbcd1ee57b607fd29e4645c7f350dd385dd9ad upstream. The timerlat interface will get and put the task that is part of the "kthread" field of the osn_var to keep it around until all references are released. But here's a race in the "stop_kthread()" code that will call put_task_struct() on the kthread if it is not a kernel thread. This can race with the releasing of the references to that task struct and the put_task_struct() can be called twice when it should have been called just once. Take the interface_lock() in stop_kthread() to synchronize this change. But to do so, the function stop_per_cpu_kthreads() needs to change the loop from for_each_online_cpu() to for_each_possible_cpu() and remove the cpu_read_lock(), as the interface_lock can not be taken while the cpu locks are held. The only side effect of this change is that it may do some extra work, as the per_cpu variables of the offline CPUs would not be set anyway, and would simply be skipped in the loop. Remove unneeded "return;" in stop_kthread(). Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tomas Glozar <tglozar@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com> Link: https://lore.kernel.org/20240905113359.2b934242@gandalf.local.home Fixes: e88ed227f639e ("tracing/timerlat: Add user-space interface") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-12tracing: Avoid possible softlockup in tracing_iter_reset()Zheng Yejian
commit 49aa8a1f4d6800721c7971ed383078257f12e8f9 upstream. In __tracing_open(), when max latency tracers took place on the cpu, the time start of its buffer would be updated, then event entries with timestamps being earlier than start of the buffer would be skipped (see tracing_iter_reset()). Softlockup will occur if the kernel is non-preemptible and too many entries were skipped in the loop that reset every cpu buffer, so add cond_resched() to avoid it. Cc: stable@vger.kernel.org Fixes: 2f26ebd549b9a ("tracing: use timestamp to determine start of latency traces") Link: https://lore.kernel.org/20240827124654.3817443-1-zhengyejian@huaweicloud.com Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Zheng Yejian <zhengyejian@huaweicloud.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-12tracing/timerlat: Only clear timer if a kthread existsSteven Rostedt
commit e6a53481da292d970d1edf0d8831121d1c5e2f0d upstream. The timerlat tracer can use user space threads to check for osnoise and timer latency. If the program using this is killed via a SIGTERM, the threads are shutdown one at a time and another tracing instance can start up resetting the threads before they are fully closed. That causes the hrtimer assigned to the kthread to be shutdown and freed twice when the dying thread finally closes the file descriptors, causing a use-after-free bug. Only cancel the hrtimer if the associated thread is still around. Also add the interface_lock around the resetting of the tlat_var->kthread. Note, this is just a quick fix that can be backported to stable. A real fix is to have a better synchronization between the shutdown of old threads and the starting of new ones. Link: https://lore.kernel.org/all/20240820130001.124768-1-tglozar@redhat.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com> Link: https://lore.kernel.org/20240905085330.45985730@gandalf.local.home Fixes: e88ed227f639e ("tracing/timerlat: Add user-space interface") Reported-by: Tomas Glozar <tglozar@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>