summaryrefslogtreecommitdiff
path: root/kernel/trace/trace.c
AgeCommit message (Collapse)Author
7 daystracing: Fix trace_marker copy link list updatesSteven Rostedt
When the "copy_trace_marker" option is enabled for an instance, anything written into /sys/kernel/tracing/trace_marker is also copied into that instances buffer. When the option is set, that instance's trace_array descriptor is added to the marker_copies link list. This list is protected by RCU, as all iterations uses an RCU protected list traversal. When the instance is deleted, all the flags that were enabled are cleared. This also clears the copy_trace_marker flag and removes the trace_array descriptor from the list. The issue is after the flags are called, a direct call to update_marker_trace() is performed to clear the flag. This function returns true if the state of the flag changed and false otherwise. If it returns true here, synchronize_rcu() is called to make sure all readers see that its removed from the list. But since the flag was already cleared, the state does not change and the synchronization is never called, leaving a possible UAF bug. Move the clearing of all flags below the updating of the copy_trace_marker option which then makes sure the synchronization is performed. Also use the flag for checking the state in update_marker_trace() instead of looking at if the list is empty. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260318185512.1b6c7db4@gandalf.local.home Fixes: 7b382efd5e8a ("tracing: Allow the top level trace_marker to write into another instances") Reported-by: Sasha Levin <sashal@kernel.org> Closes: https://lore.kernel.org/all/20260225133122.237275-1-sashal@kernel.org/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
7 daystracing: Fix failure to read user space from system call trace eventsSteven Rostedt
The system call trace events call trace_user_fault_read() to read the user space part of some system calls. This is done by grabbing a per-cpu buffer, disabling migration, enabling preemption, calling copy_from_user(), disabling preemption, enabling migration and checking if the task was preempted while preemption was enabled. If it was, the buffer is considered corrupted and it tries again. There's a safety mechanism that will fail out of this loop if it fails 100 times (with a warning). That warning message was triggered in some pi_futex stress tests. Enabling the sched_switch trace event and traceoff_on_warning, showed the problem: pi_mutex_hammer-1375 [006] d..21 138.981648: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981651: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 pi_mutex_hammer-1375 [006] d..21 138.981656: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981659: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 pi_mutex_hammer-1375 [006] d..21 138.981664: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981667: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 pi_mutex_hammer-1375 [006] d..21 138.981671: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981675: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 pi_mutex_hammer-1375 [006] d..21 138.981679: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981682: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 pi_mutex_hammer-1375 [006] d..21 138.981687: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981690: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 pi_mutex_hammer-1375 [006] d..21 138.981695: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981698: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 pi_mutex_hammer-1375 [006] d..21 138.981703: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981706: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 pi_mutex_hammer-1375 [006] d..21 138.981711: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981714: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 pi_mutex_hammer-1375 [006] d..21 138.981719: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981722: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 pi_mutex_hammer-1375 [006] d..21 138.981727: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981730: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 pi_mutex_hammer-1375 [006] d..21 138.981735: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981738: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 What happened was the task 1375 was flagged to be migrated. When preemption was enabled, the migration thread woke up to migrate that task, but failed because migration for that task was disabled. This caused the loop to fail to exit because the task scheduled out while trying to read user space. Every time the task enabled preemption the migration thread would schedule in, try to migrate the task, fail and let the task continue. But because the loop would only enable preemption with migration disabled, it would always fail because each time it enabled preemption to read user space, the migration thread would try to migrate it. To solve this, when the loop fails to read user space without being scheduled out, enabled and disable preemption with migration enabled. This will allow the migration task to successfully migrate the task and the next loop should succeed to read user space without being scheduled out. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260316130734.1858a998@gandalf.local.home Fixes: 64cf7d058a005 ("tracing: Have trace_marker use per-cpu data to read user space") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-06tracing: Fix trace_buf_size= cmdline parameter with sizes >= 2GCalvin Owens
Some of the sizing logic through tracer_alloc_buffers() uses int internally, causing unexpected behavior if the user passes a value that does not fit in an int (on my x86 machine, the result is uselessly tiny buffers). Fix by plumbing the parameter's real type (unsigned long) through to the ring buffer allocation functions, which already use unsigned long. It has always been possible to create larger ring buffers via the sysfs interface: this only affects the cmdline parameter. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/bff42a4288aada08bdf74da3f5b67a2c28b761f8.1772852067.git.calvin@wbinvd.org Fixes: 73c5162aa362 ("tracing: keep ring buffer to minimum size till used") Signed-off-by: Calvin Owens <calvin@wbinvd.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-03tracing: Fix WARN_ON in tracing_buffers_mmap_closeQing Wang
When a process forks, the child process copies the parent's VMAs but the user_mapped reference count is not incremented. As a result, when both the parent and child processes exit, tracing_buffers_mmap_close() is called twice. On the second call, user_mapped is already 0, causing the function to return -ENODEV and triggering a WARN_ON. Normally, this isn't an issue as the memory is mapped with VM_DONTCOPY set. But this is only a hint, and the application can call madvise(MADVISE_DOFORK) which resets the VM_DONTCOPY flag. When the application does that, it can trigger this issue on fork. Fix it by incrementing the user_mapped reference count without re-mapping the pages in the VMA's open callback. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://patch.msgid.link/20260227025842.1085206-1-wangqing7171@gmail.com Fixes: cf9f0f7c4c5bb ("tracing: Allow user-space mapping of the ring-buffer") Reported-by: syzbot+3b5dd2030fe08afdf65d@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=3b5dd2030fe08afdf65d Tested-by: syzbot+3b5dd2030fe08afdf65d@syzkaller.appspotmail.com Signed-off-by: Qing Wang <wangqing7171@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-21Convert more 'alloc_obj' cases to default GFP_KERNEL argumentsLinus Torvalds
This converts some of the visually simpler cases that have been split over multiple lines. I only did the ones that are easy to verify the resulting diff by having just that final GFP_KERNEL argument on the next line. Somebody should probably do a proper coccinelle script for this, but for me the trivial script actually resulted in an assertion failure in the middle of the script. I probably had made it a bit _too_ trivial. So after fighting that far a while I decided to just do some of the syntactically simpler cases with variations of the previous 'sed' scripts. The more syntactically complex multi-line cases would mostly really want whitespace cleanup anyway. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21Convert 'alloc_obj' family to use the new default GFP_KERNEL argumentLinus Torvalds
This was done entirely with mindless brute force, using git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' | xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21treewide: Replace kmalloc with kmalloc_obj for non-scalar typesKees Cook
This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-11tracing: Fix indentation of return statement in print_trace_fmt()Haoyang LIU
The return statement inside the nested if block in print_trace_fmt() is not properly indented, making the code structure unclear. This was flagged by smatch as a warning. Add proper indentation to the return statement to match the kernel coding style and improve readability. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260210153903.8041-1-tttturtleruss@gmail.com Signed-off-by: Haoyang LIU <tttturtleruss@gmail.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-11tracing: Reset last_boot_info if ring buffer is resetMasami Hiramatsu (Google)
Commit 32dc0042528d ("tracing: Reset last-boot buffers when reading out all cpu buffers") resets the last_boot_info when user read out all data via trace_pipe* files. But it is not reset when user resets the buffer from other files. (e.g. write `trace` file) Reset it when the corresponding ring buffer is reset too. Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/177071302364.2293046.17895165659153977720.stgit@mhiramat.tok.corp.google.com Fixes: 32dc0042528d ("tracing: Reset last-boot buffers when reading out all cpu buffers") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-11tracing: Fix to set write permission to per-cpu buffer_size_kbMasami Hiramatsu (Google)
Since the per-cpu buffer_size_kb file is writable for changing per-cpu ring buffer size, the file should have the write access permission. Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/177071301597.2293046.11683339475076917920.stgit@mhiramat.tok.corp.google.com Fixes: 21ccc9cd7211 ("tracing: Disable "other" permission bits in the tracefs files") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08tracing: Better separate SNAPSHOT and MAX_TRACE optionsSteven Rostedt
The latency tracers (scheduler, irqsoff, etc) were created when tracing was first added. These tracers required a "snapshot" buffer that was the same size as the ring buffer being written to. When a new max latency was hit, the main ring buffer would swap with the snapshot buffer so that the trace leading up to the latency would be saved in the snapshot buffer (The snapshot buffer is never written to directly and the data within it can be viewed without fear of being overwritten). Later, a new feature was added to allow snapshots to be taken by user space or even event triggers. This created a "snapshot" file that allowed users to trigger a snapshot from user space to save the current trace. The config for this new feature (CONFIG_TRACER_SNAPSHOT) would select the latency tracer config (CONFIG_TRACER_MAX_LATENCY) as it would need all the functionality from it as it already existed. But this was incorrect. As the snapshot feature is really what the latency tracers need and not the other way around. Have CONFIG_TRACER_MAX_TRACE select CONFIG_TRACER_SNAPSHOT where the tracers that needs the max latency buffer selects the TRACE_MAX_TRACE which will then select TRACER_SNAPSHOT. Also, go through trace.c and trace.h and make the code that only needs the TRACER_MAX_TRACE protected by that and the code that always requires the snapshot to be protected by TRACER_SNAPSHOT. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20260208183856.767870992@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08tracing: Add tracer_uses_snapshot() helper to remove #ifdefsSteven Rostedt
Instead of having #ifdef CONFIG_TRACER_MAX_TRACE around every access to the struct tracer's use_max_tr field, add a helper function for that access and if CONFIG_TRACER_MAX_TRACE is not configured it just returns false. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20260208183856.599390238@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08tracing: Rename trace_array field max_buffer to snapshot_bufferSteven Rostedt
When tracing was first added, there were latency tracers that would take a snapshot of the current trace when a new max latency was hit. This snapshot buffer was called "max_buffer". Since then, a snapshot feature was added that allowed user space or event triggers to trigger a snapshot of the current buffer using the same max_buffer of the trace_array. As this snapshot buffer now has a more generic use case, calling it "max_buffer" is confusing. Rename it to snapshot_buffer. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20260208183856.428446729@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08tracing: Move pid filtering into trace_pid.cSteven Rostedt
The trace.c file was a dumping ground for most tracing code. Start organizing it better by moving various functions out into their own files. Move the PID filtering functions from trace.c into its own trace_pid.c file. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20260208032450.998330662@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08tracing: Move trace_printk functions out of trace.c and into trace_printk.cSteven Rostedt
The file trace.c has become a catchall for most things tracing. Start making it smaller by breaking out various aspects into their own files. Move the functions associated to the trace_printk operations out of trace.c and into trace_printk.c. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20260208032450.828744197@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08tracing: Use system_state in trace_printk_init_buffers()Steven Rostedt
The function trace_printk_init_buffers() is used to expand tha trace_printk buffers when trace_printk() is used within the kernel or in modules. On kernel boot up, it holds off from starting the sched switch cmdline recorder, but will start it immediately when it is added by a module. Currently it uses a trick to see if the global_trace buffer has been allocated or not to know if it was called by module load or not. But this is more of a hack, and can not be used when this code is moved out of trace.c. Instead simply look at the system_state and if it is running then it is know that it could only be called by module load. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20260208032450.660237094@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08tracing: Have trace_printk functions use flags instead of using global_traceSteven Rostedt
The trace.c file has become a dumping ground for all tracing code and has become quite large. In order to move the trace_printk functions out of it these functions can not access global_trace directly, as that is something that needs to stay static in trace.c. Instead of testing the trace_array tr pointer to &global_trace, test the tr->flags to see if TRACE_ARRAY_FL_GLOBAL set. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20260208032450.491116245@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08tracing: Make tracing_update_buffers() take NULL for global_traceSteven Rostedt
The trace.c file has become a dumping ground for all tracing code and has become quite large. In order to move the trace_printk functions out of it these functions can not access global_trace directly, as that is something that needs to stay static in trace.c. Have tracing_update_buffers() take NULL for its trace_array to denote it should work on the global_trace top level trace_array allows that function to be used outside of trace.c and still update the global_trace trace_array. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20260208032450.318864210@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08tracing: Make printk_trace global for tracing systemSteven Rostedt
The printk_trace is used to determine which trace_array trace_printk() writes to. By making it a global variable among the tracing subsystem it will allow the trace_printk functions to be moved out of trace.c and still have direct access to that variable. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20260208032450.144525891@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08tracing: Move ftrace_trace_stack() out of trace.c and into trace.hSteven Rostedt
The file trace.c has become a catchall for most things tracing. Start making it smaller by breaking out various aspects into their own files. Make ftrace_trace_stack() into a static inline that tests if stack tracing is enabled and if so to call __ftrace_trace_stack() to do the stack trace. This keeps the test inlined in the fast paths and only does the function call if stack tracing is enabled. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20260208032449.974218132@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08tracing: Move __trace_buffer_{un}lock_*() functions to trace.hSteven Rostedt
The file trace.c has become a catchall for most things tracing. Start making it smaller by breaking out various aspects into their own files. Move the __always_inline functions __trace_buffer_lock_reserve(), __trace_buffer_unlock_commit() and trace_event_setup() into trace.h. The trace.c file will be split up and these functions will be used in more than one of these files. As they are already __always_inline they can easily be moved into the trace.h header file. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20260208032449.813550600@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08tracing: Make tracing_selftest_running global to the tracing subsystemSteven Rostedt
The file trace.c has become a catchall for most things tracing. Start making it smaller by breaking out various aspects into their own files. Make the variable tracing_selftest_running global so that it can be used by other files in the tracing subsystem and trace.c can be split up. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20260208032449.648932796@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08tracing: Make tracing_disabled global for tracing systemSteven Rostedt
The tracing_disabled variable is set to one on boot up to prevent some parts of tracing to access the tracing infrastructure before it is set up. It also can be set after boot if an anomaly is discovered. It is currently a static variable in trace.c and can be accessed via a function call trace_is_disabled(). There's really no reason to use a function call as the tracing subsystem should be able to access it directly. By making the variable accessed directly, code can be moved out of trace.c without adding overhead of a function call to see if tracing is disabled or not. Make tracing_disabled global and remove the tracing_is_disabled() helper function. Also add some "unlikely()"s around tracing_disabled where it's checked in hot paths. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20260208032449.483690153@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08tracing: Clean up use of trace_create_maxlat_file()Steven Rostedt
In trace.c, the function trace_create_maxlat_file() is defined behind the #ifdef CONFIG_TRACER_MAX_TRACE block. The #else part defines it as: #define trace_create_maxlat_file(tr, d_tracer) \ trace_create_file("tracing_max_latency", TRACE_MODE_WRITE, \ d_tracer, tr, &tracing_max_lat_fops) But the one place that it it used has: #ifdef CONFIG_TRACER_MAX_TRACE trace_create_maxlat_file(tr, d_tracer); #endif Which is pointless and also wrong! It only gets created when both CONFIG_TRACE_MAX_TRACE and CONFIG_FS_NOTIFY is defined, but the file itself should not be dependent on CONFIG_FS_NOTIFY. Always create that file when TRACE_MAX_TRACE is defined regardless if FS_NOTIFY is or is not. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://patch.msgid.link/20260207191101.0e014abd@robin Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08tracing: Move tracing_set_filter_buffering() into trace_events_hist.cSteven Rostedt
The function tracing_set_filter_buffering() is only used in trace_events_hist.c. Move it to that file and make it static. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20260206195936.617080218@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-06tracing: Rename `eval_map_wq` and allow other parts of tracing use itYaxiong Tian
The eval_map_work_func() function, though queued in eval_map_wq, holds the trace_event_sem read-write lock for a long time during kernel boot. This causes blocking issues for other functions. Rename eval_map_wq to trace_init_wq and make it global, thereby allowing other parts of tracing to schedule work on this queue asynchronously and avoiding blockage of the main boot thread. Link: https://patch.msgid.link/20260204015344.162818-1-tianyaxiong@kylinos.cn Suggested-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-26tracing: Disable trace_printk buffer on warning tooSteven Rostedt
When /proc/sys/kernel/traceoff_on_warning is set to 1, the top level tracing buffer is disabled when a warning happens. This is very useful when debugging and want the tracing buffer to stop taking new data when a warning triggers keeping the events that lead up to the warning from being overwritten. Now that there is also a persistent ring buffer and an option to have trace_printk go to that buffer, the same holds true for that buffer. A warning could happen just before a crash but still write enough events to lose the events that lead up to the first warning that was the reason for the crash. When /proc/sys/kernel/traceoff_on_warning is set to 1 and a warning is triggered, not only disable the top level tracing buffer, but also disable the buffer that trace_printk()s are written to. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mark Rutland <mark.rutland@arm.com> Link: https://patch.msgid.link/20260121093858.5c5d7e7b@gandalf.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-26tracing: Check the return value of tracing_update_buffers()Steven Rostedt
In the very unlikely event that tracing_update_buffers() fails in trace_printk_init_buffers(), report the failure so that it is known. Link: https://lore.kernel.org/all/20220917020353.3836285-1-floridsleeves@gmail.com/ Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260107161510.4dc98b15@gandalf.local.home Suggested-by: Li Zhong <floridsleeves@gmail.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-23tracing: Avoid possible signed 64-bit truncationIan Rogers
64-bit truncation to 32-bit can result in the sign of the truncated value changing. The cmp_mod_entry is used in bsearch and so the truncation could result in an invalid search order. This would only happen were the addresses more than 2GB apart and so unlikely, but let's fix the potentially broken compare anyway. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260108002625.333331-1-irogers@google.com Signed-off-by: Ian Rogers <irogers@google.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-07trace: ftrace_dump_on_oops[] is not exported, make it staticBen Dooks
The ftrace_dump_on_oops string is not used outside of trace.c so make it static to avoid the export warning from sparse: kernel/trace/trace.c:141:6: warning: symbol 'ftrace_dump_on_oops' was not declared. Should it be static? Fixes: dd293df6395a2 ("tracing: Move trace sysctls into trace.c") Link: https://patch.msgid.link/20260106231054.84270-1-ben.dooks@codethink.co.uk Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-07tracing: Add recursion protection in kernel stack trace recordingSteven Rostedt
A bug was reported about an infinite recursion caused by tracing the rcu events with the kernel stack trace trigger enabled. The stack trace code called back into RCU which then called the stack trace again. Expand the ftrace recursion protection to add a set of bits to protect events from recursion. Each bit represents the context that the event is in (normal, softirq, interrupt and NMI). Have the stack trace code use the interrupt context to protect against recursion. Note, the bug showed an issue in both the RCU code as well as the tracing stacktrace code. This only handles the tracing stack trace side of the bug. The RCU fix will be handled separately. Link: https://lore.kernel.org/all/20260102122807.7025fc87@gandalf.local.home/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Boqun Feng <boqun.feng@gmail.com> Link: https://patch.msgid.link/20260105203141.515cd49f@gandalf.local.home Reported-by: Yao Kai <yaokai34@huawei.com> Tested-by: Yao Kai <yaokai34@huawei.com> Fixes: 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-12-17tracing: Fix UBSAN warning in __remove_instance()Darrick J. Wong
xfs/558 triggers the following UBSAN warning: ------------[ cut here ]------------ UBSAN: shift-out-of-bounds in kernel/trace/trace.c:10510:10 shift exponent 32 is too large for 32-bit type 'int' CPU: 1 UID: 0 PID: 888674 Comm: rmdir Not tainted 6.19.0-rc1-xfsx #rc1 PREEMPT(lazy) dbf607ef4c142c563f76d706e71af9731d7b9c90 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-4.module+el8.8.0+21164+ed375313 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x4a/0x70 ubsan_epilogue+0x5/0x2b __ubsan_handle_shift_out_of_bounds.cold+0x5e/0x113 __remove_instance.part.0.constprop.0.cold+0x18/0x26f instance_rmdir+0xf3/0x110 tracefs_syscall_rmdir+0x4d/0x90 vfs_rmdir+0x139/0x230 do_rmdir+0x143/0x230 __x64_sys_rmdir+0x1d/0x20 do_syscall_64+0x44/0x230 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7f7ae8e51f17 Code: f0 ff ff 73 01 c3 48 8b 0d de 2e 0e 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 54 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 b1 2e 0e 00 f7 d8 64 89 02 b8 RSP: 002b:00007ffd90743f08 EFLAGS: 00000246 ORIG_RAX: 0000000000000054 RAX: ffffffffffffffda RBX: 00007ffd907440f8 RCX: 00007f7ae8e51f17 RDX: 00007f7ae8f3c5c0 RSI: 00007ffd90744a21 RDI: 00007ffd90744a21 RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f7ae8f35ac0 R11: 0000000000000246 R12: 00007ffd90744a21 R13: 0000000000000001 R14: 00007f7ae8f8b000 R15: 000055e5283e6a98 </TASK> ---[ end trace ]--- whilst tearing down an ftrace instance. TRACE_FLAGS_MAX_SIZE is now 64bit, so the mask comparison expression must be typecast to a u64 value to avoid an overflow. AFAICT, ZEROED_TRACE_FLAGS is already cast to ULL so this is ok. Link: https://patch.msgid.link/20251216174950.GA7705@frogsfrogsfrogs Fixes: bbec8e28cac592 ("tracing: Allow tracer to add more than 32 options") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-12-05tracing: Fix multiple typos in trace.cMaurice Hieronymus
Fix multiple typos in comments: "alse" -> "also" "enabed" -> "enabled" "instane" -> "instance" "outputing" -> "outputting" "seperated" -> "separated" Link: https://patch.msgid.link/20251121221835.28032-7-mhi@mailbox.org Signed-off-by: Maurice Hieronymus <mhi@mailbox.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-12-05tracing: Fix enabling of tracing on file releaseSteven Rostedt
The trace file will pause tracing if the tracing instance has the "pause-on-trace" option is set. This happens when the file is opened, and it is unpaused when the file is closed. When this was first added, there was only one user that paused tracing. On open, the check to pause was: if (!iter->snapshot && (tr->trace_flags & TRACE_ITER(PAUSE_ON_TRACE))) Where if it is not the snapshot tracer and the "pause-on-trace" option is set, then it increments a "stop_count" of the trace instance. On close, the check is: if (!iter->snapshot && tr->stop_count) That is, if it is not the snapshot buffer and it was stopped, it will re-enable tracing. Now there's more places that stop tracing. This means, if something else stops tracing the tr->stop_count will be non-zero, and that means if the trace file is closed, it will decrement the stop_count even though it never incremented it. This causes a warning because when the user that stopped tracing enables it again, the stop_count goes below zero. Instead of relying on the stop_count being set to know if the close of the trace file should enable tracing again, add a new flag to the trace iterator. The trace iterator is unique per open of the trace file, and if the open stops tracing set the trace iterator PAUSE flag. On close, if the PAUSE flag is set, then re-enable it again. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251202161751.24abaaf1@gandalf.local.home Fixes: 06e0a548bad0f ("tracing: Do not disable tracing when reading the trace file") Reported-by: syzbot+ccdec3bfe0beec58a38d@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/692f44a5.a70a0220.2ea503.00c8.GAE@google.com/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-12-05Merge tag 'trace-rv-6.19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull runtime verifier updates from Steven Rostedt: - Adapt the ftracetest script to be run from a different folder This uses the already existing OPT_TEST_DIR but extends it further to run independent tests, then add an --rv flag to allow using the script for testing RV (mostly) independently on ftrace. - Add basic RV selftests in selftests/verification for more validations Add more validations for available/enabled monitors and reactors. This could have caught the bug introducing kernel panic solved above. Tests use ftracetest. - Convert react() function in reactor to use va_list directly Use a central helper to handle the variadic arguments. Clean up macros and mark functions as static. - Add lockdep annotations to reactors to have lockdep complain of errors If the reactors are called from improper context. Useful to develop new reactors. This highlights a warning in the panic reactor that is related to the printk subsystem and not to RV. - Convert core RV code to use lock guards and __free helpers This completely removes goto statements. - Fix compilation if !CONFIG_RV_REACTORS Fix the warning by keeping LTL monitor variable as always static. * tag 'trace-rv-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: rv: Fix compilation if !CONFIG_RV_REACTORS rv: Convert to use __free rv: Convert to use lock guard rv: Add explicit lockdep context for reactors rv: Make rv_reacting_on() static rv: Pass va_list to reactors selftests/verification: Add initial RV tests selftest/ftrace: Generalise ftracetest to use with RV
2025-12-05Merge tag 'trace-v6.19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing updates from Steven Rostedt: - Extend tracing option mask to 64 bits The trace options were defined by a 32 bit variable. This limits the tracing instances to have a total of 32 different options. As that limit has been hit, and more options are being added, increase the option mask to a 64 bit number, doubling the number of options available. As this is required for the kprobe topic branches as well as the tracing topic branch, a separate branch was created and merged into both. - Make trace_user_fault_read() available for the rest of tracing The function trace_user_fault_read() is used by trace_marker file read to allow reading user space to be done fast and without locking or allocations. Make this available so that the system call trace events can use it too. - Have system call trace events read user space values Now that the system call trace events callbacks are called in a faultable context, take advantage of this and read the user space buffers for various system calls. For example, show the path name of the openat system call instead of just showing the pointer to that path name in user space. Also show the contents of the buffer of the write system call. Several system call trace events are updated to make tracing into a light weight strace tool for all applications in the system. - Update perf system call tracing to do the same - And a config and syscall_user_buf_size file to control the size of the buffer Limit the amount of data that can be read from user space. The default size is 63 bytes but that can be expanded to 165 bytes. - Allow the persistent ring buffer to print system calls normally The persistent ring buffer prints trace events by their type and ignores the print_fmt. This is because the print_fmt may change from kernel to kernel. As the system call output is fixed by the system call ABI itself, there's no reason to limit that. This makes reading the system call events in the persistent ring buffer much nicer and easier to understand. - Add options to show text offset to function profiler The function profiler that counts the number of times a function is hit currently lists all functions by its name and offset. But this becomes ambiguous when there are several functions with the same name. Add a tracing option that changes the output to be that of '_text+offset' instead. Now a user space tool can use this information to map the '_text+offset' to the unique function it is counting. - Report bad dynamic event command If a bad command is passed to the dynamic_events file, report it properly in the error log. - Clean up tracer options Clean up the tracer option code a bit, by removing some useless code and also using switch statements instead of a series of if statements. - Have tracing options be instance specific Tracers can have their own options (function tracer, irqsoff tracer, function graph tracer, etc). But now that the same tracer can be enabled in multiple trace instances, their options are still global. The API is per instance, thus changing one affects other instances. This isn't even consistent, as the option take affect differently depending on when an tracer started in an instance. Make the options for instances only affect the instance it is changed under. - Optimize pid_list lock contention Whenever the pid_list is read, it uses a spin lock. This happens at every sched switch. Taking the lock at sched switch can be removed by instead using a seqlock counter. - Clean up the trace trigger structures The trigger code uses two different structures to implement a single tigger. This was due to trying to reuse code for the two different types of triggers (always on trigger, and count limited trigger). But by adding a single field to one structure, the other structure could be absorbed into the first structure making he code easier to understand. - Create a bulk garbage collector for trace triggers If user space has triggers for several hundreds of events and then removes them, it can take several seconds to complete. This is because each removal calls tracepoint_synchronize_unregister() that can take hundreds of milliseconds to complete. Instead, create a helper thread that will do the clean up. When a trigger is removed, it will create the kthread if it isn't already created, and then add the trigger to a llist. The kthread will take the items off the llist, call tracepoint_synchronize_unregister(), and then remove the items it took off. It will then check if there's more items to free before sleeping. This makes user space removing all these triggers to finish in less than a second. - Allow function tracing of some of the tracing infrastructure code Because the tracing code can cause recursion issues if it is traced by the function tracer the entire tracing directory disables function tracing. But not all of tracing causes issues if it is traced. Namely, the event tracing code. Add a config that enables some of the tracing code to be traced to help in debugging it. Note, when this is enabled, it does add noise to general function tracing, especially if events are enabled as well (which is a common case). - Add boot-time backup instance for persistent buffer The persistent ring buffer is used mostly for kernel crash analysis in the field. One issue is that if there's a crash, the data in the persistent ring buffer must be read before tracing can begin using it. This slows down the boot process. Once tracing starts in the persistent ring buffer, the old data must be freed and the addresses no longer match and old events can't be in the buffer with new events. Create a way to create a backup buffer that copies the persistent ring buffer at boot up. Then after a crash, the always on tracer can begin immediately as well as the normal boot process while the crash analysis tooling uses the backup buffer. After the backup buffer is finished being read, it can be removed. - Enable function graph args and return address options at the same time Currently the when reading of arguments in the function graph tracer is enabled, the option to record the parent function in the entry event can not be enabled. Update the code so that it can. - Add new struct_offset() helper macro Add a new macro that takes a pointer to a structure and a name of one of its members and it will return the offset of that member. This allows the ring buffer code to simplify the following: From: size = struct_size(entry, buf, cnt - sizeof(entry->id)); To: size = struct_offset(entry, id) + cnt; There should be other simplifications that this macro can help out with as well * tag 'trace-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (42 commits) overflow: Introduce struct_offset() to get offset of member function_graph: Enable funcgraph-args and funcgraph-retaddr to work simultaneously tracing: Add boot-time backup of persistent ring buffer ftrace: Allow tracing of some of the tracing code tracing: Use strim() in trigger_process_regex() instead of skip_spaces() tracing: Add bulk garbage collection of freeing event_trigger_data tracing: Remove unneeded event_mutex lock in event_trigger_regex_release() tracing: Merge struct event_trigger_ops into struct event_command tracing: Remove get_trigger_ops() and add count_func() from trigger ops tracing: Show the tracer options in boot-time created instance ftrace: Avoid redundant initialization in register_ftrace_direct tracing: Remove unused variable in tracing_trace_options_show() fgraph: Make fgraph_no_sleep_time signed tracing: Convert function graph set_flags() to use a switch() statement tracing: Have function graph tracer option sleep-time be per instance tracing: Move graph-time out of function graph options tracing: Have function graph tracer option funcgraph-irqs be per instance trace/pid_list: optimize pid_list->lock contention tracing: Have function graph tracer define options per instance tracing: Have function tracer define options per instance ...
2025-12-02rv: Convert to use __freeNam Cao
Convert to use __free to tidy up the code. Signed-off-by: Nam Cao <namcao@linutronix.de> Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Link: https://lore.kernel.org/r/62854e2fcb8f8dd2180a98a9700702dcf89a6980.1763370183.git.namcao@linutronix.de Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
2025-11-27overflow: Introduce struct_offset() to get offset of memberSteven Rostedt
The trace_marker_raw file in tracefs takes a buffer from user space that contains an id as well as a raw data string which is usually a binary structure. The structure used has the following: struct raw_data_entry { struct trace_entry ent; unsigned int id; char buf[]; }; Since the passed in "cnt" variable is both the size of buf as well as the size of id, the code to allocate the location on the ring buffer had: size = struct_size(entry, buf, cnt - sizeof(entry->id)); Which is quite ugly and hard to understand. Instead, add a helper macro called struct_offset() which then changes the above to a simple and easy to understand: size = struct_offset(entry, id) + cnt; This will likely come in handy for other use cases too. Link: https://lore.kernel.org/all/CAHk-=whYZVoEdfO1PmtbirPdBMTV9Nxt9f09CK0k6S+HJD3Zmg@mail.gmail.com/ Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org> Link: https://patch.msgid.link/20251126145249.05b1770a@gandalf.local.home Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Kees Cook <kees@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-26tracing: Add boot-time backup of persistent ring bufferMasami Hiramatsu (Google)
Currently, the persistent ring buffer instance needs to be read before using it. This means we have to wait for boot up user space and dump the persistent ring buffer. However, in that case we can not start tracing on it from the kernel cmdline. To solve this limitation, this adds an option which allows to create a trace instance as a backup of the persistent ring buffer at boot. If user specifies trace_instance=<BACKUP>=<PERSIST_RB> then the <BACKUP> instance is made as a copy of the <PERSIST_RB> instance. For example, the below kernel cmdline records all syscalls, scheduler and interrupt events on the persistent ring buffer `boot_map` but before starting the tracing, it makes a `backup` instance from the `boot_map`. Thus, the `backup` instance has the previous boot events. 'reserve_mem=12M:4M:trace trace_instance=boot_map@trace,syscalls:*,sched:*,irq:* trace_instance=backup=boot_map' As you can see, this just make a copy of entire reserved area and make a backup instance on it. So you can release (or shrink) the backup instance after use it to save the memory usage. /sys/kernel/tracing/instances # free total used free shared buff/cache available Mem: 1999284 55704 1930520 10132 13060 1914628 Swap: 0 0 0 /sys/kernel/tracing/instances # rmdir backup/ /sys/kernel/tracing/instances # free total used free shared buff/cache available Mem: 1999284 40640 1945584 10132 13060 1929692 Swap: 0 0 0 Note: since there is no reason to make a copy of empty buffer, this backup only accepts a persistent ring buffer as the original instance. Also, since this backup is based on vmalloc(), it does not support user-space mmap(). Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/176377150002.219692.9425536150438129267.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-26tracing: Show the tracer options in boot-time created instanceMasami Hiramatsu (Google)
Since tracer_init_tracefs_work_func() only updates the tracer options for the global_trace, the instances created by the kernel cmdline do not have those options. Fix to update tracer options for those boot-time created instances to show those options. Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/176354112555.2356172.3989277078358802353.stgit@mhiramat.tok.corp.google.com Fixes: 428add559b69 ("tracing: Have tracer option be instance specific") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-26tracing: Remove unused variable in tracing_trace_options_show()Steven Rostedt
The flags and opts used in tracing_trace_options_show() now come directly from the trace array "current_trace_flags" and not the current_trace. The variable "trace" was still being assigned to tr->current_trace but never used. This caused a warning in clang. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251117120637.43ef995d@gandalf.local.home Reported-by: Andy Shevchenko <andriy.shevchenko@intel.com> Tested-by: Andy Shevchenko <andriy.shevchenko@intel.com> Closes: https://lore.kernel.org/all/aRtHWXzYa8ijUIDa@black.igk.intel.com/ Fixes: 428add559b692 ("tracing: Have tracer option be instance specific") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-25tracing: Fix WARN_ON in tracing_buffers_mmap_close for split VMAsDeepanshu Kartikey
When a VMA is split (e.g., by partial munmap or MAP_FIXED), the kernel calls vm_ops->close on each portion. For trace buffer mappings, this results in ring_buffer_unmap() being called multiple times while ring_buffer_map() was only called once. This causes ring_buffer_unmap() to return -ENODEV on subsequent calls because user_mapped is already 0, triggering a WARN_ON. Trace buffer mappings cannot support partial mappings because the ring buffer structure requires the complete buffer including the meta page. Fix this by adding a may_split callback that returns -EINVAL to prevent VMA splits entirely. Cc: stable@vger.kernel.org Fixes: cf9f0f7c4c5bb ("tracing: Allow user-space mapping of the ring-buffer") Link: https://patch.msgid.link/20251119064019.25904-1-kartikey406@gmail.com Closes: https://syzkaller.appspot.com/bug?extid=a72c325b042aae6403c7 Tested-by: syzbot+a72c325b042aae6403c7@syzkaller.appspotmail.com Reported-by: syzbot+a72c325b042aae6403c7@syzkaller.appspotmail.com Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-14tracing: Move graph-time out of function graph optionsSteven Rostedt
The option "graph-time" affects the function profiler when it is using the function graph infrastructure. It has nothing to do with the function graph tracer itself. The option only affects the global function profiler and does nothing to the function graph tracer. Move it out of the function graph tracer options and make it a global option that is only available at the top level instance. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251114192318.781711154@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-12tracing: Have tracer option be instance specificSteven Rostedt
Tracers can add specify options to modify them. This logic was added before instances were created and the tracer flags were global variables. After instances were created where a tracer may exist in more than one instance, the flags were not updated from being global into instance specific. This causes confusion with these options. For example, the function tracer has an option to enable function arguments: # cd /sys/kernel/tracing # mkdir instances/foo # echo function > instances/foo/current_tracer # echo 1 > options/func-args # echo function > current_tracer # cat trace [..] <idle>-0 [005] d..3. 1050.656187: rcu_needs_cpu() <-tick_nohz_next_event <idle>-0 [005] d..3. 1050.656188: get_next_timer_interrupt(basej=0x10002dbad, basem=0xf45fd7d300) <-tick_nohz_next_event <idle>-0 [005] d..3. 1050.656189: _raw_spin_lock(lock=0xffff8944bdf5de80) <-__get_next_timer_interrupt <idle>-0 [005] d..4. 1050.656190: do_raw_spin_lock(lock=0xffff8944bdf5de80) <-__get_next_timer_interrupt <idle>-0 [005] d..4. 1050.656191: _raw_spin_lock_nested(lock=0xffff8944bdf5f140, subclass=1) <-__get_next_timer_interrupt # cat instances/foo/options/func-args 1 # cat instances/foo/trace [..] kworker/4:1-88 [004] ...1. 298.127735: next_zone <-refresh_cpu_vm_stats kworker/4:1-88 [004] ...1. 298.127736: first_online_pgdat <-refresh_cpu_vm_stats kworker/4:1-88 [004] ...1. 298.127738: next_online_pgdat <-refresh_cpu_vm_stats kworker/4:1-88 [004] ...1. 298.127739: fold_diff <-refresh_cpu_vm_stats kworker/4:1-88 [004] ...1. 298.127741: round_jiffies_relative <-vmstat_update [..] The above shows that setting "func-args" in the top level instance also set it in the instance "foo", but since the interface of the trace flags are per instance, the update didn't take affect in the "foo" instance. Update the infrastructure to allow tracers to add a "default_flags" field in the tracer structure that can be set instead of "flags" which will make the flags per instance. If a tracer needs to keep the flags global (like blktrace), keeping the "flags" field set will keep the old behavior. This does not update function or the function graph tracers. That will be handled later. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251111232429.305317942@kernel.org Fixes: f20a580627f43 ("ftrace: Allow instances to use function tracing") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-10tracing: Use switch statement instead of ifs in set_tracer_flag()Steven Rostedt
The "mask" passed in to set_trace_flag() has a single bit set. The function then checks if the mask is equal to one of the option masks and performs the appropriate function associated to that option. Instead of having a bunch of "if ()" statement, use a "switch ()" statement instead to make it cleaner and a bit more optimal. No function changes. Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251106003501.890298562@kernel.org Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-10tracing: Exit out immediately after update_marker_trace()Steven Rostedt
The call to update_marker_trace() in set_tracer_flag() performs the update to the tr->trace_flags. There's no reason to perform it again after it is called. Return immediately instead. Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251106003501.726406870@kernel.org Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-10tracing: Have add_tracer_options() error pass up to callersSteven Rostedt
The function add_tracer_options() can fail, but currently it is ignored. Pass the status of add_tracer_options() up to adding a new tracer as well as when an instance is created. Have the instance creation fail if the add_tracer_options() fail. Only print a warning for the top level instance, like it does with other failures. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251105161935.375299297@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-10tracing: Remove dummy options and flagsSteven Rostedt
When a tracer does not define their own flags, dummy options and flags are used so that the values are always valid. There's not that many locations that reference these values so having dummy versions just complicates the code. Remove the dummy values and just check for NULL when appropriate. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251105161935.206093132@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-04Merge branch 'topic/func-profiler-offset' of ↵Steven Rostedt
git://git.kernel.org/pub/scm/linux/kernel/git/mhiramat/linux into trace/trace/core Updates to the function profiler adds new options to tracefs. The options are currently defined by an enum as flags. The added options brings the number of options over 32, which means they can no longer be held in a 32 bit enum. The TRACE_ITER_* flags are converted to a macro TRACE_ITER(*) to allow the creation of options to still be done by macros. This change is intrusive, as it affects all TRACE_ITER* options throughout the trace code. Merge the branch that added these options and converted the TRACE_ITER_* enum into a TRACE_ITER(*) macro, to allow the topic branches to still be developed without conflict. Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-04tracing: Add an option to show symbols in _text+offset for function profilerMasami Hiramatsu (Google)
Function profiler shows the hit count of each function using its symbol name. However, there are some same-name local symbols, which we can not distinguish. To solve this issue, this introduces an option to show the symbols in "_text+OFFSET" format. This can avoid exposing the random shift of KASLR. The functions in modules are shown as "MODNAME+OFFSET" where the offset is from ".text". E.g. for the kernel text symbols, specify vmlinux and the output to addr2line, you can find the actual function and source info; $ addr2line -fie vmlinux _text+3078208 __balance_callbacks kernel/sched/core.c:5064 for modules, specify the module file and .text+OFFSET; $ addr2line -fie samples/trace_events/trace-events-sample.ko .text+8224 do_simple_thread_func samples/trace_events/trace-events-sample.c:23 Link: https://lore.kernel.org/all/176187878064.994619.8878296550240416558.stgit@devnote2/ Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>