| Age | Commit message (Collapse) | Author |
|
[ Upstream commit 5125e757e62f6c1d5478db4c2b61a744060ddf3f ]
To avoid returning uninitialized or random values when querying the file
descriptor (fd) and accessing probe_addr, it is necessary to clear the
variable prior to its use.
Fixes: 41bdc4b40ed6 ("bpf: introduce bpf subcommand BPF_TASK_FD_QUERY")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230709025630.3735-6-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit eecb91b9f98d6427d4af5fdb8f108f52572a39e7 ]
Kmemleak report a leak in graph_trace_open():
unreferenced object 0xffff0040b95f4a00 (size 128):
comm "cat", pid 204981, jiffies 4301155872 (age 99771.964s)
hex dump (first 32 bytes):
e0 05 e7 b4 ab 7d 00 00 0b 00 01 00 00 00 00 00 .....}..........
f4 00 01 10 00 a0 ff ff 00 00 00 00 65 00 10 00 ............e...
backtrace:
[<000000005db27c8b>] kmem_cache_alloc_trace+0x348/0x5f0
[<000000007df90faa>] graph_trace_open+0xb0/0x344
[<00000000737524cd>] __tracing_open+0x450/0xb10
[<0000000098043327>] tracing_open+0x1a0/0x2a0
[<00000000291c3876>] do_dentry_open+0x3c0/0xdc0
[<000000004015bcd6>] vfs_open+0x98/0xd0
[<000000002b5f60c9>] do_open+0x520/0x8d0
[<00000000376c7820>] path_openat+0x1c0/0x3e0
[<00000000336a54b5>] do_filp_open+0x14c/0x324
[<000000002802df13>] do_sys_openat2+0x2c4/0x530
[<0000000094eea458>] __arm64_sys_openat+0x130/0x1c4
[<00000000a71d7881>] el0_svc_common.constprop.0+0xfc/0x394
[<00000000313647bf>] do_el0_svc+0xac/0xec
[<000000002ef1c651>] el0_svc+0x20/0x30
[<000000002fd4692a>] el0_sync_handler+0xb0/0xb4
[<000000000c309c35>] el0_sync+0x160/0x180
The root cause is descripted as follows:
__tracing_open() { // 1. File 'trace' is being opened;
...
*iter->trace = *tr->current_trace; // 2. Tracer 'function_graph' is
// currently set;
...
iter->trace->open(iter); // 3. Call graph_trace_open() here,
// and memory are allocated in it;
...
}
s_start() { // 4. The opened file is being read;
...
*iter->trace = *tr->current_trace; // 5. If tracer is switched to
// 'nop' or others, then memory
// in step 3 are leaked!!!
...
}
To fix it, in s_start(), close tracer before switching then reopen the
new tracer after switching. And some tracers like 'wakeup' may not update
'iter->private' in some cases when reopen, then it should be cleared
to avoid being mistakenly closed again.
Link: https://lore.kernel.org/linux-trace-kernel/20230817125539.1646321-1-zhengyejian1@huawei.com
Fixes: d7350c3f4569 ("tracing/core: make the read callbacks reentrants")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit dea499781a1150d285c62b26659f62fb00824fce ]
Warning happened in trace_buffered_event_disable() at
WARN_ON_ONCE(!trace_buffered_event_ref)
Call Trace:
? __warn+0xa5/0x1b0
? trace_buffered_event_disable+0x189/0x1b0
__ftrace_event_enable_disable+0x19e/0x3e0
free_probe_data+0x3b/0xa0
unregister_ftrace_function_probe_func+0x6b8/0x800
event_enable_func+0x2f0/0x3d0
ftrace_process_regex.isra.0+0x12d/0x1b0
ftrace_filter_write+0xe6/0x140
vfs_write+0x1c9/0x6f0
[...]
The cause of the warning is in __ftrace_event_enable_disable(),
trace_buffered_event_enable() was called once while
trace_buffered_event_disable() was called twice.
Reproduction script show as below, for analysis, see the comments:
```
#!/bin/bash
cd /sys/kernel/tracing/
# 1. Register a 'disable_event' command, then:
# 1) SOFT_DISABLED_BIT was set;
# 2) trace_buffered_event_enable() was called first time;
echo 'cmdline_proc_show:disable_event:initcall:initcall_finish' > \
set_ftrace_filter
# 2. Enable the event registered, then:
# 1) SOFT_DISABLED_BIT was cleared;
# 2) trace_buffered_event_disable() was called first time;
echo 1 > events/initcall/initcall_finish/enable
# 3. Try to call into cmdline_proc_show(), then SOFT_DISABLED_BIT was
# set again!!!
cat /proc/cmdline
# 4. Unregister the 'disable_event' command, then:
# 1) SOFT_DISABLED_BIT was cleared again;
# 2) trace_buffered_event_disable() was called second time!!!
echo '!cmdline_proc_show:disable_event:initcall:initcall_finish' > \
set_ftrace_filter
```
To fix it, IIUC, we can change to call trace_buffered_event_enable() at
fist time soft-mode enabled, and call trace_buffered_event_disable() at
last time soft-mode disabled.
Link: https://lore.kernel.org/linux-trace-kernel/20230726095804.920457-1-zhengyejian1@huawei.com
Cc: <mhiramat@kernel.org>
Fixes: 0fc1b09ff1ff ("tracing: Use temp buffer when filtering events")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 2d093282b0d4357373497f65db6a05eb0c28b7c8 ]
When pages are removed in rb_remove_pages(), 'cpu_buffer->read' is set
to 0 in order to make sure any read iterators reset themselves. However,
this will mess 'entries' stating, see following steps:
# cd /sys/kernel/tracing/
# 1. Enlarge ring buffer prepare for later reducing:
# echo 20 > per_cpu/cpu0/buffer_size_kb
# 2. Write a log into ring buffer of cpu0:
# taskset -c 0 echo "hello1" > trace_marker
# 3. Read the log:
# cat per_cpu/cpu0/trace_pipe
<...>-332 [000] ..... 62.406844: tracing_mark_write: hello1
# 4. Stop reading and see the stats, now 0 entries, and 1 event readed:
# cat per_cpu/cpu0/stats
entries: 0
[...]
read events: 1
# 5. Reduce the ring buffer
# echo 7 > per_cpu/cpu0/buffer_size_kb
# 6. Now entries became unexpected 1 because actually no entries!!!
# cat per_cpu/cpu0/stats
entries: 1
[...]
read events: 0
To fix it, introduce 'page_removed' field to count total removed pages
since last reset, then use it to let read iterators reset themselves
instead of changing the 'read' pointer.
Link: https://lore.kernel.org/linux-trace-kernel/20230724054040.3489499-1-zhengyejian1@huawei.com
Cc: <mhiramat@kernel.org>
Cc: <vnagarnaik@google.com>
Fixes: 83f40318dab0 ("ring-buffer: Make removal of ring buffer pages atomic")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 26efd79c4624294e553aeaa3439c646729bad084 ]
As comments in ftrace_process_locs(), there may be NULL pointers in
mcount_loc section:
> Some architecture linkers will pad between
> the different mcount_loc sections of different
> object files to satisfy alignments.
> Skip any NULL pointers.
After commit 20e5227e9f55 ("ftrace: allow NULL pointers in mcount_loc"),
NULL pointers will be accounted when allocating ftrace pages but skipped
before adding into ftrace pages, this may result in some pages not being
used. Then after commit 706c81f87f84 ("ftrace: Remove extra helper
functions"), warning may occur at:
WARN_ON(pg->next);
To fix it, only warn for case that no pointers skipped but pages not used
up, then free those unused pages after releasing ftrace_lock.
Link: https://lore.kernel.org/linux-trace-kernel/20230712060452.3175675-1-zhengyejian1@huawei.com
Cc: stable@vger.kernel.org
Fixes: 706c81f87f84 ("ftrace: Remove extra helper functions")
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit db42523b4f3e83ff86b53cdda219a9767c8b047f ]
Instead of saving the size of the records field of the ftrace_page, store
the order it uses to allocate the pages, as that is what is needed to know
in order to free the pages. This simplifies the code.
Link: https://lore.kernel.org/lkml/CAHk-=whyMxheOqXAORt9a7JK9gc9eHTgCJ55Pgs4p=X3RrQubQ@mail.gmail.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[ change log written by Steven Rostedt ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Stable-dep-of: 26efd79c4624 ("ftrace: Fix possible warning on checking all pages used in ftrace_process_locs()")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 59300b36f85f254260c81d9dd09195fa49eb0f98 ]
It is possible that on error pg->size can be zero when getting its order,
which would return a -1 value. It is dangerous to pass in an order of -1
to free_pages(). Check if order is greater than or equal to zero before
calling free_pages().
Link: https://lore.kernel.org/lkml/20210330093916.432697c7@gandalf.local.home/
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Stable-dep-of: 26efd79c4624 ("ftrace: Fix possible warning on checking all pages used in ftrace_process_locs()")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit da537f0aef1372c5204356a7df06be8769467b7b ]
Looking for ways to shrink the size of the dyn_ftrace structure, knowing the
information about how many pages and the number of groups of those pages, is
useful in working out the best ways to save on memory.
This adds one info print on how many groups of pages were used to allocate
the ftrace dyn_ftrace structures, and also shows the number of pages and
groups in the dyn_ftrace_total_info (which is used for debugging).
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Stable-dep-of: 26efd79c4624 ("ftrace: Fix possible warning on checking all pages used in ftrace_process_locs()")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
list
commit 4b8b3905165ef98386a3c06f196c85d21292d029 upstream.
Commit 6018b585e8c6 ("tracing/histograms: Add histograms to hist_vars if
they have referenced variables") added a check to fail histogram creation
if save_hist_vars() failed to add histogram to hist_vars list. But the
commit failed to set ret to failed return code before jumping to
unregister histogram, fix it.
Link: https://lore.kernel.org/linux-trace-kernel/20230714203341.51396-1-mkhalfella@purestorage.com
Cc: stable@vger.kernel.org
Fixes: 6018b585e8c6 ("tracing/histograms: Add histograms to hist_vars if they have referenced variables")
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
variables
commit 6018b585e8c6fa7d85d4b38d9ce49a5b67be7078 upstream.
Hist triggers can have referenced variables without having direct
variables fields. This can be the case if referenced variables are added
for trigger actions. In this case the newly added references will not
have field variables. Not taking such referenced variables into
consideration can result in a bug where it would be possible to remove
hist trigger with variables being refenced. This will result in a bug
that is easily reproducable like so
$ cd /sys/kernel/tracing
$ echo 'synthetic_sys_enter char[] comm; long id' >> synthetic_events
$ echo 'hist:keys=common_pid.execname,id.syscall:vals=hitcount:comm=common_pid.execname' >> events/raw_syscalls/sys_enter/trigger
$ echo 'hist:keys=common_pid.execname,id.syscall:onmatch(raw_syscalls.sys_enter).synthetic_sys_enter($comm, id)' >> events/raw_syscalls/sys_enter/trigger
$ echo '!hist:keys=common_pid.execname,id.syscall:vals=hitcount:comm=common_pid.execname' >> events/raw_syscalls/sys_enter/trigger
[ 100.263533] ==================================================================
[ 100.264634] BUG: KASAN: slab-use-after-free in resolve_var_refs+0xc7/0x180
[ 100.265520] Read of size 8 at addr ffff88810375d0f0 by task bash/439
[ 100.266320]
[ 100.266533] CPU: 2 PID: 439 Comm: bash Not tainted 6.5.0-rc1 #4
[ 100.267277] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-20220807_005459-localhost 04/01/2014
[ 100.268561] Call Trace:
[ 100.268902] <TASK>
[ 100.269189] dump_stack_lvl+0x4c/0x70
[ 100.269680] print_report+0xc5/0x600
[ 100.270165] ? resolve_var_refs+0xc7/0x180
[ 100.270697] ? kasan_complete_mode_report_info+0x80/0x1f0
[ 100.271389] ? resolve_var_refs+0xc7/0x180
[ 100.271913] kasan_report+0xbd/0x100
[ 100.272380] ? resolve_var_refs+0xc7/0x180
[ 100.272920] __asan_load8+0x71/0xa0
[ 100.273377] resolve_var_refs+0xc7/0x180
[ 100.273888] event_hist_trigger+0x749/0x860
[ 100.274505] ? kasan_save_stack+0x2a/0x50
[ 100.275024] ? kasan_set_track+0x29/0x40
[ 100.275536] ? __pfx_event_hist_trigger+0x10/0x10
[ 100.276138] ? ksys_write+0xd1/0x170
[ 100.276607] ? do_syscall_64+0x3c/0x90
[ 100.277099] ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 100.277771] ? destroy_hist_data+0x446/0x470
[ 100.278324] ? event_hist_trigger_parse+0xa6c/0x3860
[ 100.278962] ? __pfx_event_hist_trigger_parse+0x10/0x10
[ 100.279627] ? __kasan_check_write+0x18/0x20
[ 100.280177] ? mutex_unlock+0x85/0xd0
[ 100.280660] ? __pfx_mutex_unlock+0x10/0x10
[ 100.281200] ? kfree+0x7b/0x120
[ 100.281619] ? ____kasan_slab_free+0x15d/0x1d0
[ 100.282197] ? event_trigger_write+0xac/0x100
[ 100.282764] ? __kasan_slab_free+0x16/0x20
[ 100.283293] ? __kmem_cache_free+0x153/0x2f0
[ 100.283844] ? sched_mm_cid_remote_clear+0xb1/0x250
[ 100.284550] ? __pfx_sched_mm_cid_remote_clear+0x10/0x10
[ 100.285221] ? event_trigger_write+0xbc/0x100
[ 100.285781] ? __kasan_check_read+0x15/0x20
[ 100.286321] ? __bitmap_weight+0x66/0xa0
[ 100.286833] ? _find_next_bit+0x46/0xe0
[ 100.287334] ? task_mm_cid_work+0x37f/0x450
[ 100.287872] event_triggers_call+0x84/0x150
[ 100.288408] trace_event_buffer_commit+0x339/0x430
[ 100.289073] ? ring_buffer_event_data+0x3f/0x60
[ 100.292189] trace_event_raw_event_sys_enter+0x8b/0xe0
[ 100.295434] syscall_trace_enter.constprop.0+0x18f/0x1b0
[ 100.298653] syscall_enter_from_user_mode+0x32/0x40
[ 100.301808] do_syscall_64+0x1a/0x90
[ 100.304748] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 100.307775] RIP: 0033:0x7f686c75c1cb
[ 100.310617] Code: 73 01 c3 48 8b 0d 65 3c 10 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 21 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 35 3c 10 00 f7 d8 64 89 01 48
[ 100.317847] RSP: 002b:00007ffc60137a38 EFLAGS: 00000246 ORIG_RAX: 0000000000000021
[ 100.321200] RAX: ffffffffffffffda RBX: 000055f566469ea0 RCX: 00007f686c75c1cb
[ 100.324631] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 000000000000000a
[ 100.328104] RBP: 00007ffc60137ac0 R08: 00007f686c818460 R09: 000000000000000a
[ 100.331509] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000009
[ 100.334992] R13: 0000000000000007 R14: 000000000000000a R15: 0000000000000007
[ 100.338381] </TASK>
We hit the bug because when second hist trigger has was created
has_hist_vars() returned false because hist trigger did not have
variables. As a result of that save_hist_vars() was not called to add
the trigger to trace_array->hist_vars. Later on when we attempted to
remove the first histogram find_any_var_ref() failed to detect it is
being used because it did not find the second trigger in hist_vars list.
With this change we wait until trigger actions are created so we can take
into consideration if hist trigger has variable references. Also, now we
check the return value of save_hist_vars() and fail trigger creation if
save_hist_vars() fails.
Link: https://lore.kernel.org/linux-trace-kernel/20230712223021.636335-1-mkhalfella@purestorage.com
Cc: stable@vger.kernel.org
Fixes: 067fe038e70f6 ("tracing: Add variable reference handling to hist triggers")
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 7e42907f3a7b4ce3a2d1757f6d78336984daf8f5 upstream.
Soft lockup occurs when reading file 'trace_pipe':
watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [cat:4488]
[...]
RIP: 0010:ring_buffer_empty_cpu+0xed/0x170
RSP: 0018:ffff88810dd6fc48 EFLAGS: 00000246
RAX: 0000000000000000 RBX: 0000000000000246 RCX: ffffffff93d1aaeb
RDX: ffff88810a280040 RSI: 0000000000000008 RDI: ffff88811164b218
RBP: ffff88811164b218 R08: 0000000000000000 R09: ffff88815156600f
R10: ffffed102a2acc01 R11: 0000000000000001 R12: 0000000051651901
R13: 0000000000000000 R14: ffff888115e49500 R15: 0000000000000000
[...]
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f8d853c2000 CR3: 000000010dcd8000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
__find_next_entry+0x1a8/0x4b0
? peek_next_entry+0x250/0x250
? down_write+0xa5/0x120
? down_write_killable+0x130/0x130
trace_find_next_entry_inc+0x3b/0x1d0
tracing_read_pipe+0x423/0xae0
? tracing_splice_read_pipe+0xcb0/0xcb0
vfs_read+0x16b/0x490
ksys_read+0x105/0x210
? __ia32_sys_pwrite64+0x200/0x200
? switch_fpu_return+0x108/0x220
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x61/0xc6
Through the vmcore, I found it's because in tracing_read_pipe(),
ring_buffer_empty_cpu() found some buffer is not empty but then it
cannot read anything due to "rb_num_of_entries() == 0" always true,
Then it infinitely loop the procedure due to user buffer not been
filled, see following code path:
tracing_read_pipe() {
... ...
waitagain:
tracing_wait_pipe() // 1. find non-empty buffer here
trace_find_next_entry_inc() // 2. loop here try to find an entry
__find_next_entry()
ring_buffer_empty_cpu(); // 3. find non-empty buffer
peek_next_entry() // 4. but peek always return NULL
ring_buffer_peek()
rb_buffer_peek()
rb_get_reader_page()
// 5. because rb_num_of_entries() == 0 always true here
// then return NULL
// 6. user buffer not been filled so goto 'waitgain'
// and eventually leads to an deadloop in kernel!!!
}
By some analyzing, I found that when resetting ringbuffer, the 'entries'
of its pages are not all cleared (see rb_reset_cpu()). Then when reducing
the ringbuffer, and if some reduced pages exist dirty 'entries' data, they
will be added into 'cpu_buffer->overrun' (see rb_remove_pages()), which
cause wrong 'overrun' count and eventually cause the deadloop issue.
To fix it, we need to clear every pages in rb_reset_cpu().
Link: https://lore.kernel.org/linux-trace-kernel/20230708225144.3785600-1-zhengyejian1@huawei.com
Cc: stable@vger.kernel.org
Fixes: a5fb833172eca ("ring-buffer: Fix uninitialized read_stamp")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3f649ab728cda8038259d8f14492fe400fbab911 upstream.
Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.
In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:
git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
xargs perl -pi -e \
's/\buninitialized_var\(([^\)]+)\)/\1/g;
s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'
drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.
No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.
[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/
Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 675751bb20634f981498c7d66161584080cc061e upstream.
If something was written to the buffer just before destruction,
it may be possible (maybe not in a real system, but it did
happen in ARCH=um with time-travel) to destroy the ringbuffer
before the IRQ work ran, leading this KASAN report (or a crash
without KASAN):
BUG: KASAN: slab-use-after-free in irq_work_run_list+0x11a/0x13a
Read of size 8 at addr 000000006d640a48 by task swapper/0
CPU: 0 PID: 0 Comm: swapper Tainted: G W O 6.3.0-rc1 #7
Stack:
60c4f20f 0c203d48 41b58ab3 60f224fc
600477fa 60f35687 60c4f20f 601273dd
00000008 6101eb00 6101eab0 615be548
Call Trace:
[<60047a58>] show_stack+0x25e/0x282
[<60c609e0>] dump_stack_lvl+0x96/0xfd
[<60c50d4c>] print_report+0x1a7/0x5a8
[<603078d3>] kasan_report+0xc1/0xe9
[<60308950>] __asan_report_load8_noabort+0x1b/0x1d
[<60232844>] irq_work_run_list+0x11a/0x13a
[<602328b4>] irq_work_tick+0x24/0x34
[<6017f9dc>] update_process_times+0x162/0x196
[<6019f335>] tick_sched_handle+0x1a4/0x1c3
[<6019fd9e>] tick_sched_timer+0x79/0x10c
[<601812b9>] __hrtimer_run_queues.constprop.0+0x425/0x695
[<60182913>] hrtimer_interrupt+0x16c/0x2c4
[<600486a3>] um_timer+0x164/0x183
[...]
Allocated by task 411:
save_stack_trace+0x99/0xb5
stack_trace_save+0x81/0x9b
kasan_save_stack+0x2d/0x54
kasan_set_track+0x34/0x3e
kasan_save_alloc_info+0x25/0x28
____kasan_kmalloc+0x8b/0x97
__kasan_kmalloc+0x10/0x12
__kmalloc+0xb2/0xe8
load_elf_phdrs+0xee/0x182
[...]
The buggy address belongs to the object at 000000006d640800
which belongs to the cache kmalloc-1k of size 1024
The buggy address is located 584 bytes inside of
freed 1024-byte region [000000006d640800, 000000006d640c00)
Add the appropriate irq_work_sync() so the work finishes before
the buffers are destroyed.
Prior to the commit in the Fixes tag below, there was only a
single global IRQ work, so this issue didn't exist.
Link: https://lore.kernel.org/linux-trace-kernel/20230427175920.a76159263122.I8295e405c44362a86c995e9c2c37e3e03810aa56@changeid
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 15693458c4bc ("tracing/ring-buffer: Move poll wake ups into ring buffer code")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6455b6163d8c680366663cdb8c679514d55fc30c upstream.
When user reads file 'trace_pipe', kernel keeps printing following logs
that warn at "cpu_buffer->reader_page->read > rb_page_size(reader)" in
rb_get_reader_page(). It just looks like there's an infinite loop in
tracing_read_pipe(). This problem occurs several times on arm64 platform
when testing v5.10 and below.
Call trace:
rb_get_reader_page+0x248/0x1300
rb_buffer_peek+0x34/0x160
ring_buffer_peek+0xbc/0x224
peek_next_entry+0x98/0xbc
__find_next_entry+0xc4/0x1c0
trace_find_next_entry_inc+0x30/0x94
tracing_read_pipe+0x198/0x304
vfs_read+0xb4/0x1e0
ksys_read+0x74/0x100
__arm64_sys_read+0x24/0x30
el0_svc_common.constprop.0+0x7c/0x1bc
do_el0_svc+0x2c/0x94
el0_svc+0x20/0x30
el0_sync_handler+0xb0/0xb4
el0_sync+0x160/0x180
Then I dump the vmcore and look into the problematic per_cpu ring_buffer,
I found that tail_page/commit_page/reader_page are on the same page while
reader_page->read is obviously abnormal:
tail_page == commit_page == reader_page == {
.write = 0x100d20,
.read = 0x8f9f4805, // Far greater than 0xd20, obviously abnormal!!!
.entries = 0x10004c,
.real_end = 0x0,
.page = {
.time_stamp = 0x857257416af0,
.commit = 0xd20, // This page hasn't been full filled.
// .data[0...0xd20] seems normal.
}
}
The root cause is most likely the race that reader and writer are on the
same page while reader saw an event that not fully committed by writer.
To fix this, add memory barriers to make sure the reader can see the
content of what is committed. Since commit a0fcaaed0c46 ("ring-buffer: Fix
race between reset page and reading page") has added the read barrier in
rb_get_reader_page(), here we just need to add the write barrier.
Link: https://lore.kernel.org/linux-trace-kernel/20230325021247.2923907-1-zhengyejian1@huawei.com
Cc: stable@vger.kernel.org
Fixes: 77ae365eca89 ("ring-buffer: make lockless")
Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ee92fa443358f4fc0017c1d0d325c27b37802504 upstream.
KASAN reported follow problem:
BUG: KASAN: use-after-free in lookup_rec
Read of size 8 at addr ffff000199270ff0 by task modprobe
CPU: 2 Comm: modprobe
Call trace:
kasan_report
__asan_load8
lookup_rec
ftrace_location
arch_check_ftrace_location
check_kprobe_address_safe
register_kprobe
When checking pg->records[pg->index - 1].ip in lookup_rec(), it can get a
pg which is newly added to ftrace_pages_start in ftrace_process_locs().
Before the first pg->index++, index is 0 and accessing pg->records[-1].ip
will cause this problem.
Don't check the ip when pg->index is 0.
Link: https://lore.kernel.org/linux-trace-kernel/20230309080230.36064-1-chenzhongjin@huawei.com
Cc: stable@vger.kernel.org
Fixes: 9644302e3315 ("ftrace: Speed up search by skipping pages by address")
Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9f116f76fa8c04c81aef33ad870dbf9a158e5b70 upstream.
The function hist_field_name() cannot handle being passed a NULL field
parameter. It should never be NULL, but due to a previous bug, NULL was
passed to the function and the kernel crashed due to a NULL dereference.
Mark Rutland reported this to me on IRC.
The bug was fixed, but to prevent future bugs from crashing the kernel,
check the field and add a WARN_ON() if it is NULL.
Link: https://lkml.kernel.org/r/20230302020810.762384440@goodmis.org
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Mark Rutland <mark.rutland@arm.com>
Fixes: c6afad49d127f ("tracing: Add hist trigger 'sym' and 'sym-offset' modifiers")
Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 3e4272b9954094907f16861199728f14002fcaf6 ]
In a previous commit 7433632c9ff6, buffer, buffer->buffers and
buffer->buffers[cpu] in ring_buffer_wake_waiters() can be NULL,
and thus the related checks are added.
However, in the same call stack, these variables are also used in
ring_buffer_free_read_page():
tracing_buffers_release()
ring_buffer_wake_waiters(iter->array_buffer->buffer)
cpu_buffer = buffer->buffers[cpu] -> Add checks by previous commit
ring_buffer_free_read_page(iter->array_buffer->buffer)
cpu_buffer = buffer->buffers[cpu] -> No check
Thus, to avod possible null-pointer derefernces, the related checks
should be added.
These results are reported by a static tool designed by myself.
Link: https://lkml.kernel.org/r/20230113125501.760324-1-baijiaju1990@gmail.com
Reported-by: TOTE Robot <oslab@tsinghua.edu.cn>
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 8b152e9150d07a885f95e1fd401fc81af202d9a4 upstream.
Function 'create_hist_field' is called recursively at
trace_events_hist.c:1954 and can return NULL-value that's why we have
to check it to avoid null pointer dereference.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Link: https://lkml.kernel.org/r/20230111120409.4111-1-n.petrova@fintech.ru
Cc: stable@vger.kernel.org
Fixes: 30350d65ac56 ("tracing: Add variable support to hist triggers")
Signed-off-by: Natalia Petrova <n.petrova@fintech.ru>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3bb06eb6e9acf7c4a3e1b5bc87aed398ff8e2253 upstream.
Currently trace_printk() can be used as soon as early_trace_init() is
called from start_kernel(). But if a crash happens, and
"ftrace_dump_on_oops" is set on the kernel command line, all you get will
be:
[ 0.456075] <idle>-0 0dN.2. 347519us : Unknown type 6
[ 0.456075] <idle>-0 0dN.2. 353141us : Unknown type 6
[ 0.456075] <idle>-0 0dN.2. 358684us : Unknown type 6
This is because the trace_printk() event (type 6) hasn't been registered
yet. That gets done via an early_initcall(), which may be early, but not
early enough.
Instead of registering the trace_printk() event (and other ftrace events,
which are not trace events) via an early_initcall(), have them registered at
the same time that trace_printk() can be used. This way, if there is a
crash before early_initcall(), then the trace_printk()s will actually be
useful.
Link: https://lkml.kernel.org/r/20230104161412.019f6c55@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: e725c731e3bb1 ("tracing: Split tracing initialization into two for early initialization")
Reported-by: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c1ac03af6ed45d05786c219d102f37eb44880f28 upstream.
print_trace_line may overflow seq_file buffer. If the event is not
consumed, the while loop keeps peeking this event, causing a infinite loop.
Link: https://lkml.kernel.org/r/20221129113009.182425-1-yangjihong1@huawei.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 088b1e427dbba ("ftrace: pipe fixes")
Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 608c6ed3337850c767ab0dd6c583477922233e29 ]
When input some constructed invalid 'trigger' command, command info
in 'error_log' are lost [1].
The root cause is that there is a path that event_hist_trigger_parse()
is recursely called once and 'last_cmd' which save origin command is
cleared, then later calling of hist_err() will no longer record origin
command info:
event_hist_trigger_parse() {
last_cmd_set() // <1> 'last_cmd' save origin command here at first
create_actions() {
onmatch_create() {
action_create() {
trace_action_create() {
trace_action_create_field_var() {
create_field_var_hist() {
event_hist_trigger_parse() { // <2> recursely called once
hist_err_clear() // <3> 'last_cmd' is cleared here
}
hist_err() // <4> No longer find origin command!!!
Since 'glob' is empty string while running into the recurse call, we
can trickly check it and bypass the call of hist_err_clear() to solve it.
[1]
# cd /sys/kernel/tracing
# echo "my_synth_event int v1; int v2; int v3;" >> synthetic_events
# echo 'hist:keys=pid' >> events/sched/sched_waking/trigger
# echo "hist:keys=next_pid:onmatch(sched.sched_waking).my_synth_event(\
pid,pid1)" >> events/sched/sched_switch/trigger
# cat error_log
[ 8.405018] hist:sched:sched_switch: error: Couldn't find synthetic event
Command:
hist:keys=next_pid:onmatch(sched.sched_waking).my_synth_event(pid,pid1)
^
[ 8.816902] hist:sched:sched_switch: error: Couldn't find field
Command:
hist:keys=next_pid:onmatch(sched.sched_waking).my_synth_event(pid,pid1)
^
[ 8.816902] hist:sched:sched_switch: error: Couldn't parse field variable
Command:
hist:keys=next_pid:onmatch(sched.sched_waking).my_synth_event(pid,pid1)
^
[ 8.999880] : error: Couldn't find field
Command:
^
[ 8.999880] : error: Couldn't parse field variable
Command:
^
[ 8.999880] : error: Couldn't find field
Command:
^
[ 8.999880] : error: Couldn't create histogram for field
Command:
^
Link: https://lore.kernel.org/linux-trace-kernel/20221207135326.3483216-1-zhengyejian1@huawei.com
Cc: <mhiramat@kernel.org>
Cc: <zanussi@kernel.org>
Fixes: f404da6e1d46 ("tracing: Add 'last error' error facility for hist triggers")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f596da3efaf4130ff61cd029558845808df9bf99 ]
When the blk_classic option is enabled, non-blktrace events must be
filtered out. Otherwise, events of other types are output in the blktrace
classic format, which is unexpected.
The problem can be triggered in the following ways:
# echo 1 > /sys/kernel/debug/tracing/options/blk_classic
# echo 1 > /sys/kernel/debug/tracing/events/enable
# echo blk > /sys/kernel/debug/tracing/current_tracer
# cat /sys/kernel/debug/tracing/trace_pipe
Fixes: c71a89615411 ("blktrace: add ftrace plugin")
Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
Link: https://lore.kernel.org/r/20221122040410.85113-1-yangjihong1@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 56f4ca0a79a9f1af98f26c54b9b89ba1f9bcc6bd upstream.
rb_head_page_deactivate() expects cpu_buffer to contain a valid list of
->pages, so verify that the list is actually present before calling it.
Found by Linux Verification Center (linuxtesting.org) with the SVACE
static analysis tool.
Link: https://lkml.kernel.org/r/20221114143129.3534443-1-d-tatianin@yandex-team.ru
Cc: stable@vger.kernel.org
Fixes: 77ae365eca895 ("ring-buffer: make lockless")
Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 19ba6c8af9382c4c05dc6a0a79af3013b9a35cd0 upstream.
The @ftrace_mod is allocated by kzalloc(), so both the members {prev,next}
of @ftrace_mode->list are NULL, it's not a valid state to call list_del().
If kstrdup() for @ftrace_mod->{func|module} fails, it goes to @out_free
tag and calls free_ftrace_mod() to destroy @ftrace_mod, then list_del()
will write prev->next and next->prev, where null pointer dereference
happens.
BUG: kernel NULL pointer dereference, address: 0000000000000008
Oops: 0002 [#1] PREEMPT SMP NOPTI
Call Trace:
<TASK>
ftrace_mod_callback+0x20d/0x220
? do_filp_open+0xd9/0x140
ftrace_process_regex.isra.51+0xbf/0x130
ftrace_regex_write.isra.52.part.53+0x6e/0x90
vfs_write+0xee/0x3a0
? __audit_filter_op+0xb1/0x100
? auditd_test_task+0x38/0x50
ksys_write+0xa5/0xe0
do_syscall_64+0x3a/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Kernel panic - not syncing: Fatal exception
So call INIT_LIST_HEAD() to initialize the list member to fix this issue.
Link: https://lkml.kernel.org/r/20221116015207.30858-1-xiujianfeng@huawei.com
Cc: stable@vger.kernel.org
Fixes: 673feb9d76ab ("ftrace: Add :mod: caching infrastructure to trace_array")
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit bcea02b096333dc74af987cb9685a4dbdd820840 upstream.
If we can't allocate this size, try something smaller with half of the
size. Its order should be decreased by one instead of divided by two.
Link: https://lkml.kernel.org/r/20221109094434.84046-3-wangwensheng4@huawei.com
Cc: <mhiramat@kernel.org>
Cc: <mark.rutland@arm.com>
Cc: stable@vger.kernel.org
Fixes: a79008755497d ("ftrace: Allocate the mcount record pages as groups")
Signed-off-by: Wang Wensheng <wangwensheng4@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 08948caebe93482db1adfd2154eba124f66d161d upstream.
If the number of mcount entries is an integer multiple of
ENTRIES_PER_PAGE, the page count showing on the console would be wrong.
Link: https://lkml.kernel.org/r/20221109094434.84046-2-wangwensheng4@huawei.com
Cc: <mhiramat@kernel.org>
Cc: <mark.rutland@arm.com>
Cc: stable@vger.kernel.org
Fixes: 5821e1b74f0d0 ("function tracing: fix wrong pos computing when read buffer has been fulfilled")
Signed-off-by: Wang Wensheng <wangwensheng4@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a0fcaaed0c46cf9399d3a2d6e0c87ddb3df0e044 upstream.
The ring buffer is broken up into sub buffers (currently of page size).
Each sub buffer has a pointer to its "tail" (the last event written to the
sub buffer). When a new event is requested, the tail is locally
incremented to cover the size of the new event. This is done in a way that
there is no need for locking.
If the tail goes past the end of the sub buffer, the process of moving to
the next sub buffer takes place. After setting the current sub buffer to
the next one, the previous one that had the tail go passed the end of the
sub buffer needs to be reset back to the original tail location (before
the new event was requested) and the rest of the sub buffer needs to be
"padded".
The race happens when a reader takes control of the sub buffer. As readers
do a "swap" of sub buffers from the ring buffer to get exclusive access to
the sub buffer, it replaces the "head" sub buffer with an empty sub buffer
that goes back into the writable portion of the ring buffer. This swap can
happen as soon as the writer moves to the next sub buffer and before it
updates the last sub buffer with padding.
Because the sub buffer can be released to the reader while the writer is
still updating the padding, it is possible for the reader to see the event
that goes past the end of the sub buffer. This can cause obvious issues.
To fix this, add a few memory barriers so that the reader definitely sees
the updates to the sub buffer, and also waits until the writer has put
back the "tail" of the sub buffer back to the last event that was written
on it.
To be paranoid, it will only spin for 1 second, otherwise it will
warn and shutdown the ring buffer code. 1 second should be enough as
the writer does have preemption disabled. If the writer doesn't move
within 1 second (with preemption disabled) something is horribly
wrong. No interrupt should last 1 second!
Link: https://lore.kernel.org/all/20220830120854.7545-1-jiazi.li@transsion.com/
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216369
Link: https://lkml.kernel.org/r/20220929104909.0650a36c@gandalf.local.home
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Fixes: c7b0930857e22 ("ring-buffer: prevent adding write in discarded area")
Reported-by: Jiazi.Li <jiazi.li@transsion.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ec0bbc5ec5664dcee344f79373852117dc672c86 upstream.
The wake up waiters only checks the "wakeup_full" variable and not the
"full_waiters_pending". The full_waiters_pending is set when a waiter is
added to the wait queue. The wakeup_full is only set when an event is
triggered, and it clears the full_waiters_pending to avoid multiple calls
to irq_work_queue().
The irq_work callback really needs to check both wakeup_full as well as
full_waiters_pending such that this code can be used to wake up waiters
when a file is closed that represents the ring buffer and the waiters need
to be woken up.
Link: https://lkml.kernel.org/r/20220927231824.209460321@goodmis.org
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: 15693458c4bc0 ("tracing/ring-buffer: Move poll wake ups into ring buffer code")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit fa8f4a89736b654125fb254b0db753ac68a5fced upstream.
If a page is partially read, and then the splice system call is run
against the ring buffer, it will always fail to read, no matter how much
is in the ring buffer. That's because the code path for a partial read of
the page does will fail if the "full" flag is set.
The splice system call wants full pages, so if the read of the ring buffer
is not yet full, it should return zero, and the splice will block. But if
a previous read was done, where the beginning has been consumed, it should
still be given to the splice caller if the rest of the page has been
written to.
This caused the splice command to never consume data in this scenario, and
let the ring buffer just fill up and lose events.
Link: https://lkml.kernel.org/r/20220927144317.46be6b80@gandalf.local.home
Cc: stable@vger.kernel.org
Fixes: 8789a9e7df6bf ("ring-buffer: read page interface")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 0ce0638edf5ec83343302b884fa208179580700a upstream.
When executing following commands like what document said, but the log
"#### all functions enabled ####" was not shown as expect:
1. Set a 'mod' filter:
$ echo 'write*:mod:ext3' > /sys/kernel/tracing/set_ftrace_filter
2. Invert above filter:
$ echo '!write*:mod:ext3' >> /sys/kernel/tracing/set_ftrace_filter
3. Read the file:
$ cat /sys/kernel/tracing/set_ftrace_filter
By some debugging, I found that flag FTRACE_HASH_FL_MOD was not unset
after inversion like above step 2 and then result of ftrace_hash_empty()
is incorrect.
Link: https://lkml.kernel.org/r/20220926152008.2239274-1-zhengyejian1@huawei.com
Cc: <mingo@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 8c08f0d5c6fb ("ftrace: Have cached module filters be an active filter")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c3b0f72e805f0801f05fa2aa52011c4bfc694c44 upstream.
ftrace_startup does not remove ops from ftrace_ops_list when
ftrace_startup_enable fails:
register_ftrace_function
ftrace_startup
__register_ftrace_function
...
add_ftrace_ops(&ftrace_ops_list, ops)
...
...
ftrace_startup_enable // if ftrace failed to modify, ftrace_disabled is set to 1
...
return 0 // ops is in the ftrace_ops_list.
When ftrace_disabled = 1, unregister_ftrace_function simply returns without doing anything:
unregister_ftrace_function
ftrace_shutdown
if (unlikely(ftrace_disabled))
return -ENODEV; // return here, __unregister_ftrace_function is not executed,
// as a result, ops is still in the ftrace_ops_list
__unregister_ftrace_function
...
If ops is dynamically allocated, it will be free later, in this case,
is_ftrace_trampoline accesses NULL pointer:
is_ftrace_trampoline
ftrace_ops_trampoline
do_for_each_ftrace_op(op, ftrace_ops_list) // OOPS! op may be NULL!
Syzkaller reports as follows:
[ 1203.506103] BUG: kernel NULL pointer dereference, address: 000000000000010b
[ 1203.508039] #PF: supervisor read access in kernel mode
[ 1203.508798] #PF: error_code(0x0000) - not-present page
[ 1203.509558] PGD 800000011660b067 P4D 800000011660b067 PUD 130fb8067 PMD 0
[ 1203.510560] Oops: 0000 [#1] SMP KASAN PTI
[ 1203.511189] CPU: 6 PID: 29532 Comm: syz-executor.2 Tainted: G B W 5.10.0 #8
[ 1203.512324] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 1203.513895] RIP: 0010:is_ftrace_trampoline+0x26/0xb0
[ 1203.514644] Code: ff eb d3 90 41 55 41 54 49 89 fc 55 53 e8 f2 00 fd ff 48 8b 1d 3b 35 5d 03 e8 e6 00 fd ff 48 8d bb 90 00 00 00 e8 2a 81 26 00 <48> 8b ab 90 00 00 00 48 85 ed 74 1d e8 c9 00 fd ff 48 8d bb 98 00
[ 1203.518838] RSP: 0018:ffffc900012cf960 EFLAGS: 00010246
[ 1203.520092] RAX: 0000000000000000 RBX: 000000000000007b RCX: ffffffff8a331866
[ 1203.521469] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000000000000010b
[ 1203.522583] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffff8df18b07
[ 1203.523550] R10: fffffbfff1be3160 R11: 0000000000000001 R12: 0000000000478399
[ 1203.524596] R13: 0000000000000000 R14: ffff888145088000 R15: 0000000000000008
[ 1203.525634] FS: 00007f429f5f4700(0000) GS:ffff8881daf00000(0000) knlGS:0000000000000000
[ 1203.526801] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1203.527626] CR2: 000000000000010b CR3: 0000000170e1e001 CR4: 00000000003706e0
[ 1203.528611] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1203.529605] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Therefore, when ftrace_startup_enable fails, we need to rollback registration
process and remove ops from ftrace_ops_list.
Link: https://lkml.kernel.org/r/20220818032659.56209-1-yangjihong1@huawei.com
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ab8384442ee512fc0fc72deeb036110843d0e7ff upstream.
Both $comm and $COMM can be used to get current->comm in eprobes and the
filtering and histogram logic. Make kprobes and uprobes consistent in this
regard and allow both $comm and $COMM as well. Currently kprobes and
uprobes only handle $comm, which is inconsistent with the other utilities,
and can be confusing to users.
Link: https://lkml.kernel.org/r/20220820134401.317014913@goodmis.org
Link: https://lore.kernel.org/all/20220820220442.776e1ddaf8836e82edb34d01@kernel.org/
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Fixes: 533059281ee5 ("tracing: probeevent: Introduce new argument fetching code")
Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b2380577d4fe1c0ef3fa50417f1e441c016e4cbe upstream.
Make filtering consistent with histograms. As "cpu" can be a field of an
event, allow for "common_cpu" to keep it from being confused with the
"cpu" field of the event.
Link: https://lkml.kernel.org/r/20220820134401.513062765@goodmis.org
Link: https://lore.kernel.org/all/20220820220920.e42fa32b70505b1904f0a0ad@kernel.org/
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Fixes: 1e3bac71c5053 ("tracing/histogram: Rename "cpu" to "common_cpu"")
Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 7edc3945bdce9c39198a10d6129377a5c53559c2 upstream.
This reverts commit 46bbe5c671e06f070428b9be142cc4ee5cedebac.
As commit 46bbe5c671e0 ("tracing: fix double free") said, the
"double free" problem reported by clang static analyzer is:
> In parse_var_defs() if there is a problem allocating
> var_defs.expr, the earlier var_defs.name is freed.
> This free is duplicated by free_var_defs() which frees
> the rest of the list.
However, if there is a problem allocating N-th var_defs.expr:
+ in parse_var_defs(), the freed 'earlier var_defs.name' is
actually the N-th var_defs.name;
+ then in free_var_defs(), the names from 0th to (N-1)-th are freed;
IF ALLOCATING PROBLEM HAPPENED HERE!!! -+
\
|
0th 1th (N-1)-th N-th V
+-------------+-------------+-----+-------------+-----------
var_defs: | name | expr | name | expr | ... | name | expr | name | ///
+-------------+-------------+-----+-------------+-----------
These two frees don't act on same name, so there was no "double free"
problem before. Conversely, after that commit, we get a "memory leak"
problem because the above "N-th var_defs.name" is not freed.
If enable CONFIG_DEBUG_KMEMLEAK and inject a fault at where the N-th
var_defs.expr allocated, then execute on shell like:
$ echo 'hist:key=call_site:val=$v1,$v2:v1=bytes_req,v2=bytes_alloc' > \
/sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
Then kmemleak reports:
unreferenced object 0xffff8fb100ef3518 (size 8):
comm "bash", pid 196, jiffies 4295681690 (age 28.538s)
hex dump (first 8 bytes):
76 31 00 00 b1 8f ff ff v1......
backtrace:
[<0000000038fe4895>] kstrdup+0x2d/0x60
[<00000000c99c049a>] event_hist_trigger_parse+0x206f/0x20e0
[<00000000ae70d2cc>] trigger_process_regex+0xc0/0x110
[<0000000066737a4c>] event_trigger_write+0x75/0xd0
[<000000007341e40c>] vfs_write+0xbb/0x2a0
[<0000000087fde4c2>] ksys_write+0x59/0xd0
[<00000000581e9cdf>] do_syscall_64+0x3a/0x80
[<00000000cf3b065c>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
Link: https://lkml.kernel.org/r/20220711014731.69520-1-zhengyejian1@huawei.com
Cc: stable@vger.kernel.org
Fixes: 46bbe5c671e0 ("tracing: fix double free")
Reported-by: Hulk Robot <hulkci@huawei.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit ef9188bcc6ca1d8a2ad83e826b548e6820721061 ]
To prepare for support asynchronous tracer_init_tracefs initcall,
avoid calling create_trace_option_files before __update_tracer_options.
Otherwise, create_trace_option_files will show warning because
some tracers in trace_types list are already in tr->topts.
For example, hwlat_tracer call register_tracer in late_initcall,
and global_trace.dir is already created in tracing_init_dentry,
hwlat_tracer will be put into tr->topts.
Then if the __update_tracer_options is executed after hwlat_tracer
registered, create_trace_option_files find that hwlat_tracer is
already in tr->topts.
Link: https://lkml.kernel.org/r/20220426122407.17042-2-mark-pk.tsai@mediatek.com
Link: https://lore.kernel.org/lkml/20220322133339.GA32582@xsang-OptiPlex-9020/
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 12025abdc8539ed9d5014e2d647a3fd1bd3de5cd ]
When setting bootparams="trace_event=initcall:initcall_start tp_printk=1" in the
cmdline, the output_printk() was called, and the spin_lock_irqsave() was called in the
atomic and irq disable interrupt context suitation. On the PREEMPT_RT kernel,
these locks are replaced with sleepable rt-spinlock, so the stack calltrace will
be triggered.
Fix it by raw_spin_lock_irqsave when PREEMPT_RT and "trace_event=initcall:initcall_start
tp_printk=1" enabled.
BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:46
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0
preempt_count: 2, expected: 0
RCU nest depth: 0, expected: 0
Preemption disabled at:
[<ffffffff8992303e>] try_to_wake_up+0x7e/0xba0
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.17.1-rt17+ #19 34c5812404187a875f32bee7977f7367f9679ea7
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x60/0x8c
dump_stack+0x10/0x12
__might_resched.cold+0x11d/0x155
rt_spin_lock+0x40/0x70
trace_event_buffer_commit+0x2fa/0x4c0
? map_vsyscall+0x93/0x93
trace_event_raw_event_initcall_start+0xbe/0x110
? perf_trace_initcall_finish+0x210/0x210
? probe_sched_wakeup+0x34/0x40
? ttwu_do_wakeup+0xda/0x310
? trace_hardirqs_on+0x35/0x170
? map_vsyscall+0x93/0x93
do_one_initcall+0x217/0x3c0
? trace_event_raw_event_initcall_level+0x170/0x170
? push_cpu_stop+0x400/0x400
? cblist_init_generic+0x241/0x290
kernel_init_freeable+0x1ac/0x347
? _raw_spin_unlock_irq+0x65/0x80
? rest_init+0xf0/0xf0
kernel_init+0x1e/0x150
ret_from_fork+0x22/0x30
</TASK>
Link: https://lkml.kernel.org/r/20220419013910.894370-1-jun.miao@intel.com
Signed-off-by: Jun Miao <jun.miao@intel.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 99696a2592bca641eb88cc9a80c90e591afebd0f upstream.
In create_var_ref(), init_var_ref() is called to initialize the fields
of variable ref_field, which is allocated in the previous function call
to create_hist_field(). Function init_var_ref() allocates the
corresponding fields such as ref_field->system, but frees these fields
when the function encounters an error. The caller later calls
destroy_hist_field() to conduct error handling, which frees the fields
and the variable itself. This results in double free of the fields which
are already freed in the previous function.
Fix this by storing NULL to the corresponding fields when they are freed
in init_var_ref().
Link: https://lkml.kernel.org/r/20220425063739.3859998-1-keitasuzuki.park@sslab.ics.keio.ac.jp
Fixes: 067fe038e70f ("tracing: Add variable reference handling to hist triggers")
CC: stable@vger.kernel.org
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Keita Suzuki <keitasuzuki.park@sslab.ics.keio.ac.jp>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ce33c845b030c9cf768370c951bc699470b09fa7 upstream.
The stacktrace event trigger is not dumping the stacktrace to the instance
where it was enabled, but to the global "instance."
Use the private_data, pointing to the trigger file, to figure out the
corresponding trace instance, and use it in the trigger action, like
snapshot_trigger does.
Link: https://lkml.kernel.org/r/afbb0b4f18ba92c276865bc97204d438473f4ebc.1645396236.git.bristot@kernel.org
Cc: stable@vger.kernel.org
Fixes: ae63b31e4d0e2 ("tracing: Separate out trace events from global variables")
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Tested-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 7acf3a127bb7c65ff39099afd78960e77b2ca5de ]
Booting the kernel with 'trace_buf_size=1' give a warning at
boot during the ftrace selftests:
[ 0.892809] Running postponed tracer tests:
[ 0.892893] Testing tracer function:
[ 0.901899] Callback from call_rcu_tasks_trace() invoked.
[ 0.983829] Callback from call_rcu_tasks_rude() invoked.
[ 1.072003] .. bad ring buffer .. corrupted trace buffer ..
[ 1.091944] Callback from call_rcu_tasks() invoked.
[ 1.097695] PASSED
[ 1.097701] Testing dynamic ftrace: .. filter failed count=0 ..FAILED!
[ 1.353474] ------------[ cut here ]------------
[ 1.353478] WARNING: CPU: 0 PID: 1 at kernel/trace/trace.c:1951 run_tracer_selftest+0x13c/0x1b0
Therefore enforce a minimum of 4096 bytes to make the selftest pass.
Link: https://lkml.kernel.org/r/20220214134456.1751749-1-svens@linux.ibm.com
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 1d1898f65616c4601208963c3376c1d828cbf2c7 upstream.
When trying to add a histogram against an event with the "cpu" field, it
was impossible due to "cpu" being a keyword to key off of the running CPU.
So to fix this, it was changed to "common_cpu" to match the other generic
fields (like "common_pid"). But since some scripts used "cpu" for keying
off of the CPU (for events that did not have "cpu" as a field, which is
most of them), a backward compatibility trick was added such that if "cpu"
was used as a key, and the event did not have "cpu" as a field name, then
it would fallback and switch over to "common_cpu".
This fix has a couple of subtle bugs. One was that when switching over to
"common_cpu", it did not change the field name, it just set a flag. But
the code still found a "cpu" field. The "cpu" field is used for filtering
and is returned when the event does not have a "cpu" field.
This was found by:
# cd /sys/kernel/tracing
# echo hist:key=cpu,pid:sort=cpu > events/sched/sched_wakeup/trigger
# cat events/sched/sched_wakeup/hist
Which showed the histogram unsorted:
{ cpu: 19, pid: 1175 } hitcount: 1
{ cpu: 6, pid: 239 } hitcount: 2
{ cpu: 23, pid: 1186 } hitcount: 14
{ cpu: 12, pid: 249 } hitcount: 2
{ cpu: 3, pid: 994 } hitcount: 5
Instead of hard coding the "cpu" checks, take advantage of the fact that
trace_event_field_field() returns a special field for "cpu" and "CPU" if
the event does not have "cpu" as a field. This special field has the
"filter_type" of "FILTER_CPU". Check that to test if the returned field is
of the CPU type instead of doing the string compare.
Also, fix the sorting bug by testing for the hist_field flag of
HIST_FIELD_FL_CPU when setting up the sort routine. Otherwise it will use
the special CPU field to know what compare routine to use, and since that
special field does not have a size, it returns tracing_map_cmp_none.
Cc: stable@vger.kernel.org
Fixes: 1e3bac71c505 ("tracing/histogram: Rename "cpu" to "common_cpu"")
Reported-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 302e9edd54985f584cfc180098f3554774126969 upstream.
If a trigger is set on an event to disable or enable tracing within an
instance, then tracing should be disabled or enabled in the instance and
not at the top level, which is confusing to users.
Link: https://lkml.kernel.org/r/20220223223837.14f94ec3@rorschach.local.home
Cc: stable@vger.kernel.org
Fixes: ae63b31e4d0e2 ("tracing: Separate out trace events from global variables")
Tested-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 3203ce39ac0b2a57a84382ec184c7d4a0bede175 ]
The kernel parameter "tp_printk_stop_on_boot" starts with "tp_printk" which is
the same as another kernel parameter "tp_printk". If "tp_printk" setup is
called before the "tp_printk_stop_on_boot", it will override the latter
and keep it from being set.
This is similar to other kernel parameter issues, such as:
Commit 745a600cf1a6 ("um: console: Ignore console= option")
or init/do_mounts.c:45 (setup function of "ro" kernel param)
Fix it by checking for a "_" right after the "tp_printk" and if that
exists do not process the parameter.
Link: https://lkml.kernel.org/r/20220208195421.969326-1-jsyoo5b@gmail.com
Signed-off-by: JaeSang Yoo <jsyoo5b@gmail.com>
[ Fixed up change log and added space after if condition ]
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit f28439db470cca8b6b082239314e9fd10bd39034 upstream.
Tag trace_percpu_buffer as a percpu pointer to resolve warnings
reported by sparse:
/linux/kernel/trace/trace.c:3218:46: warning: incorrect type in initializer (different address spaces)
/linux/kernel/trace/trace.c:3218:46: expected void const [noderef] __percpu *__vpp_verify
/linux/kernel/trace/trace.c:3218:46: got struct trace_buffer_struct *
/linux/kernel/trace/trace.c:3234:9: warning: incorrect type in initializer (different address spaces)
/linux/kernel/trace/trace.c:3234:9: expected void const [noderef] __percpu *__vpp_verify
/linux/kernel/trace/trace.c:3234:9: got int *
Link: https://lkml.kernel.org/r/ebabd3f23101d89cb75671b68b6f819f5edc830b.1640255304.git.naveen.n.rao@linux.vnet.ibm.com
Cc: stable@vger.kernel.org
Reported-by: kernel test robot <lkp@intel.com>
Fixes: 07d777fe8c398 ("tracing: Add percpu buffers for trace_printk()")
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 823e670f7ed616d0ce993075c8afe0217885f79d upstream.
With the new osnoise tracer, we are seeing the below splat:
Kernel attempted to read user page (c7d880000) - exploit attempt? (uid: 0)
BUG: Unable to handle kernel data access on read at 0xc7d880000
Faulting instruction address: 0xc0000000002ffa10
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
...
NIP [c0000000002ffa10] __trace_array_vprintk.part.0+0x70/0x2f0
LR [c0000000002ff9fc] __trace_array_vprintk.part.0+0x5c/0x2f0
Call Trace:
[c0000008bdd73b80] [c0000000001c49cc] put_prev_task_fair+0x3c/0x60 (unreliable)
[c0000008bdd73be0] [c000000000301430] trace_array_printk_buf+0x70/0x90
[c0000008bdd73c00] [c0000000003178b0] trace_sched_switch_callback+0x250/0x290
[c0000008bdd73c90] [c000000000e70d60] __schedule+0x410/0x710
[c0000008bdd73d40] [c000000000e710c0] schedule+0x60/0x130
[c0000008bdd73d70] [c000000000030614] interrupt_exit_user_prepare_main+0x264/0x270
[c0000008bdd73de0] [c000000000030a70] syscall_exit_prepare+0x150/0x180
[c0000008bdd73e10] [c00000000000c174] system_call_vectored_common+0xf4/0x278
osnoise tracer on ppc64le is triggering osnoise_taint() for negative
duration in get_int_safe_duration() called from
trace_sched_switch_callback()->thread_exit().
The problem though is that the check for a valid trace_percpu_buffer is
incorrect in get_trace_buf(). The check is being done after calculating
the pointer for the current cpu, rather than on the main percpu pointer.
Fix the check to be against trace_percpu_buffer.
Link: https://lkml.kernel.org/r/a920e4272e0b0635cf20c444707cbce1b2c8973d.1640255304.git.naveen.n.rao@linux.vnet.ibm.com
Cc: stable@vger.kernel.org
Fixes: e2ace001176dc9 ("tracing: Choose static tp_printk buffer by explicit nesting count")
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit f25667e5980a4333729cac3101e5de1bb851f71a ]
Doing the command:
echo 'hist:key=common_pid.execname,common_timestamp' > /sys/kernel/debug/tracing/events/xxx/trigger
Triggers many kmemleak reports:
unreferenced object 0xffff0000c7ea4980 (size 128):
comm "bash", pid 338, jiffies 4294912626 (age 9339.324s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<00000000f3469921>] kmem_cache_alloc_trace+0x4c0/0x6f0
[<0000000054ca40c3>] hist_trigger_elt_data_alloc+0x140/0x178
[<00000000633bd154>] tracing_map_init+0x1f8/0x268
[<000000007e814ab9>] event_hist_trigger_func+0xca0/0x1ad0
[<00000000bf8520ed>] trigger_process_regex+0xd4/0x128
[<00000000f549355a>] event_trigger_write+0x7c/0x120
[<00000000b80f898d>] vfs_write+0xc4/0x380
[<00000000823e1055>] ksys_write+0x74/0xf8
[<000000008a9374aa>] __arm64_sys_write+0x24/0x30
[<0000000087124017>] do_el0_svc+0x88/0x1c0
[<00000000efd0dcd1>] el0_svc+0x1c/0x28
[<00000000dbfba9b3>] el0_sync_handler+0x88/0xc0
[<00000000e7399680>] el0_sync+0x148/0x180
unreferenced object 0xffff0000c7ea4980 (size 128):
comm "bash", pid 338, jiffies 4294912626 (age 9339.324s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<00000000f3469921>] kmem_cache_alloc_trace+0x4c0/0x6f0
[<0000000054ca40c3>] hist_trigger_elt_data_alloc+0x140/0x178
[<00000000633bd154>] tracing_map_init+0x1f8/0x268
[<000000007e814ab9>] event_hist_trigger_func+0xca0/0x1ad0
[<00000000bf8520ed>] trigger_process_regex+0xd4/0x128
[<00000000f549355a>] event_trigger_write+0x7c/0x120
[<00000000b80f898d>] vfs_write+0xc4/0x380
[<00000000823e1055>] ksys_write+0x74/0xf8
[<000000008a9374aa>] __arm64_sys_write+0x24/0x30
[<0000000087124017>] do_el0_svc+0x88/0x1c0
[<00000000efd0dcd1>] el0_svc+0x1c/0x28
[<00000000dbfba9b3>] el0_sync_handler+0x88/0xc0
[<00000000e7399680>] el0_sync+0x148/0x180
The reason is elts->pages[i] is alloced by get_zeroed_page.
and kmemleak will not scan the area alloced by get_zeroed_page.
The address stored in elts->pages will be regarded as leaked.
That is, the elts->pages[i] will have pointers loaded onto it as well, and
without telling kmemleak about it, those pointers will look like memory
without a reference.
To fix this, call kmemleak_alloc to tell kmemleak to scan elts->pages[i]
Link: https://lkml.kernel.org/r/20211124140801.87121-1-chenjun102@huawei.com
Signed-off-by: Chen Jun <chenjun102@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 6cb206508b621a9a0a2c35b60540e399225c8243 upstream.
When pid filtering is activated in an instance, all of the events trace
files for that instance has the PID_FILTER flag set. This determines
whether or not pid filtering needs to be done on the event, otherwise the
event is executed as normal.
If pid filtering is enabled when an event is created (via a dynamic event
or modules), its flag is not updated to reflect the current state, and the
events are not filtered properly.
Cc: stable@vger.kernel.org
Fixes: 3fdaf80f4a836 ("tracing: Implement event pid filtering")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a55f224ff5f238013de8762c4287117e47b86e22 upstream.
If a event is filtered by pid and a trigger that requires processing of
the event to happen is a attached to the event, the discard portion does
not take the pid filtering into account, and the event will then be
recorded when it should not have been.
Cc: stable@vger.kernel.org
Fixes: 3fdaf80f4a836 ("tracing: Implement event pid filtering")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 7ce1bb83a14019f8c396d57ec704d19478747716 ]
If CONFIG_CFI_CLANG=y, attempting to read an event histogram will cause
the kernel to panic due to failed CFI check.
1. echo 'hist:keys=common_pid' >> events/sched/sched_switch/trigger
2. cat events/sched/sched_switch/hist
3. kernel panics on attempting to read hist
This happens because the sort() function expects a generic
int (*)(const void *, const void *) pointer for the compare function.
To prevent this CFI failure, change tracing map cmp_entries_* function
signatures to match this.
Also, fix the build error reported by the kernel test robot [1].
[1] https://lore.kernel.org/r/202110141140.zzi4dRh4-lkp@intel.com/
Link: https://lkml.kernel.org/r/20211014045217.3265162-1-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit ed65df63a39a3f6ed04f7258de8b6789e5021c18 upstream.
While writing an email explaining the "bit = 0" logic for a discussion on
making ftrace_test_recursion_trylock() disable preemption, I discovered a
path that makes the "not do the logic if bit is zero" unsafe.
The recursion logic is done in hot paths like the function tracer. Thus,
any code executed causes noticeable overhead. Thus, tricks are done to try
to limit the amount of code executed. This included the recursion testing
logic.
Having recursion testing is important, as there are many paths that can
end up in an infinite recursion cycle when tracing every function in the
kernel. Thus protection is needed to prevent that from happening.
Because it is OK to recurse due to different running context levels (e.g.
an interrupt preempts a trace, and then a trace occurs in the interrupt
handler), a set of bits are used to know which context one is in (normal,
softirq, irq and NMI). If a recursion occurs in the same level, it is
prevented*.
Then there are infrastructure levels of recursion as well. When more than
one callback is attached to the same function to trace, it calls a loop
function to iterate over all the callbacks. Both the callbacks and the
loop function have recursion protection. The callbacks use the
"ftrace_test_recursion_trylock()" which has a "function" set of context
bits to test, and the loop function calls the internal
trace_test_and_set_recursion() directly, with an "internal" set of bits.
If an architecture does not implement all the features supported by ftrace
then the callbacks are never called directly, and the loop function is
called instead, which will implement the features of ftrace.
Since both the loop function and the callbacks do recursion protection, it
was seemed unnecessary to do it in both locations. Thus, a trick was made
to have the internal set of recursion bits at a more significant bit
location than the function bits. Then, if any of the higher bits were set,
the logic of the function bits could be skipped, as any new recursion
would first have to go through the loop function.
This is true for architectures that do not support all the ftrace
features, because all functions being traced must first go through the
loop function before going to the callbacks. But this is not true for
architectures that support all the ftrace features. That's because the
loop function could be called due to two callbacks attached to the same
function, but then a recursion function inside the callback could be
called that does not share any other callback, and it will be called
directly.
i.e.
traced_function_1: [ more than one callback tracing it ]
call loop_func
loop_func:
trace_recursion set internal bit
call callback
callback:
trace_recursion [ skipped because internal bit is set, return 0 ]
call traced_function_2
traced_function_2: [ only traced by above callback ]
call callback
callback:
trace_recursion [ skipped because internal bit is set, return 0 ]
call traced_function_2
[ wash, rinse, repeat, BOOM! out of shampoo! ]
Thus, the "bit == 0 skip" trick is not safe, unless the loop function is
call for all functions.
Since we want to encourage architectures to implement all ftrace features,
having them slow down due to this extra logic may encourage the
maintainers to update to the latest ftrace features. And because this
logic is only safe for them, remove it completely.
[*] There is on layer of recursion that is allowed, and that is to allow
for the transition between interrupt context (normal -> softirq ->
irq -> NMI), because a trace may occur before the context update is
visible to the trace recursion logic.
Link: https://lore.kernel.org/all/609b565a-ed6e-a1da-f025-166691b5d994@linux.alibaba.com/
Link: https://lkml.kernel.org/r/20211018154412.09fcad3c@gandalf.local.home
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@hansenpartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Jisheng Zhang <jszhang@kernel.org>
Cc: =?utf-8?b?546L6LSH?= <yun.wang@linux.alibaba.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: stable@vger.kernel.org
Fixes: edc15cafcbfa3 ("tracing: Avoid unnecessary multiple recursion checks")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 5afedf670caf30a2b5a52da96eb7eac7dee6a9c9 ]
There is an use-after-free problem triggered by following process:
P1(sda) P2(sdb)
echo 0 > /sys/block/sdb/trace/enable
blk_trace_remove_queue
synchronize_rcu
blk_trace_free
relay_close
rcu_read_lock
__blk_add_trace
trace_note_tsk
(Iterate running_trace_list)
relay_close_buf
relay_destroy_buf
kfree(buf)
trace_note(sdb's bt)
relay_reserve
buf->offset <- nullptr deference (use-after-free) !!!
rcu_read_unlock
[ 502.714379] BUG: kernel NULL pointer dereference, address:
0000000000000010
[ 502.715260] #PF: supervisor read access in kernel mode
[ 502.715903] #PF: error_code(0x0000) - not-present page
[ 502.716546] PGD 103984067 P4D 103984067 PUD 17592b067 PMD 0
[ 502.717252] Oops: 0000 [#1] SMP
[ 502.720308] RIP: 0010:trace_note.isra.0+0x86/0x360
[ 502.732872] Call Trace:
[ 502.733193] __blk_add_trace.cold+0x137/0x1a3
[ 502.733734] blk_add_trace_rq+0x7b/0xd0
[ 502.734207] blk_add_trace_rq_issue+0x54/0xa0
[ 502.734755] blk_mq_start_request+0xde/0x1b0
[ 502.735287] scsi_queue_rq+0x528/0x1140
...
[ 502.742704] sg_new_write.isra.0+0x16e/0x3e0
[ 502.747501] sg_ioctl+0x466/0x1100
Reproduce method:
ioctl(/dev/sda, BLKTRACESETUP, blk_user_trace_setup[buf_size=127])
ioctl(/dev/sda, BLKTRACESTART)
ioctl(/dev/sdb, BLKTRACESETUP, blk_user_trace_setup[buf_size=127])
ioctl(/dev/sdb, BLKTRACESTART)
echo 0 > /sys/block/sdb/trace/enable &
// Add delay(mdelay/msleep) before kernel enters blk_trace_free()
ioctl$SG_IO(/dev/sda, SG_IO, ...)
// Enters trace_note_tsk() after blk_trace_free() returned
// Use mdelay in rcu region rather than msleep(which may schedule out)
Remove blk_trace from running_list before calling blk_trace_free() by
sysfs if blk_trace is at Blktrace_running state.
Fixes: c71a896154119f ("blktrace: add ftrace plugin")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Link: https://lore.kernel.org/r/20210923134921.109194-1-chengzhihao1@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|