<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/kernel/trace, branch v5.10.48</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v5.10.48</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v5.10.48'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2021-06-23T12:42:50Z</updated>
<entry>
<title>tracing: Do no increment trace_clock_global() by one</title>
<updated>2021-06-23T12:42:50Z</updated>
<author>
<name>Steven Rostedt (VMware)</name>
<email>rostedt@goodmis.org</email>
</author>
<published>2021-06-17T21:12:35Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=c9fd0ab39f0ae4143f6d57ed8a6c3416d772e408'/>
<id>urn:sha1:c9fd0ab39f0ae4143f6d57ed8a6c3416d772e408</id>
<content type='text'>
commit 89529d8b8f8daf92d9979382b8d2eb39966846ea upstream.

The trace_clock_global() tries to make sure the events between CPUs is
somewhat in order. A global value is used and updated by the latest read
of a clock. If one CPU is ahead by a little, and is read by another CPU, a
lock is taken, and if the timestamp of the other CPU is behind, it will
simply use the other CPUs timestamp.

The lock is also only taken with a "trylock" due to tracing, and strange
recursions can happen. The lock is not taken at all in NMI context.

In the case where the lock is not able to be taken, the non synced
timestamp is returned. But it will not be less than the saved global
timestamp.

The problem arises because when the time goes "backwards" the time
returned is the saved timestamp plus 1. If the lock is not taken, and the
plus one to the timestamp is returned, there's a small race that can cause
the time to go backwards!

	CPU0				CPU1
	----				----
				trace_clock_global() {
				    ts = clock() [ 1000 ]
				    trylock(clock_lock) [ success ]
				    global_ts = ts; [ 1000 ]

				    &lt;interrupted by NMI&gt;
 trace_clock_global() {
    ts = clock() [ 999 ]
    if (ts &lt; global_ts)
	ts = global_ts + 1 [ 1001 ]

    trylock(clock_lock) [ fail ]

    return ts [ 1001]
 }
				    unlock(clock_lock);
				    return ts; [ 1000 ]
				}

 trace_clock_global() {
    ts = clock() [ 1000 ]
    if (ts &lt; global_ts) [ false 1000 == 1000 ]

    trylock(clock_lock) [ success ]
    global_ts = ts; [ 1000 ]
    unlock(clock_lock)

    return ts; [ 1000 ]
 }

The above case shows to reads of trace_clock_global() on the same CPU, but
the second read returns one less than the first read. That is, time when
backwards, and this is not what is allowed by trace_clock_global().

This was triggered by heavy tracing and the ring buffer checker that tests
for the clock going backwards:

 Ring buffer clock went backwards: 20613921464 -&gt; 20613921463
 ------------[ cut here ]------------
 WARNING: CPU: 2 PID: 0 at kernel/trace/ring_buffer.c:3412 check_buffer+0x1b9/0x1c0
 Modules linked in:
 [..]
 [CPU: 2]TIME DOES NOT MATCH expected:20620711698 actual:20620711697 delta:6790234 before:20613921463 after:20613921463
   [20613915818] PAGE TIME STAMP
   [20613915818] delta:0
   [20613915819] delta:1
   [20613916035] delta:216
   [20613916465] delta:430
   [20613916575] delta:110
   [20613916749] delta:174
   [20613917248] delta:499
   [20613917333] delta:85
   [20613917775] delta:442
   [20613917921] delta:146
   [20613918321] delta:400
   [20613918568] delta:247
   [20613918768] delta:200
   [20613919306] delta:538
   [20613919353] delta:47
   [20613919980] delta:627
   [20613920296] delta:316
   [20613920571] delta:275
   [20613920862] delta:291
   [20613921152] delta:290
   [20613921464] delta:312
   [20613921464] delta:0 TIME EXTEND
   [20613921464] delta:0

This happened more than once, and always for an off by one result. It also
started happening after commit aafe104aa9096 was added.

Cc: stable@vger.kernel.org
Fixes: aafe104aa9096 ("tracing: Restructure trace_clock_global() to never block")
Signed-off-by: Steven Rostedt (VMware) &lt;rostedt@goodmis.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>tracing: Do not stop recording comms if the trace file is being read</title>
<updated>2021-06-23T12:42:50Z</updated>
<author>
<name>Steven Rostedt (VMware)</name>
<email>rostedt@goodmis.org</email>
</author>
<published>2021-06-17T18:32:34Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=b313bd944ddd419f69a3c32ccac77c1883dc4aa7'/>
<id>urn:sha1:b313bd944ddd419f69a3c32ccac77c1883dc4aa7</id>
<content type='text'>
commit 4fdd595e4f9a1ff6d93ec702eaecae451cfc6591 upstream.

A while ago, when the "trace" file was opened, tracing was stopped, and
code was added to stop recording the comms to saved_cmdlines, for mapping
of the pids to the task name.

Code has been added that only records the comm if a trace event occurred,
and there's no reason to not trace it if the trace file is opened.

Cc: stable@vger.kernel.org
Fixes: 7ffbd48d5cab2 ("tracing: Cache comms only after an event occurred")
Signed-off-by: Steven Rostedt (VMware) &lt;rostedt@goodmis.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>tracing: Do not stop recording cmdlines when tracing is off</title>
<updated>2021-06-23T12:42:50Z</updated>
<author>
<name>Steven Rostedt (VMware)</name>
<email>rostedt@goodmis.org</email>
</author>
<published>2021-06-17T17:47:25Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=adb3849ed8d52af288ea3d384ec3fd00163a94e7'/>
<id>urn:sha1:adb3849ed8d52af288ea3d384ec3fd00163a94e7</id>
<content type='text'>
commit 85550c83da421fb12dc1816c45012e1e638d2b38 upstream.

The saved_cmdlines is used to map pids to the task name, such that the
output of the tracing does not just show pids, but also gives a human
readable name for the task.

If the name is not mapped, the output looks like this:

    &lt;...&gt;-1316          [005] ...2   132.044039: ...

Instead of this:

    gnome-shell-1316    [005] ...2   132.044039: ...

The names are updated when tracing is running, but are skipped if tracing
is stopped. Unfortunately, this stops the recording of the names if the
top level tracer is stopped, and not if there's other tracers active.

The recording of a name only happens when a new event is written into a
ring buffer, so there is no need to test if tracing is on or not. If
tracing is off, then no event is written and no need to test if tracing is
off or not.

Remove the check, as it hides the names of tasks for events in the
instance buffers.

Cc: stable@vger.kernel.org
Fixes: 7ffbd48d5cab2 ("tracing: Cache comms only after an event occurred")
Signed-off-by: Steven Rostedt (VMware) &lt;rostedt@goodmis.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>tracing: Correct the length check which causes memory corruption</title>
<updated>2021-06-16T10:01:47Z</updated>
<author>
<name>Liangyan</name>
<email>liangyan.peng@linux.alibaba.com</email>
</author>
<published>2021-06-07T12:57:34Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=43c32c22254b9328d7abb1c2b0f689dc67838e60'/>
<id>urn:sha1:43c32c22254b9328d7abb1c2b0f689dc67838e60</id>
<content type='text'>
commit 3e08a9f9760f4a70d633c328a76408e62d6f80a3 upstream.

We've suffered from severe kernel crashes due to memory corruption on
our production environment, like,

Call Trace:
[1640542.554277] general protection fault: 0000 [#1] SMP PTI
[1640542.554856] CPU: 17 PID: 26996 Comm: python Kdump: loaded Tainted:G
[1640542.556629] RIP: 0010:kmem_cache_alloc+0x90/0x190
[1640542.559074] RSP: 0018:ffffb16faa597df8 EFLAGS: 00010286
[1640542.559587] RAX: 0000000000000000 RBX: 0000000000400200 RCX:
0000000006e931bf
[1640542.560323] RDX: 0000000006e931be RSI: 0000000000400200 RDI:
ffff9a45ff004300
[1640542.560996] RBP: 0000000000400200 R08: 0000000000023420 R09:
0000000000000000
[1640542.561670] R10: 0000000000000000 R11: 0000000000000000 R12:
ffffffff9a20608d
[1640542.562366] R13: ffff9a45ff004300 R14: ffff9a45ff004300 R15:
696c662f65636976
[1640542.563128] FS:  00007f45d7c6f740(0000) GS:ffff9a45ff840000(0000)
knlGS:0000000000000000
[1640542.563937] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1640542.564557] CR2: 00007f45d71311a0 CR3: 000000189d63e004 CR4:
00000000003606e0
[1640542.565279] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[1640542.566069] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[1640542.566742] Call Trace:
[1640542.567009]  anon_vma_clone+0x5d/0x170
[1640542.567417]  __split_vma+0x91/0x1a0
[1640542.567777]  do_munmap+0x2c6/0x320
[1640542.568128]  vm_munmap+0x54/0x70
[1640542.569990]  __x64_sys_munmap+0x22/0x30
[1640542.572005]  do_syscall_64+0x5b/0x1b0
[1640542.573724]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[1640542.575642] RIP: 0033:0x7f45d6e61e27

James Wang has reproduced it stably on the latest 4.19 LTS.
After some debugging, we finally proved that it's due to ftrace
buffer out-of-bound access using a debug tool as follows:
[   86.775200] BUG: Out-of-bounds write at addr 0xffff88aefe8b7000
[   86.780806]  no_context+0xdf/0x3c0
[   86.784327]  __do_page_fault+0x252/0x470
[   86.788367]  do_page_fault+0x32/0x140
[   86.792145]  page_fault+0x1e/0x30
[   86.795576]  strncpy_from_unsafe+0x66/0xb0
[   86.799789]  fetch_memory_string+0x25/0x40
[   86.804002]  fetch_deref_string+0x51/0x60
[   86.808134]  kprobe_trace_func+0x32d/0x3a0
[   86.812347]  kprobe_dispatcher+0x45/0x50
[   86.816385]  kprobe_ftrace_handler+0x90/0xf0
[   86.820779]  ftrace_ops_assist_func+0xa1/0x140
[   86.825340]  0xffffffffc00750bf
[   86.828603]  do_sys_open+0x5/0x1f0
[   86.832124]  do_syscall_64+0x5b/0x1b0
[   86.835900]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

commit b220c049d519 ("tracing: Check length before giving out
the filter buffer") adds length check to protect trace data
overflow introduced in 0fc1b09ff1ff, seems that this fix can't prevent
overflow entirely, the length check should also take the sizeof
entry-&gt;array[0] into account, since this array[0] is filled the
length of trace data and occupy addtional space and risk overflow.

Link: https://lkml.kernel.org/r/20210607125734.1770447-1-liangyan.peng@linux.alibaba.com

Cc: stable@vger.kernel.org
Cc: Ingo Molnar &lt;mingo@redhat.com&gt;
Cc: Xunlei Pang &lt;xlpang@linux.alibaba.com&gt;
Cc: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
Fixes: b220c049d519 ("tracing: Check length before giving out the filter buffer")
Reviewed-by: Xunlei Pang &lt;xlpang@linux.alibaba.com&gt;
Reviewed-by: yinbinbin &lt;yinbinbin@alibabacloud.com&gt;
Reviewed-by: Wetp Zhang &lt;wetp.zy@linux.alibaba.com&gt;
Tested-by: James Wang &lt;jnwang@linux.alibaba.com&gt;
Signed-off-by: Liangyan &lt;liangyan.peng@linux.alibaba.com&gt;
Signed-off-by: Steven Rostedt (VMware) &lt;rostedt@goodmis.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>ftrace: Do not blindly read the ip address in ftrace_bug()</title>
<updated>2021-06-16T10:01:40Z</updated>
<author>
<name>Steven Rostedt (VMware)</name>
<email>rostedt@goodmis.org</email>
</author>
<published>2021-06-08T01:39:08Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=97524384762c1fb9b3ded931498dd2047bd0de81'/>
<id>urn:sha1:97524384762c1fb9b3ded931498dd2047bd0de81</id>
<content type='text'>
commit 6c14133d2d3f768e0a35128faac8aa6ed4815051 upstream.

It was reported that a bug on arm64 caused a bad ip address to be used for
updating into a nop in ftrace_init(), but the error path (rightfully)
returned -EINVAL and not -EFAULT, as the bug caused more than one error to
occur. But because -EINVAL was returned, the ftrace_bug() tried to report
what was at the location of the ip address, and read it directly. This
caused the machine to panic, as the ip was not pointing to a valid memory
address.

Instead, read the ip address with copy_from_kernel_nofault() to safely
access the memory, and if it faults, report that the address faulted,
otherwise report what was in that location.

Link: https://lore.kernel.org/lkml/20210607032329.28671-1-mark-pk.tsai@mediatek.com/

Cc: stable@vger.kernel.org
Fixes: 05736a427f7e1 ("ftrace: warn on failure to disable mcount callers")
Reported-by: Mark-PK Tsai &lt;mark-pk.tsai@mediatek.com&gt;
Tested-by: Mark-PK Tsai &lt;mark-pk.tsai@mediatek.com&gt;
Signed-off-by: Steven Rostedt (VMware) &lt;rostedt@goodmis.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>bpf, lockdown, audit: Fix buggy SELinux lockdown permission checks</title>
<updated>2021-06-10T11:39:19Z</updated>
<author>
<name>Daniel Borkmann</name>
<email>daniel@iogearbox.net</email>
</author>
<published>2021-05-28T09:16:31Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=ff5039ec75c83d2ed5b781dc7733420ee8c985fc'/>
<id>urn:sha1:ff5039ec75c83d2ed5b781dc7733420ee8c985fc</id>
<content type='text'>
[ Upstream commit ff40e51043af63715ab413995ff46996ecf9583f ]

Commit 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
added an implementation of the locked_down LSM hook to SELinux, with the aim
to restrict which domains are allowed to perform operations that would breach
lockdown. This is indirectly also getting audit subsystem involved to report
events. The latter is problematic, as reported by Ondrej and Serhei, since it
can bring down the whole system via audit:

  1) The audit events that are triggered due to calls to security_locked_down()
     can OOM kill a machine, see below details [0].

  2) It also seems to be causing a deadlock via avc_has_perm()/slow_avc_audit()
     when trying to wake up kauditd, for example, when using trace_sched_switch()
     tracepoint, see details in [1]. Triggering this was not via some hypothetical
     corner case, but with existing tools like runqlat &amp; runqslower from bcc, for
     example, which make use of this tracepoint. Rough call sequence goes like:

     rq_lock(rq) -&gt; -------------------------+
       trace_sched_switch() -&gt;               |
         bpf_prog_xyz() -&gt;                   +-&gt; deadlock
           selinux_lockdown() -&gt;             |
             audit_log_end() -&gt;              |
               wake_up_interruptible() -&gt;    |
                 try_to_wake_up() -&gt;         |
                   rq_lock(rq) --------------+

What's worse is that the intention of 59438b46471a to further restrict lockdown
settings for specific applications in respect to the global lockdown policy is
completely broken for BPF. The SELinux policy rule for the current lockdown check
looks something like this:

  allow &lt;who&gt; &lt;who&gt; : lockdown { &lt;reason&gt; };

However, this doesn't match with the 'current' task where the security_locked_down()
is executed, example: httpd does a syscall. There is a tracing program attached
to the syscall which triggers a BPF program to run, which ends up doing a
bpf_probe_read_kernel{,_str}() helper call. The selinux_lockdown() hook does
the permission check against 'current', that is, httpd in this example. httpd
has literally zero relation to this tracing program, and it would be nonsensical
having to write an SELinux policy rule against httpd to let the tracing helper
pass. The policy in this case needs to be against the entity that is installing
the BPF program. For example, if bpftrace would generate a histogram of syscall
counts by user space application:

  bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

bpftrace would then go and generate a BPF program from this internally. One way
of doing it [for the sake of the example] could be to call bpf_get_current_task()
helper and then access current-&gt;comm via one of bpf_probe_read_kernel{,_str}()
helpers. So the program itself has nothing to do with httpd or any other random
app doing a syscall here. The BPF program _explicitly initiated_ the lockdown
check. The allow/deny policy belongs in the context of bpftrace: meaning, you
want to grant bpftrace access to use these helpers, but other tracers on the
system like my_random_tracer _not_.

Therefore fix all three issues at the same time by taking a completely different
approach for the security_locked_down() hook, that is, move the check into the
program verification phase where we actually retrieve the BPF func proto. This
also reliably gets the task (current) that is trying to install the BPF tracing
program, e.g. bpftrace/bcc/perf/systemtap/etc, and it also fixes the OOM since
we're moving this out of the BPF helper's fast-path which can be called several
millions of times per second.

The check is then also in line with other security_locked_down() hooks in the
system where the enforcement is performed at open/load time, for example,
open_kcore() for /proc/kcore access or module_sig_check() for module signatures
just to pick few random ones. What's out of scope in the fix as well as in
other security_locked_down() hook locations /outside/ of BPF subsystem is that
if the lockdown policy changes on the fly there is no retrospective action.
This requires a different discussion, potentially complex infrastructure, and
it's also not clear whether this can be solved generically. Either way, it is
out of scope for a suitable stable fix which this one is targeting. Note that
the breakage is specifically on 59438b46471a where it started to rely on 'current'
as UAPI behavior, and _not_ earlier infrastructure such as 9d1f8be5cf42 ("bpf:
Restrict bpf when kernel lockdown is in confidentiality mode").

[0] https://bugzilla.redhat.com/show_bug.cgi?id=1955585, Jakub Hrozek says:

  I starting seeing this with F-34. When I run a container that is traced with
  BPF to record the syscalls it is doing, auditd is flooded with messages like:

  type=AVC msg=audit(1619784520.593:282387): avc:  denied  { confidentiality }
    for pid=476 comm="auditd" lockdown_reason="use of bpf to read kernel RAM"
      scontext=system_u:system_r:auditd_t:s0 tcontext=system_u:system_r:auditd_t:s0
        tclass=lockdown permissive=0

  This seems to be leading to auditd running out of space in the backlog buffer
  and eventually OOMs the machine.

  [...]
  auditd running at 99% CPU presumably processing all the messages, eventually I get:
  Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
  Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
  Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152579 &gt; audit_backlog_limit=64
  Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152626 &gt; audit_backlog_limit=64
  Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152694 &gt; audit_backlog_limit=64
  Apr 30 12:20:42 fedora kernel: audit: audit_lost=6878426 audit_rate_limit=0 audit_backlog_limit=64
  Apr 30 12:20:45 fedora kernel: oci-seccomp-bpf invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-1000
  Apr 30 12:20:45 fedora kernel: CPU: 0 PID: 13284 Comm: oci-seccomp-bpf Not tainted 5.11.12-300.fc34.x86_64 #1
  Apr 30 12:20:45 fedora kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
  [...]

[1] https://lore.kernel.org/linux-audit/CANYvDQN7H5tVp47fbYcRasv4XF07eUbsDwT_eDCHXJUj43J7jQ@mail.gmail.com/,
    Serhei Makarov says:

  Upstream kernel 5.11.0-rc7 and later was found to deadlock during a
  bpf_probe_read_compat() call within a sched_switch tracepoint. The problem
  is reproducible with the reg_alloc3 testcase from SystemTap's BPF backend
  testsuite on x86_64 as well as the runqlat, runqslower tools from bcc on
  ppc64le. Example stack trace:

  [...]
  [  730.868702] stack backtrace:
  [  730.869590] CPU: 1 PID: 701 Comm: in:imjournal Not tainted, 5.12.0-0.rc2.20210309git144c79ef3353.166.fc35.x86_64 #1
  [  730.871605] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
  [  730.873278] Call Trace:
  [  730.873770]  dump_stack+0x7f/0xa1
  [  730.874433]  check_noncircular+0xdf/0x100
  [  730.875232]  __lock_acquire+0x1202/0x1e10
  [  730.876031]  ? __lock_acquire+0xfc0/0x1e10
  [  730.876844]  lock_acquire+0xc2/0x3a0
  [  730.877551]  ? __wake_up_common_lock+0x52/0x90
  [  730.878434]  ? lock_acquire+0xc2/0x3a0
  [  730.879186]  ? lock_is_held_type+0xa7/0x120
  [  730.880044]  ? skb_queue_tail+0x1b/0x50
  [  730.880800]  _raw_spin_lock_irqsave+0x4d/0x90
  [  730.881656]  ? __wake_up_common_lock+0x52/0x90
  [  730.882532]  __wake_up_common_lock+0x52/0x90
  [  730.883375]  audit_log_end+0x5b/0x100
  [  730.884104]  slow_avc_audit+0x69/0x90
  [  730.884836]  avc_has_perm+0x8b/0xb0
  [  730.885532]  selinux_lockdown+0xa5/0xd0
  [  730.886297]  security_locked_down+0x20/0x40
  [  730.887133]  bpf_probe_read_compat+0x66/0xd0
  [  730.887983]  bpf_prog_250599c5469ac7b5+0x10f/0x820
  [  730.888917]  trace_call_bpf+0xe9/0x240
  [  730.889672]  perf_trace_run_bpf_submit+0x4d/0xc0
  [  730.890579]  perf_trace_sched_switch+0x142/0x180
  [  730.891485]  ? __schedule+0x6d8/0xb20
  [  730.892209]  __schedule+0x6d8/0xb20
  [  730.892899]  schedule+0x5b/0xc0
  [  730.893522]  exit_to_user_mode_prepare+0x11d/0x240
  [  730.894457]  syscall_exit_to_user_mode+0x27/0x70
  [  730.895361]  entry_SYSCALL_64_after_hwframe+0x44/0xae
  [...]

Fixes: 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
Reported-by: Ondrej Mosnacek &lt;omosnace@redhat.com&gt;
Reported-by: Jakub Hrozek &lt;jhrozek@redhat.com&gt;
Reported-by: Serhei Makarov &lt;smakarov@redhat.com&gt;
Reported-by: Jiri Olsa &lt;jolsa@redhat.com&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Acked-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Tested-by: Jiri Olsa &lt;jolsa@redhat.com&gt;
Cc: Paul Moore &lt;paul@paul-moore.com&gt;
Cc: James Morris &lt;jamorris@linux.microsoft.com&gt;
Cc: Jerome Marchand &lt;jmarchan@redhat.com&gt;
Cc: Frank Eigler &lt;fche@redhat.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Link: https://lore.kernel.org/bpf/01135120-8bf7-df2e-cff0-1d73f1f841c3@iogearbox.net
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>tracing: Restructure trace_clock_global() to never block</title>
<updated>2021-05-11T12:47:40Z</updated>
<author>
<name>Steven Rostedt (VMware)</name>
<email>rostedt@goodmis.org</email>
</author>
<published>2021-04-30T16:17:58Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=a33614d52e97fc8077eb0b292189ca7d964cc534'/>
<id>urn:sha1:a33614d52e97fc8077eb0b292189ca7d964cc534</id>
<content type='text'>
commit aafe104aa9096827a429bc1358f8260ee565b7cc upstream.

It was reported that a fix to the ring buffer recursion detection would
cause a hung machine when performing suspend / resume testing. The
following backtrace was extracted from debugging that case:

Call Trace:
 trace_clock_global+0x91/0xa0
 __rb_reserve_next+0x237/0x460
 ring_buffer_lock_reserve+0x12a/0x3f0
 trace_buffer_lock_reserve+0x10/0x50
 __trace_graph_return+0x1f/0x80
 trace_graph_return+0xb7/0xf0
 ? trace_clock_global+0x91/0xa0
 ftrace_return_to_handler+0x8b/0xf0
 ? pv_hash+0xa0/0xa0
 return_to_handler+0x15/0x30
 ? ftrace_graph_caller+0xa0/0xa0
 ? trace_clock_global+0x91/0xa0
 ? __rb_reserve_next+0x237/0x460
 ? ring_buffer_lock_reserve+0x12a/0x3f0
 ? trace_event_buffer_lock_reserve+0x3c/0x120
 ? trace_event_buffer_reserve+0x6b/0xc0
 ? trace_event_raw_event_device_pm_callback_start+0x125/0x2d0
 ? dpm_run_callback+0x3b/0xc0
 ? pm_ops_is_empty+0x50/0x50
 ? platform_get_irq_byname_optional+0x90/0x90
 ? trace_device_pm_callback_start+0x82/0xd0
 ? dpm_run_callback+0x49/0xc0

With the following RIP:

RIP: 0010:native_queued_spin_lock_slowpath+0x69/0x200

Since the fix to the recursion detection would allow a single recursion to
happen while tracing, this lead to the trace_clock_global() taking a spin
lock and then trying to take it again:

ring_buffer_lock_reserve() {
  trace_clock_global() {
    arch_spin_lock() {
      queued_spin_lock_slowpath() {
        /* lock taken */
        (something else gets traced by function graph tracer)
          ring_buffer_lock_reserve() {
            trace_clock_global() {
              arch_spin_lock() {
                queued_spin_lock_slowpath() {
                /* DEAD LOCK! */

Tracing should *never* block, as it can lead to strange lockups like the
above.

Restructure the trace_clock_global() code to instead of simply taking a
lock to update the recorded "prev_time" simply use it, as two events
happening on two different CPUs that calls this at the same time, really
doesn't matter which one goes first. Use a trylock to grab the lock for
updating the prev_time, and if it fails, simply try again the next time.
If it failed to be taken, that means something else is already updating
it.

Link: https://lkml.kernel.org/r/20210430121758.650b6e8a@gandalf.local.home

Cc: stable@vger.kernel.org
Tested-by: Konstantin Kharlamov &lt;hi-angel@yandex.ru&gt;
Tested-by: Todd Brandt &lt;todd.e.brandt@linux.intel.com&gt;
Fixes: b02414c8f045 ("ring-buffer: Fix recursion protection transitions between interrupt context") # started showing the problem
Fixes: 14131f2f98ac3 ("tracing: implement trace_clock_*() APIs") # where the bug happened
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212761
Signed-off-by: Steven Rostedt (VMware) &lt;rostedt@goodmis.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>tracing: Map all PIDs to command lines</title>
<updated>2021-05-11T12:47:40Z</updated>
<author>
<name>Steven Rostedt (VMware)</name>
<email>rostedt@goodmis.org</email>
</author>
<published>2021-04-27T15:32:07Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=9e40ef5391df9f3885d07e952b24b5bc04fe8ddf'/>
<id>urn:sha1:9e40ef5391df9f3885d07e952b24b5bc04fe8ddf</id>
<content type='text'>
commit 785e3c0a3a870e72dc530856136ab4c8dd207128 upstream.

The default max PID is set by PID_MAX_DEFAULT, and the tracing
infrastructure uses this number to map PIDs to the comm names of the
tasks, such output of the trace can show names from the recorded PIDs in
the ring buffer. This mapping is also exported to user space via the
"saved_cmdlines" file in the tracefs directory.

But currently the mapping expects the PIDs to be less than
PID_MAX_DEFAULT, which is the default maximum and not the real maximum.
Recently, systemd will increases the maximum value of a PID on the system,
and when tasks are traced that have a PID higher than PID_MAX_DEFAULT, its
comm is not recorded. This leads to the entire trace to have "&lt;...&gt;" as
the comm name, which is pretty useless.

Instead, keep the array mapping the size of PID_MAX_DEFAULT, but instead
of just mapping the index to the comm, map a mask of the PID
(PID_MAX_DEFAULT - 1) to the comm, and find the full PID from the
map_cmdline_to_pid array (that already exists).

This bug goes back to the beginning of ftrace, but hasn't been an issue
until user space started increasing the maximum value of PIDs.

Link: https://lkml.kernel.org/r/20210427113207.3c601884@gandalf.local.home

Cc: stable@vger.kernel.org
Fixes: bc0c38d139ec7 ("ftrace: latency tracer infrastructure")
Signed-off-by: Steven Rostedt (VMware) &lt;rostedt@goodmis.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>ftrace: Handle commands when closing set_ftrace_filter file</title>
<updated>2021-05-11T12:47:12Z</updated>
<author>
<name>Steven Rostedt (VMware)</name>
<email>rostedt@goodmis.org</email>
</author>
<published>2021-05-05T14:38:24Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=634684d79733124f7470b226b0f42aada4426b07'/>
<id>urn:sha1:634684d79733124f7470b226b0f42aada4426b07</id>
<content type='text'>
commit 8c9af478c06bb1ab1422f90d8ecbc53defd44bc3 upstream.

 # echo switch_mm:traceoff &gt; /sys/kernel/tracing/set_ftrace_filter

will cause switch_mm to stop tracing by the traceoff command.

 # echo -n switch_mm:traceoff &gt; /sys/kernel/tracing/set_ftrace_filter

does nothing.

The reason is that the parsing in the write function only processes
commands if it finished parsing (there is white space written after the
command). That's to handle:

 write(fd, "switch_mm:", 10);
 write(fd, "traceoff", 8);

cases, where the command is broken over multiple writes.

The problem is if the file descriptor is closed, then the write call is
not processed, and the command needs to be processed in the release code.
The release code can handle matching of functions, but does not handle
commands.

Cc: stable@vger.kernel.org
Fixes: eda1e32855656 ("tracing: handle broken names in ftrace filter")
Signed-off-by: Steven Rostedt (VMware) &lt;rostedt@goodmis.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>ftrace: Check if pages were allocated before calling free_pages()</title>
<updated>2021-04-16T09:43:20Z</updated>
<author>
<name>Steven Rostedt (VMware)</name>
<email>rostedt@goodmis.org</email>
</author>
<published>2021-03-30T13:58:38Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=1dcb3ebc24164c0b5d3b13696d80bf163e16b664'/>
<id>urn:sha1:1dcb3ebc24164c0b5d3b13696d80bf163e16b664</id>
<content type='text'>
[ Upstream commit 59300b36f85f254260c81d9dd09195fa49eb0f98 ]

It is possible that on error pg-&gt;size can be zero when getting its order,
which would return a -1 value. It is dangerous to pass in an order of -1
to free_pages(). Check if order is greater than or equal to zero before
calling free_pages().

Link: https://lore.kernel.org/lkml/20210330093916.432697c7@gandalf.local.home/

Reported-by: Abaci Robot &lt;abaci@linux.alibaba.com&gt;
Signed-off-by: Steven Rostedt (VMware) &lt;rostedt@goodmis.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
</feed>
