user/sven/linux.git/kernel, branch v6.1.53

tracing: Zero the pipe cpumask on alloc to avoid spurious -EBUSY

2023-09-13T07:43:05Z

commit 3d07fa1dd19035eb0b13ae6697efd5caa9033e74 upstream. The pipe cpumask used to serialize opens between the main and percpu trace pipes is not zeroed or initialized. This can result in spurious -EBUSY returns if underlying memory is not fully zeroed. This has been observed by immediate failure to read the main trace_pipe file on an otherwise newly booted and idle system: # cat /sys/kernel/debug/tracing/trace_pipe cat: /sys/kernel/debug/tracing/trace_pipe: Device or resource busy Zero the allocation of pipe_cpumask to avoid the problem. Link: https://lore.kernel.org/linux-trace-kernel/20230831125500.986862-1-bfoster@redhat.com Cc: stable@vger.kernel.org Fixes: c2489bb7e6be ("tracing: Introduce pipe_cpumask to avoid race on trace_pipes") Reviewed-by: Zheng Yejian Reviewed-by: Masami Hiramatsu (Google) Signed-off-by: Brian Foster Signed-off-by: Steven Rostedt (Google) Signed-off-by: Greg Kroah-Hartman

bpf: Fix issue in verifying allow_ptr_leaks

2023-09-13T07:43:02Z

commit d75e30dddf73449bc2d10bb8e2f1a2c446bc67a2 upstream. After we converted the capabilities of our networking-bpf program from cap_sys_admin to cap_net_admin+cap_bpf, our networking-bpf program failed to start. Because it failed the bpf verifier, and the error log is "R3 pointer comparison prohibited". A simple reproducer as follows, SEC("cls-ingress") int ingress(struct __sk_buff *skb) { struct iphdr *iph = (void *)(long)skb->data + sizeof(struct ethhdr); if ((long)(iph + 1) > (long)skb->data_end) return TC_ACT_STOLEN; return TC_ACT_OK; } Per discussion with Yonghong and Alexei [1], comparison of two packet pointers is not a pointer leak. This patch fixes it. Our local kernel is 6.1.y and we expect this fix to be backported to 6.1.y, so stable is CCed. [1]. https://lore.kernel.org/bpf/CAADnVQ+Nmspr7Si+pxWn8zkE7hX-7s93ugwC+94aXSy4uQ9vBg@mail.gmail.com/ Suggested-by: Yonghong Song Suggested-by: Alexei Starovoitov Signed-off-by: Yafang Shao Acked-by: Eduard Zingerman Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230823020703.3790-2-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov Signed-off-by: Greg Kroah-Hartman

cpu/hotplug: Prevent self deadlock on CPU hot-unplug

2023-09-13T07:43:00Z

commit 2b8272ff4a70b866106ae13c36be7ecbef5d5da2 upstream. Xiongfeng reported and debugged a self deadlock of the task which initiates and controls a CPU hot-unplug operation vs. the CFS bandwidth timer. CPU1 CPU2 T1 sets cfs_quota starts hrtimer cfs_bandwidth 'period_timer' T1 is migrated to CPU2 T1 initiates offlining of CPU1 Hotplug operation starts ... 'period_timer' expires and is re-enqueued on CPU1 ... take_cpu_down() CPU1 shuts down and does not handle timers anymore. They have to be migrated in the post dead hotplug steps by the control task. T1 runs the post dead offline operation T1 is scheduled out T1 waits for 'period_timer' to expire T1 waits there forever if it is scheduled out before it can execute the hrtimer offline callback hrtimers_dead_cpu(). Cure this by delegating the hotplug control operation to a worker thread on an online CPU. This takes the initiating user space task, which might be affected by the bandwidth timer, completely out of the picture. Reported-by: Xiongfeng Wang Signed-off-by: Thomas Gleixner Tested-by: Yu Liao Acked-by: Vincent Guittot Cc: stable@vger.kernel.org Link: https://lore.kernel.org/lkml/8e785777-03aa-99e1-d20e-e956f5685be6@huawei.com Link: https://lore.kernel.org/r/87h6oqdq0i.ffs@tglx Signed-off-by: Greg Kroah-Hartman

printk: ringbuffer: Fix truncating buffer size min_t cast

2023-09-13T07:43:00Z

commit 53e9e33ede37a247d926db5e4a9e56b55204e66c upstream. If an output buffer size exceeded U16_MAX, the min_t(u16, ...) cast in copy_data() was causing writes to truncate. This manifested as output bytes being skipped, seen as %NUL bytes in pstore dumps when the available record size was larger than 65536. Fix the cast to no longer truncate the calculation. Cc: Petr Mladek Cc: Sergey Senozhatsky Cc: Steven Rostedt Cc: John Ogness Reported-by: Vijay Balakrishna Link: https://lore.kernel.org/lkml/d8bb1ec7-a4c5-43a2-9de0-9643a70b899f@linux.microsoft.com/ Fixes: b6cf8b3f3312 ("printk: add lockless ringbuffer") Cc: stable@vger.kernel.org Signed-off-by: Kees Cook Tested-by: Vijay Balakrishna Tested-by: Guilherme G. Piccoli # Steam Deck Reviewed-by: Tyler Hicks (Microsoft) Tested-by: Tyler Hicks (Microsoft) Reviewed-by: John Ogness Reviewed-by: Sergey Senozhatsky Reviewed-by: Petr Mladek Signed-off-by: Petr Mladek Link: https://lore.kernel.org/r/20230811054528.never.165-kees@kernel.org Signed-off-by: Greg Kroah-Hartman

tracing: Fix race issue between cpu buffer write and swap

2023-09-13T07:42:57Z

[ Upstream commit 3163f635b20e9e1fb4659e74f47918c9dddfe64e ] Warning happened in rb_end_commit() at code: if (RB_WARN_ON(cpu_buffer, !local_read(&cpu_buffer->committing))) WARNING: CPU: 0 PID: 139 at kernel/trace/ring_buffer.c:3142 rb_commit+0x402/0x4a0 Call Trace: ring_buffer_unlock_commit+0x42/0x250 trace_buffer_unlock_commit_regs+0x3b/0x250 trace_event_buffer_commit+0xe5/0x440 trace_event_buffer_reserve+0x11c/0x150 trace_event_raw_event_sched_switch+0x23c/0x2c0 __traceiter_sched_switch+0x59/0x80 __schedule+0x72b/0x1580 schedule+0x92/0x120 worker_thread+0xa0/0x6f0 It is because the race between writing event into cpu buffer and swapping cpu buffer through file per_cpu/cpu0/snapshot: Write on CPU 0 Swap buffer by per_cpu/cpu0/snapshot on CPU 1 -------- -------- tracing_snapshot_write() [...] ring_buffer_lock_reserve() cpu_buffer = buffer->buffers[cpu]; // 1. Suppose find 'cpu_buffer_a'; [...] rb_reserve_next_event() [...] ring_buffer_swap_cpu() if (local_read(&cpu_buffer_a->committing)) goto out_dec; if (local_read(&cpu_buffer_b->committing)) goto out_dec; buffer_a->buffers[cpu] = cpu_buffer_b; buffer_b->buffers[cpu] = cpu_buffer_a; // 2. cpu_buffer has swapped here. rb_start_commit(cpu_buffer); if (unlikely(READ_ONCE(cpu_buffer->buffer) != buffer)) { // 3. This check passed due to 'cpu_buffer->buffer' [...] // has not changed here. return NULL; } cpu_buffer_b->buffer = buffer_a; cpu_buffer_a->buffer = buffer_b; [...] // 4. Reserve event from 'cpu_buffer_a'. ring_buffer_unlock_commit() [...] cpu_buffer = buffer->buffers[cpu]; // 5. Now find 'cpu_buffer_b' !!! rb_commit(cpu_buffer) rb_end_commit() // 6. WARN for the wrong 'committing' state !!! Based on above analysis, we can easily reproduce by following testcase: ``` bash #!/bin/bash dmesg -n 7 sysctl -w kernel.panic_on_warn=1 TR=/sys/kernel/tracing echo 7 > ${TR}/buffer_size_kb echo "sched:sched_switch" > ${TR}/set_event while [ true ]; do echo 1 > ${TR}/per_cpu/cpu0/snapshot done & while [ true ]; do echo 1 > ${TR}/per_cpu/cpu0/snapshot done & while [ true ]; do echo 1 > ${TR}/per_cpu/cpu0/snapshot done & ``` To fix it, IIUC, we can use smp_call_function_single() to do the swap on the target cpu where the buffer is located, so that above race would be avoided. Link: https://lore.kernel.org/linux-trace-kernel/20230831132739.4070878-1-zhengyejian1@huawei.com Cc: Fixes: f1affcaaa861 ("tracing: Add snapshot in the per_cpu trace directories") Signed-off-by: Zheng Yejian Signed-off-by: Steven Rostedt (Google) Signed-off-by: Sasha Levin

tracing: Remove extra space at the end of hwlat_detector/mode

2023-09-13T07:42:57Z

[ Upstream commit 2cf0dee989a8b2501929eaab29473b6b1fa11057 ] Space is printed after each mode value including the last one: $ echo \"$(sudo cat /sys/kernel/tracing/hwlat_detector/mode)\" "none [round-robin] per-cpu " Found by Linux Verification Center (linuxtesting.org) with SVACE. Link: https://lore.kernel.org/linux-trace-kernel/20230825103432.7750-1-m.kobuk@ispras.ru Cc: Masami Hiramatsu Fixes: 8fa826b7344d ("trace/hwlat: Implement the mode config option") Signed-off-by: Mikhail Kobuk Reviewed-by: Alexey Khoroshilov Acked-by: Daniel Bristot de Oliveira Signed-off-by: Steven Rostedt (Google) Signed-off-by: Sasha Levin

tick/rcu: Fix false positive "softirq work is pending" messages

2023-09-13T07:42:57Z

[ Upstream commit 96c1fa04f089a7e977a44e4e8fdc92e81be20bef ] In commit 0345691b24c0 ("tick/rcu: Stop allowing RCU_SOFTIRQ in idle") the new function report_idle_softirq() was created by breaking code out of the existing can_stop_idle_tick() for kernels v5.18 and newer. In doing so, the code essentially went from a one conditional: if (a && b && c) warn(); to a three conditional: if (!a) return; if (!b) return; if (!c) return; warn(); But that conversion got the condition for the RT specific local_bh_blocked() wrong. The original condition was: !local_bh_blocked() but the conversion failed to negate it so it ended up as: if (!local_bh_blocked()) return false; This issue lay dormant until another fixup for the same commit was added in commit a7e282c77785 ("tick/rcu: Fix bogus ratelimit condition"). This commit realized the ratelimit was essentially set to zero instead of ten, and hence *no* softirq pending messages would ever be issued. Once this commit was backported via linux-stable, both the v6.1 and v6.4 preempt-rt kernels started printing out 10 instances of this at boot: NOHZ tick-stop error: local softirq work is pending, handler #80!!! Remove the negation and return when local_bh_blocked() evaluates to true to bring the correct behaviour back. Fixes: 0345691b24c0 ("tick/rcu: Stop allowing RCU_SOFTIRQ in idle") Signed-off-by: Paul Gortmaker Signed-off-by: Thomas Gleixner Tested-by: Ahmad Fatoum Reviewed-by: Wen Yang Acked-by: Frederic Weisbecker Link: https://lore.kernel.org/r/20230818200757.1808398-1-paul.gortmaker@windriver.com Signed-off-by: Sasha Levin

cgroup:namespace: Remove unused cgroup_namespaces_init()

2023-09-13T07:42:56Z

[ Upstream commit 82b90b6c5b38e457c7081d50dff11ecbafc1e61a ] cgroup_namspace_init() just return 0. Therefore, there is no need to call it during start_kernel. Just remove it. Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces") Signed-off-by: Lu Jialin Signed-off-by: Tejun Heo Signed-off-by: Sasha Levin

cgroup/cpuset: Inherit parent's load balance state in v2

2023-09-13T07:42:49Z

[ Upstream commit c8c926200c55454101f072a4b16c9ff5b8c9e56f ] Since commit f28e22441f35 ("cgroup/cpuset: Add a new isolated cpus.partition type"), the CS_SCHED_LOAD_BALANCE bit of a v2 cpuset can be on or off. The child cpusets of a partition root must have the same setting as its parent or it may screw up the rebuilding of sched domains. Fix this problem by making sure the a child v2 cpuset will follows its parent cpuset load balance state unless the child cpuset is a new partition root itself. Fixes: f28e22441f35 ("cgroup/cpuset: Add a new isolated cpus.partition type") Signed-off-by: Waiman Long Signed-off-by: Tejun Heo Signed-off-by: Sasha Levin

PCI: Allow drivers to request exclusive config regions

2023-09-13T07:42:46Z

[ Upstream commit 278294798ac9118412c9624a801d3f20f2279363 ] PCI config space access from user space has traditionally been unrestricted with writes being an understood risk for device operation. Unfortunately, device breakage or odd behavior from config writes lacks indicators that can leave driver writers confused when evaluating failures. This is especially true with the new PCIe Data Object Exchange (DOE) mailbox protocol where backdoor shenanigans from user space through things such as vendor defined protocols may affect device operation without complete breakage. A prior proposal restricted read and writes completely.[1] Greg and Bjorn pointed out that proposal is flawed for a couple of reasons. First, lspci should always be allowed and should not interfere with any device operation. Second, setpci is a valuable tool that is sometimes necessary and it should not be completely restricted.[2] Finally methods exist for full lock of device access if required. Even though access should not be restricted it would be nice for driver writers to be able to flag critical parts of the config space such that interference from user space can be detected. Introduce pci_request_config_region_exclusive() to mark exclusive config regions. Such regions trigger a warning and kernel taint if accessed via user space. Create pci_warn_once() to restrict the user from spamming the log. [1] https://lore.kernel.org/all/161663543465.1867664.5674061943008380442.stgit@dwillia2-desk3.amr.corp.intel.com/ [2] https://lore.kernel.org/all/YF8NGeGv9vYcMfTV@kroah.com/ Cc: Bjorn Helgaas Cc: Greg Kroah-Hartman Reviewed-by: Jonathan Cameron Suggested-by: Dan Williams Signed-off-by: Ira Weiny Acked-by: Greg Kroah-Hartman Acked-by: Bjorn Helgaas Link: https://lore.kernel.org/r/20220926215711.2893286-2-ira.weiny@intel.com Signed-off-by: Dan Williams Stable-dep-of: 5e70d0acf082 ("PCI: Add locking to RMW PCI Express Capability Register accessors") Signed-off-by: Sasha Levin