summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2023-03-11kernel/fail_function: fix memory leak with using debugfs_lookup()Greg Kroah-Hartman
[ Upstream commit 2bb3669f576559db273efe49e0e69f82450efbca ] When calling debugfs_lookup() the result must have dput() called on it, otherwise the memory will leak over time. To make things simpler, just call debugfs_lookup_and_remove() instead which handles all of the logic at once. Cc: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Yang Yingliang <yangyingliang@huawei.com> Link: https://lore.kernel.org/r/20230202151633.2310897-1-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-11tracing: Add NULL checks for buffer in ring_buffer_free_read_page()Jia-Ju Bai
[ Upstream commit 3e4272b9954094907f16861199728f14002fcaf6 ] In a previous commit 7433632c9ff6, buffer, buffer->buffers and buffer->buffers[cpu] in ring_buffer_wake_waiters() can be NULL, and thus the related checks are added. However, in the same call stack, these variables are also used in ring_buffer_free_read_page(): tracing_buffers_release() ring_buffer_wake_waiters(iter->array_buffer->buffer) cpu_buffer = buffer->buffers[cpu] -> Add checks by previous commit ring_buffer_free_read_page(iter->array_buffer->buffer) cpu_buffer = buffer->buffers[cpu] -> No check Thus, to avod possible null-pointer derefernces, the related checks should be added. These results are reported by a static tool designed by myself. Link: https://lkml.kernel.org/r/20230113125501.760324-1-baijiaju1990@gmail.com Reported-by: TOTE Robot <oslab@tsinghua.edu.cn> Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-11ring-buffer: Handle race between rb_move_tail and rb_check_pagesMukesh Ojha
commit 8843e06f67b14f71c044bf6267b2387784c7e198 upstream. It seems a data race between ring_buffer writing and integrity check. That is, RB_FLAG of head_page is been updating, while at same time RB_FLAG was cleared when doing integrity check rb_check_pages(): rb_check_pages() rb_handle_head_page(): -------- -------- rb_head_page_deactivate() rb_head_page_set_normal() rb_head_page_activate() We do intergrity test of the list to check if the list is corrupted and it is still worth doing it. So, let's refactor rb_check_pages() such that we no longer clear and set flag during the list sanity checking. [1] and [2] are the test to reproduce and the crash report respectively. 1: ``` read_trace.sh while true; do # the "trace" file is closed after read head -1 /sys/kernel/tracing/trace > /dev/null done ``` ``` repro.sh sysctl -w kernel.panic_on_warn=1 # function tracer will writing enough data into ring_buffer echo function > /sys/kernel/tracing/current_tracer ./read_trace.sh & ./read_trace.sh & ./read_trace.sh & ./read_trace.sh & ./read_trace.sh & ./read_trace.sh & ./read_trace.sh & ./read_trace.sh & ``` 2: ------------[ cut here ]------------ WARNING: CPU: 9 PID: 62 at kernel/trace/ring_buffer.c:2653 rb_move_tail+0x450/0x470 Modules linked in: CPU: 9 PID: 62 Comm: ksoftirqd/9 Tainted: G W 6.2.0-rc6+ Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014 RIP: 0010:rb_move_tail+0x450/0x470 Code: ff ff 4c 89 c8 f0 4d 0f b1 02 48 89 c2 48 83 e2 fc 49 39 d0 75 24 83 e0 03 83 f8 02 0f 84 e1 fb ff ff 48 8b 57 10 f0 ff 42 08 <0f> 0b 83 f8 02 0f 84 ce fb ff ff e9 db RSP: 0018:ffffb5564089bd00 EFLAGS: 00000203 RAX: 0000000000000000 RBX: ffff9db385a2bf81 RCX: ffffb5564089bd18 RDX: ffff9db281110100 RSI: 0000000000000fe4 RDI: ffff9db380145400 RBP: ffff9db385a2bf80 R08: ffff9db385a2bfc0 R09: ffff9db385a2bfc2 R10: ffff9db385a6c000 R11: ffff9db385a2bf80 R12: 0000000000000000 R13: 00000000000003e8 R14: ffff9db281110100 R15: ffffffffbb006108 FS: 0000000000000000(0000) GS:ffff9db3bdcc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005602323024c8 CR3: 0000000022e0c000 CR4: 00000000000006e0 Call Trace: <TASK> ring_buffer_lock_reserve+0x136/0x360 ? __do_softirq+0x287/0x2df ? __pfx_rcu_softirq_qs+0x10/0x10 trace_function+0x21/0x110 ? __pfx_rcu_softirq_qs+0x10/0x10 ? __do_softirq+0x287/0x2df function_trace_call+0xf6/0x120 0xffffffffc038f097 ? rcu_softirq_qs+0x5/0x140 rcu_softirq_qs+0x5/0x140 __do_softirq+0x287/0x2df run_ksoftirqd+0x2a/0x30 smpboot_thread_fn+0x188/0x220 ? __pfx_smpboot_thread_fn+0x10/0x10 kthread+0xe7/0x110 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2c/0x50 </TASK> ---[ end trace 0000000000000000 ]--- [ crash report and test reproducer credit goes to Zheng Yejian] Link: https://lore.kernel.org/linux-trace-kernel/1676376403-16462-1-git-send-email-quic_mojha@quicinc.com Cc: <mhiramat@kernel.org> Cc: stable@vger.kernel.org Fixes: 1039221cc278 ("ring-buffer: Do not disable recording when there is an iterator") Reported-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-11dax/kmem: Fix leak of memory-hotplug resourcesDan Williams
commit e686c32590f40bffc45f105c04c836ffad3e531a upstream. While experimenting with CXL region removal the following corruption of /proc/iomem appeared. Before: f010000000-f04fffffff : CXL Window 0 f010000000-f02fffffff : region4 f010000000-f02fffffff : dax4.0 f010000000-f02fffffff : System RAM (kmem) After (modprobe -r cxl_test): f010000000-f02fffffff : **redacted binary garbage** f010000000-f02fffffff : System RAM (kmem) ...and testing further the same is visible with persistent memory assigned to kmem: Before: 480000000-243fffffff : Persistent Memory 480000000-57e1fffff : namespace3.0 580000000-243fffffff : dax3.0 580000000-243fffffff : System RAM (kmem) After (ndctl disable-region all): 480000000-243fffffff : Persistent Memory 580000000-243fffffff : ***redacted binary garbage*** 580000000-243fffffff : System RAM (kmem) The corrupted data is from a use-after-free of the "dax4.0" and "dax3.0" resources, and it also shows that the "System RAM (kmem)" resource is not being removed. The bug does not appear after "modprobe -r kmem", it requires the parent of "dax4.0" and "dax3.0" to be removed which re-parents the leaked "System RAM (kmem)" instances. Those in turn reference the freed resource as a parent. First up for the fix is release_mem_region_adjustable() needs to reliably delete the resource inserted by add_memory_driver_managed(). That is thwarted by a check for IORESOURCE_SYSRAM that predates the dax/kmem driver, from commit: 65c78784135f ("kernel, resource: check for IORESOURCE_SYSRAM in release_mem_region_adjustable") That appears to be working around the behavior of HMM's "MEMORY_DEVICE_PUBLIC" facility that has since been deleted. With that check removed the "System RAM (kmem)" resource gets removed, but corruption still occurs occasionally because the "dax" resource is not reliably removed. The dax range information is freed before the device is unregistered, so the driver can not reliably recall (another use after free) what it is meant to release. Lastly if that use after free got lucky, the driver was covering up the leak of "System RAM (kmem)" due to its use of release_resource() which detaches, but does not free, child resources. The switch to remove_resource() forces remove_memory() to be responsible for the deletion of the resource added by add_memory_driver_managed(). Fixes: c2f3011ee697 ("device-dax: add an allocation interface for device-dax instances") Cc: <stable@vger.kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Link: https://lore.kernel.org/r/167653656244.3147810.5705900882794040229.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-11irqdomain: Drop bogus fwspec-mapping error handlingJohan Hovold
commit e3b7ab025e931accdc2c12acf9b75c6197f1c062 upstream. In case a newly allocated IRQ ever ends up not having any associated struct irq_data it would not even be possible to dispose the mapping. Replace the bogus disposal with a WARN_ON(). This will also be used to fix a shared-interrupt mapping race, hence the CC-stable tag. Fixes: 1e2a7d78499e ("irqdomain: Don't set type when mapping an IRQ") Cc: stable@vger.kernel.org # 4.8 Tested-by: Hsin-Yi Wang <hsinyi@chromium.org> Tested-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com> Signed-off-by: Johan Hovold <johan+linaro@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230213104302.17307-4-johan+linaro@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-11irqdomain: Fix disassociation raceJohan Hovold
commit 3f883c38f5628f46b30bccf090faec054088e262 upstream. The global irq_domain_mutex is held when mapping interrupts from non-hierarchical domains but currently not when disposing them. This specifically means that updates of the domain mapcount is racy (currently only used for statistics in debugfs). Make sure to hold the global irq_domain_mutex also when disposing mappings from non-hierarchical domains. Fixes: 9dc6be3d4193 ("genirq/irqdomain: Add map counter") Cc: stable@vger.kernel.org # 4.13 Tested-by: Hsin-Yi Wang <hsinyi@chromium.org> Tested-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com> Signed-off-by: Johan Hovold <johan+linaro@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230213104302.17307-3-johan+linaro@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-11irqdomain: Fix association raceJohan Hovold
commit b06730a571a9ff1ba5bd6b20bf9e50e5a12f1ec6 upstream. The sanity check for an already mapped virq is done outside of the irq_domain_mutex-protected section which means that an (unlikely) racing association may not be detected. Fix this by factoring out the association implementation, which will also be used in a follow-on change to fix a shared-interrupt mapping race. Fixes: ddaf144c61da ("irqdomain: Refactor irq_domain_associate_many()") Cc: stable@vger.kernel.org # 3.11 Tested-by: Hsin-Yi Wang <hsinyi@chromium.org> Tested-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com> Signed-off-by: Johan Hovold <johan+linaro@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230213104302.17307-2-johan+linaro@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-11x86/kprobes: Fix arch_check_optimized_kprobe check within optimized_kprobe rangeYang Jihong
commit f1c97a1b4ef709e3f066f82e3ba3108c3b133ae6 upstream. When arch_prepare_optimized_kprobe calculating jump destination address, it copies original instructions from jmp-optimized kprobe (see __recover_optprobed_insn), and calculated based on length of original instruction. arch_check_optimized_kprobe does not check KPROBE_FLAG_OPTIMATED when checking whether jmp-optimized kprobe exists. As a result, setup_detour_execution may jump to a range that has been overwritten by jump destination address, resulting in an inval opcode error. For example, assume that register two kprobes whose addresses are <func+9> and <func+11> in "func" function. The original code of "func" function is as follows: 0xffffffff816cb5e9 <+9>: push %r12 0xffffffff816cb5eb <+11>: xor %r12d,%r12d 0xffffffff816cb5ee <+14>: test %rdi,%rdi 0xffffffff816cb5f1 <+17>: setne %r12b 0xffffffff816cb5f5 <+21>: push %rbp 1.Register the kprobe for <func+11>, assume that is kp1, corresponding optimized_kprobe is op1. After the optimization, "func" code changes to: 0xffffffff816cc079 <+9>: push %r12 0xffffffff816cc07b <+11>: jmp 0xffffffffa0210000 0xffffffff816cc080 <+16>: incl 0xf(%rcx) 0xffffffff816cc083 <+19>: xchg %eax,%ebp 0xffffffff816cc084 <+20>: (bad) 0xffffffff816cc085 <+21>: push %rbp Now op1->flags == KPROBE_FLAG_OPTIMATED; 2. Register the kprobe for <func+9>, assume that is kp2, corresponding optimized_kprobe is op2. register_kprobe(kp2) register_aggr_kprobe alloc_aggr_kprobe __prepare_optimized_kprobe arch_prepare_optimized_kprobe __recover_optprobed_insn // copy original bytes from kp1->optinsn.copied_insn, // jump address = <func+14> 3. disable kp1: disable_kprobe(kp1) __disable_kprobe ... if (p == orig_p || aggr_kprobe_disabled(orig_p)) { ret = disarm_kprobe(orig_p, true) // add op1 in unoptimizing_list, not unoptimized orig_p->flags |= KPROBE_FLAG_DISABLED; // op1->flags == KPROBE_FLAG_OPTIMATED | KPROBE_FLAG_DISABLED ... 4. unregister kp2 __unregister_kprobe_top ... if (!kprobe_disabled(ap) && !kprobes_all_disarmed) { optimize_kprobe(op) ... if (arch_check_optimized_kprobe(op) < 0) // because op1 has KPROBE_FLAG_DISABLED, here not return return; p->kp.flags |= KPROBE_FLAG_OPTIMIZED; // now op2 has KPROBE_FLAG_OPTIMIZED } "func" code now is: 0xffffffff816cc079 <+9>: int3 0xffffffff816cc07a <+10>: push %rsp 0xffffffff816cc07b <+11>: jmp 0xffffffffa0210000 0xffffffff816cc080 <+16>: incl 0xf(%rcx) 0xffffffff816cc083 <+19>: xchg %eax,%ebp 0xffffffff816cc084 <+20>: (bad) 0xffffffff816cc085 <+21>: push %rbp 5. if call "func", int3 handler call setup_detour_execution: if (p->flags & KPROBE_FLAG_OPTIMIZED) { ... regs->ip = (unsigned long)op->optinsn.insn + TMPL_END_IDX; ... } The code for the destination address is 0xffffffffa021072c: push %r12 0xffffffffa021072e: xor %r12d,%r12d 0xffffffffa0210731: jmp 0xffffffff816cb5ee <func+14> However, <func+14> is not a valid start instruction address. As a result, an error occurs. Link: https://lore.kernel.org/all/20230216034247.32348-3-yangjihong1@huawei.com/ Fixes: f66c0447cca1 ("kprobes: Set unoptimized flag after unoptimizing code") Signed-off-by: Yang Jihong <yangjihong1@huawei.com> Cc: stable@vger.kernel.org Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-11x86/kprobes: Fix __recover_optprobed_insn check optimizing logicYang Jihong
commit 868a6fc0ca2407622d2833adefe1c4d284766c4c upstream. Since the following commit: commit f66c0447cca1 ("kprobes: Set unoptimized flag after unoptimizing code") modified the update timing of the KPROBE_FLAG_OPTIMIZED, a optimized_kprobe may be in the optimizing or unoptimizing state when op.kp->flags has KPROBE_FLAG_OPTIMIZED and op->list is not empty. The __recover_optprobed_insn check logic is incorrect, a kprobe in the unoptimizing state may be incorrectly determined as unoptimizing. As a result, incorrect instructions are copied. The optprobe_queued_unopt function needs to be exported for invoking in arch directory. Link: https://lore.kernel.org/all/20230216034247.32348-2-yangjihong1@huawei.com/ Fixes: f66c0447cca1 ("kprobes: Set unoptimized flag after unoptimizing code") Cc: stable@vger.kernel.org Signed-off-by: Yang Jihong <yangjihong1@huawei.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-11PM: EM: fix memory leak with using debugfs_lookup()Greg Kroah-Hartman
[ Upstream commit a0e8c13ccd6a9a636d27353da62c2410c4eca337 ] When calling debugfs_lookup() the result must have dput() called on it, otherwise the memory will leak over time. To make things simpler, just call debugfs_lookup_and_remove() instead which handles all of the logic at once. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-11clocksource: Suspend the watchdog temporarily when high read latency detectedFeng Tang
[ Upstream commit b7082cdfc464bf9231300605d03eebf943dda307 ] Bugs have been reported on 8 sockets x86 machines in which the TSC was wrongly disabled when the system is under heavy workload. [ 818.380354] clocksource: timekeeping watchdog on CPU336: hpet wd-wd read-back delay of 1203520ns [ 818.436160] clocksource: wd-tsc-wd read-back delay of 181880ns, clock-skew test skipped! [ 819.402962] clocksource: timekeeping watchdog on CPU338: hpet wd-wd read-back delay of 324000ns [ 819.448036] clocksource: wd-tsc-wd read-back delay of 337240ns, clock-skew test skipped! [ 819.880863] clocksource: timekeeping watchdog on CPU339: hpet read-back delay of 150280ns, attempt 3, marking unstable [ 819.936243] tsc: Marking TSC unstable due to clocksource watchdog [ 820.068173] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. [ 820.092382] sched_clock: Marking unstable (818769414384, 1195404998) [ 820.643627] clocksource: Checking clocksource tsc synchronization from CPU 267 to CPUs 0,4,25,70,126,430,557,564. [ 821.067990] clocksource: Switched to clocksource hpet This can be reproduced by running memory intensive 'stream' tests, or some of the stress-ng subcases such as 'ioport'. The reason for these issues is the when system is under heavy load, the read latency of the clocksources can be very high. Even lightweight TSC reads can show high latencies, and latencies are much worse for external clocksources such as HPET or the APIC PM timer. These latencies can result in false-positive clocksource-unstable determinations. These issues were initially reported by a customer running on a production system, and this problem was reproduced on several generations of Xeon servers, especially when running the stress-ng test. These Xeon servers were not production systems, but they did have the latest steppings and firmware. Given that the clocksource watchdog is a continual diagnostic check with frequency of twice a second, there is no need to rush it when the system is under heavy load. Therefore, when high clocksource read latencies are detected, suspend the watchdog timer for 5 minutes. Signed-off-by: Feng Tang <feng.tang@intel.com> Acked-by: Waiman Long <longman@redhat.com> Cc: John Stultz <jstultz@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Stephen Boyd <sboyd@kernel.org> Cc: Feng Tang <feng.tang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-11timers: Prevent union confusion from unexpected restart_syscall()Jann Horn
[ Upstream commit 9f76d59173d9d146e96c66886b671c1915a5c5e5 ] The nanosleep syscalls use the restart_block mechanism, with a quirk: The `type` and `rmtp`/`compat_rmtp` fields are set up unconditionally on syscall entry, while the rest of the restart_block is only set up in the unlikely case that the syscall is actually interrupted by a signal (or pseudo-signal) that doesn't have a signal handler. If the restart_block was set up by a previous syscall (futex(..., FUTEX_WAIT, ...) or poll()) and hasn't been invalidated somehow since then, this will clobber some of the union fields used by futex_wait_restart() and do_restart_poll(). If userspace afterwards wrongly calls the restart_syscall syscall, futex_wait_restart()/do_restart_poll() will read struct fields that have been clobbered. This doesn't actually lead to anything particularly interesting because none of the union fields contain trusted kernel data, and futex(..., FUTEX_WAIT, ...) and poll() aren't syscalls where it makes much sense to apply seccomp filters to their arguments. So the current consequences are just of the "if userspace does bad stuff, it can damage itself, and that's not a problem" flavor. But still, it seems like a hazard for future developers, so invalidate the restart_block when partly setting it up in the nanosleep syscalls. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230105134403.754986-1-jannh@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-11rcu-tasks: Make rude RCU-Tasks work well with CPU hotplugZqiang
[ Upstream commit ea5c8987fef20a8cca07e428aa28bc64649c5104 ] The synchronize_rcu_tasks_rude() function invokes rcu_tasks_rude_wait_gp() to wait one rude RCU-tasks grace period. The rcu_tasks_rude_wait_gp() function in turn checks if there is only a single online CPU. If so, it will immediately return, because a call to synchronize_rcu_tasks_rude() is by definition a grace period on a single-CPU system. (We could have blocked!) Unfortunately, this check uses num_online_cpus() without synchronization, which can result in too-short grace periods. To see this, consider the following scenario: CPU0 CPU1 (going offline) migration/1 task: cpu_stopper_thread -> take_cpu_down -> _cpu_disable (dec __num_online_cpus) ->cpuhp_invoke_callback preempt_disable access old_data0 task1 del old_data0 ..... synchronize_rcu_tasks_rude() task1 schedule out .... task2 schedule in rcu_tasks_rude_wait_gp() ->__num_online_cpus == 1 ->return .... task1 schedule in ->free old_data0 preempt_enable When CPU1 decrements __num_online_cpus, its value becomes 1. However, CPU1 has not finished going offline, and will take one last trip through the scheduler and the idle loop before it actually stops executing instructions. Because synchronize_rcu_tasks_rude() is mostly used for tracing, and because both the scheduler and the idle loop can be traced, this means that CPU0's prematurely ended grace period might disrupt the tracing on CPU1. Given that this disruption might include CPU1 executing instructions in memory that was just now freed (and maybe reallocated), this is a matter of some concern. This commit therefore removes that problematic single-CPU check from the rcu_tasks_rude_wait_gp() function. This dispenses with the single-CPU optimization, but there is no evidence indicating that this optimization is important. In addition, synchronize_rcu_tasks_generic() contains a similar optimization (albeit only for early boot), which also splats. (As in exactly why are you invoking synchronize_rcu_tasks_rude() so early in boot, anyway???) It is OK for the synchronize_rcu_tasks_rude() function's check to be unsynchronized because the only times that this check can evaluate to true is when there is only a single CPU running with preemption disabled. While in the area, this commit also fixes a minor bug in which a call to synchronize_rcu_tasks_rude() would instead be attributed to synchronize_rcu_tasks(). [ paulmck: Add "synchronize_" prefix and "()" suffix. ] Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-11rcu: Suppress smp_processor_id() complaint in synchronize_rcu_expedited_wait()Paul E. McKenney
[ Upstream commit 2d7f00b2f01301d6e41fd4a28030dab0442265be ] The normal grace period's RCU CPU stall warnings are invoked from the scheduling-clock interrupt handler, and can thus invoke smp_processor_id() with impunity, which allows them to directly invoke dump_cpu_task(). In contrast, the expedited grace period's RCU CPU stall warnings are invoked from process context, which causes the dump_cpu_task() function's calls to smp_processor_id() to complain bitterly in debug kernels. This commit therefore causes synchronize_rcu_expedited_wait() to disable preemption around its call to dump_cpu_task(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-11bpf: Fix global subprog context argument resolution logicAndrii Nakryiko
[ Upstream commit d384dce281ed1b504fae2e279507827638d56fa3 ] KPROBE program's user-facing context type is defined as typedef bpf_user_pt_regs_t. This leads to a problem when trying to passing kprobe/uprobe/usdt context argument into global subprog, as kernel always strip away mods and typedefs of user-supplied type, but takes expected type from bpf_ctx_convert as is, which causes mismatch. Current way to work around this is to define a fake struct with the same name as expected typedef: struct bpf_user_pt_regs_t {}; __noinline my_global_subprog(struct bpf_user_pt_regs_t *ctx) { ... } This patch fixes the issue by resolving expected type, if it's not a struct. It still leaves the above work-around working for backwards compatibility. Fixes: 91cc1a99740e ("bpf: Annotate context types") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20230216045954.3002473-2-andrii@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-11rcu-tasks: Fix synchronize_rcu_tasks() VS zap_pid_ns_processes()Frederic Weisbecker
[ Upstream commit 28319d6dc5e2ffefa452c2377dd0f71621b5bff0 ] RCU Tasks and PID-namespace unshare can interact in do_exit() in a complicated circular dependency: 1) TASK A calls unshare(CLONE_NEWPID), this creates a new PID namespace that every subsequent child of TASK A will belong to. But TASK A doesn't itself belong to that new PID namespace. 2) TASK A forks() and creates TASK B. TASK A stays attached to its PID namespace (let's say PID_NS1) and TASK B is the first task belonging to the new PID namespace created by unshare() (let's call it PID_NS2). 3) Since TASK B is the first task attached to PID_NS2, it becomes the PID_NS2 child reaper. 4) TASK A forks() again and creates TASK C which get attached to PID_NS2. Note how TASK C has TASK A as a parent (belonging to PID_NS1) but has TASK B (belonging to PID_NS2) as a pid_namespace child_reaper. 5) TASK B exits and since it is the child reaper for PID_NS2, it has to kill all other tasks attached to PID_NS2, and wait for all of them to die before getting reaped itself (zap_pid_ns_process()). 6) TASK A calls synchronize_rcu_tasks() which leads to synchronize_srcu(&tasks_rcu_exit_srcu). 7) TASK B is waiting for TASK C to get reaped. But TASK B is under a tasks_rcu_exit_srcu SRCU critical section (exit_notify() is between exit_tasks_rcu_start() and exit_tasks_rcu_finish()), blocking TASK A. 8) TASK C exits and since TASK A is its parent, it waits for it to reap TASK C, but it can't because TASK A waits for TASK B that waits for TASK C. Pid_namespace semantics can hardly be changed at this point. But the coverage of tasks_rcu_exit_srcu can be reduced instead. The current task is assumed not to be concurrently reapable at this stage of exit_notify() and therefore tasks_rcu_exit_srcu can be temporarily relaxed without breaking its constraints, providing a way out of the deadlock scenario. [ paulmck: Fix build failure by adding additional declaration. ] Fixes: 3f95aa81d265 ("rcu: Make TASKS_RCU handle tasks that are almost done exiting") Reported-by: Pengfei Xu <pengfei.xu@intel.com> Suggested-by: Boqun Feng <boqun.feng@gmail.com> Suggested-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Suggested-by: Paul E. McKenney <paulmck@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Eric W . Biederman <ebiederm@xmission.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-11rcu-tasks: Remove preemption disablement around srcu_read_[un]lock() callsFrederic Weisbecker
[ Upstream commit 44757092958bdd749775022f915b7ac974384c2a ] Ever since the following commit: 5a41344a3d83 ("srcu: Simplify __srcu_read_unlock() via this_cpu_dec()") SRCU doesn't rely anymore on preemption to be disabled in order to modify the per-CPU counter. And even then it used to be done from the API itself. Therefore and after checking further, it appears to be safe to remove the preemption disablement around __srcu_read_[un]lock() in exit_tasks_rcu_start() and exit_tasks_rcu_finish() Suggested-by: Boqun Feng <boqun.feng@gmail.com> Suggested-by: Paul E. McKenney <paulmck@kernel.org> Suggested-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Stable-dep-of: 28319d6dc5e2 ("rcu-tasks: Fix synchronize_rcu_tasks() VS zap_pid_ns_processes()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-11rcu-tasks: Improve comments explaining tasks_rcu_exit_srcu purposeFrederic Weisbecker
[ Upstream commit e4e1e8089c5fd948da12cb9f4adc93821036945f ] Make sure we don't need to look again into the depths of git blame in order not to miss a subtle part about how rcu-tasks is dealing with exiting tasks. Suggested-by: Boqun Feng <boqun.feng@gmail.com> Suggested-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Suggested-by: Paul E. McKenney <paulmck@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Stable-dep-of: 28319d6dc5e2 ("rcu-tasks: Fix synchronize_rcu_tasks() VS zap_pid_ns_processes()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-11sched/rt: pick_next_rt_entity(): check list_entryPietro Borrello
[ Upstream commit 7c4a5b89a0b5a57a64b601775b296abf77a9fe97 ] Commit 326587b84078 ("sched: fix goto retry in pick_next_task_rt()") removed any path which could make pick_next_rt_entity() return NULL. However, BUG_ON(!rt_se) in _pick_next_task_rt() (the only caller of pick_next_rt_entity()) still checks the error condition, which can never happen, since list_entry() never returns NULL. Remove the BUG_ON check, and instead emit a warning in the only possible error condition here: the queue being empty which should never happen. Fixes: 326587b84078 ("sched: fix goto retry in pick_next_task_rt()") Signed-off-by: Pietro Borrello <borrello@diag.uniroma1.it> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20230128-list-entry-null-check-sched-v3-1-b1a71bd1ac6b@diag.uniroma1.it Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-11sched/deadline,rt: Remove unused parameter from pick_next_[rt|dl]_entity()Dietmar Eggemann
[ Upstream commit 821aecd09e5ad2f8d4c3d8195333d272b392f7d3 ] The `struct rq *rq` parameter isn't used. Remove it. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20220302183433.333029-7-dietmar.eggemann@arm.com Stable-dep-of: 7c4a5b89a0b5 ("sched/rt: pick_next_rt_entity(): check list_entry") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-25bpf: add missing header file includeLinus Torvalds
commit f3dd0c53370e70c0f9b7e931bbec12916f3bb8cc upstream. Commit 74e19ef0ff80 ("uaccess: Add speculation barrier to copy_from_user()") built fine on x86-64 and arm64, and that's the extent of my local build testing. It turns out those got the <linux/nospec.h> include incidentally through other header files (<linux/kvm_host.h> in particular), but that was not true of other architectures, resulting in build errors kernel/bpf/core.c: In function ‘___bpf_prog_run’: kernel/bpf/core.c:1913:3: error: implicit declaration of function ‘barrier_nospec’ so just make sure to explicitly include the proper <linux/nospec.h> header file to make everybody see it. Fixes: 74e19ef0ff80 ("uaccess: Add speculation barrier to copy_from_user()") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Viresh Kumar <viresh.kumar@linaro.org> Reported-by: Huacai Chen <chenhuacai@loongson.cn> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Tested-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-25uaccess: Add speculation barrier to copy_from_user()Dave Hansen
commit 74e19ef0ff8061ef55957c3abd71614ef0f42f47 upstream. The results of "access_ok()" can be mis-speculated. The result is that you can end speculatively: if (access_ok(from, size)) // Right here even for bad from/size combinations. On first glance, it would be ideal to just add a speculation barrier to "access_ok()" so that its results can never be mis-speculated. But there are lots of system calls just doing access_ok() via "copy_to_user()" and friends (example: fstat() and friends). Those are generally not problematic because they do not _consume_ data from userspace other than the pointer. They are also very quick and common system calls that should not be needlessly slowed down. "copy_from_user()" on the other hand uses a user-controller pointer and is frequently followed up with code that might affect caches. Take something like this: if (!copy_from_user(&kernelvar, uptr, size)) do_something_with(kernelvar); If userspace passes in an evil 'uptr' that *actually* points to a kernel addresses, and then do_something_with() has cache (or other) side-effects, it could allow userspace to infer kernel data values. Add a barrier to the common copy_from_user() code to prevent mis-speculated values which happen after the copy. Also add a stub for architectures that do not define barrier_nospec(). This makes the macro usable in generic code. Since the barrier is now usable in generic code, the x86 #ifdef in the BPF code can also go away. Reported-by: Jordy Zomer <jordyzomer@google.com> Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Daniel Borkmann <daniel@iogearbox.net> # BPF bits Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-22alarmtimer: Prevent starvation by small intervals and SIG_IGNThomas Gleixner
commit d125d1349abeb46945dc5e98f7824bf688266f13 upstream. syzbot reported a RCU stall which is caused by setting up an alarmtimer with a very small interval and ignoring the signal. The reproducer arms the alarm timer with a relative expiry of 8ns and an interval of 9ns. Not a problem per se, but that's an issue when the signal is ignored because then the timer is immediately rearmed because there is no way to delay that rearming to the signal delivery path. See posix_timer_fn() and commit 58229a189942 ("posix-timers: Prevent softirq starvation by small intervals and SIG_IGN") for details. The reproducer does not set SIG_IGN explicitely, but it sets up the timers signal with SIGCONT. That has the same effect as explicitely setting SIG_IGN for a signal as SIGCONT is ignored if there is no handler set and the task is not ptraced. The log clearly shows that: [pid 5102] --- SIGCONT {si_signo=SIGCONT, si_code=SI_TIMER, si_timerid=0, si_overrun=316014, si_int=0, si_ptr=NULL} --- It works because the tasks are traced and therefore the signal is queued so the tracer can see it, which delays the restart of the timer to the signal delivery path. But then the tracer is killed: [pid 5087] kill(-5102, SIGKILL <unfinished ...> ... ./strace-static-x86_64: Process 5107 detached and after it's gone the stall can be observed: syzkaller login: [ 79.439102][ C0] hrtimer: interrupt took 68471 ns [ 184.460538][ C1] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: ... [ 184.658237][ C1] rcu: Stack dump where RCU GP kthread last ran: [ 184.664574][ C1] Sending NMI from CPU 1 to CPUs 0: [ 184.669821][ C0] NMI backtrace for cpu 0 [ 184.669831][ C0] CPU: 0 PID: 5108 Comm: syz-executor192 Not tainted 6.2.0-rc6-next-20230203-syzkaller #0 ... [ 184.670036][ C0] Call Trace: [ 184.670041][ C0] <IRQ> [ 184.670045][ C0] alarmtimer_fired+0x327/0x670 posix_timer_fn() prevents that by checking whether the interval for timers which have the signal ignored is smaller than a jiffie and artifically delay it by shifting the next expiry out by a jiffie. That's accurate vs. the overrun accounting, but slightly inaccurate vs. timer_gettimer(2). The comment in that function says what needs to be done and there was a fix available for the regular userspace induced SIG_IGN mechanism, but that did not work due to the implicit ignore for SIGCONT and similar signals. This needs to be worked on, but for now the only available workaround is to do exactly what posix_timer_fn() does: Increase the interval of self-rearming timers, which have their signal ignored, to at least a jiffie. Interestingly this has been fixed before via commit ff86bf0c65f1 ("alarmtimer: Rate limit periodic intervals") already, but that fix got lost in a later rework. Reported-by: syzbot+b9564ba6e8e00694511b@syzkaller.appspotmail.com Fixes: f2c45807d399 ("alarmtimer: Switch over to generic set/get/rearm routine") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/87k00q1no2.ffs@tglx Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-22sched/psi: Fix use-after-free in ep_remove_wait_queue()Munehisa Kamata
commit c2dbe32d5db5c4ead121cf86dabd5ab691fb47fe upstream. If a non-root cgroup gets removed when there is a thread that registered trigger and is polling on a pressure file within the cgroup, the polling waitqueue gets freed in the following path: do_rmdir cgroup_rmdir kernfs_drain_open_files cgroup_file_release cgroup_pressure_release psi_trigger_destroy However, the polling thread still has a reference to the pressure file and will access the freed waitqueue when the file is closed or upon exit: fput ep_eventpoll_release ep_free ep_remove_wait_queue remove_wait_queue This results in use-after-free as pasted below. The fundamental problem here is that cgroup_file_release() (and consequently waitqueue's lifetime) is not tied to the file's real lifetime. Using wake_up_pollfree() here might be less than ideal, but it is in line with the comment at commit 42288cb44c4b ("wait: add wake_up_pollfree()") since the waitqueue's lifetime is not tied to file's one and can be considered as another special case. While this would be fixable by somehow making cgroup_file_release() be tied to the fput(), it would require sizable refactoring at cgroups or higher layer which might be more justifiable if we identify more cases like this. BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x60/0xc0 Write of size 4 at addr ffff88810e625328 by task a.out/4404 CPU: 19 PID: 4404 Comm: a.out Not tainted 6.2.0-rc6 #38 Hardware name: Amazon EC2 c5a.8xlarge/, BIOS 1.0 10/16/2017 Call Trace: <TASK> dump_stack_lvl+0x73/0xa0 print_report+0x16c/0x4e0 kasan_report+0xc3/0xf0 kasan_check_range+0x2d2/0x310 _raw_spin_lock_irqsave+0x60/0xc0 remove_wait_queue+0x1a/0xa0 ep_free+0x12c/0x170 ep_eventpoll_release+0x26/0x30 __fput+0x202/0x400 task_work_run+0x11d/0x170 do_exit+0x495/0x1130 do_group_exit+0x100/0x100 get_signal+0xd67/0xde0 arch_do_signal_or_restart+0x2a/0x2b0 exit_to_user_mode_prepare+0x94/0x100 syscall_exit_to_user_mode+0x20/0x40 do_syscall_64+0x52/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd </TASK> Allocated by task 4404: kasan_set_track+0x3d/0x60 __kasan_kmalloc+0x85/0x90 psi_trigger_create+0x113/0x3e0 pressure_write+0x146/0x2e0 cgroup_file_write+0x11c/0x250 kernfs_fop_write_iter+0x186/0x220 vfs_write+0x3d8/0x5c0 ksys_write+0x90/0x110 do_syscall_64+0x43/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd Freed by task 4407: kasan_set_track+0x3d/0x60 kasan_save_free_info+0x27/0x40 ____kasan_slab_free+0x11d/0x170 slab_free_freelist_hook+0x87/0x150 __kmem_cache_free+0xcb/0x180 psi_trigger_destroy+0x2e8/0x310 cgroup_file_release+0x4f/0xb0 kernfs_drain_open_files+0x165/0x1f0 kernfs_drain+0x162/0x1a0 __kernfs_remove+0x1fb/0x310 kernfs_remove_by_name_ns+0x95/0xe0 cgroup_addrm_files+0x67f/0x700 cgroup_destroy_locked+0x283/0x3c0 cgroup_rmdir+0x29/0x100 kernfs_iop_rmdir+0xd1/0x140 vfs_rmdir+0xfe/0x240 do_rmdir+0x13d/0x280 __x64_sys_rmdir+0x2c/0x30 do_syscall_64+0x43/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd Fixes: 0e94682b73bf ("psi: introduce psi monitor") Signed-off-by: Munehisa Kamata <kamatam@amazon.com> Signed-off-by: Mengchi Cheng <mengcc@amazon.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/lkml/20230106224859.4123476-1-kamatam@amazon.com/ Link: https://lore.kernel.org/r/20230214212705.4058045-1-kamatam@amazon.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-15tracing: Fix poll() and select() do not work on per_cpu trace_pipe and ↵Shiju Jose
trace_pipe_raw commit 3e46d910d8acf94e5360126593b68bf4fee4c4a1 upstream. poll() and select() on per_cpu trace_pipe and trace_pipe_raw do not work since kernel 6.1-rc6. This issue is seen after the commit 42fb0a1e84ff525ebe560e2baf9451ab69127e2b ("tracing/ring-buffer: Have polling block on watermark"). This issue is firstly detected and reported, when testing the CXL error events in the rasdaemon and also erified using the test application for poll() and select(). This issue occurs for the per_cpu case, when calling the ring_buffer_poll_wait(), in kernel/trace/ring_buffer.c, with the buffer_percent > 0 and then wait until the percentage of pages are available. The default value set for the buffer_percent is 50 in the kernel/trace/trace.c. As a fix, allow userspace application could set buffer_percent as 0 through the buffer_percent_fops, so that the task will wake up as soon as data is added to any of the specific cpu buffer. Link: https://lore.kernel.org/linux-trace-kernel/20230202182309.742-2-shiju.jose@huawei.com Cc: <mhiramat@kernel.org> Cc: <mchehab@kernel.org> Cc: <linux-edac@vger.kernel.org> Cc: stable@vger.kernel.org Fixes: 42fb0a1e84ff5 ("tracing/ring-buffer: Have polling block on watermark") Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-15bpf: Do not reject when the stack read size is different from the tracked ↵Martin KaFai Lau
scalar size [ Upstream commit f30d4968e9aee737e174fc97942af46cfb49b484 ] Below is a simplified case from a report in bcc [0]: r4 = 20 *(u32 *)(r10 -4) = r4 *(u32 *)(r10 -8) = r4 /* r4 state is tracked */ r4 = *(u64 *)(r10 -8) /* Read more than the tracked 32bit scalar. * verifier rejects as 'corrupted spill memory'. */ After commit 354e8f1970f8 ("bpf: Support <8-byte scalar spill and refill"), the 8-byte aligned 32bit spill is also tracked by the verifier and the register state is stored. However, if 8 bytes are read from the stack instead of the tracked 4 byte scalar, then verifier currently rejects the program as "corrupted spill memory". This patch fixes this case by allowing it to read but marks the register as unknown. Also note that, if the prog is trying to corrupt/leak an earlier spilled pointer by spilling another <8 bytes register on top, this has already been rejected in the check_stack_write_fixed_off(). [0] https://github.com/iovisor/bcc/pull/3683 Fixes: 354e8f1970f8 ("bpf: Support <8-byte scalar spill and refill") Reported-by: Hengqi Chen <hengqi.chen@gmail.com> Reported-by: Yonghong Song <yhs@gmail.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Hengqi Chen <hengqi.chen@gmail.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20211102064535.316018-1-kafai@fb.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-15bpf: Fix to preserve reg parent/live fields when copying range infoEduard Zingerman
[ Upstream commit 71f656a50176915d6813751188b5758daa8d012b ] Register range information is copied in several places. The intent is to transfer range/id information from one register/stack spill to another. Currently this is done using direct register assignment, e.g.: static void find_equal_scalars(..., struct bpf_reg_state *known_reg) { ... struct bpf_reg_state *reg; ... *reg = *known_reg; ... } However, such assignments also copy the following bpf_reg_state fields: struct bpf_reg_state { ... struct bpf_reg_state *parent; ... enum bpf_reg_liveness live; ... }; Copying of these fields is accidental and incorrect, as could be demonstrated by the following example: 0: call ktime_get_ns() 1: r6 = r0 2: call ktime_get_ns() 3: r7 = r0 4: if r0 > r6 goto +1 ; r0 & r6 are unbound thus generated ; branch states are identical 5: *(u64 *)(r10 - 8) = 0xdeadbeef ; 64-bit write to fp[-8] --- checkpoint --- 6: r1 = 42 ; r1 marked as written 7: *(u8 *)(r10 - 8) = r1 ; 8-bit write, fp[-8] parent & live ; overwritten 8: r2 = *(u64 *)(r10 - 8) 9: r0 = 0 10: exit This example is unsafe because 64-bit write to fp[-8] at (5) is conditional, thus not all bytes of fp[-8] are guaranteed to be set when it is read at (8). However, currently the example passes verification. First, the execution path 1-10 is examined by verifier. Suppose that a new checkpoint is created by is_state_visited() at (6). After checkpoint creation: - r1.parent points to checkpoint.r1, - fp[-8].parent points to checkpoint.fp[-8]. At (6) the r1.live is set to REG_LIVE_WRITTEN. At (7) the fp[-8].parent is set to r1.parent and fp[-8].live is set to REG_LIVE_WRITTEN, because of the following code called in check_stack_write_fixed_off(): static void save_register_state(struct bpf_func_state *state, int spi, struct bpf_reg_state *reg, int size) { ... state->stack[spi].spilled_ptr = *reg; // <--- parent & live copied if (size == BPF_REG_SIZE) state->stack[spi].spilled_ptr.live |= REG_LIVE_WRITTEN; ... } Note the intent to mark stack spill as written only if 8 bytes are spilled to a slot, however this intent is spoiled by a 'live' field copy. At (8) the checkpoint.fp[-8] should be marked as REG_LIVE_READ but this does not happen: - fp[-8] in a current state is already marked as REG_LIVE_WRITTEN; - fp[-8].parent points to checkpoint.r1, parentage chain is used by mark_reg_read() to mark checkpoint states. At (10) the verification is finished for path 1-10 and jump 4-6 is examined. The checkpoint.fp[-8] never gets REG_LIVE_READ mark and this spill is pruned from the cached states by clean_live_states(). Hence verifier state obtained via path 1-4,6 is deemed identical to one obtained via path 1-6 and program marked as safe. Note: the example should be executed with BPF_F_TEST_STATE_FREQ flag set to force creation of intermediate verifier states. This commit revisits the locations where bpf_reg_state instances are copied and replaces the direct copies with a call to a function copy_register_state(dst, src) that preserves 'parent' and 'live' fields of the 'dst'. Fixes: 679c782de14b ("bpf/verifier: per-register parent pointers") Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20230106142214.1040390-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-15bpf: Support <8-byte scalar spill and refillMartin KaFai Lau
[ Upstream commit 354e8f1970f821d4952458f77b1ab6c3eb24d530 ] The verifier currently does not save the reg state when spilling <8byte bounded scalar to the stack. The bpf program will be incorrectly rejected when this scalar is refilled to the reg and then used to offset into a packet header. The later patch has a simplified bpf prog from a real use case to demonstrate this case. The current work around is to reparse the packet again such that this offset scalar is close to where the packet data will be accessed to avoid the spill. Thus, the header is parsed twice. The llvm patch [1] will align the <8bytes spill to the 8-byte stack address. This can simplify the verifier support by avoiding to store multiple reg states for each 8 byte stack slot. This patch changes the verifier to save the reg state when spilling <8bytes scalar to the stack. This reg state saving is limited to spill aligned to the 8-byte stack address. The current refill logic has already called coerce_reg_to_size(), so coerce_reg_to_size() is not called on state->stack[spi].spilled_ptr during spill. When refilling in check_stack_read_fixed_off(), it checks the refill size is the same as the number of bytes marked with STACK_SPILL before restoring the reg state. When restoring the reg state to state->regs[dst_regno], it needs to avoid the state->regs[dst_regno].subreg_def being over written because it has been marked by the check_reg_arg() earlier [check_mem_access() is called after check_reg_arg() in do_check()]. Reordering check_mem_access() and check_reg_arg() will need a lot of changes in test_verifier's tests because of the difference in verifier's error message. Thus, the patch here is to save the state->regs[dst_regno].subreg_def first in check_stack_read_fixed_off(). There are cases that the verifier needs to scrub the spilled slot from STACK_SPILL to STACK_MISC. After this patch the spill is not always in 8 bytes now, so it can no longer assume the other 7 bytes are always marked as STACK_SPILL. In particular, the scrub needs to avoid marking an uninitialized byte from STACK_INVALID to STACK_MISC. Otherwise, the verifier will incorrectly accept bpf program reading uninitialized bytes from the stack. A new helper scrub_spilled_slot() is created for this purpose. [1]: https://reviews.llvm.org/D109073 Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210922004941.625398-1-kafai@fb.com Stable-dep-of: 71f656a50176 ("bpf: Fix to preserve reg parent/live fields when copying range info") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-15bpf: Fix a possible task gone issue with bpf_send_signal[_thread]() helpersYonghong Song
[ Upstream commit bdb7fdb0aca8b96cef9995d3a57e251c2289322f ] In current bpf_send_signal() and bpf_send_signal_thread() helper implementation, irq_work is used to handle nmi context. Hao Sun reported in [1] that the current task at the entry of the helper might be gone during irq_work callback processing. To fix the issue, a reference is acquired for the current task before enqueuing into the irq_work so that the queued task is still available during irq_work callback processing. [1] https://lore.kernel.org/bpf/20230109074425.12556-1-sunhao.th@gmail.com/ Fixes: 8b401f9ed244 ("bpf: implement bpf_send_signal() helper") Tested-by: Hao Sun <sunhao.th@gmail.com> Reported-by: Hao Sun <sunhao.th@gmail.com> Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20230118204815.3331855-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-15bpf: Fix incorrect state pruning for <8B spill/fillPaul Chaignon
commit 345e004d023343d38088fdfea39688aa11e06ccf upstream. Commit 354e8f1970f8 ("bpf: Support <8-byte scalar spill and refill") introduced support in the verifier to track <8B spill/fills of scalars. The backtracking logic for the precision bit was however skipping spill/fills of less than 8B. That could cause state pruning to consider two states equivalent when they shouldn't be. As an example, consider the following bytecode snippet: 0: r7 = r1 1: call bpf_get_prandom_u32 2: r6 = 2 3: if r0 == 0 goto pc+1 4: r6 = 3 ... 8: [state pruning point] ... /* u32 spill/fill */ 10: *(u32 *)(r10 - 8) = r6 11: r8 = *(u32 *)(r10 - 8) 12: r0 = 0 13: if r8 == 3 goto pc+1 14: r0 = 1 15: exit The verifier first walks the path with R6=3. Given the support for <8B spill/fills, at instruction 13, it knows the condition is true and skips instruction 14. At that point, the backtracking logic kicks in but stops at the fill instruction since it only propagates the precision bit for 8B spill/fill. When the verifier then walks the path with R6=2, it will consider it safe at instruction 8 because R6 is not marked as needing precision. Instruction 14 is thus never walked and is then incorrectly removed as 'dead code'. It's also possible to lead the verifier to accept e.g. an out-of-bound memory access instead of causing an incorrect dead code elimination. This regression was found via Cilium's bpf-next CI where it was causing a conntrack map update to be silently skipped because the code had been removed by the verifier. This commit fixes it by enabling support for <8B spill/fills in the bactracking logic. In case of a <8B spill/fill, the full 8B stack slot will be marked as needing precision. Then, in __mark_chain_precision, any tracked register spilled in a marked slot will itself be marked as needing precision, regardless of the spill size. This logic makes two assumptions: (1) only 8B-aligned spill/fill are tracked and (2) spilled registers are only tracked if the spill and fill sizes are equal. Commit ef979017b837 ("bpf: selftest: Add verifier tests for <8-byte scalar spill and refill") covers the first assumption and the next commit in this patchset covers the second. Fixes: 354e8f1970f8 ("bpf: Support <8-byte scalar spill and refill") Signed-off-by: Paul Chaignon <paul@isovalent.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-06bpf: Skip task with pid=1 in send_signal_common()Hao Sun
[ Upstream commit a3d81bc1eaef48e34dd0b9b48eefed9e02a06451 ] The following kernel panic can be triggered when a task with pid=1 attaches a prog that attempts to send killing signal to itself, also see [1] for more details: Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b CPU: 3 PID: 1 Comm: systemd Not tainted 6.1.0-09652-g59fe41b5255f #148 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x100/0x178 lib/dump_stack.c:106 panic+0x2c4/0x60f kernel/panic.c:275 do_exit.cold+0x63/0xe4 kernel/exit.c:789 do_group_exit+0xd4/0x2a0 kernel/exit.c:950 get_signal+0x2460/0x2600 kernel/signal.c:2858 arch_do_signal_or_restart+0x78/0x5d0 arch/x86/kernel/signal.c:306 exit_to_user_mode_loop kernel/entry/common.c:168 [inline] exit_to_user_mode_prepare+0x15f/0x250 kernel/entry/common.c:203 __syscall_exit_to_user_mode_work kernel/entry/common.c:285 [inline] syscall_exit_to_user_mode+0x1d/0x50 kernel/entry/common.c:296 do_syscall_64+0x44/0xb0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x63/0xcd So skip task with pid=1 in bpf_send_signal_common() to avoid the panic. [1] https://lore.kernel.org/bpf/20221222043507.33037-1-sunhao.th@gmail.com Signed-off-by: Hao Sun <sunhao.th@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20230106084838.12690-1-sunhao.th@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-01trace_events_hist: add check for return value of 'create_hist_field'Natalia Petrova
commit 8b152e9150d07a885f95e1fd401fc81af202d9a4 upstream. Function 'create_hist_field' is called recursively at trace_events_hist.c:1954 and can return NULL-value that's why we have to check it to avoid null pointer dereference. Found by Linux Verification Center (linuxtesting.org) with SVACE. Link: https://lkml.kernel.org/r/20230111120409.4111-1-n.petrova@fintech.ru Cc: stable@vger.kernel.org Fixes: 30350d65ac56 ("tracing: Add variable support to hist triggers") Signed-off-by: Natalia Petrova <n.petrova@fintech.ru> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-01tracing: Make sure trace_printk() can output as soon as it can be usedSteven Rostedt (Google)
commit 3bb06eb6e9acf7c4a3e1b5bc87aed398ff8e2253 upstream. Currently trace_printk() can be used as soon as early_trace_init() is called from start_kernel(). But if a crash happens, and "ftrace_dump_on_oops" is set on the kernel command line, all you get will be: [ 0.456075] <idle>-0 0dN.2. 347519us : Unknown type 6 [ 0.456075] <idle>-0 0dN.2. 353141us : Unknown type 6 [ 0.456075] <idle>-0 0dN.2. 358684us : Unknown type 6 This is because the trace_printk() event (type 6) hasn't been registered yet. That gets done via an early_initcall(), which may be early, but not early enough. Instead of registering the trace_printk() event (and other ftrace events, which are not trace events) via an early_initcall(), have them registered at the same time that trace_printk() can be used. This way, if there is a crash before early_initcall(), then the trace_printk()s will actually be useful. Link: https://lkml.kernel.org/r/20230104161412.019f6c55@gandalf.local.home Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Fixes: e725c731e3bb1 ("tracing: Split tracing initialization into two for early initialization") Reported-by: "Joel Fernandes (Google)" <joel@joelfernandes.org> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-01module: Don't wait for GOING modulesPetr Pavlu
commit 0254127ab977e70798707a7a2b757c9f3c971210 upstream. During a system boot, it can happen that the kernel receives a burst of requests to insert the same module but loading it eventually fails during its init call. For instance, udev can make a request to insert a frequency module for each individual CPU when another frequency module is already loaded which causes the init function of the new module to return an error. Since commit 6e6de3dee51a ("kernel/module.c: Only return -EEXIST for modules that have finished loading"), the kernel waits for modules in MODULE_STATE_GOING state to finish unloading before making another attempt to load the same module. This creates unnecessary work in the described scenario and delays the boot. In the worst case, it can prevent udev from loading drivers for other devices and might cause timeouts of services waiting on them and subsequently a failed boot. This patch attempts a different solution for the problem 6e6de3dee51a was trying to solve. Rather than waiting for the unloading to complete, it returns a different error code (-EBUSY) for modules in the GOING state. This should avoid the error situation that was described in 6e6de3dee51a (user space attempting to load a dependent module because the -EEXIST error code would suggest to user space that the first module had been loaded successfully), while avoiding the delay situation too. This has been tested on linux-next since December 2022 and passes all kmod selftests except test 0009 with module compression enabled but it has been confirmed that this issue has existed and has gone unnoticed since prior to this commit and can also be reproduced without module compression with a simple usleep(5000000) on tools/modprobe.c [0]. These failures are caused by hitting the kernel mod_concurrent_max and can happen either due to a self inflicted kernel module auto-loead DoS somehow or on a system with large CPU count and each CPU count incorrectly triggering many module auto-loads. Both of those issues need to be fixed in-kernel. [0] https://lore.kernel.org/all/Y9A4fiobL6IHp%2F%2FP@bombadil.infradead.org/ Fixes: 6e6de3dee51a ("kernel/module.c: Only return -EEXIST for modules that have finished loading") Co-developed-by: Martin Wilck <mwilck@suse.com> Signed-off-by: Martin Wilck <mwilck@suse.com> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Cc: stable@vger.kernel.org Reviewed-by: Petr Mladek <pmladek@suse.com> [mcgrof: enhance commit log with testing and kmod test result interpretation ] Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-01exit: Use READ_ONCE() for all oops/warn limit readsKees Cook
commit 7535b832c6399b5ebfc5b53af5c51dd915ee2538 upstream. Use a temporary variable to take full advantage of READ_ONCE() behavior. Without this, the report (and even the test) might be out of sync with the initial test. Reported-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/lkml/Y5x7GXeluFmZ8E0E@hirez.programming.kicks-ass.net Fixes: 9fc9e278a5c0 ("panic: Introduce warn_limit") Fixes: d4ccd54d28d3 ("exit: Put an upper limit on how often we can oops") Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Jann Horn <jannh@google.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Petr Mladek <pmladek@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Marco Elver <elver@google.com> Cc: tangmeng <tangmeng@uniontech.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Tiezhu Yang <yangtiezhu@loongson.cn> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-01panic: Expose "warn_count" to sysfsKees Cook
commit 8b05aa26336113c4cea25f1c333ee8cd4fc212a6 upstream. Since Warn count is now tracked and is a fairly interesting signal, add the entry /sys/kernel/warn_count to expose it to userspace. Cc: Petr Mladek <pmladek@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: tangmeng <tangmeng@uniontech.com> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Tiezhu Yang <yangtiezhu@loongson.cn> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221117234328.594699-6-keescook@chromium.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-01panic: Introduce warn_limitKees Cook
commit 9fc9e278a5c0b708eeffaf47d6eb0c82aa74ed78 upstream. Like oops_limit, add warn_limit for limiting the number of warnings when panic_on_warn is not set. Cc: Jonathan Corbet <corbet@lwn.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Petr Mladek <pmladek@suse.com> Cc: tangmeng <tangmeng@uniontech.com> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Tiezhu Yang <yangtiezhu@loongson.cn> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: linux-doc@vger.kernel.org Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221117234328.594699-5-keescook@chromium.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-01panic: Consolidate open-coded panic_on_warn checksKees Cook
commit 79cc1ba7badf9e7a12af99695a557e9ce27ee967 upstream. Several run-time checkers (KASAN, UBSAN, KFENCE, KCSAN, sched) roll their own warnings, and each check "panic_on_warn". Consolidate this into a single function so that future instrumentation can be added in a single location. Cc: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Gow <davidgow@google.com> Cc: tangmeng <tangmeng@uniontech.com> Cc: Jann Horn <jannh@google.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Petr Mladek <pmladek@suse.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Tiezhu Yang <yangtiezhu@loongson.cn> Cc: kasan-dev@googlegroups.com Cc: linux-mm@kvack.org Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Link: https://lore.kernel.org/r/20221117234328.594699-4-keescook@chromium.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-01exit: Allow oops_limit to be disabledKees Cook
commit de92f65719cd672f4b48397540b9f9eff67eca40 upstream. In preparation for keeping oops_limit logic in sync with warn_limit, have oops_limit == 0 disable checking the Oops counter. Cc: Jann Horn <jannh@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Huang Ying <ying.huang@intel.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: linux-doc@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-01exit: Expose "oops_count" to sysfsKees Cook
commit 9db89b41117024f80b38b15954017fb293133364 upstream. Since Oops count is now tracked and is a fairly interesting signal, add the entry /sys/kernel/oops_count to expose it to userspace. Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Jann Horn <jannh@google.com> Cc: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221117234328.594699-3-keescook@chromium.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-01exit: Put an upper limit on how often we can oopsJann Horn
commit d4ccd54d28d3c8598e2354acc13e28c060961dbb upstream. Many Linux systems are configured to not panic on oops; but allowing an attacker to oops the system **really** often can make even bugs that look completely unexploitable exploitable (like NULL dereferences and such) if each crash elevates a refcount by one or a lock is taken in read mode, and this causes a counter to eventually overflow. The most interesting counters for this are 32 bits wide (like open-coded refcounts that don't use refcount_t). (The ldsem reader count on 32-bit platforms is just 16 bits, but probably nobody cares about 32-bit platforms that much nowadays.) So let's panic the system if the kernel is constantly oopsing. The speed of oopsing 2^32 times probably depends on several factors, like how long the stack trace is and which unwinder you're using; an empirically important one is whether your console is showing a graphical environment or a text console that oopses will be printed to. In a quick single-threaded benchmark, it looks like oopsing in a vfork() child with a very short stack trace only takes ~510 microseconds per run when a graphical console is active; but switching to a text console that oopses are printed to slows it down around 87x, to ~45 milliseconds per run. (Adding more threads makes this faster, but the actual oops printing happens under &die_lock on x86, so you can maybe speed this up by a factor of around 2 and then any further improvement gets eaten up by lock contention.) It looks like it would take around 8-12 days to overflow a 32-bit counter with repeated oopsing on a multi-core X86 system running a graphical environment; both me (in an X86 VM) and Seth (with a distro kernel on normal hardware in a standard configuration) got numbers in that ballpark. 12 days aren't *that* short on a desktop system, and you'd likely need much longer on a typical server system (assuming that people don't run graphical desktop environments on their servers), and this is a *very* noisy and violent approach to exploiting the kernel; and it also seems to take orders of magnitude longer on some machines, probably because stuff like EFI pstore will slow it down a ton if that's active. Signed-off-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/r/20221107201317.324457-1-jannh@google.com Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221117234328.594699-2-keescook@chromium.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-01panic: Separate sysctl logic from CONFIG_SMPKees Cook
commit 9360d035a579d95d1e76c471061b9065b18a0eb1 upstream. In preparation for adding more sysctls directly in kernel/panic.c, split CONFIG_SMP from the logic that adds sysctls. Cc: Petr Mladek <pmladek@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: tangmeng <tangmeng@uniontech.com> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Tiezhu Yang <yangtiezhu@loongson.cn> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221117234328.594699-1-keescook@chromium.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-01exit: Add and use make_task_dead.Eric W. Biederman
commit 0e25498f8cd43c1b5aa327f373dd094e9a006da7 upstream. There are two big uses of do_exit. The first is it's design use to be the guts of the exit(2) system call. The second use is to terminate a task after something catastrophic has happened like a NULL pointer in kernel code. Add a function make_task_dead that is initialy exactly the same as do_exit to cover the cases where do_exit is called to handle catastrophic failure. In time this can probably be reduced to just a light wrapper around do_task_dead. For now keep it exactly the same so that there will be no behavioral differences introducing this new concept. Replace all of the uses of do_exit that use it for catastraphic task cleanup with make_task_dead to make it clear what the code is doing. As part of this rename rewind_stack_do_exit rewind_stack_and_make_dead. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-01panic: unset panic_on_warn inside panic()Tiezhu Yang
commit 1a2383e8b84c0451fd9b1eec3b9aab16f30b597c upstream. In the current code, the following three places need to unset panic_on_warn before calling panic() to avoid recursive panics: kernel/kcsan/report.c: print_report() kernel/sched/core.c: __schedule_bug() mm/kfence/report.c: kfence_report_error() In order to avoid copy-pasting "panic_on_warn = 0" all over the places, it is better to move it inside panic() and then remove it from the other places. Link: https://lkml.kernel.org/r/1644324666-15947-4-git-send-email-yangtiezhu@loongson.cn Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Reviewed-by: Marco Elver <elver@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Xuefeng Li <lixuefeng@loongson.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-01kernel/panic: move panic sysctls to its own filetangmeng
commit 9df918698408fd914493aba0b7858fef50eba63a upstream. kernel/sysctl.c is a kitchen sink where everyone leaves their dirty dishes, this makes it very difficult to maintain. To help with this maintenance let's start by moving sysctls to places where they actually belong. The proc sysctl maintainers do not want to know what sysctl knobs you wish to add for your own piece of code, we just care about the core logic. All filesystem syctls now get reviewed by fs folks. This commit follows the commit of fs, move the oops_all_cpu_backtrace sysctl to its own file, kernel/panic.c. Signed-off-by: tangmeng <tangmeng@uniontech.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-01kcsan: test: don't put the expect array on the stackMax Filippov
[ Upstream commit 5b24ac2dfd3eb3e36f794af3aa7f2828b19035bd ] Size of the 'expect' array in the __report_matches is 1536 bytes, which is exactly the default frame size warning limit of the xtensa architecture. As a result allmodconfig xtensa kernel builds with the gcc that does not support the compiler plugins (which otherwise would push the said warning limit to 2K) fail with the following message: kernel/kcsan/kcsan_test.c:257:1: error: the frame size of 1680 bytes is larger than 1536 bytes Fix it by dynamically allocating the 'expect' array. Signed-off-by: Max Filippov <jcmvbkbc@gmail.com> Reviewed-by: Marco Elver <elver@google.com> Tested-by: Marco Elver <elver@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-01bpf: Fix pointer-leak due to insufficient speculative store bypass mitigationLuis Gerhorst
[ Upstream commit e4f4db47794c9f474b184ee1418f42e6a07412b6 ] To mitigate Spectre v4, 2039f26f3aca ("bpf: Fix leakage due to insufficient speculative store bypass mitigation") inserts lfence instructions after 1) initializing a stack slot and 2) spilling a pointer to the stack. However, this does not cover cases where a stack slot is first initialized with a pointer (subject to sanitization) but then overwritten with a scalar (not subject to sanitization because the slot was already initialized). In this case, the second write may be subject to speculative store bypass (SSB) creating a speculative pointer-as-scalar type confusion. This allows the program to subsequently leak the numerical pointer value using, for example, a branch-based cache side channel. To fix this, also sanitize scalars if they write a stack slot that previously contained a pointer. Assuming that pointer-spills are only generated by LLVM on register-pressure, the performance impact on most real-world BPF programs should be small. The following unprivileged BPF bytecode drafts a minimal exploit and the mitigation: [...] // r6 = 0 or 1 (skalar, unknown user input) // r7 = accessible ptr for side channel // r10 = frame pointer (fp), to be leaked // r9 = r10 # fp alias to encourage ssb *(u64 *)(r9 - 8) = r10 // fp[-8] = ptr, to be leaked // lfence added here because of pointer spill to stack. // // Ommitted: Dummy bpf_ringbuf_output() here to train alias predictor // for no r9-r10 dependency. // *(u64 *)(r10 - 8) = r6 // fp[-8] = scalar, overwrites ptr // 2039f26f3aca: no lfence added because stack slot was not STACK_INVALID, // store may be subject to SSB // // fix: also add an lfence when the slot contained a ptr // r8 = *(u64 *)(r9 - 8) // r8 = architecturally a scalar, speculatively a ptr // // leak ptr using branch-based cache side channel: r8 &= 1 // choose bit to leak if r8 == 0 goto SLOW // no mispredict // architecturally dead code if input r6 is 0, // only executes speculatively iff ptr bit is 1 r8 = *(u64 *)(r7 + 0) # encode bit in cache (0: slow, 1: fast) SLOW: [...] After running this, the program can time the access to *(r7 + 0) to determine whether the chosen pointer bit was 0 or 1. Repeat this 64 times to recover the whole address on amd64. In summary, sanitization can only be skipped if one scalar is overwritten with another scalar. Scalar-confusion due to speculative store bypass can not lead to invalid accesses because the pointer bounds deducted during verification are enforced using branchless logic. See 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer arithmetic") for details. Do not make the mitigation depend on !env->allow_{uninit_stack,ptr_leaks} because speculative leaks are likely unexpected if these were enabled. For example, leaking the address to a protected log file may be acceptable while disabling the mitigation might unintentionally leak the address into the cached-state of a map that is accessible to unprivileged processes. Fixes: 2039f26f3aca ("bpf: Fix leakage due to insufficient speculative store bypass mitigation") Signed-off-by: Luis Gerhorst <gerhorst@cs.fau.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Henriette Hofmeier <henriette.hofmeier@rub.de> Link: https://lore.kernel.org/bpf/edc95bad-aada-9cfc-ffe2-fa9bb206583c@cs.fau.de Link: https://lore.kernel.org/bpf/20230109150544.41465-1-gerhorst@cs.fau.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-01-24prlimit: do_prlimit needs to have a speculation checkGreg Kroah-Hartman
commit 739790605705ddcf18f21782b9c99ad7d53a8c11 upstream. do_prlimit() adds the user-controlled resource value to a pointer that will subsequently be dereferenced. In order to help prevent this codepath from being used as a spectre "gadget" a barrier needs to be added after checking the range. Reported-by: Jordy Zomer <jordyzomer@google.com> Tested-by: Jordy Zomer <jordyzomer@google.com> Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-14tracing: Fix infinite loop in tracing_read_pipe on overflowed print_trace_lineYang Jihong
commit c1ac03af6ed45d05786c219d102f37eb44880f28 upstream. print_trace_line may overflow seq_file buffer. If the event is not consumed, the while loop keeps peeking this event, causing a infinite loop. Link: https://lkml.kernel.org/r/20221129113009.182425-1-yangjihong1@huawei.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: stable@vger.kernel.org Fixes: 088b1e427dbba ("ftrace: pipe fixes") Signed-off-by: Yang Jihong <yangjihong1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-14tracing/hist: Fix wrong return value in parse_action_params()Zheng Yejian
commit 2cc6a528882d0e0ccbc1bca5f95b8c963cedac54 upstream. When number of synth fields is more than SYNTH_FIELDS_MAX, parse_action_params() should return -EINVAL. Link: https://lore.kernel.org/linux-trace-kernel/20221207034635.2253990-1-zhengyejian1@huawei.com Cc: <mhiramat@kernel.org> Cc: <zanussi@kernel.org> Cc: stable@vger.kernel.org Fixes: c282a386a397 ("tracing: Add 'onmatch' hist trigger action support") Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>