summaryrefslogtreecommitdiff
path: root/kernel/bpf
AgeCommit message (Collapse)Author
2023-08-11bpf, cpumap: Make sure kthread is running before map update returnsHou Tao
commit 640a604585aa30f93e39b17d4d6ba69fcb1e66c9 upstream. The following warning was reported when running stress-mode enabled xdp_redirect_cpu with some RT threads: ------------[ cut here ]------------ WARNING: CPU: 4 PID: 65 at kernel/bpf/cpumap.c:135 CPU: 4 PID: 65 Comm: kworker/4:1 Not tainted 6.5.0-rc2+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) Workqueue: events cpu_map_kthread_stop RIP: 0010:put_cpu_map_entry+0xda/0x220 ...... Call Trace: <TASK> ? show_regs+0x65/0x70 ? __warn+0xa5/0x240 ...... ? put_cpu_map_entry+0xda/0x220 cpu_map_kthread_stop+0x41/0x60 process_one_work+0x6b0/0xb80 worker_thread+0x96/0x720 kthread+0x1a5/0x1f0 ret_from_fork+0x3a/0x70 ret_from_fork_asm+0x1b/0x30 </TASK> The root cause is the same as commit 436901649731 ("bpf: cpumap: Fix memory leak in cpu_map_update_elem"). The kthread is stopped prematurely by kthread_stop() in cpu_map_kthread_stop(), and kthread() doesn't call cpu_map_kthread_run() at all but XDP program has already queued some frames or skbs into ptr_ring. So when __cpu_map_ring_cleanup() checks the ptr_ring, it will find it was not emptied and report a warning. An alternative fix is to use __cpu_map_ring_cleanup() to drop these pending frames or skbs when kthread_stop() returns -EINTR, but it may confuse the user, because these frames or skbs have been handled correctly by XDP program. So instead of dropping these frames or skbs, just make sure the per-cpu kthread is running before __cpu_map_entry_alloc() returns. After apply the fix, the error handle for kthread_stop() will be unnecessary because it will always return 0, so just remove it. Fixes: 6710e1126934 ("bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP") Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Pu Lehui <pulehui@huawei.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://lore.kernel.org/r/20230729095107.1722450-2-houtao@huaweicloud.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-11bpf, cpumap: Handle skb as well when clean up ptr_ringHou Tao
[ Upstream commit 7c62b75cd1a792e14b037fa4f61f9b18914e7de1 ] The following warning was reported when running xdp_redirect_cpu with both skb-mode and stress-mode enabled: ------------[ cut here ]------------ Incorrect XDP memory type (-2128176192) usage WARNING: CPU: 7 PID: 1442 at net/core/xdp.c:405 Modules linked in: CPU: 7 PID: 1442 Comm: kworker/7:0 Tainted: G 6.5.0-rc2+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) Workqueue: events __cpu_map_entry_free RIP: 0010:__xdp_return+0x1e4/0x4a0 ...... Call Trace: <TASK> ? show_regs+0x65/0x70 ? __warn+0xa5/0x240 ? __xdp_return+0x1e4/0x4a0 ...... xdp_return_frame+0x4d/0x150 __cpu_map_entry_free+0xf9/0x230 process_one_work+0x6b0/0xb80 worker_thread+0x96/0x720 kthread+0x1a5/0x1f0 ret_from_fork+0x3a/0x70 ret_from_fork_asm+0x1b/0x30 </TASK> The reason for the warning is twofold. One is due to the kthread cpu_map_kthread_run() is stopped prematurely. Another one is __cpu_map_ring_cleanup() doesn't handle skb mode and treats skbs in ptr_ring as XDP frames. Prematurely-stopped kthread will be fixed by the preceding patch and ptr_ring will be empty when __cpu_map_ring_cleanup() is called. But as the comments in __cpu_map_ring_cleanup() said, handling and freeing skbs in ptr_ring as well to "catch any broken behaviour gracefully". Fixes: 11941f8a8536 ("bpf: cpumap: Implement generic cpumap") Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://lore.kernel.org/r/20230729095107.1722450-3-houtao@huaweicloud.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-27bpf: aggressively forget precise markings during state checkpointingAndrii Nakryiko
[ Upstream commit 7a830b53c17bbadcf99f778f28aaaa4e6c41df5f ] Exploit the property of about-to-be-checkpointed state to be able to forget all precise markings up to that point even more aggressively. We now clear all potentially inherited precise markings right before checkpointing and branching off into child state. If any of children states require precise knowledge of any SCALAR register, those will be propagated backwards later on before this state is finalized, preserving correctness. There is a single selftests BPF program change, but tremendous one: 25x reduction in number of verified instructions and states in trace_virtqueue_add_sgs. Cilium results are more modest, but happen across wider range of programs. SELFTESTS RESULTS ================= $ ./veristat -C -e file,prog,insns,states ~/imprecise-early-results.csv ~/imprecise-aggressive-results.csv | grep -v '+0' File Program Total insns (A) Total insns (B) Total insns (DIFF) Total states (A) Total states (B) Total states (DIFF) ------------------- ----------------------- --------------- --------------- ------------------ ---------------- ---------------- ------------------- loop6.bpf.linked1.o trace_virtqueue_add_sgs 398057 15114 -382943 (-96.20%) 8717 336 -8381 (-96.15%) ------------------- ----------------------- --------------- --------------- ------------------ ---------------- ---------------- ------------------- CILIUM RESULTS ============== $ ./veristat -C -e file,prog,insns,states ~/imprecise-early-results-cilium.csv ~/imprecise-aggressive-results-cilium.csv | grep -v '+0' File Program Total insns (A) Total insns (B) Total insns (DIFF) Total states (A) Total states (B) Total states (DIFF) ------------- -------------------------------- --------------- --------------- ------------------ ---------------- ---------------- ------------------- bpf_host.o tail_handle_nat_fwd_ipv4 23426 23221 -205 (-0.88%) 1537 1515 -22 (-1.43%) bpf_host.o tail_handle_nat_fwd_ipv6 13009 12904 -105 (-0.81%) 719 708 -11 (-1.53%) bpf_host.o tail_nodeport_nat_ingress_ipv6 5261 5196 -65 (-1.24%) 247 243 -4 (-1.62%) bpf_host.o tail_nodeport_nat_ipv6_egress 3446 3406 -40 (-1.16%) 203 198 -5 (-2.46%) bpf_lxc.o tail_handle_nat_fwd_ipv4 23426 23221 -205 (-0.88%) 1537 1515 -22 (-1.43%) bpf_lxc.o tail_handle_nat_fwd_ipv6 13009 12904 -105 (-0.81%) 719 708 -11 (-1.53%) bpf_lxc.o tail_ipv4_ct_egress 5074 4897 -177 (-3.49%) 255 248 -7 (-2.75%) bpf_lxc.o tail_ipv4_ct_ingress 5100 4923 -177 (-3.47%) 255 248 -7 (-2.75%) bpf_lxc.o tail_ipv4_ct_ingress_policy_only 5100 4923 -177 (-3.47%) 255 248 -7 (-2.75%) bpf_lxc.o tail_ipv6_ct_egress 4558 4536 -22 (-0.48%) 188 187 -1 (-0.53%) bpf_lxc.o tail_ipv6_ct_ingress 4578 4556 -22 (-0.48%) 188 187 -1 (-0.53%) bpf_lxc.o tail_ipv6_ct_ingress_policy_only 4578 4556 -22 (-0.48%) 188 187 -1 (-0.53%) bpf_lxc.o tail_nodeport_nat_ingress_ipv6 5261 5196 -65 (-1.24%) 247 243 -4 (-1.62%) bpf_overlay.o tail_nodeport_nat_ingress_ipv6 5261 5196 -65 (-1.24%) 247 243 -4 (-1.62%) bpf_overlay.o tail_nodeport_nat_ipv6_egress 3482 3442 -40 (-1.15%) 204 201 -3 (-1.47%) bpf_xdp.o tail_nodeport_nat_egress_ipv4 17200 15619 -1581 (-9.19%) 1111 1010 -101 (-9.09%) ------------- -------------------------------- --------------- --------------- ------------------ ---------------- ---------------- ------------------- Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20221104163649.121784-6-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-27bpf: stop setting precise in current stateAndrii Nakryiko
[ Upstream commit f63181b6ae79fd3b034cde641db774268c2c3acf ] Setting reg->precise to true in current state is not necessary from correctness standpoint, but it does pessimise the whole precision (or rather "imprecision", because that's what we want to keep as much as possible) tracking. Why is somewhat subtle and my best attempt to explain this is recorded in an extensive comment for __mark_chain_precise() function. Some more careful thinking and code reading is probably required still to grok this completely, unfortunately. Whiteboarding and a bunch of extra handwaiving in person would be even more helpful, but is deemed impractical in Git commit. Next patch pushes this imprecision property even further, building on top of the insights described in this patch. End results are pretty nice, we get reduction in number of total instructions and states verified due to a better states reuse, as some of the states are now more generic and permissive due to less unnecessary precise=true requirements. SELFTESTS RESULTS ================= $ ./veristat -C -e file,prog,insns,states ~/subprog-precise-results.csv ~/imprecise-early-results.csv | grep -v '+0' File Program Total insns (A) Total insns (B) Total insns (DIFF) Total states (A) Total states (B) Total states (DIFF) --------------------------------------- ---------------------- --------------- --------------- ------------------ ---------------- ---------------- ------------------- bpf_iter_ksym.bpf.linked1.o dump_ksym 347 285 -62 (-17.87%) 20 19 -1 (-5.00%) pyperf600_bpf_loop.bpf.linked1.o on_event 3678 3736 +58 (+1.58%) 276 285 +9 (+3.26%) setget_sockopt.bpf.linked1.o skops_sockopt 4038 3947 -91 (-2.25%) 347 343 -4 (-1.15%) test_l4lb.bpf.linked1.o balancer_ingress 4559 2611 -1948 (-42.73%) 118 105 -13 (-11.02%) test_l4lb_noinline.bpf.linked1.o balancer_ingress 6279 6268 -11 (-0.18%) 237 236 -1 (-0.42%) test_misc_tcp_hdr_options.bpf.linked1.o misc_estab 1307 1303 -4 (-0.31%) 100 99 -1 (-1.00%) test_sk_lookup.bpf.linked1.o ctx_narrow_access 456 447 -9 (-1.97%) 39 38 -1 (-2.56%) test_sysctl_loop1.bpf.linked1.o sysctl_tcp_mem 1389 1384 -5 (-0.36%) 26 25 -1 (-3.85%) test_tc_dtime.bpf.linked1.o egress_fwdns_prio101 518 485 -33 (-6.37%) 51 46 -5 (-9.80%) test_tc_dtime.bpf.linked1.o egress_host 519 468 -51 (-9.83%) 50 44 -6 (-12.00%) test_tc_dtime.bpf.linked1.o ingress_fwdns_prio101 842 1000 +158 (+18.76%) 73 88 +15 (+20.55%) xdp_synproxy_kern.bpf.linked1.o syncookie_tc 405757 373173 -32584 (-8.03%) 25735 22882 -2853 (-11.09%) xdp_synproxy_kern.bpf.linked1.o syncookie_xdp 479055 371590 -107465 (-22.43%) 29145 22207 -6938 (-23.81%) --------------------------------------- ---------------------- --------------- --------------- ------------------ ---------------- ---------------- ------------------- Slight regression in test_tc_dtime.bpf.linked1.o/ingress_fwdns_prio101 is left for a follow up, there might be some more precision-related bugs in existing BPF verifier logic. CILIUM RESULTS ============== $ ./veristat -C -e file,prog,insns,states ~/subprog-precise-results-cilium.csv ~/imprecise-early-results-cilium.csv | grep -v '+0' File Program Total insns (A) Total insns (B) Total insns (DIFF) Total states (A) Total states (B) Total states (DIFF) ------------- ------------------------------ --------------- --------------- ------------------ ---------------- ---------------- ------------------- bpf_host.o cil_from_host 762 556 -206 (-27.03%) 43 37 -6 (-13.95%) bpf_host.o tail_handle_nat_fwd_ipv4 23541 23426 -115 (-0.49%) 1538 1537 -1 (-0.07%) bpf_host.o tail_nodeport_nat_egress_ipv4 33592 33566 -26 (-0.08%) 2163 2161 -2 (-0.09%) bpf_lxc.o tail_handle_nat_fwd_ipv4 23541 23426 -115 (-0.49%) 1538 1537 -1 (-0.07%) bpf_overlay.o tail_nodeport_nat_egress_ipv4 33581 33543 -38 (-0.11%) 2160 2157 -3 (-0.14%) bpf_xdp.o tail_handle_nat_fwd_ipv4 21659 20920 -739 (-3.41%) 1440 1376 -64 (-4.44%) bpf_xdp.o tail_handle_nat_fwd_ipv6 17084 17039 -45 (-0.26%) 907 905 -2 (-0.22%) bpf_xdp.o tail_lb_ipv4 73442 73430 -12 (-0.02%) 4370 4369 -1 (-0.02%) bpf_xdp.o tail_lb_ipv6 152114 151895 -219 (-0.14%) 6493 6479 -14 (-0.22%) bpf_xdp.o tail_nodeport_nat_egress_ipv4 17377 17200 -177 (-1.02%) 1125 1111 -14 (-1.24%) bpf_xdp.o tail_nodeport_nat_ingress_ipv6 6405 6397 -8 (-0.12%) 309 308 -1 (-0.32%) bpf_xdp.o tail_rev_nodeport_lb4 7126 6934 -192 (-2.69%) 414 402 -12 (-2.90%) bpf_xdp.o tail_rev_nodeport_lb6 18059 17905 -154 (-0.85%) 1105 1096 -9 (-0.81%) ------------- ------------------------------ --------------- --------------- ------------------ ---------------- ---------------- ------------------- Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20221104163649.121784-5-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-27bpf: allow precision tracking for programs with subprogsAndrii Nakryiko
[ Upstream commit be2ef8161572ec1973124ebc50f56dafc2925e07 ] Stop forcing precise=true for SCALAR registers when BPF program has any subprograms. Current restriction means that any BPF program, as soon as it uses subprograms, will end up not getting any of the precision tracking benefits in reduction of number of verified states. This patch keeps the fallback mark_all_scalars_precise() behavior if precise marking has to cross function frames. E.g., if subprogram requires R1 (first input arg) to be marked precise, ideally we'd need to backtrack to the parent function and keep marking R1 and its dependencies as precise. But right now we give up and force all the SCALARs in any of the current and parent states to be forced to precise=true. We can lift that restriction in the future. But this patch fixes two issues identified when trying to enable precision tracking for subprogs. First, prevent "escaping" from top-most state in a global subprog. While with entry-level BPF program we never end up requesting precision for R1-R5 registers, because R2-R5 are not initialized (and so not readable in correct BPF program), and R1 is PTR_TO_CTX, not SCALAR, and so is implicitly precise. With global subprogs, though, it's different, as global subprog a) can have up to 5 SCALAR input arguments, which might get marked as precise=true and b) it is validated in isolation from its main entry BPF program. b) means that we can end up exhausting parent state chain and still not mark all registers in reg_mask as precise, which would lead to verifier bug warning. To handle that, we need to consider two cases. First, if the very first state is not immediately "checkpointed" (i.e., stored in state lookup hashtable), it will get correct first_insn_idx and last_insn_idx instruction set during state checkpointing. As such, this case is already handled and __mark_chain_precision() already handles that by just doing nothing when we reach to the very first parent state. st->parent will be NULL and we'll just stop. Perhaps some extra check for reg_mask and stack_mask is due here, but this patch doesn't address that issue. More problematic second case is when global function's initial state is immediately checkpointed before we manage to process the very first instruction. This is happening because when there is a call to global subprog from the main program the very first subprog's instruction is marked as pruning point, so before we manage to process first instruction we have to check and checkpoint state. This patch adds a special handling for such "empty" state, which is identified by having st->last_insn_idx set to -1. In such case, we check that we are indeed validating global subprog, and with some sanity checking we mark input args as precise if requested. Note that we also initialize state->first_insn_idx with correct start insn_idx offset. For main program zero is correct value, but for any subprog it's quite confusing to not have first_insn_idx set. This doesn't have any functional impact, but helps with debugging and state printing. We also explicitly initialize state->last_insns_idx instead of relying on is_state_visited() to do this with env->prev_insns_idx, which will be -1 on the very first instruction. This concludes necessary changes to handle specifically global subprog's precision tracking. Second identified problem was missed handling of BPF helper functions that call into subprogs (e.g., bpf_loop and few others). From precision tracking and backtracking logic's standpoint those are effectively calls into subprogs and should be called as BPF_PSEUDO_CALL calls. This patch takes the least intrusive way and just checks against a short list of current BPF helpers that do call subprogs, encapsulated in is_callback_calling_function() function. But to prevent accidentally forgetting to add new BPF helpers to this "list", we also do a sanity check in __check_func_call, which has to be called for each such special BPF helper, to validate that BPF helper is indeed recognized as callback-calling one. This should catch any missed checks in the future. Adding some special flags to be added in function proto definitions seemed like an overkill in this case. With the above changes, it's possible to remove forceful setting of reg->precise to true in __mark_reg_unknown, which turns on precision tracking both inside subprogs and entry progs that have subprogs. No warnings or errors were detected across all the selftests, but also when validating with veristat against internal Meta BPF objects and Cilium objects. Further, in some BPF programs there are noticeable reduction in number of states and instructions validated due to more effective precision tracking, especially benefiting syncookie test. $ ./veristat -C -e file,prog,insns,states ~/baseline-results.csv ~/subprog-precise-results.csv | grep -v '+0' File Program Total insns (A) Total insns (B) Total insns (DIFF) Total states (A) Total states (B) Total states (DIFF) ---------------------------------------- -------------------------- --------------- --------------- ------------------ ---------------- ---------------- ------------------- pyperf600_bpf_loop.bpf.linked1.o on_event 3966 3678 -288 (-7.26%) 306 276 -30 (-9.80%) pyperf_global.bpf.linked1.o on_event 7563 7530 -33 (-0.44%) 520 517 -3 (-0.58%) pyperf_subprogs.bpf.linked1.o on_event 36358 36934 +576 (+1.58%) 2499 2531 +32 (+1.28%) setget_sockopt.bpf.linked1.o skops_sockopt 3965 4038 +73 (+1.84%) 343 347 +4 (+1.17%) test_cls_redirect_subprogs.bpf.linked1.o cls_redirect 64965 64901 -64 (-0.10%) 4619 4612 -7 (-0.15%) test_misc_tcp_hdr_options.bpf.linked1.o misc_estab 1491 1307 -184 (-12.34%) 110 100 -10 (-9.09%) test_pkt_access.bpf.linked1.o test_pkt_access 354 349 -5 (-1.41%) 25 24 -1 (-4.00%) test_sock_fields.bpf.linked1.o egress_read_sock_fields 435 375 -60 (-13.79%) 22 20 -2 (-9.09%) test_sysctl_loop2.bpf.linked1.o sysctl_tcp_mem 1508 1501 -7 (-0.46%) 29 28 -1 (-3.45%) test_tc_dtime.bpf.linked1.o egress_fwdns_prio100 468 435 -33 (-7.05%) 45 41 -4 (-8.89%) test_tc_dtime.bpf.linked1.o ingress_fwdns_prio100 398 408 +10 (+2.51%) 42 39 -3 (-7.14%) test_tc_dtime.bpf.linked1.o ingress_fwdns_prio101 1096 842 -254 (-23.18%) 97 73 -24 (-24.74%) test_tcp_hdr_options.bpf.linked1.o estab 2758 2408 -350 (-12.69%) 208 181 -27 (-12.98%) test_urandom_usdt.bpf.linked1.o urand_read_with_sema 466 448 -18 (-3.86%) 31 28 -3 (-9.68%) test_urandom_usdt.bpf.linked1.o urand_read_without_sema 466 448 -18 (-3.86%) 31 28 -3 (-9.68%) test_urandom_usdt.bpf.linked1.o urandlib_read_with_sema 466 448 -18 (-3.86%) 31 28 -3 (-9.68%) test_urandom_usdt.bpf.linked1.o urandlib_read_without_sema 466 448 -18 (-3.86%) 31 28 -3 (-9.68%) test_xdp_noinline.bpf.linked1.o balancer_ingress_v6 4302 4294 -8 (-0.19%) 257 256 -1 (-0.39%) xdp_synproxy_kern.bpf.linked1.o syncookie_tc 583722 405757 -177965 (-30.49%) 35846 25735 -10111 (-28.21%) xdp_synproxy_kern.bpf.linked1.o syncookie_xdp 609123 479055 -130068 (-21.35%) 35452 29145 -6307 (-17.79%) ---------------------------------------- -------------------------- --------------- --------------- ------------------ ---------------- ---------------- ------------------- Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20221104163649.121784-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-27bpf: Repeat check_max_stack_depth for async callbacksKumar Kartikeya Dwivedi
[ Upstream commit b5e9ad522c4ccd32d322877515cff8d47ed731b9 ] While the check_max_stack_depth function explores call chains emanating from the main prog, which is typically enough to cover all possible call chains, it doesn't explore those rooted at async callbacks unless the async callback will have been directly called, since unlike non-async callbacks it skips their instruction exploration as they don't contribute to stack depth. It could be the case that the async callback leads to a callchain which exceeds the stack depth, but this is never reachable while only exploring the entry point from main subprog. Hence, repeat the check for the main subprog *and* all async callbacks marked by the symbolic execution pass of the verifier, as execution of the program may begin at any of them. Consider functions with following stack depths: main: 256 async: 256 foo: 256 main: rX = async bpf_timer_set_callback(...) async: foo() Here, async is not descended as it does not contribute to stack depth of main (since it is referenced using bpf_pseudo_func and not bpf_pseudo_call). However, when async is invoked asynchronously, it will end up breaching the MAX_BPF_STACK limit by calling foo. Hence, in addition to main, we also need to explore call chains beginning at all async callback subprogs in a program. Fixes: 7ddc80a476c2 ("bpf: Teach stack depth check about async callbacks.") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230717161530.1238-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-27bpf: Fix subprog idx logic in check_max_stack_depthKumar Kartikeya Dwivedi
[ Upstream commit ba7b3e7d5f9014be65879ede8fd599cb222901c9 ] The assignment to idx in check_max_stack_depth happens once we see a bpf_pseudo_call or bpf_pseudo_func. This is not an issue as the rest of the code performs a few checks and then pushes the frame to the frame stack, except the case of async callbacks. If the async callback case causes the loop iteration to be skipped, the idx assignment will be incorrect on the next iteration of the loop. The value stored in the frame stack (as the subprogno of the current subprog) will be incorrect. This leads to incorrect checks and incorrect tail_call_reachable marking. Save the target subprog in a new variable and only assign to idx once we are done with the is_async_cb check which may skip pushing of frame to the frame stack and subsequent stack depth checks and tail call markings. Fixes: 7ddc80a476c2 ("bpf: Teach stack depth check about async callbacks.") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230717161530.1238-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-27bpf: Address KCSAN report on bpf_lru_listMartin KaFai Lau
[ Upstream commit ee9fd0ac3017c4313be91a220a9ac4c99dde7ad4 ] KCSAN reported a data-race when accessing node->ref. Although node->ref does not have to be accurate, take this chance to use a more common READ_ONCE() and WRITE_ONCE() pattern instead of data_race(). There is an existing bpf_lru_node_is_ref() and bpf_lru_node_set_ref(). This patch also adds bpf_lru_node_clear_ref() to do the WRITE_ONCE(node->ref, 0) also. ================================================================== BUG: KCSAN: data-race in __bpf_lru_list_rotate / __htab_lru_percpu_map_update_elem write to 0xffff888137038deb of 1 bytes by task 11240 on cpu 1: __bpf_lru_node_move kernel/bpf/bpf_lru_list.c:113 [inline] __bpf_lru_list_rotate_active kernel/bpf/bpf_lru_list.c:149 [inline] __bpf_lru_list_rotate+0x1bf/0x750 kernel/bpf/bpf_lru_list.c:240 bpf_lru_list_pop_free_to_local kernel/bpf/bpf_lru_list.c:329 [inline] bpf_common_lru_pop_free kernel/bpf/bpf_lru_list.c:447 [inline] bpf_lru_pop_free+0x638/0xe20 kernel/bpf/bpf_lru_list.c:499 prealloc_lru_pop kernel/bpf/hashtab.c:290 [inline] __htab_lru_percpu_map_update_elem+0xe7/0x820 kernel/bpf/hashtab.c:1316 bpf_percpu_hash_update+0x5e/0x90 kernel/bpf/hashtab.c:2313 bpf_map_update_value+0x2a9/0x370 kernel/bpf/syscall.c:200 generic_map_update_batch+0x3ae/0x4f0 kernel/bpf/syscall.c:1687 bpf_map_do_batch+0x2d9/0x3d0 kernel/bpf/syscall.c:4534 __sys_bpf+0x338/0x810 __do_sys_bpf kernel/bpf/syscall.c:5096 [inline] __se_sys_bpf kernel/bpf/syscall.c:5094 [inline] __x64_sys_bpf+0x43/0x50 kernel/bpf/syscall.c:5094 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd read to 0xffff888137038deb of 1 bytes by task 11241 on cpu 0: bpf_lru_node_set_ref kernel/bpf/bpf_lru_list.h:70 [inline] __htab_lru_percpu_map_update_elem+0x2f1/0x820 kernel/bpf/hashtab.c:1332 bpf_percpu_hash_update+0x5e/0x90 kernel/bpf/hashtab.c:2313 bpf_map_update_value+0x2a9/0x370 kernel/bpf/syscall.c:200 generic_map_update_batch+0x3ae/0x4f0 kernel/bpf/syscall.c:1687 bpf_map_do_batch+0x2d9/0x3d0 kernel/bpf/syscall.c:4534 __sys_bpf+0x338/0x810 __do_sys_bpf kernel/bpf/syscall.c:5096 [inline] __se_sys_bpf kernel/bpf/syscall.c:5094 [inline] __x64_sys_bpf+0x43/0x50 kernel/bpf/syscall.c:5094 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd value changed: 0x01 -> 0x00 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 11241 Comm: syz-executor.3 Not tainted 6.3.0-rc7-syzkaller-00136-g6a66fdd29ea1 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/30/2023 ================================================================== Reported-by: syzbot+ebe648a84e8784763f82@syzkaller.appspotmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20230511043748.1384166-1-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-27bpf: Print a warning only if writing to unprivileged_bpf_disabled.Kui-Feng Lee
[ Upstream commit fedf99200ab086c42a572fca1d7266b06cdc3e3f ] Only print the warning message if you are writing to "/proc/sys/kernel/unprivileged_bpf_disabled". The kernel may print an annoying warning when you read "/proc/sys/kernel/unprivileged_bpf_disabled" saying WARNING: Unprivileged eBPF is enabled with eIBRS on, data leaks possible via Spectre v2 BHB attacks! However, this message is only meaningful when the feature is disabled or enabled. Signed-off-by: Kui-Feng Lee <kuifeng@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20230502181418.308479-1-kuifeng@meta.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-23bpf: cpumap: Fix memory leak in cpu_map_update_elemPu Lehui
[ Upstream commit 4369016497319a9635702da010d02af1ebb1849d ] Syzkaller reported a memory leak as follows: BUG: memory leak unreferenced object 0xff110001198ef748 (size 192): comm "syz-executor.3", pid 17672, jiffies 4298118891 (age 9.906s) hex dump (first 32 bytes): 00 00 00 00 4a 19 00 00 80 ad e3 e4 fe ff c0 00 ....J........... 00 b2 d3 0c 01 00 11 ff 28 f5 8e 19 01 00 11 ff ........(....... backtrace: [<ffffffffadd28087>] __cpu_map_entry_alloc+0xf7/0xb00 [<ffffffffadd28d8e>] cpu_map_update_elem+0x2fe/0x3d0 [<ffffffffadc6d0fd>] bpf_map_update_value.isra.0+0x2bd/0x520 [<ffffffffadc7349b>] map_update_elem+0x4cb/0x720 [<ffffffffadc7d983>] __se_sys_bpf+0x8c3/0xb90 [<ffffffffb029cc80>] do_syscall_64+0x30/0x40 [<ffffffffb0400099>] entry_SYSCALL_64_after_hwframe+0x61/0xc6 BUG: memory leak unreferenced object 0xff110001198ef528 (size 192): comm "syz-executor.3", pid 17672, jiffies 4298118891 (age 9.906s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffffadd281f0>] __cpu_map_entry_alloc+0x260/0xb00 [<ffffffffadd28d8e>] cpu_map_update_elem+0x2fe/0x3d0 [<ffffffffadc6d0fd>] bpf_map_update_value.isra.0+0x2bd/0x520 [<ffffffffadc7349b>] map_update_elem+0x4cb/0x720 [<ffffffffadc7d983>] __se_sys_bpf+0x8c3/0xb90 [<ffffffffb029cc80>] do_syscall_64+0x30/0x40 [<ffffffffb0400099>] entry_SYSCALL_64_after_hwframe+0x61/0xc6 BUG: memory leak unreferenced object 0xff1100010fd93d68 (size 8): comm "syz-executor.3", pid 17672, jiffies 4298118891 (age 9.906s) hex dump (first 8 bytes): 00 00 00 00 00 00 00 00 ........ backtrace: [<ffffffffade5db3e>] kvmalloc_node+0x11e/0x170 [<ffffffffadd28280>] __cpu_map_entry_alloc+0x2f0/0xb00 [<ffffffffadd28d8e>] cpu_map_update_elem+0x2fe/0x3d0 [<ffffffffadc6d0fd>] bpf_map_update_value.isra.0+0x2bd/0x520 [<ffffffffadc7349b>] map_update_elem+0x4cb/0x720 [<ffffffffadc7d983>] __se_sys_bpf+0x8c3/0xb90 [<ffffffffb029cc80>] do_syscall_64+0x30/0x40 [<ffffffffb0400099>] entry_SYSCALL_64_after_hwframe+0x61/0xc6 In the cpu_map_update_elem flow, when kthread_stop is called before calling the threadfn of rcpu->kthread, since the KTHREAD_SHOULD_STOP bit of kthread has been set by kthread_stop, the threadfn of rcpu->kthread will never be executed, and rcpu->refcnt will never be 0, which will lead to the allocated rcpu, rcpu->queue and rcpu->queue->queue cannot be released. Calling kthread_stop before executing kthread's threadfn will return -EINTR. We can complete the release of memory resources in this state. Fixes: 6710e1126934 ("bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP") Signed-off-by: Pu Lehui <pulehui@huawei.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230711115848.2701559-1-pulehui@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-23bpf: Fix max stack depth check for async callbacksKumar Kartikeya Dwivedi
[ Upstream commit 5415ccd50a8620c8cbaa32d6f18c946c453566f5 ] The check_max_stack_depth pass happens after the verifier's symbolic execution, and attempts to walk the call graph of the BPF program, ensuring that the stack usage stays within bounds for all possible call chains. There are two cases to consider: bpf_pseudo_func and bpf_pseudo_call. In the former case, the callback pointer is loaded into a register, and is assumed that it is passed to some helper later which calls it (however there is no way to be sure), but the check remains conservative and accounts the stack usage anyway. For this particular case, asynchronous callbacks are skipped as they execute asynchronously when their corresponding event fires. The case of bpf_pseudo_call is simpler and we know that the call is definitely made, hence the stack depth of the subprog is accounted for. However, the current check still skips an asynchronous callback even if a bpf_pseudo_call was made for it. This is erroneous, as it will miss accounting for the stack usage of the asynchronous callback, which can be used to breach the maximum stack depth limit. Fix this by only skipping asynchronous callbacks when the instruction is not a pseudo call to the subprog. Fixes: 7ddc80a476c2 ("bpf: Teach stack depth check about async callbacks.") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230705144730.235802-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-19bpf, btf: Warn but return no error for NULL btf from ↵SeongJae Park
__register_btf_kfunc_id_set() [ Upstream commit 3de4d22cc9ac7c9f38e10edcf54f9a8891a9c2aa ] __register_btf_kfunc_id_set() assumes .BTF to be part of the module's .ko file if CONFIG_DEBUG_INFO_BTF is enabled. If that's not the case, the function prints an error message and return an error. As a result, such modules cannot be loaded. However, the section could be stripped out during a build process. It would be better to let the modules loaded, because their basic functionalities have no problem [0], though the BTF functionalities will not be supported. Make the function to lower the level of the message from error to warn, and return no error. [0] https://lore.kernel.org/bpf/20220219082037.ow2kbq5brktf4f2u@apollo.legion Fixes: c446fdacb10d ("bpf: fix register_btf_kfunc_id_set for !CONFIG_DEBUG_INFO_BTF") Reported-by: Alexander Egorenkov <Alexander.Egorenkov@ibm.com> Suggested-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/87y228q66f.fsf@oc8242746057.ibm.com Link: https://lore.kernel.org/bpf/20220219082037.ow2kbq5brktf4f2u@apollo.legion Link: https://lore.kernel.org/bpf/20230701171447.56464-1-sj@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-19bpf: Fix memleak due to fentry attach failureYafang Shao
[ Upstream commit 108598c39eefbedc9882273ac0df96127a629220 ] If it fails to attach fentry, the allocated bpf trampoline image will be left in the system. That can be verified by checking /proc/kallsyms. This meamleak can be verified by a simple bpf program as follows: SEC("fentry/trap_init") int fentry_run() { return 0; } It will fail to attach trap_init because this function is freed after kernel init, and then we can find the trampoline image is left in the system by checking /proc/kallsyms. $ tail /proc/kallsyms ffffffffc0613000 t bpf_trampoline_6442453466_1 [bpf] ffffffffc06c3000 t bpf_trampoline_6442453466_1 [bpf] $ bpftool btf dump file /sys/kernel/btf/vmlinux | grep "FUNC 'trap_init'" [2522] FUNC 'trap_init' type_id=119 linkage=static $ echo $((6442453466 & 0x7fffffff)) 2522 Note that there are two left bpf trampoline images, that is because the libbpf will fallback to raw tracepoint if -EINVAL is returned. Fixes: e21aa341785c ("bpf: Fix fexit trampoline.") Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <song@kernel.org> Cc: Jiri Olsa <olsajiri@gmail.com> Link: https://lore.kernel.org/bpf/20230515130849.57502-2-laoar.shao@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-19bpf: Remove bpf trampoline selectorYafang Shao
[ Upstream commit 47e79cbeea4b3891ad476047f4c68543eb51c8e0 ] After commit e21aa341785c ("bpf: Fix fexit trampoline."), the selector is only used to indicate how many times the bpf trampoline image are updated and been displayed in the trampoline ksym name. After the trampoline is freed, the selector will start from 0 again. So the selector is a useless value to the user. We can remove it. If the user want to check whether the bpf trampoline image has been updated or not, the user can compare the address. Each time the trampoline image is updated, the address will change consequently. Jiri also pointed out another issue that perf is still using the old name "bpf_trampoline_%lu", so this change can fix the issue in perf. Fixes: e21aa341785c ("bpf: Fix fexit trampoline.") Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <song@kernel.org> Cc: Jiri Olsa <olsajiri@gmail.com> Link: https://lore.kernel.org/bpf/ZFvOOlrmHiY9AgXE@krava Link: https://lore.kernel.org/bpf/20230515130849.57502-3-laoar.shao@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-19bpf: Don't EFAULT for {g,s}setsockopt with wrong optlenStanislav Fomichev
[ Upstream commit 29ebbba7d46136cba324264e513a1e964ca16c0a ] With the way the hooks implemented right now, we have a special condition: optval larger than PAGE_SIZE will expose only first 4k into BPF; any modifications to the optval are ignored. If the BPF program doesn't handle this condition by resetting optlen to 0, the userspace will get EFAULT. The intention of the EFAULT was to make it apparent to the developers that the program is doing something wrong. However, this inadvertently might affect production workloads with the BPF programs that are not too careful (i.e., returning EFAULT for perfectly valid setsockopt/getsockopt calls). Let's try to minimize the chance of BPF program screwing up userspace by ignoring the output of those BPF programs (instead of returning EFAULT to the userspace). pr_info_once those cases to the dmesg to help with figuring out what's going wrong. Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") Suggested-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230511170456.1759459-2-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-28bpf: Force kprobe multi expected_attach_type for kprobe_multi linkJiri Olsa
[ Upstream commit db8eae6bc5c702d8e3ab2d0c6bb5976c131576eb ] We currently allow to create perf link for program with expected_attach_type == BPF_TRACE_KPROBE_MULTI. This will cause crash when we call helpers like get_attach_cookie or get_func_ip in such program, because it will call the kprobe_multi's version (current->bpf_ctx context setup) of those helpers while it expects perf_link's current->bpf_ctx context setup. Making sure that we use BPF_TRACE_KPROBE_MULTI expected_attach_type only for programs attaching through kprobe_multi link. Fixes: ca74823c6e16 ("bpf: Add cookie support to programs attached with kprobe multi link") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230618131414.75649-1-jolsa@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-28bpf/btf: Accept function names that contain dotsFlorent Revest
[ Upstream commit 9724160b3942b0a967b91a59f81da5593f28b8ba ] When building a kernel with LLVM=1, LLVM_IAS=0 and CONFIG_KASAN=y, LLVM leaves DWARF tags for the "asan.module_ctor" & co symbols. In turn, pahole creates BTF_KIND_FUNC entries for these and this makes the BTF metadata validation fail because they contain a dot. In a dramatic turn of event, this BTF verification failure can cause the netfilter_bpf initialization to fail, causing netfilter_core to free the netfilter_helper hashmap and netfilter_ftp to trigger a use-after-free. The risk of u-a-f in netfilter will be addressed separately but the existence of "asan.module_ctor" debug info under some build conditions sounds like a good enough reason to accept functions that contain dots in BTF. Although using only LLVM=1 is the recommended way to compile clang-based kernels, users can certainly do LLVM=1, LLVM_IAS=0 as well and we still try to support that combination according to Nick. To clarify: - > v5.10 kernel, LLVM=1 (LLVM_IAS=0 is not the default) is recommended, but user can still have LLVM=1, LLVM_IAS=0 to trigger the issue - <= 5.10 kernel, LLVM=1 (LLVM_IAS=0 is the default) is recommended in which case GNU as will be used Fixes: 1dc92851849c ("bpf: kernel side support for BTF Var and DataSec") Signed-off-by: Florent Revest <revest@chromium.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Cc: Yonghong Song <yhs@meta.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Link: https://lore.kernel.org/bpf/20230615145607.3469985-1-revest@chromium.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-28bpf: Fix verifier id tracking of scalars on spillMaxim Mikityanskiy
[ Upstream commit 713274f1f2c896d37017efee333fd44149710119 ] The following scenario describes a bug in the verifier where it incorrectly concludes about equivalent scalar IDs which could lead to verifier bypass in privileged mode: 1. Prepare a 32-bit rogue number. 2. Put the rogue number into the upper half of a 64-bit register, and roll a random (unknown to the verifier) bit in the lower half. The rest of the bits should be zero (although variations are possible). 3. Assign an ID to the register by MOVing it to another arbitrary register. 4. Perform a 32-bit spill of the register, then perform a 32-bit fill to another register. Due to a bug in the verifier, the ID will be preserved, although the new register will contain only the lower 32 bits, i.e. all zeros except one random bit. At this point there are two registers with different values but the same ID, which means the integrity of the verifier state has been corrupted. 5. Compare the new 32-bit register with 0. In the branch where it's equal to 0, the verifier will believe that the original 64-bit register is also 0, because it has the same ID, but its actual value still contains the rogue number in the upper half. Some optimizations of the verifier prevent the actual bypass, so extra care is needed: the comparison must be between two registers, and both branches must be reachable (this is why one random bit is needed). Both branches are still suitable for the bypass. 6. Right shift the original register by 32 bits to pop the rogue number. 7. Use the rogue number as an offset with any pointer. The verifier will believe that the offset is 0, while in reality it's the given number. The fix is similar to the 32-bit BPF_MOV handling in check_alu_op for SCALAR_VALUE. If the spill is narrowing the actual register value, don't keep the ID, make sure it's reset to 0. Fixes: 354e8f1970f8 ("bpf: Support <8-byte scalar spill and refill") Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Andrii Nakryiko <andrii@kernel.org> # Checked veristat delta Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20230607123951.558971-2-maxtram95@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-28bpf: track immediate values written to stack by BPF_ST instructionEduard Zingerman
[ Upstream commit ecdf985d7615356b78241fdb159c091830ed0380 ] For aligned stack writes using BPF_ST instruction track stored values in a same way BPF_STX is handled, e.g. make sure that the following commands produce similar verifier knowledge: fp[-8] = 42; r1 = 42; fp[-8] = r1; This covers two cases: - non-null values written to stack are stored as spill of fake registers; - null values written to stack are stored as STACK_ZERO marks. Previously both cases above used STACK_MISC marks instead. Some verifier test cases relied on the old logic to obtain STACK_MISC marks for some stack values. These test cases are updated in the same commit to avoid failures during bisect. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20230214232030.1502829-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Stable-dep-of: 713274f1f2c8 ("bpf: Fix verifier id tracking of scalars on spill") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-28bpf: ensure main program has an extableKrister Johansen
commit 0108a4e9f3584a7a2c026d1601b0682ff7335d95 upstream. When subprograms are in use, the main program is not jit'd after the subprograms because jit_subprogs sets a value for prog->bpf_func upon success. Subsequent calls to the JIT are bypassed when this value is non-NULL. This leads to a situation where the main program and its func[0] counterpart are both in the bpf kallsyms tree, but only func[0] has an extable. Extables are only created during JIT. Now there are two nearly identical program ksym entries in the tree, but only one has an extable. Depending upon how the entries are placed, there's a chance that a fault will call search_extable on the aux with the NULL entry. Since jit_subprogs already copies state from func[0] to the main program, include the extable pointer in this state duplication. Additionally, ensure that the copy of the main program in func[0] is not added to the bpf_prog_kallsyms table. Instead, let the main program get added later in bpf_prog_load(). This ensures there is only a single copy of the main program in the kallsyms table, and that its tag matches the tag observed by tooling like bpftool. Cc: stable@vger.kernel.org Fixes: 1c2a088a6626 ("bpf: x64: add JIT support for multi-function programs") Signed-off-by: Krister Johansen <kjlx@templeofstupid.com> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> Link: https://lore.kernel.org/r/6de9b2f4b4724ef56efbb0339daaa66c8b68b1e7.1686616663.git.kjlx@templeofstupid.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-21cgroup: bpf: use cgroup_lock()/cgroup_unlock() wrappersKamalesh Babulal
[ Upstream commit 4cdb91b0dea7d7f59fa84a13c7753cd434fdedcf ] Replace mutex_[un]lock() with cgroup_[un]lock() wrappers to stay consistent across cgroup core and other subsystem code, while operating on the cgroup_mutex. Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Stable-dep-of: 2bd110339288 ("cgroup: always put cset in cgroup_css_set_put_fork") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-14bpf: Fix elem_size not being set for inner mapsRhys Rustad-Elliott
[ Upstream commit cba41bb78d70aad98d8e61e019fd48c561f7f396 ] Commit d937bc3449fa ("bpf: make uniform use of array->elem_size everywhere in arraymap.c") changed array_map_gen_lookup to use array->elem_size instead of round_up(map->value_size, 8) as the element size when generating code to access a value in an array map. array->elem_size, however, is not set by bpf_map_meta_alloc when initializing an BPF_MAP_TYPE_ARRAY_OF_MAPS or BPF_MAP_TYPE_HASH_OF_MAPS. This results in array_map_gen_lookup incorrectly outputting code that always accesses index 0 in the array (as the index will be calculated via a multiplication with the element size, which is incorrectly set to 0). Set elem_size on the bpf_array object when allocating an array or hash of maps to fix this. Fixes: d937bc3449fa ("bpf: make uniform use of array->elem_size everywhere in arraymap.c") Signed-off-by: Rhys Rustad-Elliott <me@rhysre.net> Link: https://lore.kernel.org/r/20230602190110.47068-2-me@rhysre.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-30bpf: fix a memory leak in the LRU and LRU_PERCPU hash mapsAnton Protopopov
commit b34ffb0c6d23583830f9327864b9c1f486003305 upstream. The LRU and LRU_PERCPU maps allocate a new element on update before locking the target hash table bucket. Right after that the maps try to lock the bucket. If this fails, then maps return -EBUSY to the caller without releasing the allocated element. This makes the element untracked: it doesn't belong to either of free lists, and it doesn't belong to the hash table, so can't be re-used; this eventually leads to the permanent -ENOMEM on LRU map updates, which is unexpected. Fix this by returning the element to the local free list if bucket locking fails. Fixes: 20b6cc34ea74 ("bpf: Avoid hashtab deadlock with map_locked") Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Link: https://lore.kernel.org/r/20230522154558.2166815-1-aspsk@isovalent.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-30bpf: Fix mask generation for 32-bit narrow loads of 64-bit fieldsWill Deacon
commit 0613d8ca9ab382caabe9ed2dceb429e9781e443f upstream. A narrow load from a 64-bit context field results in a 64-bit load followed potentially by a 64-bit right-shift and then a bitwise AND operation to extract the relevant data. In the case of a 32-bit access, an immediate mask of 0xffffffff is used to construct a 64-bit BPP_AND operation which then sign-extends the mask value and effectively acts as a glorified no-op. For example: 0: 61 10 00 00 00 00 00 00 r0 = *(u32 *)(r1 + 0) results in the following code generation for a 64-bit field: ldr x7, [x7] // 64-bit load mov x10, #0xffffffffffffffff and x7, x7, x10 Fix the mask generation so that narrow loads always perform a 32-bit AND operation: ldr x7, [x7] // 64-bit load mov w10, #0xffffffff and w7, w7, w10 Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Krzesimir Nowak <krzesimir@kinvolk.io> Cc: Andrey Ignatov <rdna@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Fixes: 31fd85816dbe ("bpf: permits narrower load from bpf program context fields") Signed-off-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20230518102528.1341-1-will@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-24bpf: Add preempt_count_{sub,add} into btf id deny listYafang
[ Upstream commit c11bd046485d7bf1ca200db0e7d0bdc4bafdd395 ] The recursion check in __bpf_prog_enter* and __bpf_prog_exit* leave preempt_count_{sub,add} unprotected. When attaching trampoline to them we get panic as follows, [ 867.843050] BUG: TASK stack guard page was hit at 0000000009d325cf (stack is 0000000046a46a15..00000000537e7b28) [ 867.843064] stack guard page: 0000 [#1] PREEMPT SMP NOPTI [ 867.843067] CPU: 8 PID: 11009 Comm: trace Kdump: loaded Not tainted 6.2.0+ #4 [ 867.843100] Call Trace: [ 867.843101] <TASK> [ 867.843104] asm_exc_int3+0x3a/0x40 [ 867.843108] RIP: 0010:preempt_count_sub+0x1/0xa0 [ 867.843135] __bpf_prog_enter_recur+0x17/0x90 [ 867.843148] bpf_trampoline_6442468108_0+0x2e/0x1000 [ 867.843154] ? preempt_count_sub+0x1/0xa0 [ 867.843157] preempt_count_sub+0x5/0xa0 [ 867.843159] ? migrate_enable+0xac/0xf0 [ 867.843164] __bpf_prog_exit_recur+0x2d/0x40 [ 867.843168] bpf_trampoline_6442468108_0+0x55/0x1000 ... [ 867.843788] preempt_count_sub+0x5/0xa0 [ 867.843793] ? migrate_enable+0xac/0xf0 [ 867.843829] __bpf_prog_exit_recur+0x2d/0x40 [ 867.843837] BUG: IRQ stack guard page was hit at 0000000099bd8228 (stack is 00000000b23e2bc4..000000006d95af35) [ 867.843841] BUG: IRQ stack guard page was hit at 000000005ae07924 (stack is 00000000ffd69623..0000000014eb594c) [ 867.843843] BUG: IRQ stack guard page was hit at 00000000028320f0 (stack is 00000000034b6438..0000000078d1bcec) [ 867.843842] bpf_trampoline_6442468108_0+0x55/0x1000 ... That is because in __bpf_prog_exit_recur, the preempt_count_{sub,add} are called after prog->active is decreased. Fixing this by adding these two functions into btf ids deny list. Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Yafang <laoar.shao@gmail.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jiri Olsa <olsajiri@gmail.com> Acked-by: Hao Luo <haoluo@google.com> Link: https://lore.kernel.org/r/20230413025248.79764-1-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-24bpf: Annotate data races in bpf_local_storageKumar Kartikeya Dwivedi
[ Upstream commit 0a09a2f933c73dc76ab0b72da6855f44342a8903 ] There are a few cases where hlist_node is checked to be unhashed without holding the lock protecting its modification. In this case, one must use hlist_unhashed_lockless to avoid load tearing and KCSAN reports. Fix this by using lockless variant in places not protected by the lock. Since this is not prompted by any actual KCSAN reports but only from code review, I have not included a fixes tag. Cc: Martin KaFai Lau <martin.lau@kernel.org> Cc: KP Singh <kpsingh@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230221200646.2500777-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11bpf: Don't EFAULT for getsockopt with optval=NULLStanislav Fomichev
[ Upstream commit 00e74ae0863827d944e36e56a4ce1e77e50edb91 ] Some socket options do getsockopt with optval=NULL to estimate the size of the final buffer (which is returned via optlen). This breaks BPF getsockopt assumptions about permitted optval buffer size. Let's enforce these assumptions only when non-NULL optval is provided. Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") Reported-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/ZD7Js4fj5YyI2oLd@google.com/T/#mb68daf700f87a9244a15d01d00c3f0e5b08f49f7 Link: https://lore.kernel.org/bpf/20230418225343.553806-2-sdf@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11bpf: Fix race between btf_put and btf_idr walk.Alexei Starovoitov
[ Upstream commit acf1c3d68e9a31f10d92bc67ad4673cdae5e8d92 ] Florian and Eduard reported hard dead lock: [ 58.433327] _raw_spin_lock_irqsave+0x40/0x50 [ 58.433334] btf_put+0x43/0x90 [ 58.433338] bpf_find_btf_id+0x157/0x240 [ 58.433353] btf_parse_fields+0x921/0x11c0 This happens since btf->refcount can be 1 at the time of btf_put() and btf_put() will call btf_free_id() which will try to grab btf_idr_lock and will dead lock. Avoid the issue by doing btf_put() without locking. Fixes: 3d78417b60fb ("bpf: Add bpf_btf_find_by_name_kind() helper.") Fixes: 1e89106da253 ("bpf: Add bpf_core_add_cands() and wire it into bpf_core_apply_relo_insn().") Reported-by: Florian Westphal <fw@strlen.de> Reported-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20230421014901.70908-1-alexei.starovoitov@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11bpf/btf: Fix is_int_ptr()Feng Zhou
[ Upstream commit 91f2dc6838c19342f7f2993627c622835cc24890 ] When tracing a kernel function with arg type is u32*, btf_ctx_access() would report error: arg2 type INT is not a struct. The commit bb6728d75611 ("bpf: Allow access to int pointer arguments in tracing programs") added support for int pointer, but did not skip modifiers before checking it's type. This patch fixes it. Fixes: bb6728d75611 ("bpf: Allow access to int pointer arguments in tracing programs") Co-developed-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20230410085908.98493-2-zhoufeng.zf@bytedance.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11bpf: Fix __reg_bound_offset 64->32 var_off subreg propagationDaniel Borkmann
[ Upstream commit 7be14c1c9030f73cc18b4ff23b78a0a081f16188 ] Xu reports that after commit 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking"), the following BPF program is rejected by the verifier: 0: (61) r2 = *(u32 *)(r1 +0) ; R2_w=pkt(off=0,r=0,imm=0) 1: (61) r3 = *(u32 *)(r1 +4) ; R3_w=pkt_end(off=0,imm=0) 2: (bf) r1 = r2 3: (07) r1 += 1 4: (2d) if r1 > r3 goto pc+8 5: (71) r1 = *(u8 *)(r2 +0) ; R1_w=scalar(umax=255,var_off=(0x0; 0xff)) 6: (18) r0 = 0x7fffffffffffff10 8: (0f) r1 += r0 ; R1_w=scalar(umin=0x7fffffffffffff10,umax=0x800000000000000f) 9: (18) r0 = 0x8000000000000000 11: (07) r0 += 1 12: (ad) if r0 < r1 goto pc-2 13: (b7) r0 = 0 14: (95) exit And the verifier log says: func#0 @0 0: R1=ctx(off=0,imm=0) R10=fp0 0: (61) r2 = *(u32 *)(r1 +0) ; R1=ctx(off=0,imm=0) R2_w=pkt(off=0,r=0,imm=0) 1: (61) r3 = *(u32 *)(r1 +4) ; R1=ctx(off=0,imm=0) R3_w=pkt_end(off=0,imm=0) 2: (bf) r1 = r2 ; R1_w=pkt(off=0,r=0,imm=0) R2_w=pkt(off=0,r=0,imm=0) 3: (07) r1 += 1 ; R1_w=pkt(off=1,r=0,imm=0) 4: (2d) if r1 > r3 goto pc+8 ; R1_w=pkt(off=1,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) 5: (71) r1 = *(u8 *)(r2 +0) ; R1_w=scalar(umax=255,var_off=(0x0; 0xff)) R2_w=pkt(off=0,r=1,imm=0) 6: (18) r0 = 0x7fffffffffffff10 ; R0_w=9223372036854775568 8: (0f) r1 += r0 ; R0_w=9223372036854775568 R1_w=scalar(umin=9223372036854775568,umax=9223372036854775823,s32_min=-240,s32_max=15) 9: (18) r0 = 0x8000000000000000 ; R0_w=-9223372036854775808 11: (07) r0 += 1 ; R0_w=-9223372036854775807 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775807 R1_w=scalar(umin=9223372036854775568,umax=9223372036854775809) 13: (b7) r0 = 0 ; R0_w=0 14: (95) exit from 12 to 11: R0_w=-9223372036854775807 R1_w=scalar(umin=9223372036854775810,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) R2_w=pkt(off=0,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775806 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775806 R1_w=scalar(umin=9223372036854775810,umax=9223372036854775810,var_off=(0x8000000000000000; 0xffffffff)) 13: safe [...] from 12 to 11: R0_w=-9223372036854775795 R1=scalar(umin=9223372036854775822,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775794 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775794 R1=scalar(umin=9223372036854775822,umax=9223372036854775822,var_off=(0x8000000000000000; 0xffffffff)) 13: safe from 12 to 11: R0_w=-9223372036854775794 R1=scalar(umin=9223372036854775823,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775793 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775793 R1=scalar(umin=9223372036854775823,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) 13: safe from 12 to 11: R0_w=-9223372036854775793 R1=scalar(umin=9223372036854775824,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775792 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775792 R1=scalar(umin=9223372036854775824,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) 13: safe [...] The 64bit umin=9223372036854775810 bound continuously bumps by +1 while umax=9223372036854775823 stays as-is until the verifier complexity limit is reached and the program gets finally rejected. During this simulation, the umin also eventually surpasses umax. Looking at the first 'from 12 to 11' output line from the loop, R1 has the following state: R1_w=scalar(umin=0x8000000000000002 (9223372036854775810), umax=0x800000000000000f (9223372036854775823), var_off=(0x8000000000000000; 0xffffffff)) The var_off has technically not an inconsistent state but it's very imprecise and far off surpassing 64bit umax bounds whereas the expected output with refined known bits in var_off should have been like: R1_w=scalar(umin=0x8000000000000002 (9223372036854775810), umax=0x800000000000000f (9223372036854775823), var_off=(0x8000000000000000; 0xf)) In the above log, var_off stays as var_off=(0x8000000000000000; 0xffffffff) and does not converge into a narrower mask where more bits become known, eventually transforming R1 into a constant upon umin=9223372036854775823, umax=9223372036854775823 case where the verifier would have terminated and let the program pass. The __reg_combine_64_into_32() marks the subregister unknown and propagates 64bit {s,u}min/{s,u}max bounds to their 32bit equivalents iff they are within the 32bit universe. The question came up whether __reg_combine_64_into_32() should special case the situation that when 64bit {s,u}min bounds have the same value as 64bit {s,u}max bounds to then assign the latter as well to the 32bit reg->{s,u}32_{min,max}_value. As can be seen from the above example however, that is just /one/ special case and not a /generic/ solution given above example would still not be addressed this way and remain at an imprecise var_off=(0x8000000000000000; 0xffffffff). The improvement is needed in __reg_bound_offset() to refine var32_off with the updated var64_off instead of the prior reg->var_off. The reg_bounds_sync() code first refines information about the register's min/max bounds via __update_reg_bounds() from the current var_off, then in __reg_deduce_bounds() from sign bit and with the potentially learned bits from bounds it'll update the var_off tnum in __reg_bound_offset(). For example, intersecting with the old var_off might have improved bounds slightly, e.g. if umax was 0x7f...f and var_off was (0; 0xf...fc), then new var_off will then result in (0; 0x7f...fc). The intersected var64_off holds then the universe which is a superset of var32_off. The point for the latter is not to broaden, but to further refine known bits based on the intersection of var_off with 32 bit bounds, so that we later construct the final var_off from upper and lower 32 bits. The final __update_reg_bounds() can then potentially still slightly refine bounds if more bits became known from the new var_off. After the improvement, we can see R1 converging successively: func#0 @0 0: R1=ctx(off=0,imm=0) R10=fp0 0: (61) r2 = *(u32 *)(r1 +0) ; R1=ctx(off=0,imm=0) R2_w=pkt(off=0,r=0,imm=0) 1: (61) r3 = *(u32 *)(r1 +4) ; R1=ctx(off=0,imm=0) R3_w=pkt_end(off=0,imm=0) 2: (bf) r1 = r2 ; R1_w=pkt(off=0,r=0,imm=0) R2_w=pkt(off=0,r=0,imm=0) 3: (07) r1 += 1 ; R1_w=pkt(off=1,r=0,imm=0) 4: (2d) if r1 > r3 goto pc+8 ; R1_w=pkt(off=1,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) 5: (71) r1 = *(u8 *)(r2 +0) ; R1_w=scalar(umax=255,var_off=(0x0; 0xff)) R2_w=pkt(off=0,r=1,imm=0) 6: (18) r0 = 0x7fffffffffffff10 ; R0_w=9223372036854775568 8: (0f) r1 += r0 ; R0_w=9223372036854775568 R1_w=scalar(umin=9223372036854775568,umax=9223372036854775823,s32_min=-240,s32_max=15) 9: (18) r0 = 0x8000000000000000 ; R0_w=-9223372036854775808 11: (07) r0 += 1 ; R0_w=-9223372036854775807 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775807 R1_w=scalar(umin=9223372036854775568,umax=9223372036854775809) 13: (b7) r0 = 0 ; R0_w=0 14: (95) exit from 12 to 11: R0_w=-9223372036854775807 R1_w=scalar(umin=9223372036854775810,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2_w=pkt(off=0,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775806 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775806 R1_w=-9223372036854775806 13: safe from 12 to 11: R0_w=-9223372036854775806 R1_w=scalar(umin=9223372036854775811,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2_w=pkt(off=0,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775805 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775805 R1_w=-9223372036854775805 13: safe [...] from 12 to 11: R0_w=-9223372036854775798 R1=scalar(umin=9223372036854775819,umax=9223372036854775823,var_off=(0x8000000000000008; 0x7),s32_min=8,s32_max=15,u32_min=8,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775797 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775797 R1=-9223372036854775797 13: safe from 12 to 11: R0_w=-9223372036854775797 R1=scalar(umin=9223372036854775820,umax=9223372036854775823,var_off=(0x800000000000000c; 0x3),s32_min=12,s32_max=15,u32_min=12,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775796 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775796 R1=-9223372036854775796 13: safe from 12 to 11: R0_w=-9223372036854775796 R1=scalar(umin=9223372036854775821,umax=9223372036854775823,var_off=(0x800000000000000c; 0x3),s32_min=12,s32_max=15,u32_min=12,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775795 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775795 R1=-9223372036854775795 13: safe from 12 to 11: R0_w=-9223372036854775795 R1=scalar(umin=9223372036854775822,umax=9223372036854775823,var_off=(0x800000000000000e; 0x1),s32_min=14,s32_max=15,u32_min=14,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775794 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775794 R1=-9223372036854775794 13: safe from 12 to 11: R0_w=-9223372036854775794 R1=-9223372036854775793 R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775793 12: (ad) if r0 < r1 goto pc-2 last_idx 12 first_idx 12 parent didn't have regs=1 stack=0 marks: R0_rw=P-9223372036854775801 R1_r=scalar(umin=9223372036854775815,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 last_idx 11 first_idx 11 regs=1 stack=0 before 11: (07) r0 += 1 parent didn't have regs=1 stack=0 marks: R0_rw=P-9223372036854775805 R1_rw=scalar(umin=9223372036854775812,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2_w=pkt(off=0,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) R10=fp0 last_idx 12 first_idx 0 regs=1 stack=0 before 12: (ad) if r0 < r1 goto pc-2 regs=1 stack=0 before 11: (07) r0 += 1 regs=1 stack=0 before 12: (ad) if r0 < r1 goto pc-2 regs=1 stack=0 before 11: (07) r0 += 1 regs=1 stack=0 before 12: (ad) if r0 < r1 goto pc-2 regs=1 stack=0 before 11: (07) r0 += 1 regs=1 stack=0 before 9: (18) r0 = 0x8000000000000000 last_idx 12 first_idx 12 parent didn't have regs=2 stack=0 marks: R0_rw=P-9223372036854775801 R1_r=Pscalar(umin=9223372036854775815,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 last_idx 11 first_idx 11 regs=2 stack=0 before 11: (07) r0 += 1 parent didn't have regs=2 stack=0 marks: R0_rw=P-9223372036854775805 R1_rw=Pscalar(umin=9223372036854775812,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2_w=pkt(off=0,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) R10=fp0 last_idx 12 first_idx 0 regs=2 stack=0 before 12: (ad) if r0 < r1 goto pc-2 regs=2 stack=0 before 11: (07) r0 += 1 regs=2 stack=0 before 12: (ad) if r0 < r1 goto pc-2 regs=2 stack=0 before 11: (07) r0 += 1 regs=2 stack=0 before 12: (ad) if r0 < r1 goto pc-2 regs=2 stack=0 before 11: (07) r0 += 1 regs=2 stack=0 before 9: (18) r0 = 0x8000000000000000 regs=2 stack=0 before 8: (0f) r1 += r0 regs=3 stack=0 before 6: (18) r0 = 0x7fffffffffffff10 regs=2 stack=0 before 5: (71) r1 = *(u8 *)(r2 +0) 13: safe from 4 to 13: safe verification time 322 usec stack depth 0 processed 56 insns (limit 1000000) max_states_per_insn 1 total_states 3 peak_states 3 mark_read 1 This also fixes up a test case along with this improvement where we match on the verifier log. The updated log now has a refined var_off, too. Fixes: 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking") Reported-by: Xu Kuohai <xukuohai@huaweicloud.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20230314203424.4015351-2-xukuohai@huaweicloud.com Link: https://lore.kernel.org/bpf/20230322213056.2470-1-daniel@iogearbox.net Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11bpf: Remove misleading spec_v1 check on var-offset stack readLuis Gerhorst
[ Upstream commit 082cdc69a4651dd2a77539d69416a359ed1214f5 ] For every BPF_ADD/SUB involving a pointer, adjust_ptr_min_max_vals() ensures that the resulting pointer has a constant offset if bypass_spec_v1 is false. This is ensured by calling sanitize_check_bounds() which in turn calls check_stack_access_for_ptr_arithmetic(). There, -EACCESS is returned if the register's offset is not constant, thereby rejecting the program. In summary, an unprivileged user must never be able to create stack pointers with a variable offset. That is also the case, because a respective check in check_stack_write() is missing. If they were able to create a variable-offset pointer, users could still use it in a stack-write operation to trigger unsafe speculative behavior [1]. Because unprivileged users must already be prevented from creating variable-offset stack pointers, viable options are to either remove this check (replacing it with a clarifying comment), or to turn it into a "verifier BUG"-message, also adding a similar check in check_stack_write() (for consistency, as a second-level defense). This patch implements the first option to reduce verifier bloat. This check was introduced by commit 01f810ace9ed ("bpf: Allow variable-offset stack access") which correctly notes that "variable-offset reads and writes are disallowed (they were already disallowed for the indirect access case) because the speculative execution checking code doesn't support them". However, it does not further discuss why the check in check_stack_read() is necessary. The code which made this check obsolete was also introduced in this commit. I have compiled ~650 programs from the Linux selftests, Linux samples, Cilium, and libbpf/examples projects and confirmed that none of these trigger the check in check_stack_read() [2]. Instead, all of these programs are, as expected, already rejected when constructing the variable-offset pointers. Note that the check in check_stack_access_for_ptr_arithmetic() also prints "off=%d" while the code removed by this patch does not (the error removed does not appear in the "verification_error" values). For reproducibility, the repository linked includes the raw data and scripts used to create the plot. [1] https://arxiv.org/pdf/1807.03757.pdf [2] https://gitlab.cs.fau.de/un65esoq/bpf-spectre/-/raw/53dc19fcf459c186613b1156a81504b39c8d49db/data/plots/23-02-26_23-56_bpftool/bpftool/0004-errors.pdf?inline=false Fixes: 01f810ace9ed ("bpf: Allow variable-offset stack access") Signed-off-by: Luis Gerhorst <gerhorst@cs.fau.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230315165358.23701-1-gerhorst@cs.fau.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11bpf: fix precision propagation verbose loggingAndrii Nakryiko
[ Upstream commit 34f0677e7afd3a292bc1aadda7ce8e35faedb204 ] Fix wrong order of frame index vs register/slot index in precision propagation verbose (level 2) output. It's wrong and very confusing as is. Fixes: 529409ea92d5 ("bpf: propagate precision across all frames, not just the last one") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230313184017.4083374-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11bpf: take into account liveness when propagating precisionAndrii Nakryiko
[ Upstream commit 52c2b005a3c18c565fc70cfd0ca49375f301e952 ] When doing state comparison, if old state has register that is not marked as REG_LIVE_READ, then we just skip comparison, regardless what's the state of corresponing register in current state. This is because not REG_LIVE_READ register is irrelevant for further program execution and correctness. All good here. But when we get to precision propagation, after two states were declared equivalent, we don't take into account old register's liveness, and thus attempt to propagate precision for register in current state even if that register in old state was not REG_LIVE_READ anymore. This is bad, because register in current state could be anything at all and this could cause -EFAULT due to internal logic bugs. Fix by taking into account REG_LIVE_READ liveness mark to keep the logic in state comparison in sync with precision propagation. Fixes: a3ce685dd01a ("bpf: fix precision tracking") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230309224131.57449-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-04-26bpf: Fix incorrect verifier pruning due to missing register precision taintsDaniel Borkmann
[ Upstream commit 71b547f561247897a0a14f3082730156c0533fed ] Juan Jose et al reported an issue found via fuzzing where the verifier's pruning logic prematurely marks a program path as safe. Consider the following program: 0: (b7) r6 = 1024 1: (b7) r7 = 0 2: (b7) r8 = 0 3: (b7) r9 = -2147483648 4: (97) r6 %= 1025 5: (05) goto pc+0 6: (bd) if r6 <= r9 goto pc+2 7: (97) r6 %= 1 8: (b7) r9 = 0 9: (bd) if r6 <= r9 goto pc+1 10: (b7) r6 = 0 11: (b7) r0 = 0 12: (63) *(u32 *)(r10 -4) = r0 13: (18) r4 = 0xffff888103693400 // map_ptr(ks=4,vs=48) 15: (bf) r1 = r4 16: (bf) r2 = r10 17: (07) r2 += -4 18: (85) call bpf_map_lookup_elem#1 19: (55) if r0 != 0x0 goto pc+1 20: (95) exit 21: (77) r6 >>= 10 22: (27) r6 *= 8192 23: (bf) r1 = r0 24: (0f) r0 += r6 25: (79) r3 = *(u64 *)(r0 +0) 26: (7b) *(u64 *)(r1 +0) = r3 27: (95) exit The verifier treats this as safe, leading to oob read/write access due to an incorrect verifier conclusion: func#0 @0 0: R1=ctx(off=0,imm=0) R10=fp0 0: (b7) r6 = 1024 ; R6_w=1024 1: (b7) r7 = 0 ; R7_w=0 2: (b7) r8 = 0 ; R8_w=0 3: (b7) r9 = -2147483648 ; R9_w=-2147483648 4: (97) r6 %= 1025 ; R6_w=scalar() 5: (05) goto pc+0 6: (bd) if r6 <= r9 goto pc+2 ; R6_w=scalar(umin=18446744071562067969,var_off=(0xffffffff00000000; 0xffffffff)) R9_w=-2147483648 7: (97) r6 %= 1 ; R6_w=scalar() 8: (b7) r9 = 0 ; R9=0 9: (bd) if r6 <= r9 goto pc+1 ; R6=scalar(umin=1) R9=0 10: (b7) r6 = 0 ; R6_w=0 11: (b7) r0 = 0 ; R0_w=0 12: (63) *(u32 *)(r10 -4) = r0 last_idx 12 first_idx 9 regs=1 stack=0 before 11: (b7) r0 = 0 13: R0_w=0 R10=fp0 fp-8=0000???? 13: (18) r4 = 0xffff8ad3886c2a00 ; R4_w=map_ptr(off=0,ks=4,vs=48,imm=0) 15: (bf) r1 = r4 ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R4_w=map_ptr(off=0,ks=4,vs=48,imm=0) 16: (bf) r2 = r10 ; R2_w=fp0 R10=fp0 17: (07) r2 += -4 ; R2_w=fp-4 18: (85) call bpf_map_lookup_elem#1 ; R0=map_value_or_null(id=1,off=0,ks=4,vs=48,imm=0) 19: (55) if r0 != 0x0 goto pc+1 ; R0=0 20: (95) exit from 19 to 21: R0=map_value(off=0,ks=4,vs=48,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0 fp-8=mmmm???? 21: (77) r6 >>= 10 ; R6_w=0 22: (27) r6 *= 8192 ; R6_w=0 23: (bf) r1 = r0 ; R0=map_value(off=0,ks=4,vs=48,imm=0) R1_w=map_value(off=0,ks=4,vs=48,imm=0) 24: (0f) r0 += r6 last_idx 24 first_idx 19 regs=40 stack=0 before 23: (bf) r1 = r0 regs=40 stack=0 before 22: (27) r6 *= 8192 regs=40 stack=0 before 21: (77) r6 >>= 10 regs=40 stack=0 before 19: (55) if r0 != 0x0 goto pc+1 parent didn't have regs=40 stack=0 marks: R0_rw=map_value_or_null(id=1,off=0,ks=4,vs=48,imm=0) R6_rw=P0 R7=0 R8=0 R9=0 R10=fp0 fp-8=mmmm???? last_idx 18 first_idx 9 regs=40 stack=0 before 18: (85) call bpf_map_lookup_elem#1 regs=40 stack=0 before 17: (07) r2 += -4 regs=40 stack=0 before 16: (bf) r2 = r10 regs=40 stack=0 before 15: (bf) r1 = r4 regs=40 stack=0 before 13: (18) r4 = 0xffff8ad3886c2a00 regs=40 stack=0 before 12: (63) *(u32 *)(r10 -4) = r0 regs=40 stack=0 before 11: (b7) r0 = 0 regs=40 stack=0 before 10: (b7) r6 = 0 25: (79) r3 = *(u64 *)(r0 +0) ; R0_w=map_value(off=0,ks=4,vs=48,imm=0) R3_w=scalar() 26: (7b) *(u64 *)(r1 +0) = r3 ; R1_w=map_value(off=0,ks=4,vs=48,imm=0) R3_w=scalar() 27: (95) exit from 9 to 11: R1=ctx(off=0,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0 11: (b7) r0 = 0 ; R0_w=0 12: (63) *(u32 *)(r10 -4) = r0 last_idx 12 first_idx 11 regs=1 stack=0 before 11: (b7) r0 = 0 13: R0_w=0 R10=fp0 fp-8=0000???? 13: (18) r4 = 0xffff8ad3886c2a00 ; R4_w=map_ptr(off=0,ks=4,vs=48,imm=0) 15: (bf) r1 = r4 ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R4_w=map_ptr(off=0,ks=4,vs=48,imm=0) 16: (bf) r2 = r10 ; R2_w=fp0 R10=fp0 17: (07) r2 += -4 ; R2_w=fp-4 18: (85) call bpf_map_lookup_elem#1 frame 0: propagating r6 last_idx 19 first_idx 11 regs=40 stack=0 before 18: (85) call bpf_map_lookup_elem#1 regs=40 stack=0 before 17: (07) r2 += -4 regs=40 stack=0 before 16: (bf) r2 = r10 regs=40 stack=0 before 15: (bf) r1 = r4 regs=40 stack=0 before 13: (18) r4 = 0xffff8ad3886c2a00 regs=40 stack=0 before 12: (63) *(u32 *)(r10 -4) = r0 regs=40 stack=0 before 11: (b7) r0 = 0 parent didn't have regs=40 stack=0 marks: R1=ctx(off=0,imm=0) R6_r=P0 R7=0 R8=0 R9=0 R10=fp0 last_idx 9 first_idx 9 regs=40 stack=0 before 9: (bd) if r6 <= r9 goto pc+1 parent didn't have regs=40 stack=0 marks: R1=ctx(off=0,imm=0) R6_rw=Pscalar() R7_w=0 R8_w=0 R9_rw=0 R10=fp0 last_idx 8 first_idx 0 regs=40 stack=0 before 8: (b7) r9 = 0 regs=40 stack=0 before 7: (97) r6 %= 1 regs=40 stack=0 before 6: (bd) if r6 <= r9 goto pc+2 regs=40 stack=0 before 5: (05) goto pc+0 regs=40 stack=0 before 4: (97) r6 %= 1025 regs=40 stack=0 before 3: (b7) r9 = -2147483648 regs=40 stack=0 before 2: (b7) r8 = 0 regs=40 stack=0 before 1: (b7) r7 = 0 regs=40 stack=0 before 0: (b7) r6 = 1024 19: safe frame 0: propagating r6 last_idx 9 first_idx 0 regs=40 stack=0 before 6: (bd) if r6 <= r9 goto pc+2 regs=40 stack=0 before 5: (05) goto pc+0 regs=40 stack=0 before 4: (97) r6 %= 1025 regs=40 stack=0 before 3: (b7) r9 = -2147483648 regs=40 stack=0 before 2: (b7) r8 = 0 regs=40 stack=0 before 1: (b7) r7 = 0 regs=40 stack=0 before 0: (b7) r6 = 1024 from 6 to 9: safe verification time 110 usec stack depth 4 processed 36 insns (limit 1000000) max_states_per_insn 0 total_states 3 peak_states 3 mark_read 2 The verifier considers this program as safe by mistakenly pruning unsafe code paths. In the above func#0, code lines 0-10 are of interest. In line 0-3 registers r6 to r9 are initialized with known scalar values. In line 4 the register r6 is reset to an unknown scalar given the verifier does not track modulo operations. Due to this, the verifier can also not determine precisely which branches in line 6 and 9 are taken, therefore it needs to explore them both. As can be seen, the verifier starts with exploring the false/fall-through paths first. The 'from 19 to 21' path has both r6=0 and r9=0 and the pointer arithmetic on r0 += r6 is therefore considered safe. Given the arithmetic, r6 is correctly marked for precision tracking where backtracking kicks in where it walks back the current path all the way where r6 was set to 0 in the fall-through branch. Next, the pruning logics pops the path 'from 9 to 11' from the stack. Also here, the state of the registers is the same, that is, r6=0 and r9=0, so that at line 19 the path can be pruned as it is considered safe. It is interesting to note that the conditional in line 9 turned r6 into a more precise state, that is, in the fall-through path at the beginning of line 10, it is R6=scalar(umin=1), and in the branch-taken path (which is analyzed here) at the beginning of line 11, r6 turned into a known const r6=0 as r9=0 prior to that and therefore (unsigned) r6 <= 0 concludes that r6 must be 0 (**): [...] ; R6_w=scalar() 9: (bd) if r6 <= r9 goto pc+1 ; R6=scalar(umin=1) R9=0 [...] from 9 to 11: R1=ctx(off=0,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0 [...] The next path is 'from 6 to 9'. The verifier considers the old and current state equivalent, and therefore prunes the search incorrectly. Looking into the two states which are being compared by the pruning logic at line 9, the old state consists of R6_rwD=Pscalar() R9_rwD=0 R10=fp0 and the new state consists of R1=ctx(off=0,imm=0) R6_w=scalar(umax=18446744071562067968) R7_w=0 R8_w=0 R9_w=-2147483648 R10=fp0. While r6 had the reg->precise flag correctly set in the old state, r9 did not. Both r6'es are considered as equivalent given the old one is a superset of the current, more precise one, however, r9's actual values (0 vs 0x80000000) mismatch. Given the old r9 did not have reg->precise flag set, the verifier does not consider the register as contributing to the precision state of r6, and therefore it considered both r9 states as equivalent. However, for this specific pruned path (which is also the actual path taken at runtime), register r6 will be 0x400 and r9 0x80000000 when reaching line 21, thus oob-accessing the map. The purpose of precision tracking is to initially mark registers (including spilled ones) as imprecise to help verifier's pruning logic finding equivalent states it can then prune if they don't contribute to the program's safety aspects. For example, if registers are used for pointer arithmetic or to pass constant length to a helper, then the verifier sets reg->precise flag and backtracks the BPF program instruction sequence and chain of verifier states to ensure that the given register or stack slot including their dependencies are marked as precisely tracked scalar. This also includes any other registers and slots that contribute to a tracked state of given registers/stack slot. This backtracking relies on recorded jmp_history and is able to traverse entire chain of parent states. This process ends only when all the necessary registers/slots and their transitive dependencies are marked as precise. The backtrack_insn() is called from the current instruction up to the first instruction, and its purpose is to compute a bitmask of registers and stack slots that need precision tracking in the parent's verifier state. For example, if a current instruction is r6 = r7, then r6 needs precision after this instruction and r7 needs precision before this instruction, that is, in the parent state. Hence for the latter r7 is marked and r6 unmarked. For the class of jmp/jmp32 instructions, backtrack_insn() today only looks at call and exit instructions and for all other conditionals the masks remain as-is. However, in the given situation register r6 has a dependency on r9 (as described above in **), so also that one needs to be marked for precision tracking. In other words, if an imprecise register influences a precise one, then the imprecise register should also be marked precise. Meaning, in the parent state both dest and src register need to be tracked for precision and therefore the marking must be more conservative by setting reg->precise flag for both. The precision propagation needs to cover both for the conditional: if the src reg was marked but not the dst reg and vice versa. After the fix the program is correctly rejected: func#0 @0 0: R1=ctx(off=0,imm=0) R10=fp0 0: (b7) r6 = 1024 ; R6_w=1024 1: (b7) r7 = 0 ; R7_w=0 2: (b7) r8 = 0 ; R8_w=0 3: (b7) r9 = -2147483648 ; R9_w=-2147483648 4: (97) r6 %= 1025 ; R6_w=scalar() 5: (05) goto pc+0 6: (bd) if r6 <= r9 goto pc+2 ; R6_w=scalar(umin=18446744071562067969,var_off=(0xffffffff80000000; 0x7fffffff),u32_min=-2147483648) R9_w=-2147483648 7: (97) r6 %= 1 ; R6_w=scalar() 8: (b7) r9 = 0 ; R9=0 9: (bd) if r6 <= r9 goto pc+1 ; R6=scalar(umin=1) R9=0 10: (b7) r6 = 0 ; R6_w=0 11: (b7) r0 = 0 ; R0_w=0 12: (63) *(u32 *)(r10 -4) = r0 last_idx 12 first_idx 9 regs=1 stack=0 before 11: (b7) r0 = 0 13: R0_w=0 R10=fp0 fp-8=0000???? 13: (18) r4 = 0xffff9290dc5bfe00 ; R4_w=map_ptr(off=0,ks=4,vs=48,imm=0) 15: (bf) r1 = r4 ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R4_w=map_ptr(off=0,ks=4,vs=48,imm=0) 16: (bf) r2 = r10 ; R2_w=fp0 R10=fp0 17: (07) r2 += -4 ; R2_w=fp-4 18: (85) call bpf_map_lookup_elem#1 ; R0=map_value_or_null(id=1,off=0,ks=4,vs=48,imm=0) 19: (55) if r0 != 0x0 goto pc+1 ; R0=0 20: (95) exit from 19 to 21: R0=map_value(off=0,ks=4,vs=48,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0 fp-8=mmmm???? 21: (77) r6 >>= 10 ; R6_w=0 22: (27) r6 *= 8192 ; R6_w=0 23: (bf) r1 = r0 ; R0=map_value(off=0,ks=4,vs=48,imm=0) R1_w=map_value(off=0,ks=4,vs=48,imm=0) 24: (0f) r0 += r6 last_idx 24 first_idx 19 regs=40 stack=0 before 23: (bf) r1 = r0 regs=40 stack=0 before 22: (27) r6 *= 8192 regs=40 stack=0 before 21: (77) r6 >>= 10 regs=40 stack=0 before 19: (55) if r0 != 0x0 goto pc+1 parent didn't have regs=40 stack=0 marks: R0_rw=map_value_or_null(id=1,off=0,ks=4,vs=48,imm=0) R6_rw=P0 R7=0 R8=0 R9=0 R10=fp0 fp-8=mmmm???? last_idx 18 first_idx 9 regs=40 stack=0 before 18: (85) call bpf_map_lookup_elem#1 regs=40 stack=0 before 17: (07) r2 += -4 regs=40 stack=0 before 16: (bf) r2 = r10 regs=40 stack=0 before 15: (bf) r1 = r4 regs=40 stack=0 before 13: (18) r4 = 0xffff9290dc5bfe00 regs=40 stack=0 before 12: (63) *(u32 *)(r10 -4) = r0 regs=40 stack=0 before 11: (b7) r0 = 0 regs=40 stack=0 before 10: (b7) r6 = 0 25: (79) r3 = *(u64 *)(r0 +0) ; R0_w=map_value(off=0,ks=4,vs=48,imm=0) R3_w=scalar() 26: (7b) *(u64 *)(r1 +0) = r3 ; R1_w=map_value(off=0,ks=4,vs=48,imm=0) R3_w=scalar() 27: (95) exit from 9 to 11: R1=ctx(off=0,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0 11: (b7) r0 = 0 ; R0_w=0 12: (63) *(u32 *)(r10 -4) = r0 last_idx 12 first_idx 11 regs=1 stack=0 before 11: (b7) r0 = 0 13: R0_w=0 R10=fp0 fp-8=0000???? 13: (18) r4 = 0xffff9290dc5bfe00 ; R4_w=map_ptr(off=0,ks=4,vs=48,imm=0) 15: (bf) r1 = r4 ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R4_w=map_ptr(off=0,ks=4,vs=48,imm=0) 16: (bf) r2 = r10 ; R2_w=fp0 R10=fp0 17: (07) r2 += -4 ; R2_w=fp-4 18: (85) call bpf_map_lookup_elem#1 frame 0: propagating r6 last_idx 19 first_idx 11 regs=40 stack=0 before 18: (85) call bpf_map_lookup_elem#1 regs=40 stack=0 before 17: (07) r2 += -4 regs=40 stack=0 before 16: (bf) r2 = r10 regs=40 stack=0 before 15: (bf) r1 = r4 regs=40 stack=0 before 13: (18) r4 = 0xffff9290dc5bfe00 regs=40 stack=0 before 12: (63) *(u32 *)(r10 -4) = r0 regs=40 stack=0 before 11: (b7) r0 = 0 parent didn't have regs=40 stack=0 marks: R1=ctx(off=0,imm=0) R6_r=P0 R7=0 R8=0 R9=0 R10=fp0 last_idx 9 first_idx 9 regs=40 stack=0 before 9: (bd) if r6 <= r9 goto pc+1 parent didn't have regs=240 stack=0 marks: R1=ctx(off=0,imm=0) R6_rw=Pscalar() R7_w=0 R8_w=0 R9_rw=P0 R10=fp0 last_idx 8 first_idx 0 regs=240 stack=0 before 8: (b7) r9 = 0 regs=40 stack=0 before 7: (97) r6 %= 1 regs=40 stack=0 before 6: (bd) if r6 <= r9 goto pc+2 regs=240 stack=0 before 5: (05) goto pc+0 regs=240 stack=0 before 4: (97) r6 %= 1025 regs=240 stack=0 before 3: (b7) r9 = -2147483648 regs=40 stack=0 before 2: (b7) r8 = 0 regs=40 stack=0 before 1: (b7) r7 = 0 regs=40 stack=0 before 0: (b7) r6 = 1024 19: safe from 6 to 9: R1=ctx(off=0,imm=0) R6_w=scalar(umax=18446744071562067968) R7_w=0 R8_w=0 R9_w=-2147483648 R10=fp0 9: (bd) if r6 <= r9 goto pc+1 last_idx 9 first_idx 0 regs=40 stack=0 before 6: (bd) if r6 <= r9 goto pc+2 regs=240 stack=0 before 5: (05) goto pc+0 regs=240 stack=0 before 4: (97) r6 %= 1025 regs=240 stack=0 before 3: (b7) r9 = -2147483648 regs=40 stack=0 before 2: (b7) r8 = 0 regs=40 stack=0 before 1: (b7) r7 = 0 regs=40 stack=0 before 0: (b7) r6 = 1024 last_idx 9 first_idx 0 regs=200 stack=0 before 6: (bd) if r6 <= r9 goto pc+2 regs=240 stack=0 before 5: (05) goto pc+0 regs=240 stack=0 before 4: (97) r6 %= 1025 regs=240 stack=0 before 3: (b7) r9 = -2147483648 regs=40 stack=0 before 2: (b7) r8 = 0 regs=40 stack=0 before 1: (b7) r7 = 0 regs=40 stack=0 before 0: (b7) r6 = 1024 11: R6=scalar(umax=18446744071562067968) R9=-2147483648 11: (b7) r0 = 0 ; R0_w=0 12: (63) *(u32 *)(r10 -4) = r0 last_idx 12 first_idx 11 regs=1 stack=0 before 11: (b7) r0 = 0 13: R0_w=0 R10=fp0 fp-8=0000???? 13: (18) r4 = 0xffff9290dc5bfe00 ; R4_w=map_ptr(off=0,ks=4,vs=48,imm=0) 15: (bf) r1 = r4 ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R4_w=map_ptr(off=0,ks=4,vs=48,imm=0) 16: (bf) r2 = r10 ; R2_w=fp0 R10=fp0 17: (07) r2 += -4 ; R2_w=fp-4 18: (85) call bpf_map_lookup_elem#1 ; R0_w=map_value_or_null(id=3,off=0,ks=4,vs=48,imm=0) 19: (55) if r0 != 0x0 goto pc+1 ; R0_w=0 20: (95) exit from 19 to 21: R0=map_value(off=0,ks=4,vs=48,imm=0) R6=scalar(umax=18446744071562067968) R7=0 R8=0 R9=-2147483648 R10=fp0 fp-8=mmmm???? 21: (77) r6 >>= 10 ; R6_w=scalar(umax=18014398507384832,var_off=(0x0; 0x3fffffffffffff)) 22: (27) r6 *= 8192 ; R6_w=scalar(smax=9223372036854767616,umax=18446744073709543424,var_off=(0x0; 0xffffffffffffe000),s32_max=2147475456,u32_max=-8192) 23: (bf) r1 = r0 ; R0=map_value(off=0,ks=4,vs=48,imm=0) R1_w=map_value(off=0,ks=4,vs=48,imm=0) 24: (0f) r0 += r6 last_idx 24 first_idx 21 regs=40 stack=0 before 23: (bf) r1 = r0 regs=40 stack=0 before 22: (27) r6 *= 8192 regs=40 stack=0 before 21: (77) r6 >>= 10 parent didn't have regs=40 stack=0 marks: R0_rw=map_value(off=0,ks=4,vs=48,imm=0) R6_r=Pscalar(umax=18446744071562067968) R7=0 R8=0 R9=-2147483648 R10=fp0 fp-8=mmmm???? last_idx 19 first_idx 11 regs=40 stack=0 before 19: (55) if r0 != 0x0 goto pc+1 regs=40 stack=0 before 18: (85) call bpf_map_lookup_elem#1 regs=40 stack=0 before 17: (07) r2 += -4 regs=40 stack=0 before 16: (bf) r2 = r10 regs=40 stack=0 before 15: (bf) r1 = r4 regs=40 stack=0 before 13: (18) r4 = 0xffff9290dc5bfe00 regs=40 stack=0 before 12: (63) *(u32 *)(r10 -4) = r0 regs=40 stack=0 before 11: (b7) r0 = 0 parent didn't have regs=40 stack=0 marks: R1=ctx(off=0,imm=0) R6_rw=Pscalar(umax=18446744071562067968) R7_w=0 R8_w=0 R9_w=-2147483648 R10=fp0 last_idx 9 first_idx 0 regs=40 stack=0 before 9: (bd) if r6 <= r9 goto pc+1 regs=240 stack=0 before 6: (bd) if r6 <= r9 goto pc+2 regs=240 stack=0 before 5: (05) goto pc+0 regs=240 stack=0 before 4: (97) r6 %= 1025 regs=240 stack=0 before 3: (b7) r9 = -2147483648 regs=40 stack=0 before 2: (b7) r8 = 0 regs=40 stack=0 before 1: (b7) r7 = 0 regs=40 stack=0 before 0: (b7) r6 = 1024 math between map_value pointer and register with unbounded min value is not allowed verification time 886 usec stack depth 4 processed 49 insns (limit 1000000) max_states_per_insn 1 total_states 5 peak_states 5 mark_read 2 Fixes: b5dc0163d8fd ("bpf: precise scalar_value tracking") Reported-by: Juan Jose Lopez Jaimez <jjlopezjaimez@google.com> Reported-by: Meador Inge <meadori@google.com> Reported-by: Simon Scannell <simonscannell@google.com> Reported-by: Nenad Stojanovski <thenenadx@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Reviewed-by: Juan Jose Lopez Jaimez <jjlopezjaimez@google.com> Reviewed-by: Meador Inge <meadori@google.com> Reviewed-by: Simon Scannell <simonscannell@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-30bpf: Adjust insufficient default bpf_jit_limitDaniel Borkmann
[ Upstream commit 10ec8ca8ec1a2f04c4ed90897225231c58c124a7 ] We've seen recent AWS EKS (Kubernetes) user reports like the following: After upgrading EKS nodes from v20230203 to v20230217 on our 1.24 EKS clusters after a few days a number of the nodes have containers stuck in ContainerCreating state or liveness/readiness probes reporting the following error: Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "4a11039f730203ffc003b7[...]": OCI runtime exec failed: exec failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524: unknown However, we had not been seeing this issue on previous AMIs and it only started to occur on v20230217 (following the upgrade from kernel 5.4 to 5.10) with no other changes to the underlying cluster or workloads. We tried the suggestions from that issue (sysctl net.core.bpf_jit_limit=452534528) which helped to immediately allow containers to be created and probes to execute but after approximately a day the issue returned and the value returned by cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}' was steadily increasing. I tested bpf tree to observe bpf_jit_charge_modmem, bpf_jit_uncharge_modmem their sizes passed in as well as bpf_jit_current under tcpdump BPF filter, seccomp BPF and native (e)BPF programs, and the behavior all looks sane and expected, that is nothing "leaking" from an upstream perspective. The bpf_jit_limit knob was originally added in order to avoid a situation where unprivileged applications loading BPF programs (e.g. seccomp BPF policies) consuming all the module memory space via BPF JIT such that loading of kernel modules would be prevented. The default limit was defined back in 2018 and while good enough back then, we are generally seeing far more BPF consumers today. Adjust the limit for the BPF JIT pool from originally 1/4 to now 1/2 of the module memory space to better reflect today's needs and avoid more users running into potentially hard to debug issues. Fixes: fdadd04931c2 ("bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K") Reported-by: Stephen Haynes <sh@synk.net> Reported-by: Lefteris Alexakis <lefteris.alexakis@kpn.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://github.com/awslabs/amazon-eks-ami/issues/1179 Link: https://github.com/awslabs/amazon-eks-ami/issues/1219 Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230320143725.8394-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-17btf: fix resolving BTF_KIND_VAR after ARRAY, STRUCT, UNION, PTRLorenz Bauer
[ Upstream commit 9b459804ff9973e173fabafba2a1319f771e85fa ] btf_datasec_resolve contains a bug that causes the following BTF to fail loading: [1] DATASEC a size=2 vlen=2 type_id=4 offset=0 size=1 type_id=7 offset=1 size=1 [2] INT (anon) size=1 bits_offset=0 nr_bits=8 encoding=(none) [3] PTR (anon) type_id=2 [4] VAR a type_id=3 linkage=0 [5] INT (anon) size=1 bits_offset=0 nr_bits=8 encoding=(none) [6] TYPEDEF td type_id=5 [7] VAR b type_id=6 linkage=0 This error message is printed during btf_check_all_types: [1] DATASEC a size=2 vlen=2 type_id=7 offset=1 size=1 Invalid type By tracing btf_*_resolve we can pinpoint the problem: btf_datasec_resolve(depth: 1, type_id: 1, mode: RESOLVE_TBD) = 0 btf_var_resolve(depth: 2, type_id: 4, mode: RESOLVE_TBD) = 0 btf_ptr_resolve(depth: 3, type_id: 3, mode: RESOLVE_PTR) = 0 btf_var_resolve(depth: 2, type_id: 4, mode: RESOLVE_PTR) = 0 btf_datasec_resolve(depth: 1, type_id: 1, mode: RESOLVE_PTR) = -22 The last invocation of btf_datasec_resolve should invoke btf_var_resolve by means of env_stack_push, instead it returns EINVAL. The reason is that env_stack_push is never executed for the second VAR. if (!env_type_is_resolve_sink(env, var_type) && !env_type_is_resolved(env, var_type_id)) { env_stack_set_next_member(env, i + 1); return env_stack_push(env, var_type, var_type_id); } env_type_is_resolve_sink() changes its behaviour based on resolve_mode. For RESOLVE_PTR, we can simplify the if condition to the following: (btf_type_is_modifier() || btf_type_is_ptr) && !env_type_is_resolved() Since we're dealing with a VAR the clause evaluates to false. This is not sufficient to trigger the bug however. The log output and EINVAL are only generated if btf_type_id_size() fails. if (!btf_type_id_size(btf, &type_id, &type_size)) { btf_verifier_log_vsi(env, v->t, vsi, "Invalid type"); return -EINVAL; } Most types are sized, so for example a VAR referring to an INT is not a problem. The bug is only triggered if a VAR points at a modifier. Since we skipped btf_var_resolve that modifier was also never resolved, which means that btf_resolved_type_id returns 0 aka VOID for the modifier. This in turn causes btf_type_id_size to return NULL, triggering EINVAL. To summarise, the following conditions are necessary: - VAR pointing at PTR, STRUCT, UNION or ARRAY - Followed by a VAR pointing at TYPEDEF, VOLATILE, CONST, RESTRICT or TYPE_TAG The fix is to reset resolve_mode to RESOLVE_TBD before attempting to resolve a VAR from a DATASEC. Fixes: 1dc92851849c ("bpf: kernel side support for BTF Var and DataSec") Signed-off-by: Lorenz Bauer <lmb@isovalent.com> Link: https://lore.kernel.org/r/20230306112138.155352-2-lmb@isovalent.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-10bpf: Fix global subprog context argument resolution logicAndrii Nakryiko
[ Upstream commit d384dce281ed1b504fae2e279507827638d56fa3 ] KPROBE program's user-facing context type is defined as typedef bpf_user_pt_regs_t. This leads to a problem when trying to passing kprobe/uprobe/usdt context argument into global subprog, as kernel always strip away mods and typedefs of user-supplied type, but takes expected type from bpf_ctx_convert as is, which causes mismatch. Current way to work around this is to define a fake struct with the same name as expected typedef: struct bpf_user_pt_regs_t {}; __noinline my_global_subprog(struct bpf_user_pt_regs_t *ctx) { ... } This patch fixes the issue by resolving expected type, if it's not a struct. It still leaves the above work-around working for backwards compatibility. Fixes: 91cc1a99740e ("bpf: Annotate context types") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20230216045954.3002473-2-andrii@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-10bpf: Zeroing allocated object from slab in bpf memory allocatorHou Tao
[ Upstream commit 997849c4b969034e225153f41026657def66d286 ] Currently the freed element in bpf memory allocator may be immediately reused, for htab map the reuse will reinitialize special fields in map value (e.g., bpf_spin_lock), but lookup procedure may still access these special fields, and it may lead to hard-lockup as shown below: NMI backtrace for cpu 16 CPU: 16 PID: 2574 Comm: htab.bin Tainted: G L 6.1.0+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), RIP: 0010:queued_spin_lock_slowpath+0x283/0x2c0 ...... Call Trace: <TASK> copy_map_value_locked+0xb7/0x170 bpf_map_copy_value+0x113/0x3c0 __sys_bpf+0x1c67/0x2780 __x64_sys_bpf+0x1c/0x20 do_syscall_64+0x30/0x60 entry_SYSCALL_64_after_hwframe+0x46/0xb0 ...... </TASK> For htab map, just like the preallocated case, these is no need to initialize these special fields in map value again once these fields have been initialized. For preallocated htab map, these fields are initialized through __GFP_ZERO in bpf_map_area_alloc(), so do the similar thing for non-preallocated htab in bpf memory allocator. And there is no need to use __GFP_ZERO for per-cpu bpf memory allocator, because __alloc_percpu_gfp() does it implicitly. Fixes: 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.") Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230215082132.3856544-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-25bpf: add missing header file includeLinus Torvalds
commit f3dd0c53370e70c0f9b7e931bbec12916f3bb8cc upstream. Commit 74e19ef0ff80 ("uaccess: Add speculation barrier to copy_from_user()") built fine on x86-64 and arm64, and that's the extent of my local build testing. It turns out those got the <linux/nospec.h> include incidentally through other header files (<linux/kvm_host.h> in particular), but that was not true of other architectures, resulting in build errors kernel/bpf/core.c: In function ‘___bpf_prog_run’: kernel/bpf/core.c:1913:3: error: implicit declaration of function ‘barrier_nospec’ so just make sure to explicitly include the proper <linux/nospec.h> header file to make everybody see it. Fixes: 74e19ef0ff80 ("uaccess: Add speculation barrier to copy_from_user()") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Viresh Kumar <viresh.kumar@linaro.org> Reported-by: Huacai Chen <chenhuacai@loongson.cn> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Tested-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-25uaccess: Add speculation barrier to copy_from_user()Dave Hansen
commit 74e19ef0ff8061ef55957c3abd71614ef0f42f47 upstream. The results of "access_ok()" can be mis-speculated. The result is that you can end speculatively: if (access_ok(from, size)) // Right here even for bad from/size combinations. On first glance, it would be ideal to just add a speculation barrier to "access_ok()" so that its results can never be mis-speculated. But there are lots of system calls just doing access_ok() via "copy_to_user()" and friends (example: fstat() and friends). Those are generally not problematic because they do not _consume_ data from userspace other than the pointer. They are also very quick and common system calls that should not be needlessly slowed down. "copy_from_user()" on the other hand uses a user-controller pointer and is frequently followed up with code that might affect caches. Take something like this: if (!copy_from_user(&kernelvar, uptr, size)) do_something_with(kernelvar); If userspace passes in an evil 'uptr' that *actually* points to a kernel addresses, and then do_something_with() has cache (or other) side-effects, it could allow userspace to infer kernel data values. Add a barrier to the common copy_from_user() code to prevent mis-speculated values which happen after the copy. Also add a stub for architectures that do not define barrier_nospec(). This makes the macro usable in generic code. Since the barrier is now usable in generic code, the x86 #ifdef in the BPF code can also go away. Reported-by: Jordy Zomer <jordyzomer@google.com> Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Daniel Borkmann <daniel@iogearbox.net> # BPF bits Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-09bpf: Skip invalid kfunc call in backtrack_insnHao Sun
commit d3178e8a434b58678d99257c0387810a24042fb6 upstream. The verifier skips invalid kfunc call in check_kfunc_call(), which would be captured in fixup_kfunc_call() if such insn is not eliminated by dead code elimination. However, this can lead to the following warning in backtrack_insn(), also see [1]: ------------[ cut here ]------------ verifier backtracking bug WARNING: CPU: 6 PID: 8646 at kernel/bpf/verifier.c:2756 backtrack_insn kernel/bpf/verifier.c:2756 __mark_chain_precision kernel/bpf/verifier.c:3065 mark_chain_precision kernel/bpf/verifier.c:3165 adjust_reg_min_max_vals kernel/bpf/verifier.c:10715 check_alu_op kernel/bpf/verifier.c:10928 do_check kernel/bpf/verifier.c:13821 [inline] do_check_common kernel/bpf/verifier.c:16289 [...] So make backtracking conservative with this by returning ENOTSUPP. [1] https://lore.kernel.org/bpf/CACkBjsaXNceR8ZjkLG=dT3P=4A8SBsg0Z5h5PWLryF5=ghKq=g@mail.gmail.com/ Reported-by: syzbot+4da3ff23081bafe74fc2@syzkaller.appspotmail.com Signed-off-by: Hao Sun <sunhao.th@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20230104014709.9375-1-sunhao.th@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-09bpf: Fix the kernel crash caused by bpf_setsockopt().Kui-Feng Lee
[ Upstream commit 5416c9aea8323583e8696f0500b6142dfae80821 ] The kernel crash was caused by a BPF program attached to the "lsm_cgroup/socket_sock_rcv_skb" hook, which performed a call to `bpf_setsockopt()` in order to set the TCP_NODELAY flag as an example. Flags like TCP_NODELAY can prompt the kernel to flush a socket's outgoing queue, and this hook "lsm_cgroup/socket_sock_rcv_skb" is frequently triggered by softirqs. The issue was that in certain circumstances, when `tcp_write_xmit()` was called to flush the queue, it would also allow BH (bottom-half) to run. This could lead to our program attempting to flush the same socket recursively, which caused a `skbuff` to be unlinked twice. `security_sock_rcv_skb()` is triggered by `tcp_filter()`. This occurs before the sock ownership is checked in `tcp_v4_rcv()`. Consequently, if a bpf program runs on `security_sock_rcv_skb()` while under softirq conditions, it may not possess the lock needed for `bpf_setsockopt()`, thus presenting an issue. The patch fixes this issue by ensuring that a BPF program attached to the "lsm_cgroup/socket_sock_rcv_skb" hook is not allowed to call `bpf_setsockopt()`. The differences from v1 are - changing commit log to explain holding the lock of the sock, - emphasizing that TCP_NODELAY is not the only flag, and - adding the fixes tag. v1: https://lore.kernel.org/bpf/20230125000244.1109228-1-kuifeng@meta.com/ Signed-off-by: Kui-Feng Lee <kuifeng@meta.com> Fixes: 9113d7e48e91 ("bpf: expose bpf_{g,s}etsockopt to lsm cgroup") Link: https://lore.kernel.org/r/20230127001732.4162630-1-kuifeng@meta.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-09bpf: Add missing btf_put to register_btf_id_dtor_kfuncsJiri Olsa
[ Upstream commit 74bc3a5acc82f020d2e126f56c535d02d1e74e37 ] We take the BTF reference before we register dtors and we need to put it back when it's done. We probably won't se a problem with kernel BTF, but module BTF would stay loaded (because of the extra ref) even when its module is removed. Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> Fixes: 5ce937d613a4 ("bpf: Populate pairs of btf_id and destructor kfunc in btf") Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20230120122148.1522359-1-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-09bpf: Fix to preserve reg parent/live fields when copying range infoEduard Zingerman
[ Upstream commit 71f656a50176915d6813751188b5758daa8d012b ] Register range information is copied in several places. The intent is to transfer range/id information from one register/stack spill to another. Currently this is done using direct register assignment, e.g.: static void find_equal_scalars(..., struct bpf_reg_state *known_reg) { ... struct bpf_reg_state *reg; ... *reg = *known_reg; ... } However, such assignments also copy the following bpf_reg_state fields: struct bpf_reg_state { ... struct bpf_reg_state *parent; ... enum bpf_reg_liveness live; ... }; Copying of these fields is accidental and incorrect, as could be demonstrated by the following example: 0: call ktime_get_ns() 1: r6 = r0 2: call ktime_get_ns() 3: r7 = r0 4: if r0 > r6 goto +1 ; r0 & r6 are unbound thus generated ; branch states are identical 5: *(u64 *)(r10 - 8) = 0xdeadbeef ; 64-bit write to fp[-8] --- checkpoint --- 6: r1 = 42 ; r1 marked as written 7: *(u8 *)(r10 - 8) = r1 ; 8-bit write, fp[-8] parent & live ; overwritten 8: r2 = *(u64 *)(r10 - 8) 9: r0 = 0 10: exit This example is unsafe because 64-bit write to fp[-8] at (5) is conditional, thus not all bytes of fp[-8] are guaranteed to be set when it is read at (8). However, currently the example passes verification. First, the execution path 1-10 is examined by verifier. Suppose that a new checkpoint is created by is_state_visited() at (6). After checkpoint creation: - r1.parent points to checkpoint.r1, - fp[-8].parent points to checkpoint.fp[-8]. At (6) the r1.live is set to REG_LIVE_WRITTEN. At (7) the fp[-8].parent is set to r1.parent and fp[-8].live is set to REG_LIVE_WRITTEN, because of the following code called in check_stack_write_fixed_off(): static void save_register_state(struct bpf_func_state *state, int spi, struct bpf_reg_state *reg, int size) { ... state->stack[spi].spilled_ptr = *reg; // <--- parent & live copied if (size == BPF_REG_SIZE) state->stack[spi].spilled_ptr.live |= REG_LIVE_WRITTEN; ... } Note the intent to mark stack spill as written only if 8 bytes are spilled to a slot, however this intent is spoiled by a 'live' field copy. At (8) the checkpoint.fp[-8] should be marked as REG_LIVE_READ but this does not happen: - fp[-8] in a current state is already marked as REG_LIVE_WRITTEN; - fp[-8].parent points to checkpoint.r1, parentage chain is used by mark_reg_read() to mark checkpoint states. At (10) the verification is finished for path 1-10 and jump 4-6 is examined. The checkpoint.fp[-8] never gets REG_LIVE_READ mark and this spill is pruned from the cached states by clean_live_states(). Hence verifier state obtained via path 1-4,6 is deemed identical to one obtained via path 1-6 and program marked as safe. Note: the example should be executed with BPF_F_TEST_STATE_FREQ flag set to force creation of intermediate verifier states. This commit revisits the locations where bpf_reg_state instances are copied and replaces the direct copies with a call to a function copy_register_state(dst, src) that preserves 'parent' and 'live' fields of the 'dst'. Fixes: 679c782de14b ("bpf/verifier: per-register parent pointers") Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20230106142214.1040390-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-09bpf: Fix off-by-one error in bpf_mem_cache_idx()Hou Tao
[ Upstream commit 36024d023d139a0c8b552dc3b7f4dc7b4c139e8f ] According to the definition of sizes[NUM_CACHES], when the size passed to bpf_mem_cache_size() is 256, it should return 6 instead 7. Fixes: 7c8199e24fa0 ("bpf: Introduce any context BPF specific memory allocator.") Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20230118084630.3750680-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-01bpf: Fix pointer-leak due to insufficient speculative store bypass mitigationLuis Gerhorst
[ Upstream commit e4f4db47794c9f474b184ee1418f42e6a07412b6 ] To mitigate Spectre v4, 2039f26f3aca ("bpf: Fix leakage due to insufficient speculative store bypass mitigation") inserts lfence instructions after 1) initializing a stack slot and 2) spilling a pointer to the stack. However, this does not cover cases where a stack slot is first initialized with a pointer (subject to sanitization) but then overwritten with a scalar (not subject to sanitization because the slot was already initialized). In this case, the second write may be subject to speculative store bypass (SSB) creating a speculative pointer-as-scalar type confusion. This allows the program to subsequently leak the numerical pointer value using, for example, a branch-based cache side channel. To fix this, also sanitize scalars if they write a stack slot that previously contained a pointer. Assuming that pointer-spills are only generated by LLVM on register-pressure, the performance impact on most real-world BPF programs should be small. The following unprivileged BPF bytecode drafts a minimal exploit and the mitigation: [...] // r6 = 0 or 1 (skalar, unknown user input) // r7 = accessible ptr for side channel // r10 = frame pointer (fp), to be leaked // r9 = r10 # fp alias to encourage ssb *(u64 *)(r9 - 8) = r10 // fp[-8] = ptr, to be leaked // lfence added here because of pointer spill to stack. // // Ommitted: Dummy bpf_ringbuf_output() here to train alias predictor // for no r9-r10 dependency. // *(u64 *)(r10 - 8) = r6 // fp[-8] = scalar, overwrites ptr // 2039f26f3aca: no lfence added because stack slot was not STACK_INVALID, // store may be subject to SSB // // fix: also add an lfence when the slot contained a ptr // r8 = *(u64 *)(r9 - 8) // r8 = architecturally a scalar, speculatively a ptr // // leak ptr using branch-based cache side channel: r8 &= 1 // choose bit to leak if r8 == 0 goto SLOW // no mispredict // architecturally dead code if input r6 is 0, // only executes speculatively iff ptr bit is 1 r8 = *(u64 *)(r7 + 0) # encode bit in cache (0: slow, 1: fast) SLOW: [...] After running this, the program can time the access to *(r7 + 0) to determine whether the chosen pointer bit was 0 or 1. Repeat this 64 times to recover the whole address on amd64. In summary, sanitization can only be skipped if one scalar is overwritten with another scalar. Scalar-confusion due to speculative store bypass can not lead to invalid accesses because the pointer bounds deducted during verification are enforced using branchless logic. See 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer arithmetic") for details. Do not make the mitigation depend on !env->allow_{uninit_stack,ptr_leaks} because speculative leaks are likely unexpected if these were enabled. For example, leaking the address to a protected log file may be acceptable while disabling the mitigation might unintentionally leak the address into the cached-state of a map that is accessible to unprivileged processes. Fixes: 2039f26f3aca ("bpf: Fix leakage due to insufficient speculative store bypass mitigation") Signed-off-by: Luis Gerhorst <gerhorst@cs.fau.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Henriette Hofmeier <henriette.hofmeier@rub.de> Link: https://lore.kernel.org/bpf/edc95bad-aada-9cfc-ffe2-fa9bb206583c@cs.fau.de Link: https://lore.kernel.org/bpf/20230109150544.41465-1-gerhorst@cs.fau.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-01bpf: hash map, avoid deadlock with suitable hash maskTonghao Zhang
[ Upstream commit 9f907439dc80e4a2fcfb949927b36c036468dbb3 ] The deadlock still may occur while accessed in NMI and non-NMI context. Because in NMI, we still may access the same bucket but with different map_locked index. For example, on the same CPU, .max_entries = 2, we update the hash map, with key = 4, while running bpf prog in NMI nmi_handle(), to update hash map with key = 20, so it will have the same bucket index but have different map_locked index. To fix this issue, using min mask to hash again. Fixes: 20b6cc34ea74 ("bpf: Avoid hashtab deadlock with map_locked") Signed-off-by: Tonghao Zhang <tong@infragraf.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Song Liu <song@kernel.org> Cc: Yonghong Song <yhs@fb.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Stanislav Fomichev <sdf@google.com> Cc: Hao Luo <haoluo@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Hou Tao <houtao1@huawei.com> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230111092903.92389-1-tong@infragraf.org Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-01-24bpf: restore the ebpf program ID for BPF_AUDIT_UNLOAD and ↵Paul Moore
PERF_BPF_EVENT_PROG_UNLOAD commit ef01f4e25c1760920e2c94f1c232350277ace69b upstream. When changing the ebpf program put() routines to support being called from within IRQ context the program ID was reset to zero prior to calling the perf event and audit UNLOAD record generators, which resulted in problems as the ebpf program ID was bogus (always zero). This patch addresses this problem by removing an unnecessary call to bpf_prog_free_id() in __bpf_prog_offload_destroy() and adjusting __bpf_prog_put() to only call bpf_prog_free_id() after audit and perf have finished their bpf program unload tasks in bpf_prog_put_deferred(). For the record, no one can determine, or remember, why it was necessary to free the program ID, and remove it from the IDR, prior to executing bpf_prog_put_deferred(); regardless, both Stanislav and Alexei agree that the approach in this patch should be safe. It is worth noting that when moving the bpf_prog_free_id() call, the do_idr_lock parameter was forced to true as the ebpf devs determined this was the correct as the do_idr_lock should always be true. The do_idr_lock parameter will be removed in a follow-up patch, but it was kept here to keep the patch small in an effort to ease any stable backports. I also modified the bpf_audit_prog() logic used to associate the AUDIT_BPF record with other associated records, e.g. @ctx != NULL. Instead of keying off the operation, it now keys off the execution context, e.g. '!in_irg && !irqs_disabled()', which is much more appropriate and should help better connect the UNLOAD operations with the associated audit state (other audit records). Cc: stable@vger.kernel.org Fixes: d809e134be7a ("bpf: Prepare bpf_prog_put() to be called from irq context.") Reported-by: Burn Alting <burn.alting@iinet.net.au> Reported-by: Jiri Olsa <olsajiri@gmail.com> Suggested-by: Stanislav Fomichev <sdf@google.com> Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Paul Moore <paul@paul-moore.com> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230106154400.74211-1-paul@paul-moore.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-24bpf: keep a reference to the mm, in case the task is dead.Kui-Feng Lee
[ Upstream commit 7ff94f276f8ea05df82eb115225e9b26f47a3347 ] Fix the system crash that happens when a task iterator travel through vma of tasks. In task iterators, we used to access mm by following the pointer on the task_struct; however, the death of a task will clear the pointer, even though we still hold the task_struct. That can cause an unexpected crash for a null pointer when an iterator is visiting a task that dies during the visit. Keeping a reference of mm on the iterator ensures we always have a valid pointer to mm. Co-developed-by: Song Liu <song@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com> Reported-by: Nathan Slingerland <slinger@meta.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20221216221855.4122288-2-kuifeng@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-01-12bpf: Fix panic due to wrong pageattr of im->imageChuang Wang
commit 9ed1d9aeef5842ecacb660fce933613b58af1e00 upstream. In the scenario where livepatch and kretfunc coexist, the pageattr of im->image is rox after arch_prepare_bpf_trampoline in bpf_trampoline_update, and then modify_fentry or register_fentry returns -EAGAIN from bpf_tramp_ftrace_ops_func, the BPF_TRAMP_F_ORIG_STACK flag will be configured, and arch_prepare_bpf_trampoline will be re-executed. At this time, because the pageattr of im->image is rox, arch_prepare_bpf_trampoline will read and write im->image, which causes a fault. as follows: insmod livepatch-sample.ko # samples/livepatch/livepatch-sample.c bpftrace -e 'kretfunc:cmdline_proc_show {}' BUG: unable to handle page fault for address: ffffffffa0206000 PGD 322d067 P4D 322d067 PUD 322e063 PMD 1297e067 PTE d428061 Oops: 0003 [#1] PREEMPT SMP PTI CPU: 2 PID: 270 Comm: bpftrace Tainted: G E K 6.1.0 #5 RIP: 0010:arch_prepare_bpf_trampoline+0xed/0x8c0 RSP: 0018:ffffc90001083ad8 EFLAGS: 00010202 RAX: ffffffffa0206000 RBX: 0000000000000020 RCX: 0000000000000000 RDX: ffffffffa0206001 RSI: ffffffffa0206000 RDI: 0000000000000030 RBP: ffffc90001083b70 R08: 0000000000000066 R09: ffff88800f51b400 R10: 000000002e72c6e5 R11: 00000000d0a15080 R12: ffff8880110a68c8 R13: 0000000000000000 R14: ffff88800f51b400 R15: ffffffff814fec10 FS: 00007f87bc0dc780(0000) GS:ffff88803e600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffa0206000 CR3: 0000000010b70000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> bpf_trampoline_update+0x25a/0x6b0 __bpf_trampoline_link_prog+0x101/0x240 bpf_trampoline_link_prog+0x2d/0x50 bpf_tracing_prog_attach+0x24c/0x530 bpf_raw_tp_link_attach+0x73/0x1d0 __sys_bpf+0x100e/0x2570 __x64_sys_bpf+0x1c/0x30 do_syscall_64+0x5b/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd With this patch, when modify_fentry or register_fentry returns -EAGAIN from bpf_tramp_ftrace_ops_func, the pageattr of im->image will be reset to nx+rw. Cc: stable@vger.kernel.org Fixes: 00963a2e75a8 ("bpf: Support bpf_trampoline on functions with IPMODIFY (e.g. livepatch)") Signed-off-by: Chuang Wang <nashuiliang@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20221224133146.780578-1-nashuiliang@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>