user/sven/linux.git/kernel/bpf, branch v6.18.13

bpf: Fix truncated dmabuf iterator reads

2026-01-02T11:56:37Z

[ Upstream commit 234483565dbb2b264fdd165927c89fbf3ecf4733 ] If there is a large number (hundreds) of dmabufs allocated, the text output generated from dmabuf_iter_seq_show can exceed common user buffer sizes (e.g. PAGE_SIZE) necessitating multiple start/stop cycles to iterate through all dmabufs. However the dmabuf iterator currently returns NULL in dmabuf_iter_seq_start for all non-zero pos values, which results in the truncation of the output before all dmabufs are handled. After dma_buf_iter_begin / dma_buf_iter_next, the refcount of the buffer is elevated so that the BPF iterator program can run without holding any locks. When a stop occurs, instead of immediately dropping the reference on the buffer, stash a pointer to the buffer in seq->priv until either start is called or the iterator is released. This also enables the resumption of iteration without first walking through the list of dmabufs based on the pos value. Fixes: 76ea95534995 ("bpf: Add dmabuf iterator") Signed-off-by: T.J. Mercier Link: https://lore.kernel.org/r/20251204000348.1413593-1-tjmercier@google.com Signed-off-by: Alexei Starovoitov Signed-off-by: Sasha Levin

rqspinlock: Use trylock fallback when per-CPU rqnode is busy

2025-12-18T13:03:28Z

[ Upstream commit 81d5a6a438595e46be191d602e5c2d6d73992fdc ] In addition to deferring to the trylock fallback in NMIs, only do so when an rqspinlock waiter is queued on the current CPU. This is detected by noticing a non-zero node index. This allows NMI waiters to join the waiter queue if it isn't interrupting an existing rqspinlock waiter, and increase the chances of fairly obtaining the lock, performing deadlock detection as the head, and not being starved while attempting the trylock. The trylock path in particular is unlikely to succeed under contention, as it relies on the lock word becoming 0, which indicates no contention. This means that the most likely result for NMIs attempting a trylock is a timeout under contention if they don't hit an AA or ABBA case. The core problem being addressed through the fixed commit was removing the dependency edge between an NMI queue waiter and the queue waiter it is interrupting. Whenever a circular dependency forms, and with no way to break it (as non-head waiters don't poll for deadlocks or timeouts), we would enter into a deadlock. A trylock either breaks such an edge by probing for deadlocks, and finally terminating the waiting loop using a timeout. By excluding queueing on CPUs where the node index is non-zero for NMIs, this sort of dependency is broken. The CPU enters the trylock path for those cases, and falls back to deadlock checks and timeouts. However, in other case where it doesn't interrupt the CPU in the slow path while its queued on the lock, it can join the queue as a normal waiter, and avoid trylock associated starvation and subsequent timeouts. There are a few remaining cases here that matter: the NMI can still preempt the owner in its critical section, and if it queues as a non-head waiter, it can end up impeding the progress of the owner. While this won't deadlock, since the head waiter will eventually signal the NMI waiter to either stop (due to a timeout), it can still lead to long timeouts. These gaps will be addressed in subsequent commits. Note that while the node count detection approach is less conservative than simply deferring NMIs to trylock, it is going to return errors where attempts to lock B in NMI happen while waiters for lock A are in a lower context on the same CPU. However, this only occurs when the lower context is queued in the slow path, and the NMI attempt can proceed without failure in all other cases. To continue to prevent AA deadlocks (or ABBA in a similar NMI interrupting lower context pattern), we'd need a more fleshed out algorithm to unlink NMI waiters after they queue and detect such cases. However, all that complexity isn't appealing yet to reduce the failure rate in the small window inside the slow path. It is important to note that reentrancy in the slow path can also happen through trace_contention_{begin,end}, but in those cases, unlike an NMI, the forward progress of the head waiter (or the predecessor in general) is not being blocked. Fixes: 0d80e7f951be ("rqspinlock: Choose trylock fallback for NMI waiters") Reported-by: Ritesh Oedayrajsingh Varma Suggested-by: Alexei Starovoitov Signed-off-by: Kumar Kartikeya Dwivedi Link: https://lore.kernel.org/r/20251128232802.1031906-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov Signed-off-by: Sasha Levin

rqspinlock: Enclose lock/unlock within lock entry acquisitions

2025-12-18T13:03:28Z

[ Upstream commit beb7021a6003d9c6a463fffca0d6311efb8e0e66 ] Ritesh reported that timeouts occurred frequently for rqspinlock despite reentrancy on the same lock on the same CPU in [0]. This patch closes one of the races leading to this behavior, and reduces the frequency of timeouts. We currently have a tiny window between the fast-path cmpxchg and the grabbing of the lock entry where an NMI could land, attempt the same lock that was just acquired, and end up timing out. This is not ideal. Instead, move the lock entry acquisition from the fast path to before the cmpxchg, and remove the grabbing of the lock entry in the slow path, assuming it was already taken by the fast path. The TAS fallback is invoked directly without being preceded by the typical fast path, therefore we must continue to grab the deadlock detection entry in that case. Case on lock leading to missed AA: cmpxchg lock A ... rqspinlock acquisition of A ... timeout grab_held_lock_entry(A) There is a similar case when unlocking the lock. If the NMI lands between the WRITE_ONCE and smp_store_release, it is possible that we end up in a situation where the NMI fails to diagnose the AA condition, leading to a timeout. Case on unlock leading to missed AA: WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL) ... rqspinlock acquisition of A ... timeout smp_store_release(A->locked, 0) The patch changes the order on unlock to smp_store_release() succeeded by WRITE_ONCE() of NULL. This avoids the missed AA detection described above, but may lead to a false positive if the NMI lands between these two statements, which is acceptable (and preferred over a timeout). The original intention of the reverse order on unlock was to prevent the following possible misdiagnosis of an ABBA scenario: grab entry A lock A grab entry B lock B unlock B smp_store_release(B->locked, 0) grab entry B lock B grab entry A lock A ! WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL) If the store release were is after the WRITE_ONCE, the other CPU would not observe B in the table of the CPU unlocking the lock B. However, since the threads are obviously participating in an ABBA deadlock, it is no longer appealing to use the order above since it may lead to a 250 ms timeout due to missed AA detection. [0]: https://lore.kernel.org/bpf/CAH6OuBTjG+N=+GGwcpOUbeDN563oz4iVcU3rbse68egp9wj9_A@mail.gmail.com Fixes: 0d80e7f951be ("rqspinlock: Choose trylock fallback for NMI waiters") Reported-by: Ritesh Oedayrajsingh Varma Signed-off-by: Kumar Kartikeya Dwivedi Link: https://lore.kernel.org/r/20251128232802.1031906-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov Signed-off-by: Sasha Levin

bpf: Fix exclusive map memory leak

2025-12-18T13:03:22Z

[ Upstream commit 688b745401ab16e2e1a3b504863f0a45fd345638 ] When excl_prog_hash is 0 and excl_prog_hash_size is non-zero, the map also needs to be freed. Otherwise, the map memory will not be reclaimed, just like the memory leak problem reported by syzbot [1]. syzbot reported: BUG: memory leak backtrace (crc 7b9fb9b4): map_create+0x322/0x11e0 kernel/bpf/syscall.c:1512 __sys_bpf+0x3556/0x3610 kernel/bpf/syscall.c:6131 Fixes: baefdbdf6812 ("bpf: Implement exclusive map creation") Reported-by: syzbot+cf08c551fecea9fd1320@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=cf08c551fecea9fd1320 Tested-by: syzbot+cf08c551fecea9fd1320@syzkaller.appspotmail.com Signed-off-by: Edward Adam Davis Acked-by: Yonghong Song Link: https://lore.kernel.org/r/tencent_3F226F882CE56DCC94ACE90EED1ECCFC780A@qq.com Signed-off-by: Alexei Starovoitov Signed-off-by: Sasha Levin

bpf: properly verify tail call behavior

2025-12-18T13:03:11Z

[ Upstream commit e3245f8990431950d20631c72236d4e8cb2dcde8 ] A successful ebpf tail call does not return to the caller, but to the caller-of-the-caller, often just finishing the ebpf program altogether. Any restrictions that the verifier needs to take into account - notably the fact that the tail call might have modified packet pointers - are to be checked on the caller-of-the-caller. Checking it on the caller made the verifier refuse perfectly fine programs that would use the packet pointers after a tail call, which is no problem as this code is only executed if the tail call was unsuccessful, i.e. nothing happened. This patch simulates the behavior of a tail call in the verifier. A conditional jump to the code after the tail call is added for the case of an unsucessful tail call, and a return to the caller is simulated for a successful tail call. For the successful case we assume that the tail call returns an int, as tail calls are currently only allowed in functions that return and int. We always assume that the tail call modified the packet pointers, as we do not know what the tail call did. For the unsuccessful case we know nothing happened, so we do not need to add new constraints. This approach also allows to check other problems that may occur with tail calls, namely we are now able to check that precision is properly propagated into subprograms using tail calls, as well as checking the live slots in such a subprogram. Fixes: 1a4607ffba35 ("bpf: consider that tail calls invalidate packet pointers") Link: https://lore.kernel.org/bpf/20251029105828.1488347-1-martin.teichmann@xfel.eu/ Signed-off-by: Martin Teichmann Acked-by: Eduard Zingerman Link: https://lore.kernel.org/r/20251119160355.1160932-2-martin.teichmann@xfel.eu Signed-off-by: Alexei Starovoitov Signed-off-by: Sasha Levin

bpf: Fix invalid prog->stats access when update_effective_progs fails

2025-12-18T13:03:04Z

[ Upstream commit 7dc211c1159d991db609bdf4b0fb9033c04adcbc ] Syzkaller triggers an invalid memory access issue following fault injection in update_effective_progs. The issue can be described as follows: __cgroup_bpf_detach update_effective_progs compute_effective_progs bpf_prog_array_alloc <-- fault inject purge_effective_progs /* change to dummy_bpf_prog */ array->items[index] = &dummy_bpf_prog.prog ---softirq start--- __do_softirq ... __cgroup_bpf_run_filter_skb __bpf_prog_run_save_cb bpf_prog_run stats = this_cpu_ptr(prog->stats) /* invalid memory access */ flags = u64_stats_update_begin_irqsave(&stats->syncp) ---softirq end--- static_branch_dec(&cgroup_bpf_enabled_key[atype]) The reason is that fault injection caused update_effective_progs to fail and then changed the original prog into dummy_bpf_prog.prog in purge_effective_progs. Then a softirq came, and accessing the members of dummy_bpf_prog.prog in the softirq triggers invalid mem access. To fix it, skip updating stats when stats is NULL. Fixes: 492ecee892c2 ("bpf: enable program stats") Signed-off-by: Pu Lehui Link: https://lore.kernel.org/r/20251115102343.2200727-1-pulehui@huaweicloud.com Signed-off-by: Alexei Starovoitov Signed-off-by: Sasha Levin

bpf: Handle return value of ftrace_set_filter_ip in register_fentry

2025-12-18T13:03:01Z

[ Upstream commit fea3f5e83c5cd80a76d97343023a2f2e6bd862bf ] The error that returned by ftrace_set_filter_ip() in register_fentry() is not handled properly. Just fix it. Fixes: 00963a2e75a8 ("bpf: Support bpf_trampoline on functions with IPMODIFY (e.g. livepatch)") Signed-off-by: Menglong Dong Acked-by: Song Liu Link: https://lore.kernel.org/r/20251110120705.1553694-1-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov Signed-off-by: Sasha Levin

bpf: Prevent nesting overflow in bpf_try_get_buffers

2025-12-18T13:03:01Z

[ Upstream commit c1da3df7191f1b4df9256bcd30d78f78201e1d17 ] bpf_try_get_buffers() returns one of multiple per-CPU buffers based on a per-CPU nesting counter. This mechanism expects that buffers are not endlessly acquired before being returned. migrate_disable() ensures that a task remains on the same CPU, but it does not prevent the task from being preempted by another task on that CPU. Without disabled preemption, a task may be preempted while holding a buffer, allowing another task to run on same CPU and acquire an additional buffer. Several such preemptions can cause the per-CPU nest counter to exceed MAX_BPRINTF_NEST_LEVEL and trigger the warning in bpf_try_get_buffers(). Adding preempt_disable()/preempt_enable() around buffer acquisition and release prevents this task preemption and preserves the intended bounded nesting behavior. Reported-by: syzbot+b0cff308140f79a9c4cb@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68f6a4c8.050a0220.1be48.0011.GAE@google.com/ Fixes: 4223bf833c849 ("bpf: Remove preempt_disable in bpf_try_get_buffers") Suggested-by: Yonghong Song Reviewed-by: Sebastian Andrzej Siewior Signed-off-by: Sahil Chandna Link: https://lore.kernel.org/r/20251114064922.11650-1-chandna.sahil@gmail.com Signed-off-by: Alexei Starovoitov Signed-off-by: Sasha Levin

bpf: Free special fields when update [lru_,]percpu_hash maps

2025-12-18T13:02:59Z

[ Upstream commit 6af6e49a76c9af7d42eb923703e7648cb2bf401a ] As [lru_,]percpu_hash maps support BPF_KPTR_{REF,PERCPU}, missing calls to 'bpf_obj_free_fields()' in 'pcpu_copy_value()' could cause the memory referenced by BPF_KPTR_{REF,PERCPU} fields to be held until the map gets freed. Fix this by calling 'bpf_obj_free_fields()' after 'copy_map_value[,_long]()' in 'pcpu_copy_value()'. Fixes: 65334e64a493 ("bpf: Support kptrs in percpu hashmap and percpu LRU hashmap") Signed-off-by: Leon Hwang Acked-by: Yonghong Song Link: https://lore.kernel.org/r/20251105151407.12723-2-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov Signed-off-by: Sasha Levin

bpf: Fix stackmap overflow check in __bpf_get_stackid()

2025-12-18T13:02:42Z

[ Upstream commit 23f852daa4bab4d579110e034e4d513f7d490846 ] Syzkaller reported a KASAN slab-out-of-bounds write in __bpf_get_stackid() when copying stack trace data. The issue occurs when the perf trace contains more stack entries than the stack map bucket can hold, leading to an out-of-bounds write in the bucket's data array. Fixes: ee2a098851bf ("bpf: Adjust BPF stack helper functions to accommodate skip > 0") Reported-by: syzbot+c9b724fbb41cf2538b7b@syzkaller.appspotmail.com Signed-off-by: Arnaud Lecomte Signed-off-by: Andrii Nakryiko Acked-by: Yonghong Song Acked-by: Song Liu Link: https://lore.kernel.org/bpf/20251025192941.1500-1-contact@arnaud-lcm.com Closes: https://syzkaller.appspot.com/bug?extid=c9b724fbb41cf2538b7b Signed-off-by: Sasha Levin