<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/kernel/bpf/memalloc.c, branch v6.2-rc8</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v6.2-rc8</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v6.2-rc8'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2023-01-19T02:36:26Z</updated>
<entry>
<title>bpf: Fix off-by-one error in bpf_mem_cache_idx()</title>
<updated>2023-01-19T02:36:26Z</updated>
<author>
<name>Hou Tao</name>
<email>houtao1@huawei.com</email>
</author>
<published>2023-01-18T08:46:30Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=36024d023d139a0c8b552dc3b7f4dc7b4c139e8f'/>
<id>urn:sha1:36024d023d139a0c8b552dc3b7f4dc7b4c139e8f</id>
<content type='text'>
According to the definition of sizes[NUM_CACHES], when the size passed
to bpf_mem_cache_size() is 256, it should return 6 instead 7.

Fixes: 7c8199e24fa0 ("bpf: Introduce any context BPF specific memory allocator.")
Signed-off-by: Hou Tao &lt;houtao1@huawei.com&gt;
Acked-by: Yonghong Song &lt;yhs@fb.com&gt;
Link: https://lore.kernel.org/r/20230118084630.3750680-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
</content>
</entry>
<entry>
<title>bpf: Skip rcu_barrier() if rcu_trace_implies_rcu_gp() is true</title>
<updated>2022-12-09T01:50:17Z</updated>
<author>
<name>Hou Tao</name>
<email>houtao1@huawei.com</email>
</author>
<published>2022-12-09T01:09:47Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=822ed78fab13d5a54f8b8c030e8c5dc0fcd2cdae'/>
<id>urn:sha1:822ed78fab13d5a54f8b8c030e8c5dc0fcd2cdae</id>
<content type='text'>
If there are pending rcu callback, free_mem_alloc() will use
rcu_barrier_tasks_trace() and rcu_barrier() to wait for the pending
__free_rcu_tasks_trace() and __free_rcu() callback.

If rcu_trace_implies_rcu_gp() is true, there will be no pending
__free_rcu(), so it will be OK to skip rcu_barrier() as well.

Acked-by: Yonghong Song &lt;yhs@fb.com&gt;
Acked-by: Paul E. McKenney &lt;paulmck@kernel.org&gt;
Signed-off-by: Hou Tao &lt;houtao1@huawei.com&gt;
Link: https://lore.kernel.org/r/20221209010947.3130477-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
</content>
</entry>
<entry>
<title>bpf: Reuse freed element in free_by_rcu during allocation</title>
<updated>2022-12-09T01:50:17Z</updated>
<author>
<name>Hou Tao</name>
<email>houtao1@huawei.com</email>
</author>
<published>2022-12-09T01:09:46Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=0893d6007db5cf397f3fc92b2a6935c3ed0c6f00'/>
<id>urn:sha1:0893d6007db5cf397f3fc92b2a6935c3ed0c6f00</id>
<content type='text'>
When there are batched freeing operations on a specific CPU, part of
the freed elements ((high_watermark - lower_watermark) / 2 + 1) will be
indirectly moved into waiting_for_gp list through free_by_rcu list.
After call_rcu_in_progress becomes false again, the remaining elements
in free_by_rcu list will be moved to waiting_for_gp list by the next
invocation of free_bulk(). However if the expiration of RCU tasks trace
grace period is relatively slow, none element in free_by_rcu list will
be moved.

So instead of invoking __alloc_percpu_gfp() or kmalloc_node() to
allocate a new object, in alloc_bulk() just check whether or not there is
freed element in free_by_rcu list and reuse it if available.

Acked-by: Yonghong Song &lt;yhs@fb.com&gt;
Signed-off-by: Hou Tao &lt;houtao1@huawei.com&gt;
Link: https://lore.kernel.org/r/20221209010947.3130477-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
</content>
</entry>
<entry>
<title>Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net</title>
<updated>2022-10-24T20:44:11Z</updated>
<author>
<name>Jakub Kicinski</name>
<email>kuba@kernel.org</email>
</author>
<published>2022-10-24T20:44:11Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=96917bb3a377e1ed103fd5c420a95e5fb25ca104'/>
<id>urn:sha1:96917bb3a377e1ed103fd5c420a95e5fb25ca104</id>
<content type='text'>
include/linux/net.h
  a5ef058dc4d9 ("net: introduce and use custom sockopt socket flag")
  e993ffe3da4b ("net: flag sockets supporting msghdr originated zerocopy")

Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>bpf: Use __llist_del_all() whenever possbile during memory draining</title>
<updated>2022-10-22T02:17:38Z</updated>
<author>
<name>Hou Tao</name>
<email>houtao1@huawei.com</email>
</author>
<published>2022-10-21T11:49:13Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=fa4447cb73b2bfe7175f1b7ffdc70580fcfcb991'/>
<id>urn:sha1:fa4447cb73b2bfe7175f1b7ffdc70580fcfcb991</id>
<content type='text'>
Except for waiting_for_gp list, there are no concurrent operations on
free_by_rcu, free_llist and free_llist_extra lists, so use
__llist_del_all() instead of llist_del_all(). waiting_for_gp list can be
deleted by RCU callback concurrently, so still use llist_del_all().

Acked-by: Stanislav Fomichev &lt;sdf@google.com&gt;
Signed-off-by: Hou Tao &lt;houtao1@huawei.com&gt;
Link: https://lore.kernel.org/r/20221021114913.60508-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
</content>
</entry>
<entry>
<title>bpf: Wait for busy refill_work when destroying bpf memory allocator</title>
<updated>2022-10-22T02:17:38Z</updated>
<author>
<name>Hou Tao</name>
<email>houtao1@huawei.com</email>
</author>
<published>2022-10-21T11:49:12Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=3d05818707bb2cf133ccdcd29f2d5585b5bc1298'/>
<id>urn:sha1:3d05818707bb2cf133ccdcd29f2d5585b5bc1298</id>
<content type='text'>
A busy irq work is an unfinished irq work and it can be either in the
pending state or in the running state. When destroying bpf memory
allocator, refill_work may be busy for PREEMPT_RT kernel in which irq
work is invoked in a per-CPU RT-kthread. It is also possible for kernel
with arch_irq_work_has_interrupt() being false (e.g. 1-cpu arm32 host or
mips) and irq work is inovked in timer interrupt.

The busy refill_work leads to various issues. The obvious one is that
there will be concurrent operations on free_by_rcu and free_list between
irq work and memory draining. Another one is call_rcu_in_progress will
not be reliable for the checking of pending RCU callback because
do_call_rcu() may have not been invoked by irq work yet. The other is
there will be use-after-free if irq work is freed before the callback
of irq work is invoked as shown below:

 BUG: kernel NULL pointer dereference, address: 0000000000000000
 #PF: supervisor instruction fetch in kernel mode
 #PF: error_code(0x0010) - not-present page
 PGD 12ab94067 P4D 12ab94067 PUD 1796b4067 PMD 0
 Oops: 0010 [#1] PREEMPT_RT SMP
 CPU: 5 PID: 64 Comm: irq_work/5 Not tainted 6.0.0-rt11+ #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
 RIP: 0010:0x0
 Code: Unable to access opcode bytes at 0xffffffffffffffd6.
 RSP: 0018:ffffadc080293e78 EFLAGS: 00010286
 RAX: 0000000000000000 RBX: ffffcdc07fb6a388 RCX: ffffa05000a2e000
 RDX: ffffa05000a2e000 RSI: ffffffff96cc9827 RDI: ffffcdc07fb6a388
 ......
 Call Trace:
  &lt;TASK&gt;
  irq_work_single+0x24/0x60
  irq_work_run_list+0x24/0x30
  run_irq_workd+0x23/0x30
  smpboot_thread_fn+0x203/0x300
  kthread+0x126/0x150
  ret_from_fork+0x1f/0x30
  &lt;/TASK&gt;

Considering the ease of concurrency handling, no overhead for
irq_work_sync() under non-PREEMPT_RT kernel and has-irq-work-interrupt
kernel and the short wait time used for irq_work_sync() under PREEMPT_RT
(When running two test_maps on PREEMPT_RT kernel and 72-cpus host, the
max wait time is about 8ms and the 99th percentile is 10us), just using
irq_work_sync() to wait for busy refill_work to complete before memory
draining and memory freeing.

Fixes: 7c8199e24fa0 ("bpf: Introduce any context BPF specific memory allocator.")
Acked-by: Stanislav Fomichev &lt;sdf@google.com&gt;
Signed-off-by: Hou Tao &lt;houtao1@huawei.com&gt;
Link: https://lore.kernel.org/r/20221021114913.60508-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
</content>
</entry>
<entry>
<title>bpf: Use rcu_trace_implies_rcu_gp() in bpf memory allocator</title>
<updated>2022-10-18T17:27:02Z</updated>
<author>
<name>Hou Tao</name>
<email>houtao1@huawei.com</email>
</author>
<published>2022-10-14T11:39:44Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=59be91e5e70a1aa91dfee8088b071f6d05c8a1a3'/>
<id>urn:sha1:59be91e5e70a1aa91dfee8088b071f6d05c8a1a3</id>
<content type='text'>
The memory free logic in bpf memory allocator chains a RCU Tasks Trace
grace period and a normal RCU grace period one after the other, so it
can ensure that both sleepable and non-sleepable programs have finished.

With the introduction of rcu_trace_implies_rcu_gp(),
__free_rcu_tasks_trace() can check whether or not a normal RCU grace
period has also passed after a RCU Tasks Trace grace period has passed.
If it is true, freeing these elements directly, else freeing through
call_rcu().

Signed-off-by: Hou Tao &lt;houtao1@huawei.com&gt;
Link: https://lore.kernel.org/r/20221014113946.965131-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
</content>
</entry>
<entry>
<title>bpf: Check whether or not node is NULL before free it in free_bulk</title>
<updated>2022-09-20T15:06:27Z</updated>
<author>
<name>Hou Tao</name>
<email>houtao1@huawei.com</email>
</author>
<published>2022-09-19T14:48:11Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=c31b38cb948ee7d3317139f005fa1f90de4a06b7'/>
<id>urn:sha1:c31b38cb948ee7d3317139f005fa1f90de4a06b7</id>
<content type='text'>
llnode could be NULL if there are new allocations after the checking of
c-free_cnt &gt; c-&gt;high_watermark in bpf_mem_refill() and before the
calling of __llist_del_first() in free_bulk (e.g. a PREEMPT_RT kernel
or allocation in NMI context). And it will incur oops as shown below:

 BUG: kernel NULL pointer dereference, address: 0000000000000000
 #PF: supervisor write access in kernel mode
 #PF: error_code(0x0002) - not-present page
 PGD 0 P4D 0
 Oops: 0002 [#1] PREEMPT_RT SMP
 CPU: 39 PID: 373 Comm: irq_work/39 Tainted: G        W          6.0.0-rc6-rt9+ #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
 RIP: 0010:bpf_mem_refill+0x66/0x130
 ......
 Call Trace:
  &lt;TASK&gt;
  irq_work_single+0x24/0x60
  irq_work_run_list+0x24/0x30
  run_irq_workd+0x18/0x20
  smpboot_thread_fn+0x13f/0x2c0
  kthread+0x121/0x140
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x1f/0x30
  &lt;/TASK&gt;

Simply fixing it by checking whether or not llnode is NULL in free_bulk().

Fixes: 8d5a8011b35d ("bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.")
Signed-off-by: Hou Tao &lt;houtao1@huawei.com&gt;
Link: https://lore.kernel.org/r/20220919144811.3570825-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
</content>
</entry>
<entry>
<title>bpf: Replace __ksize with ksize.</title>
<updated>2022-09-07T02:38:53Z</updated>
<author>
<name>Alexei Starovoitov</name>
<email>ast@kernel.org</email>
</author>
<published>2022-09-07T02:38:53Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=1e660f7ebe0ff6ac65ee0000280392d878630a67'/>
<id>urn:sha1:1e660f7ebe0ff6ac65ee0000280392d878630a67</id>
<content type='text'>
__ksize() was made private. Use ksize() instead.

Reported-by: Stephen Rothwell &lt;sfr@canb.auug.org.au&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
</content>
</entry>
<entry>
<title>bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc.</title>
<updated>2022-09-05T13:33:07Z</updated>
<author>
<name>Alexei Starovoitov</name>
<email>ast@kernel.org</email>
</author>
<published>2022-09-02T21:10:58Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=9f2c6e96c65e6fa1aebef546be0c30a5895fcb37'/>
<id>urn:sha1:9f2c6e96c65e6fa1aebef546be0c30a5895fcb37</id>
<content type='text'>
User space might be creating and destroying a lot of hash maps. Synchronous
rcu_barrier-s in a destruction path of hash map delay freeing of hash buckets
and other map memory and may cause artificial OOM situation under stress.
Optimize rcu_barrier usage between bpf hash map and bpf_mem_alloc:
- remove rcu_barrier from hash map, since htab doesn't use call_rcu
  directly and there are no callback to wait for.
- bpf_mem_alloc has call_rcu_in_progress flag that indicates pending callbacks.
  Use it to avoid barriers in fast path.
- When barriers are needed copy bpf_mem_alloc into temp structure
  and wait for rcu barrier-s in the worker to let the rest of
  hash map freeing to proceed.

Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Link: https://lore.kernel.org/bpf/20220902211058.60789-17-alexei.starovoitov@gmail.com
</content>
</entry>
</feed>
