user/sven/linux.git - Linux Kernel

Age	Commit message (Collapse)	Author
2025-11-25	selftests/bpf: Call bpf_get_numa_node_id() in trigger_count()	Menglong Dong
	The bench test "trig-kernel-count" can be used as a baseline comparison for fentry and other benchmarks, and the calling to bpf_get_numa_node_id() should be considered as composition of the baseline. So, let's call it in trigger_count(). Meanwhile, rename trigger_count() to trigger_kernel_count() to make it easier understand. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20251116014242.151110-1-dongml2@chinatelecom.cn
2025-10-27	selftests/bpf/benchs: Add overwrite mode benchmark for BPF ring buffer	Xu Kuohai
	Add --rb-overwrite option to benchmark BPF ring buffer in overwrite mode. Since overwrite mode is not yet supported by libbpf for consumer, also add --rb-bench-producer option to benchmark producer directly without a consumer. Benchmarks on an x86_64 and an arm64 CPU are shown below for reference. - AMD EPYC 9654 (x86_64) Ringbuf, multi-producer contention in overwrite mode, no consumer ================================================================= rb-prod nr_prod 1 32.180 ± 0.033M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 2 9.617 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 3 8.810 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 4 9.272 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 8 9.173 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 12 3.086 ± 0.032M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 16 2.945 ± 0.021M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 20 2.519 ± 0.021M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 24 2.545 ± 0.021M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 28 2.363 ± 0.024M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 32 2.357 ± 0.021M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 36 2.267 ± 0.011M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 40 2.284 ± 0.020M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 44 2.215 ± 0.025M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 48 2.193 ± 0.023M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 52 2.208 ± 0.024M/s (drops 0.000 ± 0.000M/s) - HiSilicon Kunpeng 920 (arm64) Ringbuf, multi-producer contention in overwrite mode, no consumer ================================================================= rb-prod nr_prod 1 14.478 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 2 21.787 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 3 6.045 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 4 5.352 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 8 4.850 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 12 3.542 ± 0.016M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 16 3.509 ± 0.021M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 20 3.171 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 24 3.154 ± 0.014M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 28 2.974 ± 0.015M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 32 3.167 ± 0.014M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 36 2.903 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 40 2.866 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 44 2.914 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-prod nr_prod 48 2.806 ± 0.012M/s (drops 0.000 ± 0.000M/s) Rb-prod nr_prod 52 2.840 ± 0.012M/s (drops 0.000 ± 0.000M/s) Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20251018035738.4039621-4-xukuohai@huaweicloud.com
2025-09-09	selftests/bpf: Fix incorrect array size calculation	Jiayuan Chen
	The loop in bench_sockmap_prog_destroy() has two issues: 1. Using 'sizeof(ctx.fds)' as the loop bound results in the number of bytes, not the number of file descriptors, causing the loop to iterate far more times than intended. 2. The condition 'ctx.fds[0] > 0' incorrectly checks only the first fd for all iterations, potentially leaving file descriptors unclosed. Change it to 'ctx.fds[i] > 0' to check each fd properly. These fixes ensure correct cleanup of all file descriptors when the benchmark exits. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250909124721.191555-1-jiayuan.chen@linux.dev Closes: https://lore.kernel.org/bpf/aLqfWuRR9R_KTe5e@stanley.mountain/
2025-09-04	selftests/bpf: add benchmark testing for kprobe-multi-all	Menglong Dong
	For now, the benchmark for kprobe-multi is single, which means there is only 1 function is hooked during testing. Add the testing "kprobe-multi-all", which will hook all the kernel functions during the benchmark. And the "kretprobe-multi-all" is added too. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Link: https://lore.kernel.org/r/20250904021011.14069-4-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-08-27	selftests/bpf: Add LPM trie microbenchmarks	Matt Fleming
	Add benchmarks for the standard set of operations: LOOKUP, INSERT, UPDATE, DELETE. Also include benchmarks to measure the overhead of the bench framework itself (NOOP) as well as the overhead of generating keys (BASELINE). Lastly, this includes a benchmark for FREE (trie_free()) which is known to have terrible performance for maps with many entries. Benchmarks operate on tries without gaps in the key range, i.e. each test begins or ends with a trie with valid keys in the range [0, nr_entries). This is intended to cause maximum branching when traversing the trie. LOOKUP, UPDATE, DELETE, and FREE fill a BPF LPM trie from userspace using bpf_map_update_batch() and run the corresponding benchmark operation via bpf_loop(). INSERT starts with an empty map and fills it kernel-side from bpf_loop(). FREE records the time to free a filled LPM trie by attaching and destroying a BPF prog. NOOP measures the overhead of the test harness by running an empty function with bpf_loop(). BASELINE is similar to NOOP except that the function generates a key. Each operation runs 10,000 times using bpf_loop(). Note that this value is intentionally independent of the number of entries in the LPM trie so that the stability of the results isn't affected by the number of entries. For those benchmarks that need to reset the LPM trie once it's full (INSERT) or empty (DELETE), throughput and latency results are scaled by the fraction of a second the operation actually ran to ignore any time spent reinitialising the trie. By default, benchmarks run using sequential keys in the range [0, nr_entries). BASELINE, LOOKUP, and UPDATE can use random keys via the --random parameter but beware there is a runtime cost involved in generating random keys. Other benchmarks are prohibited from using random keys because it can skew the results, e.g. when inserting an existing key or deleting a missing one. All measurements are recorded from within the kernel to eliminate syscall overhead. Most benchmarks run an XDP program to generate stats but FREE needs to collect latencies using fentry/fexit on map_free_deferred() because it's not possible to use fentry directly on lpm_trie.c since commit c83508da5620 ("bpf: Avoid deadlock caused by nested kprobe and fentry bpf programs") and there's no way to create/destroy a map from within an XDP program. Here is example output from an AMD EPYC 9684X 96-Core machine for each of the benchmarks using a trie with 10K entries and a 32-bit prefix length, e.g. $ ./bench lpm-trie-$op \ --prefix_len=32 \ --producers=1 \ --nr_entries=10000 noop: throughput 74.417 ± 0.032 M ops/s ( 74.417M ops/prod), latency 13.438 ns/op baseline: throughput 70.107 ± 0.171 M ops/s ( 70.107M ops/prod), latency 14.264 ns/op lookup: throughput 8.467 ± 0.047 M ops/s ( 8.467M ops/prod), latency 118.109 ns/op insert: throughput 2.440 ± 0.015 M ops/s ( 2.440M ops/prod), latency 409.290 ns/op update: throughput 2.806 ± 0.042 M ops/s ( 2.806M ops/prod), latency 356.322 ns/op delete: throughput 4.625 ± 0.011 M ops/s ( 4.625M ops/prod), latency 215.613 ns/op free: throughput 0.578 ± 0.006 K ops/s ( 0.578K ops/prod), latency 1.730 ms/op And the same benchmarks using random keys: $ ./bench lpm-trie-$op \ --prefix_len=32 \ --producers=1 \ --nr_entries=10000 \ --random noop: throughput 74.259 ± 0.335 M ops/s ( 74.259M ops/prod), latency 13.466 ns/op baseline: throughput 35.150 ± 0.144 M ops/s ( 35.150M ops/prod), latency 28.450 ns/op lookup: throughput 7.119 ± 0.048 M ops/s ( 7.119M ops/prod), latency 140.469 ns/op insert: N/A update: throughput 2.736 ± 0.012 M ops/s ( 2.736M ops/prod), latency 365.523 ns/op delete: N/A free: N/A Signed-off-by: Matt Fleming <mfleming@cloudflare.com> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://lore.kernel.org/r/20250827140149.1001557-1-matt@readmodwrite.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-28	Merge tag 'bpf-next-6.16' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf updates from Alexei Starovoitov: - Fix and improve BTF deduplication of identical BTF types (Alan Maguire and Andrii Nakryiko) - Support up to 12 arguments in BPF trampoline on arm64 (Xu Kuohai and Alexis Lothoré) - Support load-acquire and store-release instructions in BPF JIT on riscv64 (Andrea Parri) - Fix uninitialized values in BPF_{CORE,PROBE}_READ macros (Anton Protopopov) - Streamline allowed helpers across program types (Feng Yang) - Support atomic update for hashtab of BPF maps (Hou Tao) - Implement json output for BPF helpers (Ihor Solodrai) - Several s390 JIT fixes (Ilya Leoshkevich) - Various sockmap fixes (Jiayuan Chen) - Support mmap of vmlinux BTF data (Lorenz Bauer) - Support BPF rbtree traversal and list peeking (Martin KaFai Lau) - Tests for sockmap/sockhash redirection (Michal Luczaj) - Introduce kfuncs for memory reads into dynptrs (Mykyta Yatsenko) - Add support for dma-buf iterators in BPF (T.J. Mercier) - The verifier support for __bpf_trap() (Yonghong Song) * tag 'bpf-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (135 commits) bpf, arm64: Remove unused-but-set function and variable. selftests/bpf: Add tests with stack ptr register in conditional jmp bpf: Do not include stack ptr register in precision backtracking bookkeeping selftests/bpf: enable many-args tests for arm64 bpf, arm64: Support up to 12 function arguments bpf: Check rcu_read_lock_trace_held() in bpf_map_lookup_percpu_elem() bpf: Avoid __bpf_prog_ret0_warn when jit fails bpftool: Add support for custom BTF path in prog load/loadall selftests/bpf: Add unit tests with __bpf_trap() kfunc bpf: Warn with __bpf_trap() kfunc maybe due to uninitialized variable bpf: Remove special_kfunc_set from verifier selftests/bpf: Add test for open coded dmabuf_iter selftests/bpf: Add test for dmabuf_iter bpf: Add open coded dmabuf iterator bpf: Add dmabuf iterator dma-buf: Rename debugfs symbols bpf: Fix error return value in bpf_copy_from_user_dynptr libbpf: Use mmap to parse vmlinux BTF from sysfs selftests: bpf: Add a test for mmapable vmlinux BTF btf: Allow mmap of vmlinux btf ...
2025-04-22	selftests/bpf: Close the file descriptor to avoid resource leaks	Malaya Kumar Rout
	Static analysis found an issue in bench_htab_mem.c and sk_assign.c cppcheck output before this patch: tools/testing/selftests/bpf/benchs/bench_htab_mem.c:284:3: error: Resource leak: fd [resourceLeak] tools/testing/selftests/bpf/prog_tests/sk_assign.c:41:3: error: Resource leak: tc [resourceLeak] cppcheck output after this patch: No resource leaks found Fix the issue by closing the file descriptors fd and tc. Signed-off-by: Malaya Kumar Rout <malayarout91@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250421174405.26080-1-malayarout91@gmail.com
2025-04-18	selftests/bpf: Add 5-byte NOP uprobe trigger benchmark	Jiri Olsa
	Add a 5-byte NOP uprobe trigger benchmark (x86_64 specific) to measure uprobes/uretprobes on top of NOP5 instructions. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Song Liu <songliubraving@fb.com> Cc: Yonghong Song <yhs@fb.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Hao Luo <haoluo@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Alan Maguire <alan.maguire@oracle.com> Link: https://lore.kernel.org/r/20250414083647.1234007-2-jolsa@kernel.org
2025-04-15	selftest/bpf/benchs: Remove duplicate sys/types.h header	Jiapeng Chong
	./tools/testing/selftests/bpf/benchs/bench_sockmap.c: sys/types.h is included more than once. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=20436 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20250415061459.11644-1-jiapeng.chong@linux.alibaba.com
2025-04-09	selftest/bpf/benchs: Add benchmark for sockmap usage	Jiayuan Chen
	Add TCP+sockmap-based benchmark. Since sockmap's own update and delete operations are generally less critical, the performance of the fast forwarding framework built upon it is the key aspect. Also with cgset/cgexec, we can observe the behavior of sockmap under memory pressure. The benchmark can be run with: ''' ./bench sockmap -c 2 -p 1 -a --rx-verdict-ingress ''' In the future, we plan to move socket_helpers.h out of the prog_tests directory to make it accessible for the benchmark. This will enable better support for various socket types. Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20250407142234.47591-5-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-04	selftests/bpf: Clean up open-coded gettid syscall invocations	Kumar Kartikeya Dwivedi
	Availability of the gettid definition across glibc versions supported by BPF selftests is not certain. Currently, all users in the tree open-code syscall to gettid. Convert them to a common macro definition. Reviewed-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241104171959.2938862-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-05	selftests/bpf: fix some typos in selftests	Lin Yikai
	Hi, fix some spelling errors in selftest, the details are as follows: -in the codes: test_bpf_sk_stoarge_map_iter_fd(void) ->test_bpf_sk_storage_map_iter_fd(void) load BTF from btf_data.o->load BTF from btf_data.bpf.o -in the code comments: preample->preamble multi-contollers->multi-controllers errono->errno unsighed/unsinged->unsigned egree->egress shoud->should regsiter->register assummed->assumed conditiona->conditional rougly->roughly timetamp->timestamp ingores->ignores null-termainted->null-terminated slepable->sleepable implemenation->implementation veriables->variables timetamps->timestamps substitue a costant->substitute a constant secton->section unreferened->unreferenced verifer->verifier libppf->libbpf ... Signed-off-by: Lin Yikai <yikai.lin@vivo.com> Link: https://lore.kernel.org/r/20240905110354.3274546-1-yikai.lin@vivo.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-23	selftests/bpf: add multi-uprobe benchmarks	Andrii Nakryiko
	Add multi-uprobe and multi-uretprobe benchmarks to bench tool. Multi- and classic uprobes/uretprobes have different low-level triggering code paths, so it's sometimes important to be able to benchmark both flavors of uprobes/uretprobes. Sample examples from my dev machine below. Single-threaded peformance almost doesn't differ, but with more parallel CPUs triggering the same uprobe/uretprobe the difference grows. This might be due to [0], but given the code is slightly different, there could be other sources of slowdown. Note, all these numbers will change due to ongoing work to improve uprobe/uretprobe scalability (e.g., [1]), but having benchmark like this is useful for measurements and debugging nevertheless. \#!/bin/bash set -eufo pipefail for p in 1 8 16 32; do for i in uprobe-nop uretprobe-nop uprobe-multi-nop uretprobe-multi-nop; do summary=$(sudo ./bench -w1 -d3 -p$p -a trig-$i \| tail -n1) total=$(echo "$summary" \| cut -d'(' -f1 \| cut -d' ' -f3-) percpu=$(echo "$summary" \| cut -d'(' -f2 \| cut -d')' -f1 \| cut -d'/' -f1) printf "%-21s (%2d cpus): %s (%s/s/cpu)\n" $i $p "$total" "$percpu" done echo done uprobe-nop ( 1 cpus): 1.020 ± 0.005M/s ( 1.020M/s/cpu) uretprobe-nop ( 1 cpus): 0.515 ± 0.009M/s ( 0.515M/s/cpu) uprobe-multi-nop ( 1 cpus): 1.036 ± 0.004M/s ( 1.036M/s/cpu) uretprobe-multi-nop ( 1 cpus): 0.512 ± 0.005M/s ( 0.512M/s/cpu) uprobe-nop ( 8 cpus): 3.481 ± 0.030M/s ( 0.435M/s/cpu) uretprobe-nop ( 8 cpus): 2.222 ± 0.008M/s ( 0.278M/s/cpu) uprobe-multi-nop ( 8 cpus): 3.769 ± 0.094M/s ( 0.471M/s/cpu) uretprobe-multi-nop ( 8 cpus): 2.482 ± 0.007M/s ( 0.310M/s/cpu) uprobe-nop (16 cpus): 2.968 ± 0.011M/s ( 0.185M/s/cpu) uretprobe-nop (16 cpus): 1.870 ± 0.002M/s ( 0.117M/s/cpu) uprobe-multi-nop (16 cpus): 3.541 ± 0.037M/s ( 0.221M/s/cpu) uretprobe-multi-nop (16 cpus): 2.123 ± 0.026M/s ( 0.133M/s/cpu) uprobe-nop (32 cpus): 2.524 ± 0.026M/s ( 0.079M/s/cpu) uretprobe-nop (32 cpus): 1.572 ± 0.003M/s ( 0.049M/s/cpu) uprobe-multi-nop (32 cpus): 2.717 ± 0.003M/s ( 0.085M/s/cpu) uretprobe-multi-nop (32 cpus): 1.687 ± 0.007M/s ( 0.053M/s/cpu) [0] https://lore.kernel.org/linux-trace-kernel/20240805202803.1813090-1-andrii@kernel.org/ [1] https://lore.kernel.org/linux-trace-kernel/20240731214256.3588718-1-andrii@kernel.org/ Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20240806042935.3867862-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-24	selftests: bpf: crypto: add benchmark for crypto functions	Vadim Fedorenko
	Some simple benchmarks are added to understand the baseline of performance. Signed-off-by: Vadim Fedorenko <vadfed@meta.com> Link: https://lore.kernel.org/r/20240422225024.2847039-5-vadfed@meta.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-03-28	selftests/bpf: add batched tp/raw_tp/fmodret tests	Andrii Nakryiko
	Utilize bpf_modify_return_test_tp() kfunc to have a fast way to trigger tp/raw_tp/fmodret programs from another BPF program, which gives us comparable batched benchmarks to (batched) kprobe/fentry benchmarks. We don't switch kprobe/fentry batched benchmarks to this kfunc to make bench tool usable on older kernels as well. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240326162151.3981687-7-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-28	selftests/bpf: lazy-load trigger bench BPF programs	Andrii Nakryiko
	Instead of front-loading all possible benchmarking BPF programs for trigger benchmarks, explicitly specify which BPF programs are used by specific benchmark and load only it. This allows to be more flexible in supporting older kernels, where some program types might not be possible to load (e.g., those that rely on newly added kfunc). Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240326162151.3981687-5-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-28	selftests/bpf: remove syscall-driven benchs, keep syscall-count only	Andrii Nakryiko
	Remove "legacy" benchmarks triggered by syscalls in favor of newly added in-kernel/batched benchmarks. Drop -batched suffix now as well. Next patch will restore "feature parity" by adding back tp/raw_tp/fmodret benchmarks based on in-kernel kfunc approach. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240326162151.3981687-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-28	selftests/bpf: add batched, mostly in-kernel BPF triggering benchmarks	Andrii Nakryiko
	Existing kprobe/fentry triggering benchmarks have 1-to-1 mapping between one syscall execution and BPF program run. While we use a fast get_pgid() syscall, syscall overhead can still be non-trivial. This patch adds kprobe/fentry set of benchmarks significantly amortizing the cost of syscall vs actual BPF triggering overhead. We do this by employing BPF_PROG_TEST_RUN command to trigger "driver" raw_tp program which does a tight parameterized loop calling cheap BPF helper (bpf_get_numa_node_id()), to which kprobe/fentry programs are attached for benchmarking. This way 1 bpf() syscall causes N executions of BPF program being benchmarked. N defaults to 100, but can be adjusted with --trig-batch-iters CLI argument. For comparison we also implement a new baseline program that instead of triggering another BPF program just does N atomic per-CPU counter increments, establishing the limit for all other types of program within this batched benchmarking setup. Taking the final set of benchmarks added in this patch set (including tp/raw_tp/fmodret, added in later patch), and keeping for now "legacy" syscall-driven benchmarks, we can capture all triggering benchmarks in one place for comparison, before we remove the legacy ones (and rename xxx-batched into just xxx). $ benchs/run_bench_trigger.sh usermode-count : 79.500 ± 0.024M/s kernel-count : 49.949 ± 0.081M/s syscall-count : 9.009 ± 0.007M/s fentry-batch : 31.002 ± 0.015M/s fexit-batch : 20.372 ± 0.028M/s fmodret-batch : 21.651 ± 0.659M/s rawtp-batch : 36.775 ± 0.264M/s tp-batch : 19.411 ± 0.248M/s kprobe-batch : 12.949 ± 0.220M/s kprobe-multi-batch : 15.400 ± 0.007M/s kretprobe-batch : 5.559 ± 0.011M/s kretprobe-multi-batch: 5.861 ± 0.003M/s fentry-legacy : 8.329 ± 0.004M/s fexit-legacy : 6.239 ± 0.003M/s fmodret-legacy : 6.595 ± 0.001M/s rawtp-legacy : 8.305 ± 0.004M/s tp-legacy : 6.382 ± 0.001M/s kprobe-legacy : 5.528 ± 0.003M/s kprobe-multi-legacy : 5.864 ± 0.022M/s kretprobe-legacy : 3.081 ± 0.001M/s kretprobe-multi-legacy: 3.193 ± 0.001M/s Note how xxx-batch variants are measured with significantly higher throughput, even though it's exactly the same in-kernel overhead. As such, results can be compared only between benchmarks of the same kind (syscall vs batched): fentry-legacy : 8.329 ± 0.004M/s fentry-batch : 31.002 ± 0.015M/s kprobe-multi-legacy : 5.864 ± 0.022M/s kprobe-multi-batch : 15.400 ± 0.007M/s Note also that syscall-count is setting a theoretical limit for syscall-triggered benchmarks, while kernel-count is setting similar limits for batch variants. usermode-count is a happy and unachievable case of user space counting without doing any syscalls, and is mostly the measure of CPU speed for such a trivial benchmark. As was mentioned, tp/raw_tp/fmodret require kernel-side kfunc to produce similar benchmark, which we address in a separate patch. Note that run_bench_trigger.sh allows to override a list of benchmarks to run, which is very useful for performance work. Cc: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240326162151.3981687-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-28	selftests/bpf: rename and clean up userspace-triggered benchmarks	Andrii Nakryiko
	Rename uprobe-base to more precise usermode-count (it will match other baseline-like benchmarks, kernel-count and syscall-count). Also use BENCH_TRIG_USERMODE() macro to define all usermode-based triggering benchmarks, which include usermode-count and uprobe/uretprobe benchmarks. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240326162151.3981687-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-22	selftests/bpf: Mark uprobe trigger functions with nocf_check attribute	Jiri Olsa
	Some distros seem to enable the -fcf-protection=branch by default, which breaks our setup on first instruction of uprobe trigger functions and place there endbr64 instruction. Marking them with nocf_check attribute to skip that. Ignoring unknown attribute warning in gcc for bench objects, because nocf_check can be used only when -fcf-protection=branch is enabled, otherwise we get a warning and break compilation. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240322134936.1075395-1-jolsa@kernel.org
2024-03-22	selftests/bpf: Use syscall(SYS_gettid) instead of gettid() wrapper in bench	Alan Maguire
	With glibc 2.28, selftests compilation fails for benchs/bench_trigger.c: benchs/bench_trigger.c: In function ‘inc_counter’: benchs/bench_trigger.c:25:23: error: implicit declaration of function ‘gettid’; did you mean ‘getgid’? [-Werror=implicit-function-declaration] 25 \| tid = gettid(); \| ^~~~~~ \| getgid cc1: all warnings being treated as errors It appears support for the gettid() wrapper is variable across glibc versions, so may be safer to use syscall(SYS_gettid) instead. Fixes: 520fad2e3206 ("selftests/bpf: scale benchmark counting by using per-CPU counters") Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240322095728.95671-1-alan.maguire@oracle.com
2024-03-19	selftests/bpf: scale benchmark counting by using per-CPU counters	Andrii Nakryiko
	When benchmarking with multiple threads (-pN, where N>1), we start contending on single atomic counter that both BPF trigger benchmarks are using, as well as "baseline" tests in user space (trig-base and trig-uprobe-base benchmarks). As such, we start bottlenecking on something completely irrelevant to benchmark at hand. Scale counting up by using per-CPU counters on BPF side. On use space side we do the next best thing: hash thread ID to approximate per-CPU behavior. It seems to work quite well in practice. To demonstrate the difference, I ran three benchmarks with 1, 2, 4, 8, 16, and 32 threads: - trig-uprobe-base (no syscalls, pure tight counting loop in user-space); - trig-base (get_pgid() syscall, atomic counter in user-space); - trig-fentry (syscall to trigger fentry program, atomic uncontended per-CPU counter on BPF side). Command used: for b in uprobe-base base fentry; do \ for p in 1 2 4 8 16 32; do \ printf "%-11s %2d: %s\n" $b $p \ "$(sudo ./bench -w2 -d5 -a -p$p trig-$b \| tail -n1 \| cut -d'(' -f1 \| cut -d' ' -f3-)"; \ done; \ done Before these changes, aggregate throughput across all threads doesn't scale well with number of threads, it actually even falls sharply for uprobe-base due to a very high contention: uprobe-base 1: 138.998 ± 0.650M/s uprobe-base 2: 70.526 ± 1.147M/s uprobe-base 4: 63.114 ± 0.302M/s uprobe-base 8: 54.177 ± 0.138M/s uprobe-base 16: 45.439 ± 0.057M/s uprobe-base 32: 37.163 ± 0.242M/s base 1: 16.940 ± 0.182M/s base 2: 19.231 ± 0.105M/s base 4: 21.479 ± 0.038M/s base 8: 23.030 ± 0.037M/s base 16: 22.034 ± 0.004M/s base 32: 18.152 ± 0.013M/s fentry 1: 14.794 ± 0.054M/s fentry 2: 17.341 ± 0.055M/s fentry 4: 23.792 ± 0.024M/s fentry 8: 21.557 ± 0.047M/s fentry 16: 21.121 ± 0.004M/s fentry 32: 17.067 ± 0.023M/s After these changes, we see almost perfect linear scaling, as expected. The sub-linear scaling when going from 8 to 16 threads is interesting and consistent on my test machine, but I haven't investigated what is causing it this peculiar slowdown (across all benchmarks, could be due to hyperthreading effects, not sure). uprobe-base 1: 139.980 ± 0.648M/s uprobe-base 2: 270.244 ± 0.379M/s uprobe-base 4: 532.044 ± 1.519M/s uprobe-base 8: 1004.571 ± 3.174M/s uprobe-base 16: 1720.098 ± 0.744M/s uprobe-base 32: 3506.659 ± 8.549M/s base 1: 16.869 ± 0.071M/s base 2: 33.007 ± 0.092M/s base 4: 64.670 ± 0.203M/s base 8: 121.969 ± 0.210M/s base 16: 207.832 ± 0.112M/s base 32: 424.227 ± 1.477M/s fentry 1: 14.777 ± 0.087M/s fentry 2: 28.575 ± 0.146M/s fentry 4: 56.234 ± 0.176M/s fentry 8: 106.095 ± 0.385M/s fentry 16: 181.440 ± 0.032M/s fentry 32: 369.131 ± 0.693M/s Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Message-ID: <20240315213329.1161589-1-andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-15	selftests/bpf: Remove second semicolon	Colin Ian King
	There are statements with two semicolons. Remove the second one, it is redundant. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20240315092654.2431062-1-colin.i.king@gmail.com
2024-03-11	selftests/bpf: Add kprobe multi triggering benchmarks	Jiri Olsa
	Adding kprobe multi triggering benchmarks. It's useful now to bench new fprobe implementation and might be useful later as well. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240311211023.590321-1-jolsa@kernel.org
2024-03-11	selftests/bpf: Add fexit and kretprobe triggering benchmarks	Andrii Nakryiko
	We already have kprobe and fentry benchmarks. Let's add kretprobe and fexit ones for completeness. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20240309005124.3004446-1-andrii@kernel.org
2024-03-04	selftests/bpf: Extend uprobe/uretprobe triggering benchmarks	Andrii Nakryiko
	Settle on three "flavors" of uprobe/uretprobe, installed on different kinds of instruction: nop, push, and ret. All three are testing different internal code paths emulating or single-stepping instructions, so are interesting to compare and benchmark separately. To ensure `push rbp` instruction we ensure that uprobe_target_push() is not a leaf function by calling (global __weak) noop function and returning something afterwards (if we don't do that, compiler will just do a tail call optimization). Also, we need to make sure that compiler isn't skipping frame pointer generation, so let's add `-fno-omit-frame-pointers` to Makefile. Just to give an idea of where we currently stand in terms of relative performance of different uprobe/uretprobe cases vs a cheap syscall (getpgid()) baseline, here are results from my local machine: $ benchs/run_bench_uprobes.sh base : 1.561 ± 0.020M/s uprobe-nop : 0.947 ± 0.007M/s uprobe-push : 0.951 ± 0.004M/s uprobe-ret : 0.443 ± 0.007M/s uretprobe-nop : 0.471 ± 0.013M/s uretprobe-push : 0.483 ± 0.004M/s uretprobe-ret : 0.306 ± 0.007M/s Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20240301214551.1686095-1-andrii@kernel.org
2023-12-19	selftests/bpf: Close cgrp fd before calling cleanup_cgroup_environment()	Hou Tao
	There is error log when htab-mem benchmark completes. The error log looks as follows: $ ./bench htab-mem -d1 Setting up benchmark 'htab-mem'... Benchmark 'htab-mem' started. ...... (cgroup_helpers.c:353: errno: Device or resource busy) umount cgroup2 Fix it by closing cgrp fd before invoking cleanup_cgroup_environment(). Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231219135727.2661527-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-14	selftests/bpf: Clean up fmod_ret in bench_rename test script	Yipeng Zou
	Running the bench_rename test script, the following error occurs: # ./benchs/run_bench_rename.sh base : 0.819 ± 0.012M/s kprobe : 0.538 ± 0.009M/s kretprobe : 0.503 ± 0.004M/s rawtp : 0.779 ± 0.020M/s fentry : 0.726 ± 0.007M/s fexit : 0.691 ± 0.007M/s benchmark 'rename-fmodret' not found The bench_rename_fmodret has been removed in commit b000def2e052 ("selftests: Remove fmod_ret from test_overhead"), thus remove it from the runners in the test script. Fixes: b000def2e052 ("selftests: Remove fmod_ret from test_overhead") Signed-off-by: Yipeng Zou <zouyipeng@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230814030727.3010390-1-zouyipeng@huawei.com
2023-07-07	selftests/bpf: Correct two typos	Lu Hongfei
	When wrapping code, use ';' better than using ',' which is more in line with the coding habits of most engineers. Signed-off-by: Lu Hongfei <luhongfei@vivo.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hou Tao <houtao1@huawei.com> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20230707081253.34638-1-luhongfei@vivo.com
2023-07-05	selftests/bpf: Add benchmark for bpf memory allocator	Hou Tao
	The benchmark could be used to compare the performance of hash map operations and the memory usage between different flavors of bpf memory allocator (e.g., no bpf ma vs bpf ma vs reuse-after-gp bpf ma). It also could be used to check the performance improvement or the memory saving provided by optimization. The benchmark creates a non-preallocated hash map which uses bpf memory allocator and shows the operation performance and the memory usage of the hash map under different use cases: (1) overwrite Each CPU overwrites nonoverlapping part of hash map. When each CPU completes overwriting of 64 elements in hash map, it increases the op_count. (2) batch_add_batch_del Each CPU adds then deletes nonoverlapping part of hash map in batch. When each CPU adds and deletes 64 elements in hash map, it increases the op_count twice. (3) add_del_on_diff_cpu Each two-CPUs pair adds and deletes nonoverlapping part of map cooperatively. When each CPU adds or deletes 64 elements in hash map, it will increase the op_count. The following is the benchmark results when comparing between different flavors of bpf memory allocator. These tests are conducted on a KVM guest with 8 CPUs and 16 GB memory. The command line below is used to do all the following benchmarks: ./bench htab-mem --use-case $name ${OPTS} -w3 -d10 -a -p8 These results show that preallocated hash map has both better performance and smaller memory footprint. (1) non-preallocated + no bpf memory allocator (v6.0.19) use kmalloc() + call_rcu overwrite per-prod-op: 11.24 ± 0.07k/s, avg mem: 82.64 ± 26.32MiB, peak mem: 119.18MiB batch_add_batch_del per-prod-op: 18.45 ± 0.10k/s, avg mem: 50.47 ± 14.51MiB, peak mem: 94.96MiB add_del_on_diff_cpu per-prod-op: 14.50 ± 0.03k/s, avg mem: 4.64 ± 0.73MiB, peak mem: 7.20MiB (2) preallocated OPTS=--preallocated overwrite per-prod-op: 191.42 ± 0.09k/s, avg mem: 1.24 ± 0.00MiB, peak mem: 1.49MiB batch_add_batch_del per-prod-op: 221.83 ± 0.17k/s, avg mem: 1.23 ± 0.00MiB, peak mem: 1.49MiB add_del_on_diff_cpu per-prod-op: 39.66 ± 0.31k/s, avg mem: 1.47 ± 0.13MiB, peak mem: 1.75MiB (3) normal bpf memory allocator overwrite per-prod-op: 126.59 ± 0.02k/s, avg mem: 2.26 ± 0.00MiB, peak mem: 2.74MiB batch_add_batch_del per-prod-op: 83.37 ± 0.20k/s, avg mem: 2.14 ± 0.17MiB, peak mem: 2.74MiB add_del_on_diff_cpu per-prod-op: 21.25 ± 0.24k/s, avg mem: 17.50 ± 3.32MiB, peak mem: 28.87MiB Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230704025039.938914-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-06-19	selftests/bpf: Set the default value of consumer_cnt as 0	Hou Tao
	Considering that only bench_ringbufs.c supports consumer, just set the default value of consumer_cnt as 0. After that, update the validity check of consumer_cnt, remove unused consumer_thread code snippets and set consumer_cnt as 1 in run_bench_ringbufs.sh accordingly. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230613080921.1623219-5-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-06-19	selftests/bpf: Use producer_cnt to allocate local counter array	Hou Tao
	For count-local benchmark, use producer_cnt instead of consumer_cnt when allocating local counter array. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230613080921.1623219-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-31	selftests/bpf: Fix conflicts with built-in functions in ↵	James Hilliard
	bench_local_storage_create The fork function in gcc is considered a built in function due to being used by libgcov when building with gnu extensions. Rename fork to sched_process_fork to prevent this conflict. See details: https://github.com/gcc-mirror/gcc/commit/d1c38823924506d389ca58d02926ace21bdf82fa https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82457 Fixes the following error: In file included from progs/bench_local_storage_create.c:6: progs/bench_local_storage_create.c:43:14: error: conflicting types for built-in function 'fork'; expected 'int(void)' [-Werror=builtin-declaration-mismatch] 43 \| int BPF_PROG(fork, struct task_struct parent, struct task_struct child) \| ^~~~ Fixes: cbe9d93d58b1 ("selftests/bpf: Add bench for task storage creation") Signed-off-by: James Hilliard <james.hilliard1@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20230331075848.1642814-1-james.hilliard1@gmail.com
2023-03-25	selftests/bpf: Add bench for task storage creation	Martin KaFai Lau
	This patch adds a task storage benchmark to the existing local-storage-create benchmark. For task storage, ./bench --storage-type task --batch-size 32: bpf_ma: Summary: creates 30.456 ± 0.507k/s ( 30.456k/prod), 6.08 kmallocs/create no bpf_ma: Summary: creates 31.962 ± 0.486k/s ( 31.962k/prod), 6.13 kmallocs/create ./bench --storage-type task --batch-size 64: bpf_ma: Summary: creates 30.197 ± 1.476k/s ( 30.197k/prod), 6.08 kmallocs/create no bpf_ma: Summary: creates 31.103 ± 0.297k/s ( 31.103k/prod), 6.13 kmallocs/create Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230322215246.1675516-6-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10	selftests/bpf: Add local-storage-create benchmark	Martin KaFai Lau
	This patch tests how many kmallocs is needed to create and free a batch of UDP sockets and each socket has a 64bytes bpf storage. It also measures how fast the UDP sockets can be created. The result is from my qemu setup. Before bpf_mem_cache_alloc/free: ./bench -p 1 local-storage-create Setting up benchmark 'local-storage-create'... Benchmark 'local-storage-create' started. Iter 0 ( 73.193us): creates 213.552k/s (213.552k/prod), 3.09 kmallocs/create Iter 1 (-20.724us): creates 211.908k/s (211.908k/prod), 3.09 kmallocs/create Iter 2 ( 9.280us): creates 212.574k/s (212.574k/prod), 3.12 kmallocs/create Iter 3 ( 11.039us): creates 213.209k/s (213.209k/prod), 3.12 kmallocs/create Iter 4 (-11.411us): creates 213.351k/s (213.351k/prod), 3.12 kmallocs/create Iter 5 ( -7.915us): creates 214.754k/s (214.754k/prod), 3.12 kmallocs/create Iter 6 ( 11.317us): creates 210.942k/s (210.942k/prod), 3.12 kmallocs/create Summary: creates 212.789 ± 1.310k/s (212.789k/prod), 3.12 kmallocs/create After bpf_mem_cache_alloc/free: ./bench -p 1 local-storage-create Setting up benchmark 'local-storage-create'... Benchmark 'local-storage-create' started. Iter 0 ( 68.265us): creates 243.984k/s (243.984k/prod), 1.04 kmallocs/create Iter 1 ( 30.357us): creates 238.424k/s (238.424k/prod), 1.04 kmallocs/create Iter 2 (-18.712us): creates 232.963k/s (232.963k/prod), 1.04 kmallocs/create Iter 3 (-15.885us): creates 238.879k/s (238.879k/prod), 1.04 kmallocs/create Iter 4 ( 5.590us): creates 237.490k/s (237.490k/prod), 1.04 kmallocs/create Iter 5 ( 8.577us): creates 237.521k/s (237.521k/prod), 1.04 kmallocs/create Iter 6 ( -6.263us): creates 238.508k/s (238.508k/prod), 1.04 kmallocs/create Summary: creates 237.298 ± 2.198k/s (237.298k/prod), 1.04 kmallocs/create Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230308065936.1550103-18-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-15	selftest/bpf/benchs: Add benchmark for hashmap lookups	Anton Protopopov
	Add a new benchmark which measures hashmap lookup operations speed. A user can control the following parameters of the benchmark: * key_size (max 1024): the key size to use * max_entries: the hashmap max entries * nr_entries: the number of entries to insert/lookup * nr_loops: the number of loops for the benchmark * map_flags The hashmap flags passed to BPF_MAP_CREATE The BPF program performing the benchmarks calls two nested bpf_loop: bpf_loop(nr_loops/nr_entries) bpf_loop(nr_entries) bpf_map_lookup() So the nr_loops determines the number of actual map lookups. All lookups are successful. Example (the output is generated on a AMD Ryzen 9 3950X machine): for nr_entries in `seq 4096 4096 65536`; do echo -n "$((nr_entries*100/65536))% full: "; sudo ./bench -d2 -a bpf-hashmap-lookup --key_size=4 --nr_entries=$nr_entries --max_entries=65536 --nr_loops=1000000 --map_flags=0x40 \| grep cpu; done 6% full: cpu01: lookup 50.739M ± 0.018M events/sec (approximated from 32 samples of ~19ms) 12% full: cpu01: lookup 47.751M ± 0.015M events/sec (approximated from 32 samples of ~20ms) 18% full: cpu01: lookup 45.153M ± 0.013M events/sec (approximated from 32 samples of ~22ms) 25% full: cpu01: lookup 43.826M ± 0.014M events/sec (approximated from 32 samples of ~22ms) 31% full: cpu01: lookup 41.971M ± 0.012M events/sec (approximated from 32 samples of ~23ms) 37% full: cpu01: lookup 41.034M ± 0.015M events/sec (approximated from 32 samples of ~24ms) 43% full: cpu01: lookup 39.946M ± 0.012M events/sec (approximated from 32 samples of ~25ms) 50% full: cpu01: lookup 38.256M ± 0.014M events/sec (approximated from 32 samples of ~26ms) 56% full: cpu01: lookup 36.580M ± 0.018M events/sec (approximated from 32 samples of ~27ms) 62% full: cpu01: lookup 36.252M ± 0.012M events/sec (approximated from 32 samples of ~27ms) 68% full: cpu01: lookup 35.200M ± 0.012M events/sec (approximated from 32 samples of ~28ms) 75% full: cpu01: lookup 34.061M ± 0.009M events/sec (approximated from 32 samples of ~29ms) 81% full: cpu01: lookup 34.374M ± 0.010M events/sec (approximated from 32 samples of ~29ms) 87% full: cpu01: lookup 33.244M ± 0.011M events/sec (approximated from 32 samples of ~30ms) 93% full: cpu01: lookup 32.182M ± 0.013M events/sec (approximated from 32 samples of ~31ms) 100% full: cpu01: lookup 31.497M ± 0.016M events/sec (approximated from 32 samples of ~31ms) Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20230213091519.1202813-8-aspsk@isovalent.com
2023-02-15	selftest/bpf/benchs: Make quiet option common	Anton Protopopov
	The "local-storage-tasks-trace" benchmark has a `--quiet` option. Move it to the list of common options, so that the main code and other benchmarks can use (new) env.quiet variable. Patch the run_bench_local_storage_rcu_tasks_trace.sh helper script accordingly. Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20230213091519.1202813-6-aspsk@isovalent.com
2023-02-15	selftest/bpf/benchs: Remove an unused header	Anton Protopopov
	The benchs/bench_bpf_hashmap_full_update.c doesn't set a custom argp, so it shouldn't include the <argp.h> header. Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20230213091519.1202813-5-aspsk@isovalent.com
2023-02-15	selftest/bpf/benchs: Enhance argp parsing	Anton Protopopov
	To parse command line the bench utility uses the argp_parse() function. This function takes as an argument a parent 'struct argp' structure which defines common command line options and an array of children 'struct argp' structures which defines additional command line options for particular benchmarks. This implementation doesn't allow benchmarks to share option names, e.g., if two benchmarks want to use, say, the --option option, then only one of them will succeed (the first one encountered in the array). This will be convenient if same option names could be used in different benchmarks (with the same semantics, e.g., --nr_loops=N). Fix this by calling the argp_parse() function twice. The first call is the same as it was before, with all children argps, and helps to find the benchmark name and to print a combined help message if anything is wrong. Given the name, we can call the argp_parse the second time, but now the children array points only to a correct benchmark thus always calling the correct parsers. (If there's no a specific list of arguments, then only one call to argp_parse will be done.) Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20230213091519.1202813-4-aspsk@isovalent.com
2023-02-15	selftest/bpf/benchs: Make a function static in bpf_hashmap_full_update	Anton Protopopov
	The hashmap_report_final callback function defined in the benchs/bench_bpf_hashmap_full_update.c file should be static. Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20230213091519.1202813-3-aspsk@isovalent.com
2023-02-15	selftest/bpf/benchs: Fix a typo in bpf_hashmap_full_update	Anton Protopopov
	To call the bpf_hashmap_full_update benchmark, one should say: bench bpf-hashmap-ful-update The patch adds a missing 'l' to the benchmark name. Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20230213091519.1202813-2-aspsk@isovalent.com
2022-07-07	selftests/bpf: Add benchmark for local_storage RCU Tasks Trace usage	Dave Marchevsky
	This benchmark measures grace period latency and kthread cpu usage of RCU Tasks Trace when many processes are creating/deleting BPF local_storage. Intent here is to quantify improvement on these metrics after Paul's recent RCU Tasks patches [0]. Specifically, fork 15k tasks which call a bpf prog that creates/destroys task local_storage and sleep in a loop, resulting in many call_rcu_tasks_trace calls. To determine grace period latency, trace time elapsed between rcu_tasks_trace_pregp_step and rcu_tasks_trace_postgp; for cpu usage look at rcu_task_trace_kthread's stime in /proc/PID/stat. On my virtualized test environment (Skylake, 8 cpus) benchmark results demonstrate significant improvement: BEFORE Paul's patches: SUMMARY tasks_trace grace period latency avg 22298.551 us stddev 1302.165 us SUMMARY ticks per tasks_trace grace period avg 2.291 stddev 0.324 AFTER Paul's patches: SUMMARY tasks_trace grace period latency avg 16969.197 us stddev 2525.053 us SUMMARY ticks per tasks_trace grace period avg 1.146 stddev 0.178 Note that since these patches are not in bpf-next benchmarking was done by cherry-picking this patch onto rcu tree. [0] https://lore.kernel.org/rcu/20220620225402.GA3842369@paulmck-ThinkPad-P17-Gen-1/ Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220705190018.3239050-1-davemarchevsky@fb.com
2022-06-22	selftests/bpf: Add benchmark for local_storage get	Dave Marchevsky
	Add a benchmarks to demonstrate the performance cliff for local_storage get as the number of local_storage maps increases beyond current local_storage implementation's cache size. "sequential get" and "interleaved get" benchmarks are added, both of which do many bpf_task_storage_get calls on sets of task local_storage maps of various counts, while considering a single specific map to be 'important' and counting task_storage_gets to the important map separately in addition to normal 'hits' count of all gets. Goal here is to mimic scenario where a particular program using one map - the important one - is running on a system where many other local_storage maps exist and are accessed often. While "sequential get" benchmark does bpf_task_storage_get for map 0, 1, ..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4 bpf_task_storage_gets for the important map for every 10 map gets. This is meant to highlight performance differences when important map is accessed far more frequently than non-important maps. A "hashmap control" benchmark is also included for easy comparison of standard bpf hashmap lookup vs local_storage get. The benchmark is similar to "sequential get", but creates and uses BPF_MAP_TYPE_HASH instead of local storage. Only one inner map is created - a hashmap meant to hold tid -> data mapping for all tasks. Size of the hashmap is hardcoded to my system's PID_MAX_LIMIT (4,194,304). The number of these keys which are actually fetched as part of the benchmark is configurable. Addition of this benchmark is inspired by conversation with Alexei in a previous patchset's thread [0], which highlighted the need for such a benchmark to motivate and validate improvements to local_storage implementation. My approach in that series focused on improving performance for explicitly-marked 'important' maps and was rejected with feedback to make more generally-applicable improvements while avoiding explicitly marking maps as important. Thus the benchmark reports both general and important-map-focused metrics, so effect of future work on both is clear. Regarding the benchmark results. On a powerful system (Skylake, 20 cores, 256gb ram): Hashmap Control =============== num keys: 10 hashmap (control) sequential get: hits throughput: 20.900 ± 0.334 M ops/s, hits latency: 47.847 ns/op, important_hits throughput: 20.900 ± 0.334 M ops/s num keys: 1000 hashmap (control) sequential get: hits throughput: 13.758 ± 0.219 M ops/s, hits latency: 72.683 ns/op, important_hits throughput: 13.758 ± 0.219 M ops/s num keys: 10000 hashmap (control) sequential get: hits throughput: 6.995 ± 0.034 M ops/s, hits latency: 142.959 ns/op, important_hits throughput: 6.995 ± 0.034 M ops/s num keys: 100000 hashmap (control) sequential get: hits throughput: 4.452 ± 0.371 M ops/s, hits latency: 224.635 ns/op, important_hits throughput: 4.452 ± 0.371 M ops/s num keys: 4194304 hashmap (control) sequential get: hits throughput: 3.043 ± 0.033 M ops/s, hits latency: 328.587 ns/op, important_hits throughput: 3.043 ± 0.033 M ops/s Local Storage ============= num_maps: 1 local_storage cache sequential get: hits throughput: 47.298 ± 0.180 M ops/s, hits latency: 21.142 ns/op, important_hits throughput: 47.298 ± 0.180 M ops/s local_storage cache interleaved get: hits throughput: 55.277 ± 0.888 M ops/s, hits latency: 18.091 ns/op, important_hits throughput: 55.277 ± 0.888 M ops/s num_maps: 10 local_storage cache sequential get: hits throughput: 40.240 ± 0.802 M ops/s, hits latency: 24.851 ns/op, important_hits throughput: 4.024 ± 0.080 M ops/s local_storage cache interleaved get: hits throughput: 48.701 ± 0.722 M ops/s, hits latency: 20.533 ns/op, important_hits throughput: 17.393 ± 0.258 M ops/s num_maps: 16 local_storage cache sequential get: hits throughput: 44.515 ± 0.708 M ops/s, hits latency: 22.464 ns/op, important_hits throughput: 2.782 ± 0.044 M ops/s local_storage cache interleaved get: hits throughput: 49.553 ± 2.260 M ops/s, hits latency: 20.181 ns/op, important_hits throughput: 15.767 ± 0.719 M ops/s num_maps: 17 local_storage cache sequential get: hits throughput: 38.778 ± 0.302 M ops/s, hits latency: 25.788 ns/op, important_hits throughput: 2.284 ± 0.018 M ops/s local_storage cache interleaved get: hits throughput: 43.848 ± 1.023 M ops/s, hits latency: 22.806 ns/op, important_hits throughput: 13.349 ± 0.311 M ops/s num_maps: 24 local_storage cache sequential get: hits throughput: 19.317 ± 0.568 M ops/s, hits latency: 51.769 ns/op, important_hits throughput: 0.806 ± 0.024 M ops/s local_storage cache interleaved get: hits throughput: 24.397 ± 0.272 M ops/s, hits latency: 40.989 ns/op, important_hits throughput: 6.863 ± 0.077 M ops/s num_maps: 32 local_storage cache sequential get: hits throughput: 13.333 ± 0.135 M ops/s, hits latency: 75.000 ns/op, important_hits throughput: 0.417 ± 0.004 M ops/s local_storage cache interleaved get: hits throughput: 16.898 ± 0.383 M ops/s, hits latency: 59.178 ns/op, important_hits throughput: 4.717 ± 0.107 M ops/s num_maps: 100 local_storage cache sequential get: hits throughput: 6.360 ± 0.107 M ops/s, hits latency: 157.233 ns/op, important_hits throughput: 0.064 ± 0.001 M ops/s local_storage cache interleaved get: hits throughput: 7.303 ± 0.362 M ops/s, hits latency: 136.930 ns/op, important_hits throughput: 1.907 ± 0.094 M ops/s num_maps: 1000 local_storage cache sequential get: hits throughput: 0.452 ± 0.010 M ops/s, hits latency: 2214.022 ns/op, important_hits throughput: 0.000 ± 0.000 M ops/s local_storage cache interleaved get: hits throughput: 0.542 ± 0.007 M ops/s, hits latency: 1843.341 ns/op, important_hits throughput: 0.136 ± 0.002 M ops/s Looking at the "sequential get" results, it's clear that as the number of task local_storage maps grows beyond the current cache size (16), there's a significant reduction in hits throughput. Note that current local_storage implementation assigns a cache_idx to maps as they are created. Since "sequential get" is creating maps 0..n in order and then doing bpf_task_storage_get calls in the same order, the benchmark is effectively ensuring that a map will not be in cache when the program tries to access it. For "interleaved get" results, important-map hits throughput is greatly increased as the important map is more likely to be in cache by virtue of being accessed far more frequently. Throughput still reduces as # maps increases, though. To get a sense of the overhead of the benchmark program, I commented out bpf_task_storage_get/bpf_map_lookup_elem in local_storage_bench.c and ran the benchmark on the same host as the 'real' run. Results: Hashmap Control =============== num keys: 10 hashmap (control) sequential get: hits throughput: 54.288 ± 0.655 M ops/s, hits latency: 18.420 ns/op, important_hits throughput: 54.288 ± 0.655 M ops/s num keys: 1000 hashmap (control) sequential get: hits throughput: 52.913 ± 0.519 M ops/s, hits latency: 18.899 ns/op, important_hits throughput: 52.913 ± 0.519 M ops/s num keys: 10000 hashmap (control) sequential get: hits throughput: 53.480 ± 1.235 M ops/s, hits latency: 18.699 ns/op, important_hits throughput: 53.480 ± 1.235 M ops/s num keys: 100000 hashmap (control) sequential get: hits throughput: 54.982 ± 1.902 M ops/s, hits latency: 18.188 ns/op, important_hits throughput: 54.982 ± 1.902 M ops/s num keys: 4194304 hashmap (control) sequential get: hits throughput: 50.858 ± 0.707 M ops/s, hits latency: 19.662 ns/op, important_hits throughput: 50.858 ± 0.707 M ops/s Local Storage ============= num_maps: 1 local_storage cache sequential get: hits throughput: 110.990 ± 4.828 M ops/s, hits latency: 9.010 ns/op, important_hits throughput: 110.990 ± 4.828 M ops/s local_storage cache interleaved get: hits throughput: 161.057 ± 4.090 M ops/s, hits latency: 6.209 ns/op, important_hits throughput: 161.057 ± 4.090 M ops/s num_maps: 10 local_storage cache sequential get: hits throughput: 112.930 ± 1.079 M ops/s, hits latency: 8.855 ns/op, important_hits throughput: 11.293 ± 0.108 M ops/s local_storage cache interleaved get: hits throughput: 115.841 ± 2.088 M ops/s, hits latency: 8.633 ns/op, important_hits throughput: 41.372 ± 0.746 M ops/s num_maps: 16 local_storage cache sequential get: hits throughput: 115.653 ± 0.416 M ops/s, hits latency: 8.647 ns/op, important_hits throughput: 7.228 ± 0.026 M ops/s local_storage cache interleaved get: hits throughput: 138.717 ± 1.649 M ops/s, hits latency: 7.209 ns/op, important_hits throughput: 44.137 ± 0.525 M ops/s num_maps: 17 local_storage cache sequential get: hits throughput: 112.020 ± 1.649 M ops/s, hits latency: 8.927 ns/op, important_hits throughput: 6.598 ± 0.097 M ops/s local_storage cache interleaved get: hits throughput: 128.089 ± 1.960 M ops/s, hits latency: 7.807 ns/op, important_hits throughput: 38.995 ± 0.597 M ops/s num_maps: 24 local_storage cache sequential get: hits throughput: 92.447 ± 5.170 M ops/s, hits latency: 10.817 ns/op, important_hits throughput: 3.855 ± 0.216 M ops/s local_storage cache interleaved get: hits throughput: 128.844 ± 2.808 M ops/s, hits latency: 7.761 ns/op, important_hits throughput: 36.245 ± 0.790 M ops/s num_maps: 32 local_storage cache sequential get: hits throughput: 102.042 ± 1.462 M ops/s, hits latency: 9.800 ns/op, important_hits throughput: 3.194 ± 0.046 M ops/s local_storage cache interleaved get: hits throughput: 126.577 ± 1.818 M ops/s, hits latency: 7.900 ns/op, important_hits throughput: 35.332 ± 0.507 M ops/s num_maps: 100 local_storage cache sequential get: hits throughput: 111.327 ± 1.401 M ops/s, hits latency: 8.983 ns/op, important_hits throughput: 1.113 ± 0.014 M ops/s local_storage cache interleaved get: hits throughput: 131.327 ± 1.339 M ops/s, hits latency: 7.615 ns/op, important_hits throughput: 34.302 ± 0.350 M ops/s num_maps: 1000 local_storage cache sequential get: hits throughput: 101.978 ± 0.563 M ops/s, hits latency: 9.806 ns/op, important_hits throughput: 0.102 ± 0.001 M ops/s local_storage cache interleaved get: hits throughput: 141.084 ± 1.098 M ops/s, hits latency: 7.088 ns/op, important_hits throughput: 35.430 ± 0.276 M ops/s Adjusting for overhead, latency numbers for "hashmap control" and "sequential get" are: hashmap_control_1k: ~53.8ns hashmap_control_10k: ~124.2ns hashmap_control_100k: ~206.5ns sequential_get_1: ~12.1ns sequential_get_10: ~16.0ns sequential_get_16: ~13.8ns sequential_get_17: ~16.8ns sequential_get_24: ~40.9ns sequential_get_32: ~65.2ns sequential_get_100: ~148.2ns sequential_get_1000: ~2204ns Clearly demonstrating a cliff. In the discussion for v1 of this patch, Alexei noted that local_storage was 2.5x faster than a large hashmap when initially implemented [1]. The benchmark results show that local_storage is 5-10x faster: a long-running BPF application putting some pid-specific info into a hashmap for each pid it sees will probably see on the order of 10-100k pids. Bench numbers for hashmaps of this size are ~10x slower than sequential_get_16, but as the number of local_storage maps grows far past local_storage cache size the performance advantage shrinks and eventually reverses. When running the benchmarks it may be necessary to bump 'open files' ulimit for a successful run. [0]: https://lore.kernel.org/all/20220420002143.1096548-1-davemarchevsky@fb.com [1]: https://lore.kernel.org/bpf/20220511173305.ftldpn23m4ski3d3@MBP-98dd607d3435.dhcp.thefacebook.com/ Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20220620222554.270578-1-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-11	selftest/bpf/benchs: Add bpf_map benchmark	Feng Zhou
	Add benchmark for hash_map to reproduce the worst case that non-stop update when map's free is zero. Just like this: ./run_bench_bpf_hashmap_full_update.sh Setting up benchmark 'bpf-hashmap-ful-update'... Benchmark 'bpf-hashmap-ful-update' started. 1:hash_map_full_perf 555830 events per sec ... Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com> Link: https://lore.kernel.org/r/20220610023308.93798-3-zhoufeng.zf@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-26	selftests/bpf: fix uprobe offset calculation in selftests	Andrii Nakryiko
	Fix how selftests determine relative offset of a function that is uprobed. Previously, there was an assumption that uprobed function is always in the first executable region, which is not always the case (libbpf CI hits this case now). So get_base_addr() approach in isolation doesn't work anymore. So teach get_uprobe_offset() to determine correct memory mapping and calculate uprobe offset correctly. While at it, I merged together two implementations of get_uprobe_offset() helper, moving powerpc64-specific logic inside (had to add extra {} block to avoid unused variable error for insn). Also ensured that uprobed functions are never inlined, but are still static (and thus local to each selftest), by using a no-op asm volatile block internally. I didn't want to keep them global __weak, because some tests use uprobe's ref counter offset (to test USDT-like logic) which is not compatible with non-refcounted uprobe. So it's nicer to have each test uprobe target local to the file and guaranteed to not be inlined or skipped by the compiler (which can happen with static functions, especially if compiling selftests with -O2). Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20220126193058.3390292-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-25	selftests/bpf: use preferred setter/getter APIs instead of deprecated ones	Andrii Nakryiko
	Switch to using preferred setters and getters instead of deprecated ones. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20220124194254.2051434-6-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-12-11	selftests/bpf: Add benchmark for bpf_strncmp() helper	Hou Tao
	Add benchmark to compare the performance between home-made strncmp() in bpf program and bpf_strncmp() helper. In summary, the performance win of bpf_strncmp() under x86-64 is greater than 18% when the compared string length is greater than 64, and is 179% when the length is 4095. Under arm64 the performance win is even bigger: 33% when the length is greater than 64 and 600% when the length is 4095. The following is the details: no-helper-X: use home-made strncmp() to compare X-sized string helper-Y: use bpf_strncmp() to compare Y-sized string Under x86-64: no-helper-1 3.504 ± 0.000M/s (drops 0.000 ± 0.000M/s) helper-1 3.347 ± 0.001M/s (drops 0.000 ± 0.000M/s) no-helper-8 3.357 ± 0.001M/s (drops 0.000 ± 0.000M/s) helper-8 3.307 ± 0.001M/s (drops 0.000 ± 0.000M/s) no-helper-32 3.064 ± 0.000M/s (drops 0.000 ± 0.000M/s) helper-32 3.253 ± 0.001M/s (drops 0.000 ± 0.000M/s) no-helper-64 2.563 ± 0.001M/s (drops 0.000 ± 0.000M/s) helper-64 3.040 ± 0.001M/s (drops 0.000 ± 0.000M/s) no-helper-128 1.975 ± 0.000M/s (drops 0.000 ± 0.000M/s) helper-128 2.641 ± 0.000M/s (drops 0.000 ± 0.000M/s) no-helper-512 0.759 ± 0.000M/s (drops 0.000 ± 0.000M/s) helper-512 1.574 ± 0.000M/s (drops 0.000 ± 0.000M/s) no-helper-2048 0.329 ± 0.000M/s (drops 0.000 ± 0.000M/s) helper-2048 0.602 ± 0.000M/s (drops 0.000 ± 0.000M/s) no-helper-4095 0.117 ± 0.000M/s (drops 0.000 ± 0.000M/s) helper-4095 0.327 ± 0.000M/s (drops 0.000 ± 0.000M/s) Under arm64: no-helper-1 2.806 ± 0.004M/s (drops 0.000 ± 0.000M/s) helper-1 2.819 ± 0.002M/s (drops 0.000 ± 0.000M/s) no-helper-8 2.797 ± 0.109M/s (drops 0.000 ± 0.000M/s) helper-8 2.786 ± 0.025M/s (drops 0.000 ± 0.000M/s) no-helper-32 2.399 ± 0.011M/s (drops 0.000 ± 0.000M/s) helper-32 2.703 ± 0.002M/s (drops 0.000 ± 0.000M/s) no-helper-64 2.020 ± 0.015M/s (drops 0.000 ± 0.000M/s) helper-64 2.702 ± 0.073M/s (drops 0.000 ± 0.000M/s) no-helper-128 1.604 ± 0.001M/s (drops 0.000 ± 0.000M/s) helper-128 2.516 ± 0.002M/s (drops 0.000 ± 0.000M/s) no-helper-512 0.699 ± 0.000M/s (drops 0.000 ± 0.000M/s) helper-512 2.106 ± 0.003M/s (drops 0.000 ± 0.000M/s) no-helper-2048 0.215 ± 0.000M/s (drops 0.000 ± 0.000M/s) helper-2048 1.223 ± 0.003M/s (drops 0.000 ± 0.000M/s) no-helper-4095 0.112 ± 0.000M/s (drops 0.000 ± 0.000M/s) helper-4095 0.796 ± 0.000M/s (drops 0.000 ± 0.000M/s) Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211210141652.877186-4-houtao1@huawei.com
2021-12-11	selftests/bpf: Fix checkpatch error on empty function parameter	Hou Tao
	Fix checkpatch error: "ERROR: Bad function definition - void foo() should probably be void foo(void)". Most replacements are done by the following command: sed -i 's#$[a-z]$()$#\1(void)#g' testing/selftests/bpf/benchs/*.c Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211210141652.877186-3-houtao1@huawei.com
2021-11-30	selftest/bpf/benchs: Add bpf_loop benchmark	Joanne Koong
	Add benchmark to measure the throughput and latency of the bpf_loop call. Testing this on my dev machine on 1 thread, the data is as follows: nr_loops: 10 bpf_loop - throughput: 198.519 ± 0.155 M ops/s, latency: 5.037 ns/op nr_loops: 100 bpf_loop - throughput: 247.448 ± 0.305 M ops/s, latency: 4.041 ns/op nr_loops: 500 bpf_loop - throughput: 260.839 ± 0.380 M ops/s, latency: 3.834 ns/op nr_loops: 1000 bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op nr_loops: 5000 bpf_loop - throughput: 264.211 ± 1.508 M ops/s, latency: 3.785 ns/op nr_loops: 10000 bpf_loop - throughput: 265.366 ± 3.054 M ops/s, latency: 3.768 ns/op nr_loops: 50000 bpf_loop - throughput: 235.986 ± 20.205 M ops/s, latency: 4.238 ns/op nr_loops: 100000 bpf_loop - throughput: 264.482 ± 0.279 M ops/s, latency: 3.781 ns/op nr_loops: 500000 bpf_loop - throughput: 309.773 ± 87.713 M ops/s, latency: 3.228 ns/op nr_loops: 1000000 bpf_loop - throughput: 262.818 ± 4.143 M ops/s, latency: 3.805 ns/op >From this data, we can see that the latency per loop decreases as the number of loops increases. On this particular machine, each loop had an overhead of about ~4 ns, and we were able to run ~250 million loops per second. Signed-off-by: Joanne Koong <joannekoong@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211130030622.4131246-5-joannekoong@fb.com
2021-11-16	selftests/bpf: Add uprobe triggering overhead benchmarks	Andrii Nakryiko
	Add benchmark to measure overhead of uprobes and uretprobes. Also have a baseline (no uprobe attached) benchmark. On my dev machine, baseline benchmark can trigger 130M user_target() invocations. When uprobe is attached, this falls to just 700K. With uretprobe, we get down to 520K: $ sudo ./bench trig-uprobe-base -a Summary: hits 131.289 ± 2.872M/s # UPROBE $ sudo ./bench -a trig-uprobe-without-nop Summary: hits 0.729 ± 0.007M/s $ sudo ./bench -a trig-uprobe-with-nop Summary: hits 1.798 ± 0.017M/s # URETPROBE $ sudo ./bench -a trig-uretprobe-without-nop Summary: hits 0.508 ± 0.012M/s $ sudo ./bench -a trig-uretprobe-with-nop Summary: hits 0.883 ± 0.008M/s So there is almost 2.5x performance difference between probing nop vs non-nop instruction for entry uprobe. And 1.7x difference for uretprobe. This means that non-nop uprobe overhead is around 1.4 microseconds for uprobe and 2 microseconds for non-nop uretprobe. For nop variants, uprobe and uretprobe overhead is down to 0.556 and 1.13 microseconds, respectively. For comparison, just doing a very low-overhead syscall (with no BPF programs attached anywhere) gives: $ sudo ./bench trig-base -a Summary: hits 4.830 ± 0.036M/s So uprobes are about 2.67x slower than pure context switch. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20211116013041.4072571-1-andrii@kernel.org