<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/include/uapi/linux/bpf.h, branch v5.1.20</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v5.1.20</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v5.1.20'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2019-07-26T07:12:47Z</updated>
<entry>
<title>bpf: fix uapi bpf_prog_info fields alignment</title>
<updated>2019-07-26T07:12:47Z</updated>
<author>
<name>Baruch Siach</name>
<email>baruch@tkos.co.il</email>
</author>
<published>2019-06-28T04:08:45Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=91adaf0eb7bd981ca7c6d7b37823d69bd87bf3af'/>
<id>urn:sha1:91adaf0eb7bd981ca7c6d7b37823d69bd87bf3af</id>
<content type='text'>
[ Upstream commit 0472301a28f6cf53a6bc5783e48a2d0bbff4682f ]

Merge commit 1c8c5a9d38f60 ("Merge
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next") undid the
fix from commit 36f9814a494 ("bpf: fix uapi hole for 32 bit compat
applications") by taking the gpl_compatible 1-bit field definition from
commit b85fab0e67b162 ("bpf: Add gpl_compatible flag to struct
bpf_prog_info") as is. That breaks architectures with 16-bit alignment
like m68k. Add 31-bit pad after gpl_compatible to restore alignment of
following fields.

Thanks to Dmitry V. Levin his analysis of this bug history.

Signed-off-by: Baruch Siach &lt;baruch@tkos.co.il&gt;
Acked-by: Song Liu &lt;songliubraving@fb.com&gt;
Cc: Jiri Olsa &lt;jolsa@kernel.org&gt;
Cc: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Cc: Geert Uytterhoeven &lt;geert@linux-m68k.org&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>bpf: fix unconnected udp hooks</title>
<updated>2019-07-03T11:13:43Z</updated>
<author>
<name>Daniel Borkmann</name>
<email>daniel@iogearbox.net</email>
</author>
<published>2019-06-06T23:48:57Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=591c18e3aed16fde52cfdfc4af094b2cfd5dd0f2'/>
<id>urn:sha1:591c18e3aed16fde52cfdfc4af094b2cfd5dd0f2</id>
<content type='text'>
commit 983695fa676568fc0fe5ddd995c7267aabc24632 upstream.

Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently
to applications as also stated in original motivation in 7828f20e3779 ("Merge
branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter
two hooks into Cilium to enable host based load-balancing with Kubernetes,
I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes
typically sets up DNS as a service and is thus subject to load-balancing.

Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API
is currently insufficient and thus not usable as-is for standard applications
shipped with most distros. To break down the issue we ran into with a simple
example:

  # cat /etc/resolv.conf
  nameserver 147.75.207.207
  nameserver 147.75.207.208

For the purpose of a simple test, we set up above IPs as service IPs and
transparently redirect traffic to a different DNS backend server for that
node:

  # cilium service list
  ID   Frontend            Backend
  1    147.75.207.207:53   1 =&gt; 8.8.8.8:53
  2    147.75.207.208:53   1 =&gt; 8.8.8.8:53

The attached BPF program is basically selecting one of the backends if the
service IP/port matches on the cgroup hook. DNS breaks here, because the
hooks are not transparent enough to applications which have built-in msg_name
address checks:

  # nslookup 1.1.1.1
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
  [...]
  ;; connection timed out; no servers could be reached

  # dig 1.1.1.1
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
  [...]

  ; &lt;&lt;&gt;&gt; DiG 9.11.3-1ubuntu1.7-Ubuntu &lt;&lt;&gt;&gt; 1.1.1.1
  ;; global options: +cmd
  ;; connection timed out; no servers could be reached

For comparison, if none of the service IPs is used, and we tell nslookup
to use 8.8.8.8 directly it works just fine, of course:

  # nslookup 1.1.1.1 8.8.8.8
  1.1.1.1.in-addr.arpa	name = one.one.one.one.

In order to fix this and thus act more transparent to the application,
this needs reverse translation on recvmsg() side. A minimal fix for this
API is to add similar recvmsg() hooks behind the BPF cgroups static key
such that the program can track state and replace the current sockaddr_in{,6}
with the original service IP. From BPF side, this basically tracks the
service tuple plus socket cookie in an LRU map where the reverse NAT can
then be retrieved via map value as one example. Side-note: the BPF cgroups
static key should be converted to a per-hook static key in future.

Same example after this fix:

  # cilium service list
  ID   Frontend            Backend
  1    147.75.207.207:53   1 =&gt; 8.8.8.8:53
  2    147.75.207.208:53   1 =&gt; 8.8.8.8:53

Lookups work fine now:

  # nslookup 1.1.1.1
  1.1.1.1.in-addr.arpa    name = one.one.one.one.

  Authoritative answers can be found from:

  # dig 1.1.1.1

  ; &lt;&lt;&gt;&gt; DiG 9.11.3-1ubuntu1.7-Ubuntu &lt;&lt;&gt;&gt; 1.1.1.1
  ;; global options: +cmd
  ;; Got answer:
  ;; -&gt;&gt;HEADER&lt;&lt;- opcode: QUERY, status: NXDOMAIN, id: 51550
  ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

  ;; OPT PSEUDOSECTION:
  ; EDNS: version: 0, flags:; udp: 512
  ;; QUESTION SECTION:
  ;1.1.1.1.                       IN      A

  ;; AUTHORITY SECTION:
  .                       23426   IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400

  ;; Query time: 17 msec
  ;; SERVER: 147.75.207.207#53(147.75.207.207)
  ;; WHEN: Tue May 21 12:59:38 UTC 2019
  ;; MSG SIZE  rcvd: 111

And from an actual packet level it shows that we're using the back end
server when talking via 147.75.207.20{7,8} front end:

  # tcpdump -i any udp
  [...]
  12:59:52.698732 IP foo.42011 &gt; google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
  12:59:52.698735 IP foo.42011 &gt; google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
  12:59:52.701208 IP google-public-dns-a.google.com.domain &gt; foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
  12:59:52.701208 IP google-public-dns-a.google.com.domain &gt; foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
  [...]

In order to be flexible and to have same semantics as in sendmsg BPF
programs, we only allow return codes in [1,1] range. In the sendmsg case
the program is called if msg-&gt;msg_name is present which can be the case
in both, connected and unconnected UDP.

The former only relies on the sockaddr_in{,6} passed via connect(2) if
passed msg-&gt;msg_name was NULL. Therefore, on recvmsg side, we act in similar
way to call into the BPF program whenever a non-NULL msg-&gt;msg_name was
passed independent of sk-&gt;sk_state being TCP_ESTABLISHED or not. Note
that for TCP case, the msg-&gt;msg_name is ignored in the regular recvmsg
path and therefore not relevant.

For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE,
the hook is not called. This is intentional as it aligns with the same
semantics as in case of TCP cgroup BPF hooks right now. This might be
better addressed in future through a different bpf_attach_type such
that this case can be distinguished from the regular recvmsg paths,
for example.

Fixes: 1cedee13d25a ("bpf: Hooks for sys_sendmsg")
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Acked-by: Andrey Ignatov &lt;rdna@fb.com&gt;
Acked-by: Martin KaFai Lau &lt;kafai@fb.com&gt;
Acked-by: Martynas Pumputis &lt;m@lambda.lt&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>bpf: simplify definition of BPF_FIB_LOOKUP related flags</title>
<updated>2019-07-03T11:13:42Z</updated>
<author>
<name>Martynas Pumputis</name>
<email>m@lambda.lt</email>
</author>
<published>2019-06-12T16:05:40Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=7108c83502e822e52052eea7d13f9d3d8bc8dea0'/>
<id>urn:sha1:7108c83502e822e52052eea7d13f9d3d8bc8dea0</id>
<content type='text'>
commit b1d6c15b9d824a58c5415673f374fac19e8eccdf upstream.

Previously, the BPF_FIB_LOOKUP_{DIRECT,OUTPUT} flags in the BPF UAPI
were defined with the help of BIT macro. This had the following issues:

- In order to use any of the flags, a user was required to depend
  on &lt;linux/bits.h&gt;.
- No other flag in bpf.h uses the macro, so it seems that an unwritten
  convention is to use (1 &lt;&lt; (nr)) to define BPF-related flags.

Fixes: 87f5fc7e48dd ("bpf: Provide helper to do forwarding lookups in kernel FIB table")
Signed-off-by: Martynas Pumputis &lt;m@lambda.lt&gt;
Acked-by: Andrii Nakryiko &lt;andriin@fb.com&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>bpf: add documentation for helpers bpf_spin_lock(), bpf_spin_unlock()</title>
<updated>2019-03-14T21:03:21Z</updated>
<author>
<name>Quentin Monnet</name>
<email>quentin.monnet@netronome.com</email>
</author>
<published>2019-03-14T12:38:40Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=0eb0978528d47699edd091dc2c337952ad8da436'/>
<id>urn:sha1:0eb0978528d47699edd091dc2c337952ad8da436</id>
<content type='text'>
Add documentation for the BPF spinlock-related helpers to the doc in
bpf.h. I added the constraints and restrictions coming with the use of
spinlocks for BPF: not all of it is directly related to the use of the
helper, but I thought it would be nice for users to find them in the man
page.

This list of restrictions is nearly a verbatim copy of the list in
Alexei's commit log for those helpers.

Signed-off-by: Quentin Monnet &lt;quentin.monnet@netronome.com&gt;
Reviewed-by: Jakub Kicinski &lt;jakub.kicinski@netronome.com&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
</content>
</entry>
<entry>
<title>bpf: fix documentation for eBPF helpers</title>
<updated>2019-03-14T21:03:21Z</updated>
<author>
<name>Quentin Monnet</name>
<email>quentin.monnet@netronome.com</email>
</author>
<published>2019-03-14T12:38:39Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=62369db2df8d1edfa040878203b446e023a16802'/>
<id>urn:sha1:62369db2df8d1edfa040878203b446e023a16802</id>
<content type='text'>
Another round of minor fixes for the documentation of the BPF helpers
located in the UAPI bpf.h header file. Changes include:

- Moving around description of some helpers, to keep the descriptions in
  the same order as helpers are declared (bpf_map_push_elem(), leftover
  from commit 90b1023f68c7 ("bpf: fix documentation for eBPF helpers"),
  bpf_rc_keydown(), and bpf_skb_ancestor_cgroup_id()).
- Fixing typos ("contex" -&gt; "context").
- Harmonising return types ("void* " -&gt; "void *", "uint64_t" -&gt; "u64").
- Addition of the "bpf_" prefix to bpf_get_storage().
- Light additions of RST markup on some keywords.
- Empty line deletion between description and return value for
  bpf_tcp_sock().
- Edit for the description for bpf_skb_ecn_set_ce() (capital letters,
  acronym expansion, no effect if ECT not set, more details on return
  value).

Signed-off-by: Quentin Monnet &lt;quentin.monnet@netronome.com&gt;
Reviewed-by: Jakub Kicinski &lt;jakub.kicinski@netronome.com&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
</content>
</entry>
<entry>
<title>bpf: Add bpf_get_listener_sock(struct bpf_sock *sk) helper</title>
<updated>2019-03-13T19:04:35Z</updated>
<author>
<name>Martin KaFai Lau</name>
<email>kafai@fb.com</email>
</author>
<published>2019-03-12T17:23:04Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=dbafd7ddd62369b2f3926ab847cbf8fc40e800b7'/>
<id>urn:sha1:dbafd7ddd62369b2f3926ab847cbf8fc40e800b7</id>
<content type='text'>
Add a new helper "struct bpf_sock *bpf_get_listener_sock(struct bpf_sock *sk)"
which returns a bpf_sock in TCP_LISTEN state.  It will trace back to
the listener sk from a request_sock if possible.  It returns NULL
for all other cases.

No reference is taken because the helper ensures the sk is
in SOCK_RCU_FREE (where the TCP_LISTEN sock should be in).
Hence, bpf_sk_release() is unnecessary and the verifier does not
allow bpf_sk_release(listen_sk) to be called either.

The following is also allowed because the bpf_prog is run under
rcu_read_lock():

	sk = bpf_sk_lookup_tcp();
	/* if (!sk) { ... } */
	listen_sk = bpf_get_listener_sock(sk);
	/* if (!listen_sk) { ... } */
	bpf_sk_release(sk);
	src_port = listen_sk-&gt;src_port; /* Allowed */

Signed-off-by: Martin KaFai Lau &lt;kafai@fb.com&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
</content>
</entry>
<entry>
<title>bpf: add bpf helper bpf_skb_ecn_set_ce</title>
<updated>2019-03-02T18:48:27Z</updated>
<author>
<name>brakmo</name>
<email>brakmo@fb.com</email>
</author>
<published>2019-03-01T20:38:46Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=f7c917ba11a67632a8452ea99fe132f626a7a2cc'/>
<id>urn:sha1:f7c917ba11a67632a8452ea99fe132f626a7a2cc</id>
<content type='text'>
This patch adds a new bpf helper BPF_FUNC_skb_ecn_set_ce
"int bpf_skb_ecn_set_ce(struct sk_buff *skb)". It is added to
BPF_PROG_TYPE_CGROUP_SKB typed bpf_prog which currently can
be attached to the ingress and egress path. The helper is needed
because his type of bpf_prog cannot modify the skb directly.

This helper is used to set the ECN field of ECN capable IP packets to ce
(congestion encountered) in the IPv6 or IPv4 header of the skb. It can be
used by a bpf_prog to manage egress or ingress network bandwdith limit
per cgroupv2 by inducing an ECN response in the TCP sender.
This works best when using DCTCP.

Signed-off-by: Lawrence Brakmo &lt;brakmo@fb.com&gt;
Signed-off-by: Martin KaFai Lau &lt;kafai@fb.com&gt;
Acked-by: Song Liu &lt;songliubraving@fb.com&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
</content>
</entry>
<entry>
<title>bpf: expose program stats via bpf_prog_info</title>
<updated>2019-02-27T16:22:50Z</updated>
<author>
<name>Alexei Starovoitov</name>
<email>ast@kernel.org</email>
</author>
<published>2019-02-25T22:28:40Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=5f8f8b93aeb8371c54af08bece2bd04bc2d48707'/>
<id>urn:sha1:5f8f8b93aeb8371c54af08bece2bd04bc2d48707</id>
<content type='text'>
Return bpf program run_time_ns and run_cnt via bpf_prog_info

Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Acked-by: Andrii Nakryiko &lt;andriin@fb.com&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
</content>
</entry>
<entry>
<title>bpf: add plumbing for BPF_LWT_ENCAP_IP in bpf_lwt_push_encap</title>
<updated>2019-02-14T02:27:55Z</updated>
<author>
<name>Peter Oskolkov</name>
<email>posk@google.com</email>
</author>
<published>2019-02-13T19:53:35Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=3e0bd37ce0e4a574df6d87a901e13bcb46e10301'/>
<id>urn:sha1:3e0bd37ce0e4a574df6d87a901e13bcb46e10301</id>
<content type='text'>
This patch adds all needed plumbing in preparation to allowing
bpf programs to do IP encapping via bpf_lwt_push_encap. Actual
implementation is added in the next patch in the patchset.

Of note:
- bpf_lwt_push_encap can now be called from BPF_PROG_TYPE_LWT_XMIT
  prog types in addition to BPF_PROG_TYPE_LWT_IN;
- if the skb being encapped has GSO set, encapsulation is limited
  to IPIP/IP+GRE/IP+GUE (both IPv4 and IPv6);
- as route lookups are different for ingress vs egress, the single
  external bpf_lwt_push_encap BPF helper is routed internally to
  either bpf_lwt_in_push_encap or bpf_lwt_xmit_push_encap BPF_CALLs,
  depending on prog type.

v8 changes: fixed a typo.

Signed-off-by: Peter Oskolkov &lt;posk@google.com&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
</content>
</entry>
<entry>
<title>bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sock</title>
<updated>2019-02-11T03:46:17Z</updated>
<author>
<name>Martin KaFai Lau</name>
<email>kafai@fb.com</email>
</author>
<published>2019-02-10T07:22:24Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=655a51e536c09d15ffa3603b1b6fce2b45b85a1f'/>
<id>urn:sha1:655a51e536c09d15ffa3603b1b6fce2b45b85a1f</id>
<content type='text'>
This patch adds a helper function BPF_FUNC_tcp_sock and it
is currently available for cg_skb and sched_(cls|act):

struct bpf_tcp_sock *bpf_tcp_sock(struct bpf_sock *sk);

int cg_skb_foo(struct __sk_buff *skb) {
	struct bpf_tcp_sock *tp;
	struct bpf_sock *sk;
	__u32 snd_cwnd;

	sk = skb-&gt;sk;
	if (!sk)
		return 1;

	tp = bpf_tcp_sock(sk);
	if (!tp)
		return 1;

	snd_cwnd = tp-&gt;snd_cwnd;
	/* ... */

	return 1;
}

A 'struct bpf_tcp_sock' is also added to the uapi bpf.h to provide
read-only access.  bpf_tcp_sock has all the existing tcp_sock's fields
that has already been exposed by the bpf_sock_ops.
i.e. no new tcp_sock's fields are exposed in bpf.h.

This helper returns a pointer to the tcp_sock.  If it is not a tcp_sock
or it cannot be traced back to a tcp_sock by sk_to_full_sk(), it
returns NULL.  Hence, the caller needs to check for NULL before
accessing it.

The current use case is to expose members from tcp_sock
to allow a cg_skb_bpf_prog to provide per cgroup traffic
policing/shaping.

Acked-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: Martin KaFai Lau &lt;kafai@fb.com&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
</content>
</entry>
</feed>
