<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/include/linux/bpf.h, branch v4.9.155</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v4.9.155</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v4.9.155'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2018-01-31T11:55:57Z</updated>
<entry>
<title>bpf: avoid false sharing of map refcount with max_entries</title>
<updated>2018-01-31T11:55:57Z</updated>
<author>
<name>Daniel Borkmann</name>
<email>daniel@iogearbox.net</email>
</author>
<published>2018-01-29T01:48:58Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=5cb917aa1f1e03df9a4c29b363e3900d73508fa8'/>
<id>urn:sha1:5cb917aa1f1e03df9a4c29b363e3900d73508fa8</id>
<content type='text'>
[ upstream commit be95a845cc4402272994ce290e3ad928aff06cb9 ]

In addition to commit b2157399cc98 ("bpf: prevent out-of-bounds
speculation") also change the layout of struct bpf_map such that
false sharing of fast-path members like max_entries is avoided
when the maps reference counter is altered. Therefore enforce
them to be placed into separate cachelines.

pahole dump after change:

  struct bpf_map {
        const struct bpf_map_ops  * ops;                 /*     0     8 */
        struct bpf_map *           inner_map_meta;       /*     8     8 */
        void *                     security;             /*    16     8 */
        enum bpf_map_type          map_type;             /*    24     4 */
        u32                        key_size;             /*    28     4 */
        u32                        value_size;           /*    32     4 */
        u32                        max_entries;          /*    36     4 */
        u32                        map_flags;            /*    40     4 */
        u32                        pages;                /*    44     4 */
        u32                        id;                   /*    48     4 */
        int                        numa_node;            /*    52     4 */
        bool                       unpriv_array;         /*    56     1 */

        /* XXX 7 bytes hole, try to pack */

        /* --- cacheline 1 boundary (64 bytes) --- */
        struct user_struct *       user;                 /*    64     8 */
        atomic_t                   refcnt;               /*    72     4 */
        atomic_t                   usercnt;              /*    76     4 */
        struct work_struct         work;                 /*    80    32 */
        char                       name[16];             /*   112    16 */
        /* --- cacheline 2 boundary (128 bytes) --- */

        /* size: 128, cachelines: 2, members: 17 */
        /* sum members: 121, holes: 1, sum holes: 7 */
  };

Now all entries in the first cacheline are read only throughout
the life time of the map, set up once during map creation. Overall
struct size and number of cachelines doesn't change from the
reordering. struct bpf_map is usually first member and embedded
in map structs in specific map implementations, so also avoid those
members to sit at the end where it could potentially share the
cacheline with first map values e.g. in the array since remote
CPUs could trigger map updates just as well for those (easily
dirtying members like max_entries intentionally as well) while
having subsequent values in cache.

Quoting from Google's Project Zero blog [1]:

  Additionally, at least on the Intel machine on which this was
  tested, bouncing modified cache lines between cores is slow,
  apparently because the MESI protocol is used for cache coherence
  [8]. Changing the reference counter of an eBPF array on one
  physical CPU core causes the cache line containing the reference
  counter to be bounced over to that CPU core, making reads of the
  reference counter on all other CPU cores slow until the changed
  reference counter has been written back to memory. Because the
  length and the reference counter of an eBPF array are stored in
  the same cache line, this also means that changing the reference
  counter on one physical CPU core causes reads of the eBPF array's
  length to be slow on other physical CPU cores (intentional false
  sharing).

While this doesn't 'control' the out-of-bounds speculation through
masking the index as in commit b2157399cc98, triggering a manipulation
of the map's reference counter is really trivial, so lets not allow
to easily affect max_entries from it.

Splitting to separate cachelines also generally makes sense from
a performance perspective anyway in that fast-path won't have a
cache miss if the map gets pinned, reused in other progs, etc out
of control path, thus also avoids unintentional false sharing.

  [1] https://googleprojectzero.blogspot.ch/2018/01/reading-privileged-memory-with-side.html

Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>bpf: prevent out-of-bounds speculation</title>
<updated>2018-01-17T08:38:55Z</updated>
<author>
<name>Alexei Starovoitov</name>
<email>ast@kernel.org</email>
</author>
<published>2018-01-08T01:33:02Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=a9bfac14cde2b481eeb0e64fbe15305df66ab32e'/>
<id>urn:sha1:a9bfac14cde2b481eeb0e64fbe15305df66ab32e</id>
<content type='text'>
commit b2157399cc9898260d6031c5bfe45fe137c1fbe7 upstream.

Under speculation, CPUs may mis-predict branches in bounds checks. Thus,
memory accesses under a bounds check may be speculated even if the
bounds check fails, providing a primitive for building a side channel.

To avoid leaking kernel data round up array-based maps and mask the index
after bounds check, so speculated load with out of bounds index will load
either valid value from the array or zero from the padded area.

Unconditionally mask index for all array types even when max_entries
are not rounded to power of 2 for root user.
When map is created by unpriv user generate a sequence of bpf insns
that includes AND operation to make sure that JITed code includes
the same 'index &amp; index_mask' operation.

If prog_array map is created by unpriv user replace
  bpf_tail_call(ctx, map, index);
with
  if (index &gt;= max_entries) {
    index &amp;= map-&gt;index_mask;
    bpf_tail_call(ctx, map, index);
  }
(along with roundup to power 2) to prevent out-of-bounds speculation.
There is secondary redundant 'if (index &gt;= max_entries)' in the interpreter
and in all JITs, but they can be optimized later if necessary.

Other array-like maps (cpumap, devmap, sockmap, perf_event_array, cgroup_array)
cannot be used by unpriv, so no changes there.

That fixes bpf side of "Variant 1: bounds check bypass (CVE-2017-5753)" on
all architectures with and without JIT.

v2-&gt;v3:
Daniel noticed that attack potentially can be crafted via syscall commands
without loading the program, so add masking to those paths as well.

Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Acked-by: John Fastabend &lt;john.fastabend@gmail.com&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Cc: Jiri Slaby &lt;jslaby@suse.cz&gt;
[ Backported to 4.9 - gregkh ]
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>bpf: don't trigger OOM killer under pressure with map alloc</title>
<updated>2017-07-05T12:40:21Z</updated>
<author>
<name>Daniel Borkmann</name>
<email>daniel@iogearbox.net</email>
</author>
<published>2017-01-18T14:14:17Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=251d00bf1309c65316f5bd3850b2ca523b46921c'/>
<id>urn:sha1:251d00bf1309c65316f5bd3850b2ca523b46921c</id>
<content type='text'>
[ Upstream commit d407bd25a204bd66b7346dde24bd3d37ef0e0b05 ]

This patch adds two helpers, bpf_map_area_alloc() and bpf_map_area_free(),
that are to be used for map allocations. Using kmalloc() for very large
allocations can cause excessive work within the page allocator, so i) fall
back earlier to vmalloc() when the attempt is considered costly anyway,
and even more importantly ii) don't trigger OOM killer with any of the
allocators.

Since this is based on a user space request, for example, when creating
maps with element pre-allocation, we really want such requests to fail
instead of killing other user space processes.

Also, don't spam the kernel log with warnings should any of the allocations
fail under pressure. Given that, we can make backend selection in
bpf_map_area_alloc() generic, and convert all maps over to use this API
for spots with potentially large allocation requests.

Note, replacing the one kmalloc_array() is fine as overflow checks happen
earlier in htab_map_alloc(), since it must also protect the multiplication
for vmalloc() should kmalloc_array() fail.

Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Acked-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@verizon.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>bpf: allow access into map value arrays</title>
<updated>2016-09-29T05:35:35Z</updated>
<author>
<name>Josef Bacik</name>
<email>jbacik@fb.com</email>
</author>
<published>2016-09-28T14:54:32Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=484611357c19f9e19ef742ebef4505a07d243cc9'/>
<id>urn:sha1:484611357c19f9e19ef742ebef4505a07d243cc9</id>
<content type='text'>
Suppose you have a map array value that is something like this

struct foo {
	unsigned iter;
	int array[SOME_CONSTANT];
};

You can easily insert this into an array, but you cannot modify the contents of
foo-&gt;array[] after the fact.  This is because we have no way to verify we won't
go off the end of the array at verification time.  This patch provides a start
for this work.  We accomplish this by keeping track of a minimum and maximum
value a register could be while we're checking the code.  Then at the time we
try to do an access into a MAP_VALUE we verify that the maximum offset into that
region is a valid access into that memory region.  So in practice, code such as
this

unsigned index = 0;

if (foo-&gt;iter &gt;= SOME_CONSTANT)
	foo-&gt;iter = index;
else
	index = foo-&gt;iter++;
foo-&gt;array[index] = bar;

would be allowed, as we can verify that index will always be between 0 and
SOME_CONSTANT-1.  If you wish to use signed values you'll have to have an extra
check to make sure the index isn't less than 0, or do something like index %=
SOME_CONSTANT.

Signed-off-by: Josef Bacik &lt;jbacik@fb.com&gt;
Acked-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>bpf: direct packet write and access for helpers for clsact progs</title>
<updated>2016-09-21T03:32:11Z</updated>
<author>
<name>Daniel Borkmann</name>
<email>daniel@iogearbox.net</email>
</author>
<published>2016-09-19T22:26:13Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=36bbef52c7eb646ed6247055a2acd3851e317857'/>
<id>urn:sha1:36bbef52c7eb646ed6247055a2acd3851e317857</id>
<content type='text'>
This work implements direct packet access for helpers and direct packet
write in a similar fashion as already available for XDP types via commits
4acf6c0b84c9 ("bpf: enable direct packet data write for xdp progs") and
6841de8b0d03 ("bpf: allow helpers access the packet directly"), and as a
complementary feature to the already available direct packet read for tc
(cls/act) programs.

For enabling this, we need to introduce two helpers, bpf_skb_pull_data()
and bpf_csum_update(). The first is generally needed for both, read and
write, because they would otherwise only be limited to the current linear
skb head. Usually, when the data_end test fails, programs just bail out,
or, in the direct read case, use bpf_skb_load_bytes() as an alternative
to overcome this limitation. If such data sits in non-linear parts, we
can just pull them in once with the new helper, retest and eventually
access them.

At the same time, this also makes sure the skb is uncloned, which is, of
course, a necessary condition for direct write. As this needs to be an
invariant for the write part only, the verifier detects writes and adds
a prologue that is calling bpf_skb_pull_data() to effectively unclone the
skb from the very beginning in case it is indeed cloned. The heuristic
makes use of a similar trick that was done in 233577a22089 ("net: filter:
constify detection of pkt_type_offset"). This comes at zero cost for other
programs that do not use the direct write feature. Should a program use
this feature only sparsely and has read access for the most parts with,
for example, drop return codes, then such write action can be delegated
to a tail called program for mitigating this cost of potential uncloning
to a late point in time where it would have been paid similarly with the
bpf_skb_store_bytes() as well. Advantage of direct write is that the
writes are inlined whereas the helper cannot make any length assumptions
and thus needs to generate a call to memcpy() also for small sizes, as well
as cost of helper call itself with sanity checks are avoided. Plus, when
direct read is already used, we don't need to cache or perform rechecks
on the data boundaries (due to verifier invalidating previous checks for
helpers that change skb-&gt;data), so more complex programs using rewrites
can benefit from switching to direct read plus write.

For direct packet access to helpers, we save the otherwise needed copy into
a temp struct sitting on stack memory when use-case allows. Both facilities
are enabled via may_access_direct_pkt_data() in verifier. For now, we limit
this to map helpers and csum_diff, and can successively enable other helpers
where we find it makes sense. Helpers that definitely cannot be allowed for
this are those part of bpf_helper_changes_skb_data() since they can change
underlying data, and those that write into memory as this could happen for
packet typed args when still cloned. bpf_csum_update() helper accommodates
for the fact that we need to fixup checksum_complete when using direct write
instead of bpf_skb_store_bytes(), meaning the programs can use available
helpers like bpf_csum_diff(), and implement csum_add(), csum_sub(),
csum_block_add(), csum_block_sub() equivalents in eBPF together with the
new helper. A usage example will be provided for iproute2's examples/bpf/
directory.

Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Acked-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>perf, bpf: add perf events core support for BPF_PROG_TYPE_PERF_EVENT programs</title>
<updated>2016-09-02T17:46:44Z</updated>
<author>
<name>Alexei Starovoitov</name>
<email>ast@fb.com</email>
</author>
<published>2016-09-02T01:37:24Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=aa6a5f3cb2b2edc5b9aab0b4fdfdfa9c3b5096a8'/>
<id>urn:sha1:aa6a5f3cb2b2edc5b9aab0b4fdfdfa9c3b5096a8</id>
<content type='text'>
Allow attaching BPF_PROG_TYPE_PERF_EVENT programs to sw and hw perf events
via overflow_handler mechanism.
When program is attached the overflow_handlers become stacked.
The program acts as a filter.
Returning zero from the program means that the normal perf_event_output handler
will not be called and sampling event won't be stored in the ring buffer.

The overflow_handler_context==NULL is an additional safety check
to make sure programs are not attached to hw breakpoints and watchdog
in case other checks (that prevent that now anyway) get accidentally
relaxed in the future.

The program refcnt is incremented in case perf_events are inhereted
when target task is forked.
Similar to kprobe and tracepoint programs there is no ioctl to
detach the program or swap already attached program. The user space
expected to close(perf_event_fd) like it does right now for kprobe+bpf.
That restriction simplifies the code quite a bit.

The invocation of overflow_handler in __perf_event_overflow() is now
done via READ_ONCE, since that pointer can be replaced when the program
is attached while perf_event itself could have been active already.
There is no need to do similar treatment for event-&gt;prog, since it's
assigned only once before it's accessed.

Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>bpf, events: fix offset in skb copy handler</title>
<updated>2016-07-25T17:34:11Z</updated>
<author>
<name>Daniel Borkmann</name>
<email>daniel@iogearbox.net</email>
</author>
<published>2016-07-21T23:19:42Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=aa7145c16d6bf086538ad7eb20c807513bfa5efc'/>
<id>urn:sha1:aa7145c16d6bf086538ad7eb20c807513bfa5efc</id>
<content type='text'>
This patch fixes the __output_custom() routine we currently use with
bpf_skb_copy(). I missed that when len is larger than the size of the
current handle, we can issue multiple invocations of copy_func, and
__output_custom() advances destination but also source buffer by the
written amount of bytes. When we have __output_custom(), this is actually
wrong since in that case the source buffer points to a non-linear object,
in our case an skb, which the copy_func helper is supposed to walk.
Therefore, since this is non-linear we thus need to pass the offset into
the helper, so that copy_func can use it for extracting the data from
the source object.

Therefore, adjust the callback signatures properly and pass offset
into the skb_header_pointer() invoked from bpf_skb_copy() callback. The
__DEFINE_OUTPUT_COPY_BODY() is adjusted to accommodate for two things:
i) to pass in whether we should advance source buffer or not; this is
a compile-time constant condition, ii) to pass in the offset for
__output_custom(), which we do with help of __VA_ARGS__, so everything
can stay inlined as is currently. Both changes allow for adapting the
__output_* fast-path helpers w/o extra overhead.

Fixes: 555c8a8623a3 ("bpf: avoid stack copy and use skb ctx for event output")
Fixes: 7e3f977edd0b ("perf, events: add non-linear data support for raw records")
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Acked-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>bpf: fix implicit declaration of bpf_prog_add</title>
<updated>2016-07-20T18:59:21Z</updated>
<author>
<name>Brenden Blanco</name>
<email>bblanco@plumgrid.com</email>
</author>
<published>2016-07-20T14:55:52Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=cc2e0b3fbcdd9667d7b7ecdf36d7b4d3647681d6'/>
<id>urn:sha1:cc2e0b3fbcdd9667d7b7ecdf36d7b4d3647681d6</id>
<content type='text'>
For the ifndef case of CONFIG_BPF_SYSCALL, an inline version of
bpf_prog_add needs to exist otherwise the build breaks on some configs.

 drivers/net/ethernet/mellanox/mlx4/en_netdev.c:2544:10: error: implicit declaration of function 'bpf_prog_add'
       prog = bpf_prog_add(prog, priv-&gt;rx_ring_num - 1);

The function is introduced in
59d3656d5bf50 ("bpf: add bpf_prog_add api for bulk prog refcnt")
and first used in
47f1afdba2b87 ("net/mlx4_en: add support for fast rx drop bpf program").

Fixes: 47f1afdba2b87 ("net/mlx4_en: add support for fast rx drop bpf program")
Reported-by: kbuild test robot &lt;fengguang.wu@intel.com&gt;
Reported-by: Tariq Toukan &lt;ttoukan.linux@gmail.com&gt;
Signed-off-by: Brenden Blanco &lt;bblanco@plumgrid.com&gt;
Acked-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>bpf: add bpf_prog_add api for bulk prog refcnt</title>
<updated>2016-07-20T04:46:31Z</updated>
<author>
<name>Brenden Blanco</name>
<email>bblanco@plumgrid.com</email>
</author>
<published>2016-07-19T19:16:46Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=59d3656d5bf504f771fc44fdbc7a9a8590795f22'/>
<id>urn:sha1:59d3656d5bf504f771fc44fdbc7a9a8590795f22</id>
<content type='text'>
A subsystem may need to store many copies of a bpf program, each
deserving its own reference. Rather than requiring the caller to loop
one by one (with possible mid-loop failure), add a bulk bpf_prog_add
api.

Signed-off-by: Brenden Blanco &lt;bblanco@plumgrid.com&gt;
Acked-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>bpf: avoid stack copy and use skb ctx for event output</title>
<updated>2016-07-15T21:23:56Z</updated>
<author>
<name>Daniel Borkmann</name>
<email>daniel@iogearbox.net</email>
</author>
<published>2016-07-14T16:08:05Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=555c8a8623a3a87b3c990ba30b7fd2e5914e41d2'/>
<id>urn:sha1:555c8a8623a3a87b3c990ba30b7fd2e5914e41d2</id>
<content type='text'>
This work addresses a couple of issues bpf_skb_event_output()
helper currently has: i) We need two copies instead of just a
single one for the skb data when it should be part of a sample.
The data can be non-linear and thus needs to be extracted via
bpf_skb_load_bytes() helper first, and then copied once again
into the ring buffer slot. ii) Since bpf_skb_load_bytes()
currently needs to be used first, the helper needs to see a
constant size on the passed stack buffer to make sure BPF
verifier can do sanity checks on it during verification time.
Thus, just passing skb-&gt;len (or any other non-constant value)
wouldn't work, but changing bpf_skb_load_bytes() is also not
the proper solution, since the two copies are generally still
needed. iii) bpf_skb_load_bytes() is just for rather small
buffers like headers, since they need to sit on the limited
BPF stack anyway. Instead of working around in bpf_skb_load_bytes(),
this work improves the bpf_skb_event_output() helper to address
all 3 at once.

We can make use of the passed in skb context that we have in
the helper anyway, and use some of the reserved flag bits as
a length argument. The helper will use the new __output_custom()
facility from perf side with bpf_skb_copy() as callback helper
to walk and extract the data. It will pass the data for setup
to bpf_event_output(), which generates and pushes the raw record
with an additional frag part. The linear data used in the first
frag of the record serves as programmatically defined meta data
passed along with the appended sample.

Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Acked-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
</feed>
