<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/include/linux/seccomp.h, branch v5.10.87</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v5.10.87</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v5.10.87'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2020-08-05T04:00:11Z</updated>
<entry>
<title>Merge tag 'core-entry-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip</title>
<updated>2020-08-05T04:00:11Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2020-08-05T04:00:11Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=3f0d6ecdf1ab35ac54cabb759f748fb0bffd26a5'/>
<id>urn:sha1:3f0d6ecdf1ab35ac54cabb759f748fb0bffd26a5</id>
<content type='text'>
Pull generic kernel entry/exit code from Thomas Gleixner:
 "Generic implementation of common syscall, interrupt and exception
  entry/exit functionality based on the recent X86 effort to ensure
  correctness of entry/exit vs RCU and instrumentation.

  As this functionality and the required entry/exit sequences are not
  architecture specific, sharing them allows other architectures to
  benefit instead of copying the same code over and over again.

  This branch was kept standalone to allow others to work on it. The
  conversion of x86 comes in a seperate pull request which obviously is
  based on this branch"

* tag 'core-entry-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  entry: Correct __secure_computing() stub
  entry: Correct 'noinstr' attributes
  entry: Provide infrastructure for work before transitioning to guest mode
  entry: Provide generic interrupt entry/exit code
  entry: Provide generic syscall exit function
  entry: Provide generic syscall entry functionality
  seccomp: Provide stub for __secure_computing()
</content>
</entry>
<entry>
<title>entry: Correct __secure_computing() stub</title>
<updated>2020-07-26T16:22:27Z</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@linutronix.de</email>
</author>
<published>2020-07-26T16:14:43Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=3135f5b73592988af0eb1b11ccbb72a8667be201'/>
<id>urn:sha1:3135f5b73592988af0eb1b11ccbb72a8667be201</id>
<content type='text'>
The original version of that used secure_computing() which has no
arguments. Review requested to switch to __secure_computing() which has
one. The function name was correct, but no argument added and of course
compiling without SECCOMP was deemed overrated.

Add the missing function argument.

Fixes: 6823ecabf030 ("seccomp: Provide stub for __secure_computing()")
Reported-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
</content>
</entry>
<entry>
<title>seccomp: Provide stub for __secure_computing()</title>
<updated>2020-07-24T12:59:03Z</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@linutronix.de</email>
</author>
<published>2020-07-22T21:59:55Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=6823ecabf03031d610a6c5afe7ed4b4fd659a99f'/>
<id>urn:sha1:6823ecabf03031d610a6c5afe7ed4b4fd659a99f</id>
<content type='text'>
To avoid #ifdeffery in the upcoming generic syscall entry work code provide
a stub for __secure_computing() as this is preferred over
secure_computing() because the TIF flag is already evaluated.

Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Acked-by: Kees Cook &lt;keescook@chromium.org&gt;
Link: https://lkml.kernel.org/r/20200722220519.404974280@linutronix.de


</content>
</entry>
<entry>
<title>seccomp: Introduce addfd ioctl to seccomp user notifier</title>
<updated>2020-07-14T23:29:42Z</updated>
<author>
<name>Sargun Dhillon</name>
<email>sargun@sargun.me</email>
</author>
<published>2020-06-03T01:10:43Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=7cf97b12545503992020796c74bd84078eb39299'/>
<id>urn:sha1:7cf97b12545503992020796c74bd84078eb39299</id>
<content type='text'>
The current SECCOMP_RET_USER_NOTIF API allows for syscall supervision over
an fd. It is often used in settings where a supervising task emulates
syscalls on behalf of a supervised task in userspace, either to further
restrict the supervisee's syscall abilities or to circumvent kernel
enforced restrictions the supervisor deems safe to lift (e.g. actually
performing a mount(2) for an unprivileged container).

While SECCOMP_RET_USER_NOTIF allows for the interception of any syscall,
only a certain subset of syscalls could be correctly emulated. Over the
last few development cycles, the set of syscalls which can't be emulated
has been reduced due to the addition of pidfd_getfd(2). With this we are
now able to, for example, intercept syscalls that require the supervisor
to operate on file descriptors of the supervisee such as connect(2).

However, syscalls that cause new file descriptors to be installed can not
currently be correctly emulated since there is no way for the supervisor
to inject file descriptors into the supervisee. This patch adds a
new addfd ioctl to remove this restriction by allowing the supervisor to
install file descriptors into the intercepted task. By implementing this
feature via seccomp the supervisor effectively instructs the supervisee
to install a set of file descriptors into its own file descriptor table
during the intercepted syscall. This way it is possible to intercept
syscalls such as open() or accept(), and install (or replace, like
dup2(2)) the supervisor's resulting fd into the supervisee. One
replacement use-case would be to redirect the stdout and stderr of a
supervisee into log file descriptors opened by the supervisor.

The ioctl handling is based on the discussions[1] of how Extensible
Arguments should interact with ioctls. Instead of building size into
the addfd structure, make it a function of the ioctl command (which
is how sizes are normally passed to ioctls). To support forward and
backward compatibility, just mask out the direction and size, and match
everything. The size (and any future direction) checks are done along
with copy_struct_from_user() logic.

As a note, the seccomp_notif_addfd structure is laid out based on 8-byte
alignment without requiring packing as there have been packing issues
with uapi highlighted before[2][3]. Although we could overload the
newfd field and use -1 to indicate that it is not to be used, doing
so requires changing the size of the fd field, and introduces struct
packing complexity.

[1]: https://lore.kernel.org/lkml/87o8w9bcaf.fsf@mid.deneb.enyo.de/
[2]: https://lore.kernel.org/lkml/a328b91d-fd8f-4f27-b3c2-91a9c45f18c0@rasmusvillemoes.dk/
[3]: https://lore.kernel.org/lkml/20200612104629.GA15814@ircssh-2.c.rugged-nimbus-611.internal

Cc: Christoph Hellwig &lt;hch@lst.de&gt;
Cc: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Cc: Tycho Andersen &lt;tycho@tycho.ws&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Robert Sesek &lt;rsesek@google.com&gt;
Cc: Chris Palmer &lt;palmer@google.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-api@vger.kernel.org
Suggested-by: Matt Denton &lt;mpdenton@google.com&gt;
Link: https://lore.kernel.org/r/20200603011044.7972-4-sargun@sargun.me
Signed-off-by: Sargun Dhillon &lt;sargun@sargun.me&gt;
Reviewed-by: Will Drewry &lt;wad@chromium.org&gt;
Co-developed-by: Kees Cook &lt;keescook@chromium.org&gt;
Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
</content>
</entry>
<entry>
<title>seccomp: release filter after task is fully dead</title>
<updated>2020-07-10T23:01:51Z</updated>
<author>
<name>Christian Brauner</name>
<email>christian.brauner@ubuntu.com</email>
</author>
<published>2020-05-31T11:50:29Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=3a15fb6ed92cb32b0a83f406aa4a96f28c9adbc3'/>
<id>urn:sha1:3a15fb6ed92cb32b0a83f406aa4a96f28c9adbc3</id>
<content type='text'>
The seccomp filter used to be released in free_task() which is called
asynchronously via call_rcu() and assorted mechanisms. Since we need
to inform tasks waiting on the seccomp notifier when a filter goes empty
we will notify them as soon as a task has been marked fully dead in
release_task(). To not split seccomp cleanup into two parts, move
filter release out of free_task() and into release_task() after we've
unhashed struct task from struct pid, exited signals, and unlinked it
from the threadgroups' thread list. We'll put the empty filter
notification infrastructure into it in a follow up patch.

This also renames put_seccomp_filter() to seccomp_filter_release() which
is a more descriptive name of what we're doing here especially once
we've added the empty filter notification mechanism in there.

We're also NULL-ing the task's filter tree entrypoint which seems
cleaner than leaving a dangling pointer in there. Note that this shouldn't
need any memory barriers since we're calling this when the task is in
release_task() which means it's EXIT_DEAD. So it can't modify its seccomp
filters anymore. You can also see this from the point where we're calling
seccomp_filter_release(). It's after __exit_signal() and at this point,
tsk-&gt;sighand will already have been NULLed which is required for
thread-sync and filter installation alike.

Cc: Tycho Andersen &lt;tycho@tycho.ws&gt;
Cc: Kees Cook &lt;keescook@chromium.org&gt;
Cc: Matt Denton &lt;mpdenton@google.com&gt;
Cc: Sargun Dhillon &lt;sargun@sargun.me&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Chris Palmer &lt;palmer@google.com&gt;
Cc: Aleksa Sarai &lt;cyphar@cyphar.com&gt;
Cc: Robert Sesek &lt;rsesek@google.com&gt;
Cc: Jeffrey Vander Stoep &lt;jeffv@google.com&gt;
Cc: Linux Containers &lt;containers@lists.linux-foundation.org&gt;
Signed-off-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Link: https://lore.kernel.org/r/20200531115031.391515-2-christian.brauner@ubuntu.com
Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
</content>
</entry>
<entry>
<title>seccomp: Report number of loaded filters in /proc/$pid/status</title>
<updated>2020-07-10T23:01:51Z</updated>
<author>
<name>Kees Cook</name>
<email>keescook@chromium.org</email>
</author>
<published>2020-05-13T21:11:26Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=c818c03b661cd769e035e41673d5543ba2ebda64'/>
<id>urn:sha1:c818c03b661cd769e035e41673d5543ba2ebda64</id>
<content type='text'>
A common question asked when debugging seccomp filters is "how many
filters are attached to your process?" Provide a way to easily answer
this question through /proc/$pid/status with a "Seccomp_filters" line.

Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
</content>
</entry>
<entry>
<title>seccomp: allow TSYNC and USER_NOTIF together</title>
<updated>2020-03-04T22:48:54Z</updated>
<author>
<name>Tycho Andersen</name>
<email>tycho@tycho.ws</email>
</author>
<published>2020-03-04T18:05:17Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=51891498f2da78ee64dfad88fa53c9e85fb50abf'/>
<id>urn:sha1:51891498f2da78ee64dfad88fa53c9e85fb50abf</id>
<content type='text'>
The restriction introduced in 7a0df7fbc145 ("seccomp: Make NEW_LISTENER and
TSYNC flags exclusive") is mostly artificial: there is enough information
in a seccomp user notification to tell which thread triggered a
notification. The reason it was introduced is because TSYNC makes the
syscall return a thread-id on failure, and NEW_LISTENER returns an fd, and
there's no way to distinguish between these two cases (well, I suppose the
caller could check all fds it has, then do the syscall, and if the return
value was an fd that already existed, then it must be a thread id, but
bleh).

Matthew would like to use these two flags together in the Chrome sandbox
which wants to use TSYNC for video drivers and NEW_LISTENER to proxy
syscalls.

So, let's fix this ugliness by adding another flag, TSYNC_ESRCH, which
tells the kernel to just return -ESRCH on a TSYNC error. This way,
NEW_LISTENER (and any subsequent seccomp() commands that want to return
positive values) don't conflict with each other.

Suggested-by: Matthew Denton &lt;mpdenton@google.com&gt;
Signed-off-by: Tycho Andersen &lt;tycho@tycho.ws&gt;
Link: https://lore.kernel.org/r/20200304180517.23867-1-tycho@tycho.ws
Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
</content>
</entry>
<entry>
<title>seccomp: simplify secure_computing()</title>
<updated>2019-10-10T21:55:24Z</updated>
<author>
<name>Christian Brauner</name>
<email>christian.brauner@ubuntu.com</email>
</author>
<published>2019-09-24T06:44:20Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=fefad9ef58ffc228f7b78b667c2aea8267503350'/>
<id>urn:sha1:fefad9ef58ffc228f7b78b667c2aea8267503350</id>
<content type='text'>
Afaict, the struct seccomp_data argument to secure_computing() is unused
by all current callers. So let's remove it.
The argument was added in [1]. It was added because having the arch
supply the syscall arguments used to be faster than having it done by
secure_computing() (cf. Andy's comment in [2]). This is not true anymore
though.

/* References */
[1]: 2f275de5d1ed ("seccomp: Add a seccomp_data parameter secure_computing()")
[2]: https://lore.kernel.org/r/CALCETrU_fs_At-hTpr231kpaAd0z7xJN4ku-DvzhRU6cvcJA_w@mail.gmail.com

Signed-off-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Cc: Andy Lutomirski &lt;luto@kernel.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Will Drewry &lt;wad@chromium.org&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-parisc@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linux-um@lists.infradead.org
Cc: x86@kernel.org
Acked-by: Borislav Petkov &lt;bp@suse.de&gt;
Acked-by: Andy Lutomirski &lt;luto@kernel.org&gt;
Link: https://lore.kernel.org/r/20190924064420.6353-1-christian.brauner@ubuntu.com
Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
</content>
</entry>
<entry>
<title>seccomp: add a return code to trap to userspace</title>
<updated>2018-12-12T00:28:41Z</updated>
<author>
<name>Tycho Andersen</name>
<email>tycho@tycho.ws</email>
</author>
<published>2018-12-09T18:24:13Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=6a21cc50f0c7f87dae5259f6cfefe024412313f6'/>
<id>urn:sha1:6a21cc50f0c7f87dae5259f6cfefe024412313f6</id>
<content type='text'>
This patch introduces a means for syscalls matched in seccomp to notify
some other task that a particular filter has been triggered.

The motivation for this is primarily for use with containers. For example,
if a container does an init_module(), we obviously don't want to load this
untrusted code, which may be compiled for the wrong version of the kernel
anyway. Instead, we could parse the module image, figure out which module
the container is trying to load and load it on the host.

As another example, containers cannot mount() in general since various
filesystems assume a trusted image. However, if an orchestrator knows that
e.g. a particular block device has not been exposed to a container for
writing, it want to allow the container to mount that block device (that
is, handle the mount for it).

This patch adds functionality that is already possible via at least two
other means that I know about, both of which involve ptrace(): first, one
could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
Unfortunately this is slow, so a faster version would be to install a
filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
Since ptrace allows only one tracer, if the container runtime is that
tracer, users inside the container (or outside) trying to debug it will not
be able to use ptrace, which is annoying. It also means that older
distributions based on Upstart cannot boot inside containers using ptrace,
since upstart itself uses ptrace to monitor services while starting.

The actual implementation of this is fairly small, although getting the
synchronization right was/is slightly complex.

Finally, it's worth noting that the classic seccomp TOCTOU of reading
memory data from the task still applies here, but can be avoided with
careful design of the userspace handler: if the userspace handler reads all
of the task memory that is necessary before applying its security policy,
the tracee's subsequent memory edits will not be read by the tracer.

Signed-off-by: Tycho Andersen &lt;tycho@tycho.ws&gt;
CC: Kees Cook &lt;keescook@chromium.org&gt;
CC: Andy Lutomirski &lt;luto@amacapital.net&gt;
CC: Oleg Nesterov &lt;oleg@redhat.com&gt;
CC: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
CC: "Serge E. Hallyn" &lt;serge@hallyn.com&gt;
Acked-by: Serge Hallyn &lt;serge@hallyn.com&gt;
CC: Christian Brauner &lt;christian@brauner.io&gt;
CC: Tyler Hicks &lt;tyhicks@canonical.com&gt;
CC: Akihiro Suda &lt;suda.akihiro@lab.ntt.co.jp&gt;
Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
</content>
</entry>
<entry>
<title>seccomp: switch system call argument type to void *</title>
<updated>2018-12-12T00:28:41Z</updated>
<author>
<name>Tycho Andersen</name>
<email>tycho@tycho.ws</email>
</author>
<published>2018-12-09T18:24:12Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=a5662e4d81c4d4b08140c625d0f3c50b15786252'/>
<id>urn:sha1:a5662e4d81c4d4b08140c625d0f3c50b15786252</id>
<content type='text'>
The const qualifier causes problems for any code that wants to write to the
third argument of the seccomp syscall, as we will do in a future patch in
this series.

The third argument to the seccomp syscall is documented as void *, so
rather than just dropping the const, let's switch everything to use void *
as well.

I believe this is safe because of 1. the documentation above, 2. there's no
real type information exported about syscalls anywhere besides the man
pages.

Signed-off-by: Tycho Andersen &lt;tycho@tycho.ws&gt;
CC: Kees Cook &lt;keescook@chromium.org&gt;
CC: Andy Lutomirski &lt;luto@amacapital.net&gt;
CC: Oleg Nesterov &lt;oleg@redhat.com&gt;
CC: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
CC: "Serge E. Hallyn" &lt;serge@hallyn.com&gt;
Acked-by: Serge Hallyn &lt;serge@hallyn.com&gt;
CC: Christian Brauner &lt;christian@brauner.io&gt;
CC: Tyler Hicks &lt;tyhicks@canonical.com&gt;
CC: Akihiro Suda &lt;suda.akihiro@lab.ntt.co.jp&gt;
Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
</content>
</entry>
</feed>
