user/sven/linux.git/include/linux/seccomp.h, branch v5.10.87

Merge tag 'core-entry-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2020-08-05T04:00:11Z

Pull generic kernel entry/exit code from Thomas Gleixner: "Generic implementation of common syscall, interrupt and exception entry/exit functionality based on the recent X86 effort to ensure correctness of entry/exit vs RCU and instrumentation. As this functionality and the required entry/exit sequences are not architecture specific, sharing them allows other architectures to benefit instead of copying the same code over and over again. This branch was kept standalone to allow others to work on it. The conversion of x86 comes in a seperate pull request which obviously is based on this branch" * tag 'core-entry-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: entry: Correct __secure_computing() stub entry: Correct 'noinstr' attributes entry: Provide infrastructure for work before transitioning to guest mode entry: Provide generic interrupt entry/exit code entry: Provide generic syscall exit function entry: Provide generic syscall entry functionality seccomp: Provide stub for __secure_computing()

entry: Correct __secure_computing() stub

2020-07-26T16:22:27Z

The original version of that used secure_computing() which has no arguments. Review requested to switch to __secure_computing() which has one. The function name was correct, but no argument added and of course compiling without SECCOMP was deemed overrated. Add the missing function argument. Fixes: 6823ecabf030 ("seccomp: Provide stub for __secure_computing()") Reported-by: Ingo Molnar Signed-off-by: Thomas Gleixner

seccomp: Provide stub for __secure_computing()

2020-07-24T12:59:03Z

To avoid #ifdeffery in the upcoming generic syscall entry work code provide a stub for __secure_computing() as this is preferred over secure_computing() because the TIF flag is already evaluated. Signed-off-by: Thomas Gleixner Acked-by: Kees Cook Link: https://lkml.kernel.org/r/20200722220519.404974280@linutronix.de

seccomp: Introduce addfd ioctl to seccomp user notifier

2020-07-14T23:29:42Z

The current SECCOMP_RET_USER_NOTIF API allows for syscall supervision over an fd. It is often used in settings where a supervising task emulates syscalls on behalf of a supervised task in userspace, either to further restrict the supervisee's syscall abilities or to circumvent kernel enforced restrictions the supervisor deems safe to lift (e.g. actually performing a mount(2) for an unprivileged container). While SECCOMP_RET_USER_NOTIF allows for the interception of any syscall, only a certain subset of syscalls could be correctly emulated. Over the last few development cycles, the set of syscalls which can't be emulated has been reduced due to the addition of pidfd_getfd(2). With this we are now able to, for example, intercept syscalls that require the supervisor to operate on file descriptors of the supervisee such as connect(2). However, syscalls that cause new file descriptors to be installed can not currently be correctly emulated since there is no way for the supervisor to inject file descriptors into the supervisee. This patch adds a new addfd ioctl to remove this restriction by allowing the supervisor to install file descriptors into the intercepted task. By implementing this feature via seccomp the supervisor effectively instructs the supervisee to install a set of file descriptors into its own file descriptor table during the intercepted syscall. This way it is possible to intercept syscalls such as open() or accept(), and install (or replace, like dup2(2)) the supervisor's resulting fd into the supervisee. One replacement use-case would be to redirect the stdout and stderr of a supervisee into log file descriptors opened by the supervisor. The ioctl handling is based on the discussions[1] of how Extensible Arguments should interact with ioctls. Instead of building size into the addfd structure, make it a function of the ioctl command (which is how sizes are normally passed to ioctls). To support forward and backward compatibility, just mask out the direction and size, and match everything. The size (and any future direction) checks are done along with copy_struct_from_user() logic. As a note, the seccomp_notif_addfd structure is laid out based on 8-byte alignment without requiring packing as there have been packing issues with uapi highlighted before[2][3]. Although we could overload the newfd field and use -1 to indicate that it is not to be used, doing so requires changing the size of the fd field, and introduces struct packing complexity. [1]: https://lore.kernel.org/lkml/87o8w9bcaf.fsf@mid.deneb.enyo.de/ [2]: https://lore.kernel.org/lkml/a328b91d-fd8f-4f27-b3c2-91a9c45f18c0@rasmusvillemoes.dk/ [3]: https://lore.kernel.org/lkml/20200612104629.GA15814@ircssh-2.c.rugged-nimbus-611.internal Cc: Christoph Hellwig Cc: Christian Brauner Cc: Tycho Andersen Cc: Jann Horn Cc: Robert Sesek Cc: Chris Palmer Cc: Al Viro Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-api@vger.kernel.org Suggested-by: Matt Denton Link: https://lore.kernel.org/r/20200603011044.7972-4-sargun@sargun.me Signed-off-by: Sargun Dhillon Reviewed-by: Will Drewry Co-developed-by: Kees Cook Signed-off-by: Kees Cook

seccomp: release filter after task is fully dead

2020-07-10T23:01:51Z

The seccomp filter used to be released in free_task() which is called asynchronously via call_rcu() and assorted mechanisms. Since we need to inform tasks waiting on the seccomp notifier when a filter goes empty we will notify them as soon as a task has been marked fully dead in release_task(). To not split seccomp cleanup into two parts, move filter release out of free_task() and into release_task() after we've unhashed struct task from struct pid, exited signals, and unlinked it from the threadgroups' thread list. We'll put the empty filter notification infrastructure into it in a follow up patch. This also renames put_seccomp_filter() to seccomp_filter_release() which is a more descriptive name of what we're doing here especially once we've added the empty filter notification mechanism in there. We're also NULL-ing the task's filter tree entrypoint which seems cleaner than leaving a dangling pointer in there. Note that this shouldn't need any memory barriers since we're calling this when the task is in release_task() which means it's EXIT_DEAD. So it can't modify its seccomp filters anymore. You can also see this from the point where we're calling seccomp_filter_release(). It's after __exit_signal() and at this point, tsk->sighand will already have been NULLed which is required for thread-sync and filter installation alike. Cc: Tycho Andersen Cc: Kees Cook Cc: Matt Denton Cc: Sargun Dhillon Cc: Jann Horn Cc: Chris Palmer Cc: Aleksa Sarai Cc: Robert Sesek Cc: Jeffrey Vander Stoep Cc: Linux Containers Signed-off-by: Christian Brauner Link: https://lore.kernel.org/r/20200531115031.391515-2-christian.brauner@ubuntu.com Signed-off-by: Kees Cook

seccomp: Report number of loaded filters in /proc/$pid/status

2020-07-10T23:01:51Z

A common question asked when debugging seccomp filters is "how many filters are attached to your process?" Provide a way to easily answer this question through /proc/$pid/status with a "Seccomp_filters" line. Signed-off-by: Kees Cook

seccomp: allow TSYNC and USER_NOTIF together

2020-03-04T22:48:54Z

The restriction introduced in 7a0df7fbc145 ("seccomp: Make NEW_LISTENER and TSYNC flags exclusive") is mostly artificial: there is enough information in a seccomp user notification to tell which thread triggered a notification. The reason it was introduced is because TSYNC makes the syscall return a thread-id on failure, and NEW_LISTENER returns an fd, and there's no way to distinguish between these two cases (well, I suppose the caller could check all fds it has, then do the syscall, and if the return value was an fd that already existed, then it must be a thread id, but bleh). Matthew would like to use these two flags together in the Chrome sandbox which wants to use TSYNC for video drivers and NEW_LISTENER to proxy syscalls. So, let's fix this ugliness by adding another flag, TSYNC_ESRCH, which tells the kernel to just return -ESRCH on a TSYNC error. This way, NEW_LISTENER (and any subsequent seccomp() commands that want to return positive values) don't conflict with each other. Suggested-by: Matthew Denton Signed-off-by: Tycho Andersen Link: https://lore.kernel.org/r/20200304180517.23867-1-tycho@tycho.ws Signed-off-by: Kees Cook

seccomp: simplify secure_computing()

2019-10-10T21:55:24Z

Afaict, the struct seccomp_data argument to secure_computing() is unused by all current callers. So let's remove it. The argument was added in [1]. It was added because having the arch supply the syscall arguments used to be faster than having it done by secure_computing() (cf. Andy's comment in [2]). This is not true anymore though. /* References */ [1]: 2f275de5d1ed ("seccomp: Add a seccomp_data parameter secure_computing()") [2]: https://lore.kernel.org/r/CALCETrU_fs_At-hTpr231kpaAd0z7xJN4ku-DvzhRU6cvcJA_w@mail.gmail.com Signed-off-by: Christian Brauner Cc: Andy Lutomirski Cc: Thomas Gleixner Cc: Will Drewry Cc: Oleg Nesterov Cc: linux-arm-kernel@lists.infradead.org Cc: linux-parisc@vger.kernel.org Cc: linux-s390@vger.kernel.org Cc: linux-um@lists.infradead.org Cc: x86@kernel.org Acked-by: Borislav Petkov Acked-by: Andy Lutomirski Link: https://lore.kernel.org/r/20190924064420.6353-1-christian.brauner@ubuntu.com Signed-off-by: Kees Cook

seccomp: add a return code to trap to userspace

2018-12-12T00:28:41Z

This patch introduces a means for syscalls matched in seccomp to notify some other task that a particular filter has been triggered. The motivation for this is primarily for use with containers. For example, if a container does an init_module(), we obviously don't want to load this untrusted code, which may be compiled for the wrong version of the kernel anyway. Instead, we could parse the module image, figure out which module the container is trying to load and load it on the host. As another example, containers cannot mount() in general since various filesystems assume a trusted image. However, if an orchestrator knows that e.g. a particular block device has not been exposed to a container for writing, it want to allow the container to mount that block device (that is, handle the mount for it). This patch adds functionality that is already possible via at least two other means that I know about, both of which involve ptrace(): first, one could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL. Unfortunately this is slow, so a faster version would be to install a filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP. Since ptrace allows only one tracer, if the container runtime is that tracer, users inside the container (or outside) trying to debug it will not be able to use ptrace, which is annoying. It also means that older distributions based on Upstart cannot boot inside containers using ptrace, since upstart itself uses ptrace to monitor services while starting. The actual implementation of this is fairly small, although getting the synchronization right was/is slightly complex. Finally, it's worth noting that the classic seccomp TOCTOU of reading memory data from the task still applies here, but can be avoided with careful design of the userspace handler: if the userspace handler reads all of the task memory that is necessary before applying its security policy, the tracee's subsequent memory edits will not be read by the tracer. Signed-off-by: Tycho Andersen CC: Kees Cook CC: Andy Lutomirski CC: Oleg Nesterov CC: Eric W. Biederman CC: "Serge E. Hallyn" Acked-by: Serge Hallyn CC: Christian Brauner CC: Tyler Hicks CC: Akihiro Suda Signed-off-by: Kees Cook

seccomp: switch system call argument type to void *

2018-12-12T00:28:41Z

The const qualifier causes problems for any code that wants to write to the third argument of the seccomp syscall, as we will do in a future patch in this series. The third argument to the seccomp syscall is documented as void *, so rather than just dropping the const, let's switch everything to use void * as well. I believe this is safe because of 1. the documentation above, 2. there's no real type information exported about syscalls anywhere besides the man pages. Signed-off-by: Tycho Andersen CC: Kees Cook CC: Andy Lutomirski CC: Oleg Nesterov CC: Eric W. Biederman CC: "Serge E. Hallyn" Acked-by: Serge Hallyn CC: Christian Brauner CC: Tyler Hicks CC: Akihiro Suda Signed-off-by: Kees Cook