<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/kernel/pid.c, branch v5.9.8</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v5.9.8</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v5.9.8'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2020-08-04T22:02:07Z</updated>
<entry>
<title>Merge tag 'cap-checkpoint-restore-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux</title>
<updated>2020-08-04T22:02:07Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2020-08-04T22:02:07Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=74858abbb1032222f922487fd1a24513bbed80f9'/>
<id>urn:sha1:74858abbb1032222f922487fd1a24513bbed80f9</id>
<content type='text'>
Pull checkpoint-restore updates from Christian Brauner:
 "This enables unprivileged checkpoint/restore of processes.

  Given that this work has been going on for quite some time the first
  sentence in this summary is hopefully more exciting than the actual
  final code changes required. Unprivileged checkpoint/restore has seen
  a frequent increase in interest over the last two years and has thus
  been one of the main topics for the combined containers &amp;
  checkpoint/restore microconference since at least 2018 (cf. [1]).

  Here are just the three most frequent use-cases that were brought forward:

   - The JVM developers are integrating checkpoint/restore into a Java
     VM to significantly decrease the startup time.

   - In high-performance computing environment a resource manager will
     typically be distributing jobs where users are always running as
     non-root. Long-running and "large" processes with significant
     startup times are supposed to be checkpointed and restored with
     CRIU.

   - Container migration as a non-root user.

  In all of these scenarios it is either desirable or required to run
  without CAP_SYS_ADMIN. The userspace implementation of
  checkpoint/restore CRIU already has the pull request for supporting
  unprivileged checkpoint/restore up (cf. [2]).

  To enable unprivileged checkpoint/restore a new dedicated capability
  CAP_CHECKPOINT_RESTORE is introduced. This solution has last been
  discussed in 2019 in a talk by Google at Linux Plumbers (cf. [1]
  "Update on Task Migration at Google Using CRIU") with Adrian and
  Nicolas providing the implementation now over the last months. In
  essence, this allows the CRIU binary to be installed with the
  CAP_CHECKPOINT_RESTORE vfs capability set thereby enabling
  unprivileged users to restore processes.

  To make this possible the following permissions are altered:

   - Selecting a specific PID via clone3() set_tid relaxed from userns
     CAP_SYS_ADMIN to CAP_CHECKPOINT_RESTORE.

   - Selecting a specific PID via /proc/sys/kernel/ns_last_pid relaxed
     from userns CAP_SYS_ADMIN to CAP_CHECKPOINT_RESTORE.

   - Accessing /proc/pid/map_files relaxed from init userns
     CAP_SYS_ADMIN to init userns CAP_CHECKPOINT_RESTORE.

   - Changing /proc/self/exe from userns CAP_SYS_ADMIN to userns
     CAP_CHECKPOINT_RESTORE.

  Of these four changes the /proc/self/exe change deserves a few words
  because the reasoning behind even restricting /proc/self/exe changes
  in the first place is just full of historical quirks and tracking this
  down was a questionable version of fun that I'd like to spare others.

  In short, it is trivial to change /proc/self/exe as an unprivileged
  user, i.e. without userns CAP_SYS_ADMIN right now. Either via ptrace()
  or by simply intercepting the elf loader in userspace during exec.
  Nicolas was nice enough to even provide a POC for the latter (cf. [3])
  to illustrate this fact.

  The original patchset which introduced PR_SET_MM_MAP had no
  permissions around changing the exe link. They too argued that it is
  trivial to spoof the exe link already which is true. The argument
  brought up against this was that the Tomoyo LSM uses the exe link in
  tomoyo_manager() to detect whether the calling process is a policy
  manager. This caused changing the exe links to be guarded by userns
  CAP_SYS_ADMIN.

  All in all this rather seems like a "better guard it with something
  rather than nothing" argument which imho doesn't qualify as a great
  security policy. Again, because spoofing the exe link is possible for
  the calling process so even if this were security relevant it was
  broken back then and would be broken today. So technically, dropping
  all permissions around changing the exe link would probably be
  possible and would send a clearer message to any userspace that relies
  on /proc/self/exe for security reasons that they should stop doing
  this but for now we're only relaxing the exe link permissions from
  userns CAP_SYS_ADMIN to userns CAP_CHECKPOINT_RESTORE.

  There's a final uapi change in here. Changing the exe link used to
  accidently return EINVAL when the caller lacked the necessary
  permissions instead of the more correct EPERM. This pr contains a
  commit fixing this. I assume that userspace won't notice or care and
  if they do I will revert this commit. But since we are changing the
  permissions anyway it seems like a good opportunity to try this fix.

  With these changes merged unprivileged checkpoint/restore will be
  possible and has already been tested by various users"

[1] LPC 2018
     1. "Task Migration at Google Using CRIU"
        https://www.youtube.com/watch?v=yI_1cuhoDgA&amp;t=12095
     2. "Securely Migrating Untrusted Workloads with CRIU"
        https://www.youtube.com/watch?v=yI_1cuhoDgA&amp;t=14400
     LPC 2019
     1. "CRIU and the PID dance"
         https://www.youtube.com/watch?v=LN2CUgp8deo&amp;list=PLVsQ_xZBEyN30ZA3Pc9MZMFzdjwyz26dO&amp;index=9&amp;t=2m48s
     2. "Update on Task Migration at Google Using CRIU"
        https://www.youtube.com/watch?v=LN2CUgp8deo&amp;list=PLVsQ_xZBEyN30ZA3Pc9MZMFzdjwyz26dO&amp;index=9&amp;t=1h2m8s

[2] https://github.com/checkpoint-restore/criu/pull/1155

[3] https://github.com/nviennot/run_as_exe

* tag 'cap-checkpoint-restore-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  selftests: add clone3() CAP_CHECKPOINT_RESTORE test
  prctl: exe link permission error changed from -EINVAL to -EPERM
  prctl: Allow local CAP_CHECKPOINT_RESTORE to change /proc/self/exe
  proc: allow access in init userns for map_files with CAP_CHECKPOINT_RESTORE
  pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid
  pid: use checkpoint_restore_ns_capable() for set_tid
  capabilities: Introduce CAP_CHECKPOINT_RESTORE
</content>
</entry>
<entry>
<title>pid: use checkpoint_restore_ns_capable() for set_tid</title>
<updated>2020-07-19T18:14:42Z</updated>
<author>
<name>Adrian Reber</name>
<email>areber@redhat.com</email>
</author>
<published>2020-07-19T10:04:12Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=1caef81da05a84a40dbf02110e967ce6d1135ff6'/>
<id>urn:sha1:1caef81da05a84a40dbf02110e967ce6d1135ff6</id>
<content type='text'>
Use the newly introduced capability CAP_CHECKPOINT_RESTORE to allow
using clone3() with set_tid set.

Signed-off-by: Adrian Reber &lt;areber@redhat.com&gt;
Signed-off-by: Nicolas Viennot &lt;Nicolas.Viennot@twosigma.com&gt;
Reviewed-by: Serge Hallyn &lt;serge@hallyn.com&gt;
Acked-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Link: https://lore.kernel.org/r/20200719100418.2112740-3-areber@redhat.com
Signed-off-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
</content>
</entry>
<entry>
<title>pidfd: Replace open-coded receive_fd()</title>
<updated>2020-07-13T18:03:45Z</updated>
<author>
<name>Kees Cook</name>
<email>keescook@chromium.org</email>
</author>
<published>2020-06-09T23:21:38Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=910d2f16ac90463a1f5b03d53246c443e2b354b9'/>
<id>urn:sha1:910d2f16ac90463a1f5b03d53246c443e2b354b9</id>
<content type='text'>
Replace the open-coded version of receive_fd() with a call to the
new helper.

Thanks to Vamshi K Sthambamkadi &lt;vamshi.k.sthambamkadi@gmail.com&gt; for
catching a missed fput() in an earlier version of this patch.

Cc: Christoph Hellwig &lt;hch@lst.de&gt;
Cc: Jakub Kicinski &lt;kuba@kernel.org&gt;
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Sargun Dhillon &lt;sargun@sargun.me&gt;
Acked-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
</content>
</entry>
<entry>
<title>pidfd: Add missing sock updates for pidfd_getfd()</title>
<updated>2020-07-13T18:03:44Z</updated>
<author>
<name>Kees Cook</name>
<email>keescook@chromium.org</email>
</author>
<published>2020-06-09T23:21:38Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=4969f8a073977123504609d7310b42a588297aa4'/>
<id>urn:sha1:4969f8a073977123504609d7310b42a588297aa4</id>
<content type='text'>
The sock counting (sock_update_netprioidx() and sock_update_classid())
was missing from pidfd's implementation of received fd installation. Add
a call to the new __receive_sock() helper.

Cc: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Cc: Christoph Hellwig &lt;hch@lst.de&gt;
Cc: Sargun Dhillon &lt;sargun@sargun.me&gt;
Cc: Jakub Kicinski &lt;kuba@kernel.org&gt;
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: 8649c322f75c ("pid: Implement pidfd_getfd syscall")
Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
</content>
</entry>
<entry>
<title>remove the no longer needed pid_alive() check in __task_pid_nr_ns()</title>
<updated>2020-04-30T11:40:14Z</updated>
<author>
<name>Oleg Nesterov</name>
<email>oleg@redhat.com</email>
</author>
<published>2020-04-21T10:19:04Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=1dd694a1b72f69a3add938f4b5bfb4cf9914b133'/>
<id>urn:sha1:1dd694a1b72f69a3add938f4b5bfb4cf9914b133</id>
<content type='text'>
Starting from 2c4704756cab ("pids: Move the pgrp and session pid pointers
from task_struct to signal_struct") __task_pid_nr_ns() doesn't dereference
task-&gt;group_leader, we can remove the pid_alive() check.

pid_nr_ns() has to check pid != NULL anyway, pid_alive() just adds the
unnecessary confusion.

Signed-off-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Reviewed-by: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Acked-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Signed-off-by: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
</content>
</entry>
<entry>
<title>proc: Ensure we see the exit of each process tid exactly once</title>
<updated>2020-04-28T21:13:58Z</updated>
<author>
<name>Eric W. Biederman</name>
<email>ebiederm@xmission.com</email>
</author>
<published>2020-04-19T11:35:02Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=6b03d1304a32dc8450c7516000a0fe07bef7c446'/>
<id>urn:sha1:6b03d1304a32dc8450c7516000a0fe07bef7c446</id>
<content type='text'>
When the thread group leader changes during exec and the old leaders
thread is reaped proc_flush_pid will flush the dentries for the entire
process because the leader still has it's original pid.

Fix this by exchanging the pids in an rcu safe manner,
and wrapping the code to do that up in a helper exchange_tids.

When I removed switch_exec_pids and introduced this behavior
in d73d65293e3e ("[PATCH] pidhash: kill switch_exec_pids") there
really was nothing that cared as flushing happened with
the cached dentry and de_thread flushed both of them on exec.

This lack of fully exchanging pids became a problem a few months later
when I introduced 48e6484d4902 ("[PATCH] proc: Rewrite the proc dentry
flush on exit optimization").  Which overlooked the de_thread case
was no longer swapping pids, and I was looking up proc dentries
by task-&gt;pid.

The current behavior isn't properly a bug as everything in proc will
continue to work correctly just a little bit less efficiently.  Fix
this just so there are no little surprise corner cases waiting to bite
people.

-- Oleg points out this could be an issue in next_tgid in proc where
   has_group_leader_pid is called, and reording some of the assignments
   should fix that.

-- Oleg points out this will break the 10 year old hack in __exit_signal.c
&gt;	/*
&gt;	 * This can only happen if the caller is de_thread().
&gt;	 * FIXME: this is the temporary hack, we should teach
&gt;	 * posix-cpu-timers to handle this case correctly.
&gt;	 */
&gt;	if (unlikely(has_group_leader_pid(tsk)))
&gt;		posix_cpu_timers_exit_group(tsk);

The code in next_tgid has been changed to use PIDTYPE_TGID,
and the posix cpu timers code has been fixed so it does not
need the 10 year old hack, so this should be safe to merge
now.

Link: https://lore.kernel.org/lkml/87h7x3ajll.fsf_-_@x220.int.ebiederm.org/
Acked-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Acked-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Fixes: 48e6484d4902 ("[PATCH] proc: Rewrite the proc dentry flush on exit optimization").
Signed-off-by: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
</content>
</entry>
<entry>
<title>Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace</title>
<updated>2020-04-10T19:59:56Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2020-04-10T19:59:56Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=87ad46e601340394cd75c1c79b19ca906f82c543'/>
<id>urn:sha1:87ad46e601340394cd75c1c79b19ca906f82c543</id>
<content type='text'>
Pull proc fix from Eric Biederman:
 "A brown paper bag slipped through my proc changes, and syzcaller
  caught it when the code ended up in your tree.

  I have opted to fix it the simplest cleanest way I know how, so there
  is no reasonable chance for the bug to repeat"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  proc: Use a dedicated lock in struct pid
</content>
</entry>
<entry>
<title>proc: Use a dedicated lock in struct pid</title>
<updated>2020-04-09T17:15:35Z</updated>
<author>
<name>Eric W. Biederman</name>
<email>ebiederm@xmission.com</email>
</author>
<published>2020-04-07T14:43:04Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=63f818f46af9f8b3f17b9695501e8d08959feb60'/>
<id>urn:sha1:63f818f46af9f8b3f17b9695501e8d08959feb60</id>
<content type='text'>
syzbot wrote:
&gt; ========================================================
&gt; WARNING: possible irq lock inversion dependency detected
&gt; 5.6.0-syzkaller #0 Not tainted
&gt; --------------------------------------------------------
&gt; swapper/1/0 just changed the state of lock:
&gt; ffffffff898090d8 (tasklist_lock){.+.?}-{2:2}, at: send_sigurg+0x9f/0x320 fs/fcntl.c:840
&gt; but this lock took another, SOFTIRQ-unsafe lock in the past:
&gt;  (&amp;pid-&gt;wait_pidfd){+.+.}-{2:2}
&gt;
&gt;
&gt; and interrupts could create inverse lock ordering between them.
&gt;
&gt;
&gt; other info that might help us debug this:
&gt;  Possible interrupt unsafe locking scenario:
&gt;
&gt;        CPU0                    CPU1
&gt;        ----                    ----
&gt;   lock(&amp;pid-&gt;wait_pidfd);
&gt;                                local_irq_disable();
&gt;                                lock(tasklist_lock);
&gt;                                lock(&amp;pid-&gt;wait_pidfd);
&gt;   &lt;Interrupt&gt;
&gt;     lock(tasklist_lock);
&gt;
&gt;  *** DEADLOCK ***
&gt;
&gt; 4 locks held by swapper/1/0:

The problem is that because wait_pidfd.lock is taken under the tasklist
lock.  It must always be taken with irqs disabled as tasklist_lock can be
taken from interrupt context and if wait_pidfd.lock was already taken this
would create a lock order inversion.

Oleg suggested just disabling irqs where I have added extra calls to
wait_pidfd.lock.  That should be safe and I think the code will eventually
do that.  It was rightly pointed out by Christian that sharing the
wait_pidfd.lock was a premature optimization.

It is also true that my pre-merge window testing was insufficient.  So
remove the premature optimization and give struct pid a dedicated lock of
it's own for struct pid things.  I have verified that lockdep sees all 3
paths where we take the new pid-&gt;lock and lockdep does not complain.

It is my current day dream that one day pid-&gt;lock can be used to guard the
task lists as well and then the tasklist_lock won't need to be held to
deliver signals.  That will require taking pid-&gt;lock with irqs disabled.

Acked-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Link: https://lore.kernel.org/lkml/00000000000011d66805a25cd73f@google.com/
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Reported-by: syzbot+343f75cdeea091340956@syzkaller.appspotmail.com
Reported-by: syzbot+832aabf700bc3ec920b9@syzkaller.appspotmail.com
Reported-by: syzbot+f675f964019f884dbd0f@syzkaller.appspotmail.com
Reported-by: syzbot+a9fb1457d720a55d6dc5@syzkaller.appspotmail.com
Fixes: 7bc3e6e55acf ("proc: Use a list of inodes to flush from proc")
Signed-off-by: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
</content>
</entry>
<entry>
<title>Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace</title>
<updated>2020-04-02T18:22:17Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2020-04-02T18:22:17Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=d987ca1c6b7e22fbd30664111e85cec7aa66000d'/>
<id>urn:sha1:d987ca1c6b7e22fbd30664111e85cec7aa66000d</id>
<content type='text'>
Pull exec/proc updates from Eric Biederman:
 "This contains two significant pieces of work: the work to sort out
  proc_flush_task, and the work to solve a deadlock between strace and
  exec.

  Fixing proc_flush_task so that it no longer requires a persistent
  mount makes improvements to proc possible. The removal of the
  persistent mount solves an old regression that that caused the hidepid
  mount option to only work on remount not on mount. The regression was
  found and reported by the Android folks. This further allows Alexey
  Gladkov's work making proc mount options specific to an individual
  mount of proc to move forward.

  The work on exec starts solving a long standing issue with exec that
  it takes mutexes of blocking userspace applications, which makes exec
  extremely deadlock prone. For the moment this adds a second mutex with
  a narrower scope that handles all of the easy cases. Which makes the
  tricky cases easy to spot. With a little luck the code to solve those
  deadlocks will be ready by next merge window"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (25 commits)
  signal: Extend exec_id to 64bits
  pidfd: Use new infrastructure to fix deadlocks in execve
  perf: Use new infrastructure to fix deadlocks in execve
  proc: io_accounting: Use new infrastructure to fix deadlocks in execve
  proc: Use new infrastructure to fix deadlocks in execve
  kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve
  kernel: doc: remove outdated comment cred.c
  mm: docs: Fix a comment in process_vm_rw_core
  selftests/ptrace: add test cases for dead-locks
  exec: Fix a deadlock in strace
  exec: Add exec_update_mutex to replace cred_guard_mutex
  exec: Move exec_mmap right after de_thread in flush_old_exec
  exec: Move cleanup of posix timers on exec out of de_thread
  exec: Factor unshare_sighand out of de_thread and call it separately
  exec: Only compute current once in flush_old_exec
  pid: Improve the comment about waiting in zap_pid_ns_processes
  proc: Remove the now unnecessary internal mount of proc
  uml: Create a private mount of proc for mconsole
  uml: Don't consult current to find the proc_mnt in mconsole_proc
  proc: Use a list of inodes to flush from proc
  ...
</content>
</entry>
<entry>
<title>pidfd: Use new infrastructure to fix deadlocks in execve</title>
<updated>2020-03-25T15:04:01Z</updated>
<author>
<name>Bernd Edlinger</name>
<email>bernd.edlinger@hotmail.de</email>
</author>
<published>2020-03-21T02:46:16Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=501f9328bf5c6b5e4863da4b50e0e86792de3aa9'/>
<id>urn:sha1:501f9328bf5c6b5e4863da4b50e0e86792de3aa9</id>
<content type='text'>
This changes __pidfd_fget to use the new exec_update_mutex
instead of cred_guard_mutex.

This should be safe, as the credentials do not change
before exec_update_mutex is locked.  Therefore whatever
file access is possible with holding the cred_guard_mutex
here is also possbile with the exec_update_mutex.

Signed-off-by: Bernd Edlinger &lt;bernd.edlinger@hotmail.de&gt;
Signed-off-by: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
</content>
</entry>
</feed>
