<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/kernel/pid_namespace.c, branch v4.9.329</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v4.9.329</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v4.9.329'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2021-09-22T09:43:08Z</updated>
<entry>
<title>memcg: enable accounting for pids in nested pid namespaces</title>
<updated>2021-09-22T09:43:08Z</updated>
<author>
<name>Vasily Averin</name>
<email>vvs@virtuozzo.com</email>
</author>
<published>2021-09-02T21:54:57Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=91acb9615115f256a68564a6ec0ac01f9deea298'/>
<id>urn:sha1:91acb9615115f256a68564a6ec0ac01f9deea298</id>
<content type='text'>
commit fab827dbee8c2e06ca4ba000fa6c48bcf9054aba upstream.

Commit 5d097056c9a0 ("kmemcg: account certain kmem allocations to memcg")
enabled memcg accounting for pids allocated from init_pid_ns.pid_cachep,
but forgot to adjust the setting for nested pid namespaces.  As a result,
pid memory is not accounted exactly where it is really needed, inside
memcg-limited containers with their own pid namespaces.

Pid was one the first kernel objects enabled for memcg accounting.
init_pid_ns.pid_cachep marked by SLAB_ACCOUNT and we can expect that any
new pids in the system are memcg-accounted.

Though recently I've noticed that it is wrong.  nested pid namespaces
creates own slab caches for pid objects, nested pids have increased size
because contain id both for all parent and for own pid namespaces.  The
problem is that these slab caches are _NOT_ marked by SLAB_ACCOUNT, as a
result any pids allocated in nested pid namespaces are not
memcg-accounted.

Pid struct in nested pid namespace consumes up to 500 bytes memory, 100000
such objects gives us up to ~50Mb unaccounted memory, this allow container
to exceed assigned memcg limits.

Link: https://lkml.kernel.org/r/8b6de616-fd1a-02c6-cbdb-976ecdcfa604@virtuozzo.com
Fixes: 5d097056c9a0 ("kmemcg: account certain kmem allocations to memcg")
Cc: stable@vger.kernel.org
Signed-off-by: Vasily Averin &lt;vvs@virtuozzo.com&gt;
Reviewed-by: Michal Koutný &lt;mkoutny@suse.com&gt;
Reviewed-by: Shakeel Butt &lt;shakeelb@google.com&gt;
Acked-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Acked-by: Roman Gushchin &lt;guro@fb.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>signal/pid_namespace: Fix reboot_pid_ns to use send_sig not force_sig</title>
<updated>2019-08-04T07:33:16Z</updated>
<author>
<name>Eric W. Biederman</name>
<email>ebiederm@xmission.com</email>
</author>
<published>2019-05-15T17:29:52Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=9ac2e1c3f650dd13c3e13e8bbf3a9ad8e72e33c3'/>
<id>urn:sha1:9ac2e1c3f650dd13c3e13e8bbf3a9ad8e72e33c3</id>
<content type='text'>
[ Upstream commit f9070dc94542093fd516ae4ccea17ef46a4362c5 ]

The locking in force_sig_info is not prepared to deal with a task that
exits or execs (as sighand may change).  The is not a locking problem
in force_sig as force_sig is only built to handle synchronous
exceptions.

Further the function force_sig_info changes the signal state if the
signal is ignored, or blocked or if SIGNAL_UNKILLABLE will prevent the
delivery of the signal.  The signal SIGKILL can not be ignored and can
not be blocked and SIGNAL_UNKILLABLE won't prevent it from being
delivered.

So using force_sig rather than send_sig for SIGKILL is confusing
and pointless.

Because it won't impact the sending of the signal and and because
using force_sig is wrong, replace force_sig with send_sig.

Cc: Daniel Lezcano &lt;daniel.lezcano@free.fr&gt;
Cc: Serge Hallyn &lt;serge@hallyn.com&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Fixes: cf3f89214ef6 ("pidns: add reboot_pid_ns() to handle the reboot syscall")
Signed-off-by: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>pid_ns: Sleep in TASK_INTERRUPTIBLE in zap_pid_ns_processes</title>
<updated>2017-05-25T13:44:38Z</updated>
<author>
<name>Eric W. Biederman</name>
<email>ebiederm@xmission.com</email>
</author>
<published>2017-05-11T23:21:01Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=6dc6a2700b6a0dd755027654b85bd614d8d3d52b'/>
<id>urn:sha1:6dc6a2700b6a0dd755027654b85bd614d8d3d52b</id>
<content type='text'>
commit b9a985db98961ae1ba0be169f19df1c567e4ffe0 upstream.

The code can potentially sleep for an indefinite amount of time in
zap_pid_ns_processes triggering the hung task timeout, and increasing
the system average.  This is undesirable.  Sleep with a task state of
TASK_INTERRUPTIBLE instead of TASK_UNINTERRUPTIBLE to remove these
undesirable side effects.

Apparently under heavy load this has been allowing Chrome to trigger
the hung time task timeout error and cause ChromeOS to reboot.

Reported-by: Vovo Yang &lt;vovoy@google.com&gt;
Reported-by: Guenter Roeck &lt;linux@roeck-us.net&gt;
Tested-by: Guenter Roeck &lt;linux@roeck-us.net&gt;
Fixes: 6347e9009104 ("pidns: guarantee that the pidns init will be the last pidns process reaped")
Signed-off-by: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>pid: fix lockdep deadlock warning due to ucount_lock</title>
<updated>2017-01-19T19:18:03Z</updated>
<author>
<name>Andrei Vagin</name>
<email>avagin@openvz.org</email>
</author>
<published>2017-01-05T03:28:14Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=52fd0ab07676b295abc54eeb1cef7a06d49c9847'/>
<id>urn:sha1:52fd0ab07676b295abc54eeb1cef7a06d49c9847</id>
<content type='text'>
commit add7c65ca426b7a37184dd3d2172394e23d585d6 upstream.

=========================================================
[ INFO: possible irq lock inversion dependency detected ]
4.10.0-rc2-00024-g4aecec9-dirty #118 Tainted: G        W
---------------------------------------------------------
swapper/1/0 just changed the state of lock:
 (&amp;(&amp;sighand-&gt;siglock)-&gt;rlock){-.....}, at: [&lt;ffffffffbd0a1bc6&gt;] __lock_task_sighand+0xb6/0x2c0
but this lock took another, HARDIRQ-unsafe lock in the past:
 (ucounts_lock){+.+...}
and interrupts could create inverse lock ordering between them.
other info that might help us debug this:
Chain exists of:                 &amp;(&amp;sighand-&gt;siglock)-&gt;rlock --&gt; &amp;(&amp;tty-&gt;ctrl_lock)-&gt;rlock --&gt; ucounts_lock
 Possible interrupt unsafe locking scenario:
       CPU0                    CPU1
       ----                    ----
  lock(ucounts_lock);
                               local_irq_disable();
                               lock(&amp;(&amp;sighand-&gt;siglock)-&gt;rlock);
                               lock(&amp;(&amp;tty-&gt;ctrl_lock)-&gt;rlock);
  &lt;Interrupt&gt;
    lock(&amp;(&amp;sighand-&gt;siglock)-&gt;rlock);

 *** DEADLOCK ***

This patch removes a dependency between rlock and ucount_lock.

Fixes: f333c700c610 ("pidns: Add a limit on the number of pid namespaces")
Signed-off-by: Andrei Vagin &lt;avagin@openvz.org&gt;
Acked-by: Al Viro &lt;viro@ZenIV.linux.org.uk&gt;
Signed-off-by: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>Merge branch 'nsfs-ioctls' into HEAD</title>
<updated>2016-09-23T01:00:36Z</updated>
<author>
<name>Eric W. Biederman</name>
<email>ebiederm@xmission.com</email>
</author>
<published>2016-09-23T01:00:36Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=78725596644be0181c46f55c52aadfb8c70bcdb7'/>
<id>urn:sha1:78725596644be0181c46f55c52aadfb8c70bcdb7</id>
<content type='text'>
From: Andrey Vagin &lt;avagin@openvz.org&gt;

Each namespace has an owning user namespace and now there is not way
to discover these relationships.

Pid and user namepaces are hierarchical. There is no way to discover
parent-child relationships too.

Why we may want to know relationships between namespaces?

One use would be visualization, in order to understand the running
system.  Another would be to answer the question: what capability does
process X have to perform operations on a resource governed by namespace
Y?

One more use-case (which usually called abnormal) is checkpoint/restart.
In CRIU we are going to dump and restore nested namespaces.

There [1] was a discussion about which interface to choose to determing
relationships between namespaces.

Eric suggested to add two ioctl-s [2]:
&gt; Grumble, Grumble.  I think this may actually a case for creating ioctls
&gt; for these two cases.  Now that random nsfs file descriptors are bind
&gt; mountable the original reason for using proc files is not as pressing.
&gt;
&gt; One ioctl for the user namespace that owns a file descriptor.
&gt; One ioctl for the parent namespace of a namespace file descriptor.

Here is an implementaions of these ioctl-s.

$ man man7/namespaces.7
...
Since  Linux  4.X,  the  following  ioctl(2)  calls are supported for
namespace file descriptors.  The correct syntax is:

      fd = ioctl(ns_fd, ioctl_type);

where ioctl_type is one of the following:

NS_GET_USERNS
      Returns a file descriptor that refers to an owning user names‐
      pace.

NS_GET_PARENT
      Returns  a  file descriptor that refers to a parent namespace.
      This ioctl(2) can be used for pid  and  user  namespaces.  For
      user namespaces, NS_GET_PARENT and NS_GET_USERNS have the same
      meaning.

In addition to generic ioctl(2) errors, the following  specific  ones
can occur:

EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.

EPERM  The  requested  namespace  is outside of the current namespace
      scope.

[1] https://lkml.org/lkml/2016/7/6/158
[2] https://lkml.org/lkml/2016/7/9/101

Changes for v2:
* don't return ENOENT for init_user_ns and init_pid_ns. There is nothing
  outside of the init namespace, so we can return EPERM in this case too.
  &gt; The fewer special cases the easier the code is to get
  &gt; correct, and the easier it is to read. // Eric

Changes for v3:
* rename ns-&gt;get_owner() to ns-&gt;owner(). get_* usually means that it
  grabs a reference.

Cc: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Cc: James Bottomley &lt;James.Bottomley@HansenPartnership.com&gt;
Cc: "Michael Kerrisk (man-pages)" &lt;mtk.manpages@gmail.com&gt;
Cc: "W. Trevor King" &lt;wking@tremily.us&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Serge Hallyn &lt;serge.hallyn@canonical.com&gt;
</content>
</entry>
<entry>
<title>nsfs: add ioctl to get a parent namespace</title>
<updated>2016-09-23T00:59:41Z</updated>
<author>
<name>Andrey Vagin</name>
<email>avagin@openvz.org</email>
</author>
<published>2016-09-06T07:47:15Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=a7306ed8d94af729ecef8b6e37506a1c6fc14788'/>
<id>urn:sha1:a7306ed8d94af729ecef8b6e37506a1c6fc14788</id>
<content type='text'>
Pid and user namepaces are hierarchical. There is no way to discover
parent-child relationships.

In a future we will use this interface to dump and restore nested
namespaces.

Acked-by: Serge Hallyn &lt;serge@hallyn.com&gt;
Signed-off-by: Andrei Vagin &lt;avagin@openvz.org&gt;
Signed-off-by: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
</content>
</entry>
<entry>
<title>kernel: add a helper to get an owning user namespace for a namespace</title>
<updated>2016-09-23T00:59:39Z</updated>
<author>
<name>Andrey Vagin</name>
<email>avagin@openvz.org</email>
</author>
<published>2016-09-06T07:47:13Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=bcac25a58bfc6bd79191ac5d7afb49bea96da8c9'/>
<id>urn:sha1:bcac25a58bfc6bd79191ac5d7afb49bea96da8c9</id>
<content type='text'>
Return -EPERM if an owning user namespace is outside of a process
current user namespace.

v2: In a first version ns_get_owner returned ENOENT for init_user_ns.
    This special cases was removed from this version. There is nothing
    outside of init_user_ns, so we can return EPERM.
v3: rename ns-&gt;get_owner() to ns-&gt;owner(). get_* usually means that it
grabs a reference.

Acked-by: Serge Hallyn &lt;serge@hallyn.com&gt;
Signed-off-by: Andrei Vagin &lt;avagin@openvz.org&gt;
Signed-off-by: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
</content>
</entry>
<entry>
<title>userns: When the per user per user namespace limit is reached return ENOSPC</title>
<updated>2016-09-22T18:25:56Z</updated>
<author>
<name>Eric W. Biederman</name>
<email>ebiederm@xmission.com</email>
</author>
<published>2016-09-22T18:08:36Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=df75e7748bae1c7098bfa358485389b897f71305'/>
<id>urn:sha1:df75e7748bae1c7098bfa358485389b897f71305</id>
<content type='text'>
The current error codes returned when a the per user per user
namespace limit are hit (EINVAL, EUSERS, and ENFILE) are wrong.  I
asked for advice on linux-api and it we made clear that those were
the wrong error code, but a correct effor code was not suggested.

The best general error code I have found for hitting a resource limit
is ENOSPC.  It is not perfect but as it is unambiguous it will serve
until someone comes up with a better error code.

Signed-off-by: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
</content>
</entry>
<entry>
<title>pidns: Add a limit on the number of pid namespaces</title>
<updated>2016-08-08T19:42:01Z</updated>
<author>
<name>Eric W. Biederman</name>
<email>ebiederm@xmission.com</email>
</author>
<published>2016-08-08T19:08:36Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=f333c700c6100b53050980986be922bb21466e29'/>
<id>urn:sha1:f333c700c6100b53050980986be922bb21466e29</id>
<content type='text'>
Acked-by: Kees Cook &lt;keescook@chromium.org&gt;
Signed-off-by: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
</content>
</entry>
<entry>
<title>Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs</title>
<updated>2014-12-16T23:53:03Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2014-12-16T23:53:03Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=603ba7e41bf5d405aba22294af5d075d8898176d'/>
<id>urn:sha1:603ba7e41bf5d405aba22294af5d075d8898176d</id>
<content type='text'>
Pull vfs pile #2 from Al Viro:
 "Next pile (and there'll be one or two more).

  The large piece in this one is getting rid of /proc/*/ns/* weirdness;
  among other things, it allows to (finally) make nameidata completely
  opaque outside of fs/namei.c, making for easier further cleanups in
  there"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  coda_venus_readdir(): use file_inode()
  fs/namei.c: fold link_path_walk() call into path_init()
  path_init(): don't bother with LOOKUP_PARENT in argument
  fs/namei.c: new helper (path_cleanup())
  path_init(): store the "base" pointer to file in nameidata itself
  make default -&gt;i_fop have -&gt;open() fail with ENXIO
  make nameidata completely opaque outside of fs/namei.c
  kill proc_ns completely
  take the targets of /proc/*/ns/* symlinks to separate fs
  bury struct proc_ns in fs/proc
  copy address of proc_ns_ops into ns_common
  new helpers: ns_alloc_inum/ns_free_inum
  make proc_ns_operations work with struct ns_common * instead of void *
  switch the rest of proc_ns_operations to working with &amp;...-&gt;ns
  netns: switch -&gt;get()/-&gt;put()/-&gt;install()/-&gt;inum() to working with &amp;net-&gt;ns
  make mntns -&gt;get()/-&gt;put()/-&gt;install()/-&gt;inum() work with &amp;mnt_ns-&gt;ns
  common object embedded into various struct ....ns
</content>
</entry>
</feed>
