<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/include/linux/xattr.h, branch v6.2.7</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v6.2.7</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v6.2.7'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2022-12-13T18:08:36Z</updated>
<entry>
<title>Merge tag 'fs.xattr.simple.rework.rbtree.rwlock.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping</title>
<updated>2022-12-13T18:08:36Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2022-12-13T18:08:36Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=02bf43c7b7f7a19aa59a75f5244f0a3408bace1a'/>
<id>urn:sha1:02bf43c7b7f7a19aa59a75f5244f0a3408bace1a</id>
<content type='text'>
Pull simple-xattr updates from Christian Brauner:
 "This ports the simple xattr infrastucture to rely on a simple rbtree
  protected by a read-write lock instead of a linked list protected by a
  spinlock.

  A while ago we received reports about scaling issues for filesystems
  using the simple xattr infrastructure that also support setting a
  larger number of xattrs. Specifically, cgroups and tmpfs.

  Both cgroupfs and tmpfs can be mounted by unprivileged users in
  unprivileged containers and root in an unprivileged container can set
  an unrestricted number of security.* xattrs and privileged users can
  also set unlimited trusted.* xattrs. A few more words on further that
  below. Other xattrs such as user.* are restricted for kernfs-based
  instances to a fairly limited number.

  As there are apparently users that have a fairly large number of
  xattrs we should scale a bit better. Using a simple linked list
  protected by a spinlock used for set, get, and list operations doesn't
  scale well if users use a lot of xattrs even if it's not a crazy
  number.

  Let's switch to a simple rbtree protected by a rwlock. It scales way
  better and gets rid of the perf issues some people reported. We
  originally had fancier solutions even using an rcu+seqlock protected
  rbtree but we had concerns about being to clever and also that
  deletion from an rbtree with rcu+seqlock isn't entirely safe.

  The rbtree plus rwlock is perfectly fine. By far the most common
  operation is getting an xattr. While setting an xattr is not and
  should be comparatively rare. And listxattr() often only happens when
  copying xattrs between files or together with the contents to a new
  file.

  Holding a lock across listxattr() is unproblematic because it doesn't
  list the values of xattrs. It can only be used to list the names of
  all xattrs set on a file. And the number of xattr names that can be
  listed with listxattr() is limited to XATTR_LIST_MAX aka 65536 bytes.
  If a larger buffer is passed then vfs_listxattr() caps it to
  XATTR_LIST_MAX and if more xattr names are found it will return
  -E2BIG. In short, the maximum amount of memory that can be retrieved
  via listxattr() is limited and thus listxattr() bounded.

  Of course, the API is broken as documented on xattr(7) already. While
  I have no idea how the xattr api ended up in this state we should
  probably try to come up with something here at some point. An iterator
  pattern similar to readdir() as an alternative to listxattr() or
  something else.

  Right now it is extremly strange that users can set millions of xattrs
  but then can't use listxattr() to know which xattrs are actually set.
  And it's really trivial to do:

	for i in {1..1000000}; do setfattr -n security.$i -v $i ./file1; done

  And around 5000 xattrs it's impossible to use listxattr() to figure
  out which xattrs are actually set. So I have suggested that we try to
  limit the number of xattrs for simple xattrs at least. But that's a
  future patch and I don't consider it very urgent.

  A bonus of this port to rbtree+rwlock is that we shrink the memory
  consumption for users of the simple xattr infrastructure.

  This also adds kernel documentation to all the functions"

* tag 'fs.xattr.simple.rework.rbtree.rwlock.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
  xattr: use rbtree for simple_xattrs
</content>
</entry>
<entry>
<title>Merge tag 'lsm-pr-20221212' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm</title>
<updated>2022-12-13T17:47:48Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2022-12-13T17:47:48Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=c76ff350bd57682ae12bea6383dd8baf4824ac96'/>
<id>urn:sha1:c76ff350bd57682ae12bea6383dd8baf4824ac96</id>
<content type='text'>
Pull lsm updates from Paul Moore:

 - Improve the error handling in the device cgroup such that memory
   allocation failures when updating the access policy do not
   potentially alter the policy.

 - Some minor fixes to reiserfs to ensure that it properly releases
   LSM-related xattr values.

 - Update the security_socket_getpeersec_stream() LSM hook to take
   sockptr_t values.

   Previously the net/BPF folks updated the getsockopt code in the
   network stack to leverage the sockptr_t type to make it easier to
   pass both kernel and __user pointers, but unfortunately when they did
   so they didn't convert the LSM hook.

   While there was/is no immediate risk by not converting the LSM hook,
   it seems like this is a mistake waiting to happen so this patch
   proactively does the LSM hook conversion.

 - Convert vfs_getxattr_alloc() to return an int instead of a ssize_t
   and cleanup the callers. Internally the function was never going to
   return anything larger than an int and the callers were doing some
   very odd things casting the return value; this patch fixes all that
   and helps bring a bit of sanity to vfs_getxattr_alloc() and its
   callers.

 - More verbose, and helpful, LSM debug output when the system is booted
   with "lsm.debug" on the command line. There are examples in the
   commit description, but the quick summary is that this patch provides
   better information about which LSMs are enabled and the ordering in
   which they are processed.

 - General comment and kernel-doc fixes and cleanups.

* tag 'lsm-pr-20221212' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
  lsm: Fix description of fs_context_parse_param
  lsm: Add/fix return values in lsm_hooks.h and fix formatting
  lsm: Clarify documentation of vm_enough_memory hook
  reiserfs: Add missing calls to reiserfs_security_free()
  lsm,fs: fix vfs_getxattr_alloc() return type and caller error paths
  device_cgroup: Roll back to original exceptions after copy failure
  LSM: Better reporting of actual LSMs at boot
  lsm: make security_socket_getpeersec_stream() sockptr_t safe
  audit: Fix some kernel-doc warnings
  lsm: remove obsoleted comments for security hooks
  fs: edit a comment made in bad taste
</content>
</entry>
<entry>
<title>lsm,fs: fix vfs_getxattr_alloc() return type and caller error paths</title>
<updated>2022-11-18T22:07:03Z</updated>
<author>
<name>Paul Moore</name>
<email>paul@paul-moore.com</email>
</author>
<published>2022-11-09T19:14:35Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=f6fbd8cbf3ed1915a7b957f2801f7c306a686c08'/>
<id>urn:sha1:f6fbd8cbf3ed1915a7b957f2801f7c306a686c08</id>
<content type='text'>
The vfs_getxattr_alloc() function currently returns a ssize_t value
despite the fact that it only uses int values internally for return
values.  Fix this by converting vfs_getxattr_alloc() to return an
int type and adjust the callers as necessary.  As part of these
caller modifications, some of the callers are fixed to properly free
the xattr value buffer on both success and failure to ensure that
memory is not leaked in the failure case.

Reviewed-by: Serge Hallyn &lt;serge@hallyn.com&gt;
Reviewed-by: Mimi Zohar &lt;zohar@linux.ibm.com&gt;
Signed-off-by: Paul Moore &lt;paul@paul-moore.com&gt;
</content>
</entry>
<entry>
<title>xattr: use rbtree for simple_xattrs</title>
<updated>2022-11-12T09:49:26Z</updated>
<author>
<name>Christian Brauner</name>
<email>brauner@kernel.org</email>
</author>
<published>2022-11-04T12:52:42Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=3b4c7bc01727e3a465759236eeac03d0dd686da3'/>
<id>urn:sha1:3b4c7bc01727e3a465759236eeac03d0dd686da3</id>
<content type='text'>
A while ago Vasily reported that it is possible to set a large number of
xattrs on inodes of filesystems that make use of the simple xattr
infrastructure. This includes all kernfs-based filesystems that support
xattrs (e.g., cgroupfs and tmpfs). Both cgroupfs and tmpfs can be
mounted by unprivileged users in unprivileged containers and root in an
unprivileged container can set an unrestricted number of security.*
xattrs and privileged users can also set unlimited trusted.* xattrs. As
there are apparently users that have a fairly large number of xattrs we
should scale a bit better. Other xattrs such as user.* are restricted
for kernfs-based instances to a fairly limited number.

Using a simple linked list protected by a spinlock used for set, get,
and list operations doesn't scale well if users use a lot of xattrs even
if it's not a crazy number. There's no need to bring in the big guns
like rhashtables or rw semaphores for this. An rbtree with a rwlock, or
limited rcu semanics and seqlock is enough.

It scales within the constraints we are working in. By far the most
common operation is getting an xattr. Setting xattrs should be a
moderately rare operation. And listxattr() often only happens when
copying xattrs between files or together with the contents to a new
file. Holding a lock across listxattr() is unproblematic because it
doesn't list the values of xattrs. It can only be used to list the names
of all xattrs set on a file. And the number of xattr names that can be
listed with listxattr() is limited to XATTR_LIST_MAX aka 65536 bytes. If
a larger buffer is passed then vfs_listxattr() caps it to XATTR_LIST_MAX
and if more xattr names are found it will return -E2BIG. In short, the
maximum amount of memory that can be retrieved via listxattr() is
limited.

Of course, the API is broken as documented on xattr(7) already. In the
future we might want to address this but for now this is the world we
live in and have lived for a long time. But it does indeed mean that
once an application goes over XATTR_LIST_MAX limit of xattrs set on an
inode it isn't possible to copy the file and include its xattrs in the
copy unless the caller knows all xattrs or limits the copy of the xattrs
to important ones it knows by name (At least for tmpfs, and kernfs-based
filesystems. Other filesystems might provide ways of achieving this.).

Bonus of this port to rbtree+rwlock is that we shrink the memory
consumption for users of the simple xattr infrastructure.

Also add proper kernel documentation to all the functions.
A big thanks to Paul for his comments.

Cc: Vasily Averin &lt;vvs@openvz.org&gt;
Cc: "Paul E. McKenney" &lt;paulmck@kernel.org&gt;
Acked-by: Roman Gushchin &lt;roman.gushchin@linux.dev&gt;
Acked-by: Paul E. McKenney &lt;paulmck@kernel.org&gt;
Signed-off-by: Christian Brauner (Microsoft) &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>ovl: use posix acl api</title>
<updated>2022-10-20T08:13:31Z</updated>
<author>
<name>Christian Brauner</name>
<email>brauner@kernel.org</email>
</author>
<published>2022-09-22T15:17:21Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=31acceb97500dd6e9105526301d76488cd6ca21c'/>
<id>urn:sha1:31acceb97500dd6e9105526301d76488cd6ca21c</id>
<content type='text'>
Now that posix acls have a proper api us it to copy them.

All filesystems that can serve as lower or upper layers for overlayfs
have gained support for the new posix acl api in previous patches.
So switch all internal overlayfs codepaths for copying posix acls to the
new posix acl api.

Acked-by: Miklos Szeredi &lt;miklos@szeredi.hu&gt;
Signed-off-by: Christian Brauner (Microsoft) &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>xattr: constify value argument in vfs_setxattr()</title>
<updated>2022-08-31T14:38:07Z</updated>
<author>
<name>Christian Brauner</name>
<email>brauner@kernel.org</email>
</author>
<published>2022-08-29T12:38:45Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=6344e66970c619a1623f457910e78819076e9104'/>
<id>urn:sha1:6344e66970c619a1623f457910e78819076e9104</id>
<content type='text'>
Now that we don't perform translations directly in vfs_setxattr()
anymore we can constify the @value argument in vfs_setxattr(). This also
allows us to remove the hack to cast from a const in ovl_do_setxattr().

Signed-off-by: Christian Brauner (Microsoft) &lt;brauner@kernel.org&gt;
Reviewed-by: Seth Forshee (DigitalOcean) &lt;sforshee@kernel.org&gt;
</content>
</entry>
<entry>
<title>acl: move idmapped mount fixup into vfs_{g,s}etxattr()</title>
<updated>2022-07-15T20:08:59Z</updated>
<author>
<name>Christian Brauner</name>
<email>brauner@kernel.org</email>
</author>
<published>2022-07-06T16:30:59Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=0c5fd887d2bb47aa37aa9fb1eb1d1d2abac62972'/>
<id>urn:sha1:0c5fd887d2bb47aa37aa9fb1eb1d1d2abac62972</id>
<content type='text'>
This cycle we added support for mounting overlayfs on top of idmapped mounts.
Recently I've started looking into potential corner cases when trying to add
additional tests and I noticed that reporting for POSIX ACLs is currently wrong
when using idmapped layers with overlayfs mounted on top of it.

I'm going to give a rather detailed explanation to both the origin of the
problem and the solution.

Let's assume the user creates the following directory layout and they have a
rootfs /var/lib/lxc/c1/rootfs. The files in this rootfs are owned as you would
expect files on your host system to be owned. For example, ~/.bashrc for your
regular user would be owned by 1000:1000 and /root/.bashrc would be owned by
0:0. IOW, this is just regular boring filesystem tree on an ext4 or xfs
filesystem.

The user chooses to set POSIX ACLs using the setfacl binary granting the user
with uid 4 read, write, and execute permissions for their .bashrc file:

        setfacl -m u:4:rwx /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc

Now they to expose the whole rootfs to a container using an idmapped mount. So
they first create:

        mkdir -pv /vol/contpool/{ctrover,merge,lowermap,overmap}
        mkdir -pv /vol/contpool/ctrover/{over,work}
        chown 10000000:10000000 /vol/contpool/ctrover/{over,work}

The user now creates an idmapped mount for the rootfs:

        mount-idmapped/mount-idmapped --map-mount=b:0:10000000:65536 \
                                      /var/lib/lxc/c2/rootfs \
                                      /vol/contpool/lowermap

This for example makes it so that /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc
which is owned by uid and gid 1000 as being owned by uid and gid 10001000 at
/vol/contpool/lowermap/home/ubuntu/.bashrc.

Assume the user wants to expose these idmapped mounts through an overlayfs
mount to a container.

       mount -t overlay overlay                      \
             -o lowerdir=/vol/contpool/lowermap,     \
                upperdir=/vol/contpool/overmap/over, \
                workdir=/vol/contpool/overmap/work   \
             /vol/contpool/merge

The user can do this in two ways:

(1) Mount overlayfs in the initial user namespace and expose it to the
    container.
(2) Mount overlayfs on top of the idmapped mounts inside of the container's
    user namespace.

Let's assume the user chooses the (1) option and mounts overlayfs on the host
and then changes into a container which uses the idmapping 0:10000000:65536
which is the same used for the two idmapped mounts.

Now the user tries to retrieve the POSIX ACLs using the getfacl command

        getfacl -n /vol/contpool/lowermap/home/ubuntu/.bashrc

and to their surprise they see:

        # file: vol/contpool/merge/home/ubuntu/.bashrc
        # owner: 1000
        # group: 1000
        user::rw-
        user:4294967295:rwx
        group::r--
        mask::rwx
        other::r--

indicating the the uid wasn't correctly translated according to the idmapped
mount. The problem is how we currently translate POSIX ACLs. Let's inspect the
callchain in this example:

        idmapped mount /vol/contpool/merge:      0:10000000:65536
        caller's idmapping:                      0:10000000:65536
        overlayfs idmapping (ofs-&gt;creator_cred): 0:0:4k /* initial idmapping */

        sys_getxattr()
        -&gt; path_getxattr()
           -&gt; getxattr()
              -&gt; do_getxattr()
                  |&gt; vfs_getxattr()
                  |  -&gt; __vfs_getxattr()
                  |     -&gt; handler-&gt;get == ovl_posix_acl_xattr_get()
                  |        -&gt; ovl_xattr_get()
                  |           -&gt; vfs_getxattr()
                  |              -&gt; __vfs_getxattr()
                  |                 -&gt; handler-&gt;get() /* lower filesystem callback */
                  |&gt; posix_acl_fix_xattr_to_user()
                     {
                              4 = make_kuid(&amp;init_user_ns, 4);
                              4 = mapped_kuid_fs(&amp;init_user_ns /* no idmapped mount */, 4);
                              /* FAILURE */
                             -1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4);
                     }

If the user chooses to use option (2) and mounts overlayfs on top of idmapped
mounts inside the container things don't look that much better:

        idmapped mount /vol/contpool/merge:      0:10000000:65536
        caller's idmapping:                      0:10000000:65536
        overlayfs idmapping (ofs-&gt;creator_cred): 0:10000000:65536

        sys_getxattr()
        -&gt; path_getxattr()
           -&gt; getxattr()
              -&gt; do_getxattr()
                  |&gt; vfs_getxattr()
                  |  -&gt; __vfs_getxattr()
                  |     -&gt; handler-&gt;get == ovl_posix_acl_xattr_get()
                  |        -&gt; ovl_xattr_get()
                  |           -&gt; vfs_getxattr()
                  |              -&gt; __vfs_getxattr()
                  |                 -&gt; handler-&gt;get() /* lower filesystem callback */
                  |&gt; posix_acl_fix_xattr_to_user()
                     {
                              4 = make_kuid(&amp;init_user_ns, 4);
                              4 = mapped_kuid_fs(&amp;init_user_ns, 4);
                              /* FAILURE */
                             -1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4);
                     }

As is easily seen the problem arises because the idmapping of the lower mount
isn't taken into account as all of this happens in do_gexattr(). But
do_getxattr() is always called on an overlayfs mount and inode and thus cannot
possible take the idmapping of the lower layers into account.

This problem is similar for fscaps but there the translation happens as part of
vfs_getxattr() already. Let's walk through an fscaps overlayfs callchain:

        setcap 'cap_net_raw+ep' /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc

The expected outcome here is that we'll receive the cap_net_raw capability as
we are able to map the uid associated with the fscap to 0 within our container.
IOW, we want to see 0 as the result of the idmapping translations.

If the user chooses option (1) we get the following callchain for fscaps:

        idmapped mount /vol/contpool/merge:      0:10000000:65536
        caller's idmapping:                      0:10000000:65536
        overlayfs idmapping (ofs-&gt;creator_cred): 0:0:4k /* initial idmapping */

        sys_getxattr()
        -&gt; path_getxattr()
           -&gt; getxattr()
              -&gt; do_getxattr()
                   -&gt; vfs_getxattr()
                      -&gt; xattr_getsecurity()
                         -&gt; security_inode_getsecurity()                                       ________________________________
                            -&gt; cap_inode_getsecurity()                                         |                              |
                               {                                                               V                              |
                                        10000000 = make_kuid(0:0:4k /* overlayfs idmapping */, 10000000);                     |
                                        10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000);                  |
                                               /* Expected result is 0 and thus that we own the fscap. */                     |
                                               0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000);            |
                               }                                                                                              |
                               -&gt; vfs_getxattr_alloc()                                                                        |
                                  -&gt; handler-&gt;get == ovl_other_xattr_get()                                                    |
                                     -&gt; vfs_getxattr()                                                                        |
                                        -&gt; xattr_getsecurity()                                                                |
                                           -&gt; security_inode_getsecurity()                                                    |
                                              -&gt; cap_inode_getsecurity()                                                      |
                                                 {                                                                            |
                                                                0 = make_kuid(0:0:4k /* lower s_user_ns */, 0);               |
                                                         10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0); |
                                                         10000000 = from_kuid(0:0:4k /* overlayfs idmapping */, 10000000);    |
                                                         |____________________________________________________________________|
                                                 }
                                                 -&gt; vfs_getxattr_alloc()
                                                    -&gt; handler-&gt;get == /* lower filesystem callback */

And if the user chooses option (2) we get:

        idmapped mount /vol/contpool/merge:      0:10000000:65536
        caller's idmapping:                      0:10000000:65536
        overlayfs idmapping (ofs-&gt;creator_cred): 0:10000000:65536

        sys_getxattr()
        -&gt; path_getxattr()
           -&gt; getxattr()
              -&gt; do_getxattr()
                   -&gt; vfs_getxattr()
                      -&gt; xattr_getsecurity()
                         -&gt; security_inode_getsecurity()                                                _______________________________
                            -&gt; cap_inode_getsecurity()                                                  |                             |
                               {                                                                        V                             |
                                       10000000 = make_kuid(0:10000000:65536 /* overlayfs idmapping */, 0);                           |
                                       10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000);                           |
                                               /* Expected result is 0 and thus that we own the fscap. */                             |
                                              0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000);                     |
                               }                                                                                                      |
                               -&gt; vfs_getxattr_alloc()                                                                                |
                                  -&gt; handler-&gt;get == ovl_other_xattr_get()                                                            |
                                    |-&gt; vfs_getxattr()                                                                                |
                                        -&gt; xattr_getsecurity()                                                                        |
                                           -&gt; security_inode_getsecurity()                                                            |
                                              -&gt; cap_inode_getsecurity()                                                              |
                                                 {                                                                                    |
                                                                 0 = make_kuid(0:0:4k /* lower s_user_ns */, 0);                      |
                                                          10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0);        |
                                                                 0 = from_kuid(0:10000000:65536 /* overlayfs idmapping */, 10000000); |
                                                                 |____________________________________________________________________|
                                                 }
                                                 -&gt; vfs_getxattr_alloc()
                                                    -&gt; handler-&gt;get == /* lower filesystem callback */

We can see how the translation happens correctly in those cases as the
conversion happens within the vfs_getxattr() helper.

For POSIX ACLs we need to do something similar. However, in contrast to fscaps
we cannot apply the fix directly to the kernel internal posix acl data
structure as this would alter the cached values and would also require a rework
of how we currently deal with POSIX ACLs in general which almost never take the
filesystem idmapping into account (the noteable exception being FUSE but even
there the implementation is special) and instead retrieve the raw values based
on the initial idmapping.

The correct values are then generated right before returning to userspace. The
fix for this is to move taking the mount's idmapping into account directly in
vfs_getxattr() instead of having it be part of posix_acl_fix_xattr_to_user().

To this end we split out two small and unexported helpers
posix_acl_getxattr_idmapped_mnt() and posix_acl_setxattr_idmapped_mnt(). The
former to be called in vfs_getxattr() and the latter to be called in
vfs_setxattr().

Let's go back to the original example. Assume the user chose option (1) and
mounted overlayfs on top of idmapped mounts on the host:

        idmapped mount /vol/contpool/merge:      0:10000000:65536
        caller's idmapping:                      0:10000000:65536
        overlayfs idmapping (ofs-&gt;creator_cred): 0:0:4k /* initial idmapping */

        sys_getxattr()
        -&gt; path_getxattr()
           -&gt; getxattr()
              -&gt; do_getxattr()
                  |&gt; vfs_getxattr()
                  |  |&gt; __vfs_getxattr()
                  |  |  -&gt; handler-&gt;get == ovl_posix_acl_xattr_get()
                  |  |     -&gt; ovl_xattr_get()
                  |  |        -&gt; vfs_getxattr()
                  |  |           |&gt; __vfs_getxattr()
                  |  |           |  -&gt; handler-&gt;get() /* lower filesystem callback */
                  |  |           |&gt; posix_acl_getxattr_idmapped_mnt()
                  |  |              {
                  |  |                              4 = make_kuid(&amp;init_user_ns, 4);
                  |  |                       10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4);
                  |  |                       10000004 = from_kuid(&amp;init_user_ns, 10000004);
                  |  |                       |_______________________
                  |  |              }                               |
                  |  |                                              |
                  |  |&gt; posix_acl_getxattr_idmapped_mnt()           |
                  |     {                                           |
                  |                                                 V
                  |             10000004 = make_kuid(&amp;init_user_ns, 10000004);
                  |             10000004 = mapped_kuid_fs(&amp;init_user_ns /* no idmapped mount */, 10000004);
                  |             10000004 = from_kuid(&amp;init_user_ns, 10000004);
                  |     }       |_________________________________________________
                  |                                                              |
                  |                                                              |
                  |&gt; posix_acl_fix_xattr_to_user()                               |
                     {                                                           V
                                 10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004);
                                        /* SUCCESS */
                                        4 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000004);
                     }

And similarly if the user chooses option (1) and mounted overayfs on top of
idmapped mounts inside the container:

        idmapped mount /vol/contpool/merge:      0:10000000:65536
        caller's idmapping:                      0:10000000:65536
        overlayfs idmapping (ofs-&gt;creator_cred): 0:10000000:65536

        sys_getxattr()
        -&gt; path_getxattr()
           -&gt; getxattr()
              -&gt; do_getxattr()
                  |&gt; vfs_getxattr()
                  |  |&gt; __vfs_getxattr()
                  |  |  -&gt; handler-&gt;get == ovl_posix_acl_xattr_get()
                  |  |     -&gt; ovl_xattr_get()
                  |  |        -&gt; vfs_getxattr()
                  |  |           |&gt; __vfs_getxattr()
                  |  |           |  -&gt; handler-&gt;get() /* lower filesystem callback */
                  |  |           |&gt; posix_acl_getxattr_idmapped_mnt()
                  |  |              {
                  |  |                              4 = make_kuid(&amp;init_user_ns, 4);
                  |  |                       10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4);
                  |  |                       10000004 = from_kuid(&amp;init_user_ns, 10000004);
                  |  |                       |_______________________
                  |  |              }                               |
                  |  |                                              |
                  |  |&gt; posix_acl_getxattr_idmapped_mnt()           |
                  |     {                                           V
                  |             10000004 = make_kuid(&amp;init_user_ns, 10000004);
                  |             10000004 = mapped_kuid_fs(&amp;init_user_ns /* no idmapped mount */, 10000004);
                  |             10000004 = from_kuid(0(&amp;init_user_ns, 10000004);
                  |             |_________________________________________________
                  |     }                                                        |
                  |                                                              |
                  |&gt; posix_acl_fix_xattr_to_user()                               |
                     {                                                           V
                                 10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004);
                                        /* SUCCESS */
                                        4 = from_kuid(0:10000000:65536 /* caller's idmappings */, 10000004);
                     }

The last remaining problem we need to fix here is ovl_get_acl(). During
ovl_permission() overlayfs will call:

        ovl_permission()
        -&gt; generic_permission()
           -&gt; acl_permission_check()
              -&gt; check_acl()
                 -&gt; get_acl()
                    -&gt; inode-&gt;i_op-&gt;get_acl() == ovl_get_acl()
                        &gt; get_acl() /* on the underlying filesystem)
                          -&gt;inode-&gt;i_op-&gt;get_acl() == /*lower filesystem callback */
                 -&gt; posix_acl_permission()

passing through the get_acl request to the underlying filesystem. This will
retrieve the acls stored in the lower filesystem without taking the idmapping
of the underlying mount into account as this would mean altering the cached
values for the lower filesystem. So we block using ACLs for now until we
decided on a nice way to fix this. Note this limitation both in the
documentation and in the code.

The most straightforward solution would be to have ovl_get_acl() simply
duplicate the ACLs, update the values according to the idmapped mount and
return it to acl_permission_check() so it can be used in posix_acl_permission()
forgetting them afterwards. This is a bit heavy handed but fairly
straightforward otherwise.

Link: https://github.com/brauner/mount-idmapped/issues/9
Link: https://lore.kernel.org/r/20220708090134.385160-2-brauner@kernel.org
Cc: Seth Forshee &lt;sforshee@digitalocean.com&gt;
Cc: Amir Goldstein &lt;amir73il@gmail.com&gt;
Cc: Vivek Goyal &lt;vgoyal@redhat.com&gt;
Cc: Christoph Hellwig &lt;hch@lst.de&gt;
Cc: Aleksa Sarai &lt;cyphar@cyphar.com&gt;
Cc: Miklos Szeredi &lt;mszeredi@redhat.com&gt;
Cc: linux-unionfs@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Seth Forshee &lt;sforshee@digitalocean.com&gt;
Signed-off-by: Christian Brauner (Microsoft) &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>xattr: handle idmapped mounts</title>
<updated>2021-01-24T13:27:17Z</updated>
<author>
<name>Tycho Andersen</name>
<email>tycho@tycho.pizza</email>
</author>
<published>2021-01-21T13:19:28Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=c7c7a1a18af4c3bb7749d33e3df3acdf0a95bbb5'/>
<id>urn:sha1:c7c7a1a18af4c3bb7749d33e3df3acdf0a95bbb5</id>
<content type='text'>
When interacting with extended attributes the vfs verifies that the
caller is privileged over the inode with which the extended attribute is
associated. For posix access and posix default extended attributes a uid
or gid can be stored on-disk. Let the functions handle posix extended
attributes on idmapped mounts. If the inode is accessed through an
idmapped mount we need to map it according to the mount's user
namespace. Afterwards the checks are identical to non-idmapped mounts.
This has no effect for e.g. security xattrs since they don't store uids
or gids and don't perform permission checks on them like posix acls do.

Link: https://lore.kernel.org/r/20210121131959.646623-10-christian.brauner@ubuntu.com
Cc: Christoph Hellwig &lt;hch@lst.de&gt;
Cc: David Howells &lt;dhowells@redhat.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed-by: James Morris &lt;jamorris@linux.microsoft.com&gt;
Signed-off-by: Tycho Andersen &lt;tycho@tycho.pizza&gt;
Signed-off-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
</content>
</entry>
<entry>
<title>acl: handle idmapped mounts</title>
<updated>2021-01-24T13:27:17Z</updated>
<author>
<name>Christian Brauner</name>
<email>christian.brauner@ubuntu.com</email>
</author>
<published>2021-01-21T13:19:27Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=e65ce2a50cf6af216bea6fd80d771fcbb4c0aaa1'/>
<id>urn:sha1:e65ce2a50cf6af216bea6fd80d771fcbb4c0aaa1</id>
<content type='text'>
The posix acl permission checking helpers determine whether a caller is
privileged over an inode according to the acls associated with the
inode. Add helpers that make it possible to handle acls on idmapped
mounts.

The vfs and the filesystems targeted by this first iteration make use of
posix_acl_fix_xattr_from_user() and posix_acl_fix_xattr_to_user() to
translate basic posix access and default permissions such as the
ACL_USER and ACL_GROUP type according to the initial user namespace (or
the superblock's user namespace) to and from the caller's current user
namespace. Adapt these two helpers to handle idmapped mounts whereby we
either map from or into the mount's user namespace depending on in which
direction we're translating.
Similarly, cap_convert_nscap() is used by the vfs to translate user
namespace and non-user namespace aware filesystem capabilities from the
superblock's user namespace to the caller's user namespace. Enable it to
handle idmapped mounts by accounting for the mount's user namespace.

In addition the fileystems targeted in the first iteration of this patch
series make use of the posix_acl_chmod() and, posix_acl_update_mode()
helpers. Both helpers perform permission checks on the target inode. Let
them handle idmapped mounts. These two helpers are called when posix
acls are set by the respective filesystems to handle this case we extend
the -&gt;set() method to take an additional user namespace argument to pass
the mount's user namespace down.

Link: https://lore.kernel.org/r/20210121131959.646623-9-christian.brauner@ubuntu.com
Cc: Christoph Hellwig &lt;hch@lst.de&gt;
Cc: David Howells &lt;dhowells@redhat.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
</content>
</entry>
<entry>
<title>Merge tag 'nfsd-5.9' of git://git.linux-nfs.org/projects/cel/cel-2.6</title>
<updated>2020-08-09T20:58:04Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2020-08-09T20:58:04Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=7a6b60441f02f6e22e7c0936ef16fa3f51832a48'/>
<id>urn:sha1:7a6b60441f02f6e22e7c0936ef16fa3f51832a48</id>
<content type='text'>
Pull NFS server updates from Chuck Lever:
 "Highlights:
   - Support for user extended attributes on NFS (RFC 8276)
   - Further reduce unnecessary NFSv4 delegation recalls

  Notable fixes:
   - Fix recent krb5p regression
   - Address a few resource leaks and a rare NULL dereference

  Other:
   - De-duplicate RPC/RDMA error handling and other utility functions
   - Replace storage and display of kernel memory addresses by tracepoints"

* tag 'nfsd-5.9' of git://git.linux-nfs.org/projects/cel/cel-2.6: (38 commits)
  svcrdma: CM event handler clean up
  svcrdma: Remove transport reference counting
  svcrdma: Fix another Receive buffer leak
  SUNRPC: Refresh the show_rqstp_flags() macro
  nfsd: netns.h: delete a duplicated word
  SUNRPC: Fix ("SUNRPC: Add "@len" parameter to gss_unwrap()")
  nfsd: avoid a NULL dereference in __cld_pipe_upcall()
  nfsd4: a client's own opens needn't prevent delegations
  nfsd: Use seq_putc() in two functions
  svcrdma: Display chunk completion ID when posting a rw_ctxt
  svcrdma: Record send_ctxt completion ID in trace_svcrdma_post_send()
  svcrdma: Introduce Send completion IDs
  svcrdma: Record Receive completion ID in svc_rdma_decode_rqst
  svcrdma: Introduce Receive completion IDs
  svcrdma: Introduce infrastructure to support completion IDs
  svcrdma: Add common XDR encoders for RDMA and Read segments
  svcrdma: Add common XDR decoders for RDMA and Read segments
  SUNRPC: Add helpers for decoding list discriminators symbolically
  svcrdma: Remove declarations for functions long removed
  svcrdma: Clean up trace_svcrdma_send_failed() tracepoint
  ...
</content>
</entry>
</feed>
