user/sven/linux.git/kernel/cgroup.c, branch v3.16.52

cgroup: make sure a parent css isn't offlined before its children

2016-02-25T00:18:35Z

commit aa226ff4a1ce79f229c6b7a4c0a14e17fececd01 upstream. There are three subsystem callbacks in css shutdown path - css_offline(), css_released() and css_free(). Except for css_released(), cgroup core didn't guarantee the order of invocation. css_offline() or css_free() could be called on a parent css before its children. This behavior is unexpected and led to bugs in cpu and memory controller. This patch updates offline path so that a parent css is never offlined before its children. Each css keeps online_cnt which reaches zero iff itself and all its children are offline and offline_css() is invoked only after online_cnt reaches zero. This fixes the memory controller bug and allows the fix for cpu controller. Signed-off-by: Tejun Heo Reported-and-tested-by: Christian Borntraeger Reported-by: Brian Christiansen Link: http://lkml.kernel.org/g/5698A023.9070703@de.ibm.com Link: http://lkml.kernel.org/g/CAKB58ikDkzc8REt31WBkD99+hxNzjK4+FBmhkgS+NVrC9vjMSg@mail.gmail.com Cc: Heiko Carstens Cc: Peter Zijlstra [ luis: backported to 3.16: - file rename: cgroup-defs.h -> cgroup.h - adjusted context ] Signed-off-by: Luis Henriques

fs: create and use seq_show_option for escaping

2015-09-29T15:44:47Z

commit a068acf2ee77693e0bf39d6e07139ba704f461c3 upstream. Many file systems that implement the show_options hook fail to correctly escape their output which could lead to unescaped characters (e.g. new lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This could lead to confusion, spoofed entries (resulting in things like systemd issuing false d-bus "mount" notifications), and who knows what else. This looks like it would only be the root user stepping on themselves, but it's possible weird things could happen in containers or in other situations with delegated mount privileges. Here's an example using overlay with setuid fusermount trusting the contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use of "sudo" is something more sneaky: $ BASE="ovl" $ MNT="$BASE/mnt" $ LOW="$BASE/lower" $ UP="$BASE/upper" $ WORK="$BASE/work/ 0 0 none /proc fuse.pwn user_id=1000" $ mkdir -p "$LOW" "$UP" "$WORK" $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt $ cat /proc/mounts none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0 none /proc fuse.pwn user_id=1000 0 0 $ fusermount -u /proc $ cat /proc/mounts cat: /proc/mounts: No such file or directory This fixes the problem by adding new seq_show_option and seq_show_option_n helpers, and updating the vulnerable show_option handlers to use them as needed. Some, like SELinux, need to be open coded due to unusual existing escape mechanisms. [akpm@linux-foundation.org: add lost chunk, per Kees] [keescook@chromium.org: seq_show_option should be using const parameters] Signed-off-by: Kees Cook Acked-by: Serge Hallyn Acked-by: Jan Kara Acked-by: Paul Moore Cc: J. R. Okajima Signed-off-by: Kees Cook Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds [ luis: backported to 3.16: - dropped changes to fs/overlayfs/super.c, net/ceph/ceph_common.c and to cgroup_show_options() in kernel/cgroup.c - adjusted context ] Signed-off-by: Luis Henriques

sysfs: Create mountpoints with sysfs_create_mount_point

2015-08-20T17:24:03Z

commit f9bb48825a6b5d02f4cabcc78967c75db903dcdc upstream. This allows for better documentation in the code and it allows for a simpler and fully correct version of fs_fully_visible to be written. The mount points converted and their filesystems are: /sys/hypervisor/s390/ s390_hypfs /sys/kernel/config/ configfs /sys/kernel/debug/ debugfs /sys/firmware/efi/efivars/ efivarfs /sys/fs/fuse/connections/ fusectl /sys/fs/pstore/ pstore /sys/kernel/tracing/ tracefs /sys/fs/cgroup/ cgroup /sys/kernel/security/ securityfs /sys/fs/selinux/ selinuxfs /sys/fs/smackfs/ smackfs Acked-by: Greg Kroah-Hartman Signed-off-by: "Eric W. Biederman" Signed-off-by: Seth Forshee Signed-off-by: Luis Henriques

cgroup: fix unbalanced locking

2014-10-05T20:41:03Z

commit eb4aec84d6bdf98d00cedb41c18000f7a31e648a upstream. cgroup_pidlist_start() holds cgrp->pidlist_mutex and then calls pidlist_array_load(), and cgroup_pidlist_stop() releases the mutex. It is wrong that we release the mutex in the failure path in pidlist_array_load(), because cgroup_pidlist_stop() will be called no matter if cgroup_pidlist_start() returns errno or not. Fixes: 4bac00d16a8760eae7205e41d2c246477d42a210 Signed-off-by: Zefan Li Signed-off-by: Tejun Heo Acked-by: Cong Wang Signed-off-by: Greg Kroah-Hartman

cgroup: delay the clearing of cgrp->kn->priv

2014-10-05T20:41:03Z

commit a4189487da1b4f8260c6006b9dc47c3c4107a5ae upstream. Run these two scripts concurrently: for ((; ;)) { mkdir /cgroup/sub rmdir /cgroup/sub } for ((; ;)) { echo $$ > /cgroup/sub/cgroup.procs echo $$ > /cgroup/cgroup.procs } A kernel bug will be triggered: BUG: unable to handle kernel NULL pointer dereference at 00000038 IP: [] cgroup_put+0x9/0x80 ... Call Trace: [] cgroup_kn_unlock+0x39/0x50 [] cgroup_kn_lock_live+0x61/0x70 [] __cgroup_procs_write.isra.26+0x51/0x230 [] cgroup_tasks_write+0x12/0x20 [] cgroup_file_write+0x40/0x130 [] kernfs_fop_write+0xd1/0x160 [] vfs_write+0x98/0x1e0 [] SyS_write+0x4d/0xa0 [] sysenter_do_call+0x12/0x12 We clear cgrp->kn->priv in the end of cgroup_rmdir(), but another concurrent thread can access kn->priv after the clearing. We should move the clearing to css_release_work_fn(). At that time no one is holding reference to the cgroup and no one can gain a new reference to access it. v2: - move RCU_INIT_POINTER() into the else block. (Tejun) - remove the cgroup_parent() check. (Tejun) - update the comment in css_tryget_online_from_dir(). Reported-by: Toralf Förster Signed-off-by: Zefan Li Signed-off-by: Tejun Heo Signed-off-by: Greg Kroah-Hartman

cgroup: reject cgroup names with '\n'

2014-10-05T20:41:03Z

commit 71b1fb5c4473a5b1e601d41b109bdfe001ec82e0 upstream. /proc//cgroup contains one cgroup path on each line. If cgroup names are allowed to contain "\n", applications cannot parse /proc//cgroup safely. Signed-off-by: Alban Crequy Signed-off-by: Tejun Heo Signed-off-by: Greg Kroah-Hartman

cgroup: check cgroup liveliness before unbreaking kernfs

2014-10-05T20:41:00Z

commit aa32362f011c6e863132b16c1761487166a4bad2 upstream. When cgroup_kn_lock_live() is called through some kernfs operation and another thread is calling cgroup_rmdir(), we'll trigger the warning in cgroup_get(). ------------[ cut here ]------------ WARNING: CPU: 1 PID: 1228 at kernel/cgroup.c:1034 cgroup_get+0x89/0xa0() ... Call Trace: [] dump_stack+0x41/0x52 [] warn_slowpath_common+0x7f/0xa0 [] warn_slowpath_null+0x1d/0x20 [] cgroup_get+0x89/0xa0 [] cgroup_kn_lock_live+0x28/0x70 [] __cgroup_procs_write.isra.26+0x51/0x230 [] cgroup_tasks_write+0x12/0x20 [] cgroup_file_write+0x40/0x130 [] kernfs_fop_write+0xd1/0x160 [] vfs_write+0x98/0x1e0 [] SyS_write+0x4d/0xa0 [] sysenter_do_call+0x12/0x12 ---[ end trace 6f2e0c38c2108a74 ]--- Fix this by calling css_tryget() instead of cgroup_get(). v2: - move cgroup_tryget() right below cgroup_get() definition. (Tejun) Reported-by: Toralf Förster Signed-off-by: Zefan Li Signed-off-by: Tejun Heo Signed-off-by: Greg Kroah-Hartman

cgroup: fix a race between cgroup_mount() and cgroup_kill_sb()

2014-06-30T14:16:26Z

We've converted cgroup to kernfs so cgroup won't be intertwined with vfs objects and locking, but there are dark areas. Run two instances of this script concurrently: for ((; ;)) { mount -t cgroup -o cpuacct xxx /cgroup umount /cgroup } After a while, I saw two mount processes were stuck at retrying, because they were waiting for a subsystem to become free, but the root associated with this subsystem never got freed. This can happen, if thread A is in the process of killing superblock but hasn't called percpu_ref_kill(), and at this time thread B is mounting the same cgroup root and finds the root in the root list and performs percpu_ref_try_get(). To fix this, we try to increase both the refcnt of the superblock and the percpu refcnt of cgroup root. v2: - we should try to get both the superblock refcnt and cgroup_root refcnt, because cgroup_root may have no superblock assosiated with it. - adjust/add comments. tj: Updated comments. Renamed @sb to @pinned_sb. Cc: # 3.15 Signed-off-by: Li Zefan Signed-off-by: Tejun Heo

cgroup: fix mount failure in a corner case

2014-06-30T14:16:25Z

# cat test.sh #! /bin/bash mount -t cgroup -o cpu xxx /cgroup umount /cgroup mount -t cgroup -o cpu,cpuacct xxx /cgroup umount /cgroup # ./test.sh mount: xxx already mounted or /cgroup busy mount: according to mtab, xxx is already mounted on /cgroup It's because the cgroupfs_root of the first mount was under destruction asynchronously. Fix this by delaying and then retrying mount for this case. v3: - put the refcnt immediately after getting it. (Tejun) v2: - use percpu_ref_tryget_live() rather that introducing percpu_ref_alive(). (Tejun) - adjust comment. tj: Updated the comment a bit. Cc: # 3.15 Signed-off-by: Li Zefan Signed-off-by: Tejun Heo

cgroup: fix broken css_has_online_children()

2014-06-17T22:52:53Z

After running: # mount -t cgroup cpu xxx /cgroup && mkdir /cgroup/sub && \ rmdir /cgroup/sub && umount /cgroup I found the cgroup root still existed: # cat /proc/cgroups #subsys_name hierarchy num_cgroups enabled cpuset 0 1 1 cpu 1 1 1 ... It turned out css_has_online_children() is broken. Signed-off-by: Li Zefan Sigend-off-by: Tejun Heo