<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/include/linux/cgroup.h, branch v4.19.285</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v4.19.285</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v4.19.285'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2022-12-14T10:28:27Z</updated>
<entry>
<title>memcg: fix possible use-after-free in memcg_write_event_control()</title>
<updated>2022-12-14T10:28:27Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2022-12-08T02:53:15Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=e1ae97624ecf400ea56c238bff23e5cd139df0b8'/>
<id>urn:sha1:e1ae97624ecf400ea56c238bff23e5cd139df0b8</id>
<content type='text'>
commit 4a7ba45b1a435e7097ca0f79a847d0949d0eb088 upstream.

memcg_write_event_control() accesses the dentry-&gt;d_name of the specified
control fd to route the write call.  As a cgroup interface file can't be
renamed, it's safe to access d_name as long as the specified file is a
regular cgroup file.  Also, as these cgroup interface files can't be
removed before the directory, it's safe to access the parent too.

Prior to 347c4a874710 ("memcg: remove cgroup_event-&gt;cft"), there was a
call to __file_cft() which verified that the specified file is a regular
cgroupfs file before further accesses.  The cftype pointer returned from
__file_cft() was no longer necessary and the commit inadvertently dropped
the file type check with it allowing any file to slip through.  With the
invarients broken, the d_name and parent accesses can now race against
renames and removals of arbitrary files and cause use-after-free's.

Fix the bug by resurrecting the file type check in __file_cft().  Now that
cgroupfs is implemented through kernfs, checking the file operations needs
to go through a layer of indirection.  Instead, let's check the superblock
and dentry type.

Link: https://lkml.kernel.org/r/Y5FRm/cfcKPGzWwl@slm.duckdns.org
Fixes: 347c4a874710 ("memcg: remove cgroup_event-&gt;cft")
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reported-by: Jann Horn &lt;jannh@google.com&gt;
Acked-by: Roman Gushchin &lt;roman.gushchin@linux.dev&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Michal Hocko &lt;mhocko@kernel.org&gt;
Cc: Muchun Song &lt;songmuchun@bytedance.com&gt;
Cc: Shakeel Butt &lt;shakeelb@google.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[3.14+]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>cgroup: fix cgroup_sk_alloc() for sk_clone_lock()</title>
<updated>2020-07-22T07:32:00Z</updated>
<author>
<name>Cong Wang</name>
<email>xiyou.wangcong@gmail.com</email>
</author>
<published>2020-07-02T18:52:56Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=0505cc4c908f46a4d22c4994dbbe7bc489d0c52d'/>
<id>urn:sha1:0505cc4c908f46a4d22c4994dbbe7bc489d0c52d</id>
<content type='text'>
[ Upstream commit ad0f75e5f57ccbceec13274e1e242f2b5a6397ed ]

When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
copied, so the cgroup refcnt must be taken too. And, unlike the
sk_alloc() path, sock_update_netprioidx() is not called here.
Therefore, it is safe and necessary to grab the cgroup refcnt
even when cgroup_sk_alloc is disabled.

sk_clone_lock() is in BH context anyway, the in_interrupt()
would terminate this function if called there. And for sk_alloc()
skcd-&gt;val is always zero. So it's safe to factor out the code
to make it more readable.

The global variable 'cgroup_sk_alloc_disabled' is used to determine
whether to take these reference counts. It is impossible to make
the reference counting correct unless we save this bit of information
in skcd-&gt;val. So, add a new bit there to record whether the socket
has already taken the reference counts. This obviously relies on
kmalloc() to align cgroup pointers to at least 4 bytes,
ARCH_KMALLOC_MINALIGN is certainly larger than that.

This bug seems to be introduced since the beginning, commit
d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets")
tried to fix it but not compeletely. It seems not easy to trigger until
the recent commit 090e28b229af
("netprio_cgroup: Fix unlimited memory leak of v2 cgroups") was merged.

Fixes: bd1060a1d671 ("sock, cgroup: add sock-&gt;sk_cgroup")
Reported-by: Cameron Berkenpas &lt;cam@neo-zeon.de&gt;
Reported-by: Peter Geis &lt;pgwipeout@gmail.com&gt;
Reported-by: Lu Fengqi &lt;lufq.fnst@cn.fujitsu.com&gt;
Reported-by: Daniël Sonck &lt;dsonck92@gmail.com&gt;
Reported-by: Zhang Qiang &lt;qiang.zhang@windriver.com&gt;
Tested-by: Cameron Berkenpas &lt;cam@neo-zeon.de&gt;
Tested-by: Peter Geis &lt;pgwipeout@gmail.com&gt;
Tested-by: Thomas Lamprecht &lt;t.lamprecht@proxmox.com&gt;
Cc: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Cc: Zefan Li &lt;lizefan@huawei.com&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Roman Gushchin &lt;guro@fb.com&gt;
Signed-off-by: Cong Wang &lt;xiyou.wangcong@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>cgroup: Iterate tasks that did not finish do_exit()</title>
<updated>2020-03-18T06:14:19Z</updated>
<author>
<name>Michal Koutný</name>
<email>mkoutny@suse.com</email>
</author>
<published>2020-01-24T11:40:15Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=ab3e3b23d8d53c542cdfef19f5dbb2e13cc2b957'/>
<id>urn:sha1:ab3e3b23d8d53c542cdfef19f5dbb2e13cc2b957</id>
<content type='text'>
commit 9c974c77246460fa6a92c18554c3311c8c83c160 upstream.

PF_EXITING is set earlier than actual removal from css_set when a task
is exitting. This can confuse cgroup.procs readers who see no PF_EXITING
tasks, however, rmdir is checking against css_set membership so it can
transitionally fail with EBUSY.

Fix this by listing tasks that weren't unlinked from css_set active
lists.
It may happen that other users of the task iterator (without
CSS_TASK_ITER_PROCS) spot a PF_EXITING task before cgroup_exit(). This
is equal to the state before commit c03cd7738a83 ("cgroup: Include dying
leaders with live threads in PROCS iterations") but it may be reviewed
later.

Reported-by: Suren Baghdasaryan &lt;surenb@google.com&gt;
Fixes: c03cd7738a83 ("cgroup: Include dying leaders with live threads in PROCS iterations")
Signed-off-by: Michal Koutný &lt;mkoutny@suse.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>cgroup: Include dying leaders with live threads in PROCS iterations</title>
<updated>2019-08-09T15:52:34Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2019-05-31T17:38:58Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=4340d175b89896d069c1e875f5b98c80a408f680'/>
<id>urn:sha1:4340d175b89896d069c1e875f5b98c80a408f680</id>
<content type='text'>
commit c03cd7738a83b13739f00546166969342c8ff014 upstream.

CSS_TASK_ITER_PROCS currently iterates live group leaders; however,
this means that a process with dying leader and live threads will be
skipped.  IOW, cgroup.procs might be empty while cgroup.threads isn't,
which is confusing to say the least.

Fix it by making cset track dying tasks and include dying leaders with
live threads in PROCS iteration.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reported-and-tested-by: Topi Miettinen &lt;toiwoton@gmail.com&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>cgroup: Implement css_task_iter_skip()</title>
<updated>2019-08-09T15:52:34Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2019-05-31T17:38:58Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=370b9e6399da09fe10005fe455878b356de7b85f'/>
<id>urn:sha1:370b9e6399da09fe10005fe455878b356de7b85f</id>
<content type='text'>
commit b636fd38dc40113f853337a7d2a6885ad23b8811 upstream.

When a task is moved out of a cset, task iterators pointing to the
task are advanced using the normal css_task_iter_advance() call.  This
is fine but we'll be tracking dying tasks on csets and thus moving
tasks from cset-&gt;tasks to (to be added) cset-&gt;dying_tasks.  When we
remove a task from cset-&gt;tasks, if we advance the iterators, they may
move over to the next cset before we had the chance to add the task
back on the dying list, which can allow the task to escape iteration.

This patch separates out skipping from advancing.  Skipping only moves
the affected iterators to the next pointer rather than fully advancing
it and the following advancing will recognize that the cursor has
already been moved forward and do the rest of advancing.  This ensures
that when a task moves from one list to another in its cset, as long
as it moves in the right direction, it's always visible to iteration.

This doesn't cause any visible behavior changes.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>cgroup: Use css_tryget() instead of css_tryget_online() in task_get_css()</title>
<updated>2019-06-19T06:18:00Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2019-05-29T20:46:25Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=c3b85bda41f0a8c00b7492cbc9042b612dad2162'/>
<id>urn:sha1:c3b85bda41f0a8c00b7492cbc9042b612dad2162</id>
<content type='text'>
commit 18fa84a2db0e15b02baa5d94bdb5bd509175d2f6 upstream.

A PF_EXITING task can stay associated with an offline css.  If such
task calls task_get_css(), it can get stuck indefinitely.  This can be
triggered by BSD process accounting which writes to a file with
PF_EXITING set when racing against memcg disable as in the backtrace
at the end.

After this change, task_get_css() may return a css which was already
offline when the function was called.  None of the existing users are
affected by this change.

  INFO: rcu_sched self-detected stall on CPU
  INFO: rcu_sched detected stalls on CPUs/tasks:
  ...
  NMI backtrace for cpu 0
  ...
  Call Trace:
   &lt;IRQ&gt;
   dump_stack+0x46/0x68
   nmi_cpu_backtrace.cold.2+0x13/0x57
   nmi_trigger_cpumask_backtrace+0xba/0xca
   rcu_dump_cpu_stacks+0x9e/0xce
   rcu_check_callbacks.cold.74+0x2af/0x433
   update_process_times+0x28/0x60
   tick_sched_timer+0x34/0x70
   __hrtimer_run_queues+0xee/0x250
   hrtimer_interrupt+0xf4/0x210
   smp_apic_timer_interrupt+0x56/0x110
   apic_timer_interrupt+0xf/0x20
   &lt;/IRQ&gt;
  RIP: 0010:balance_dirty_pages_ratelimited+0x28f/0x3d0
  ...
   btrfs_file_write_iter+0x31b/0x563
   __vfs_write+0xfa/0x140
   __kernel_write+0x4f/0x100
   do_acct_process+0x495/0x580
   acct_process+0xb9/0xdb
   do_exit+0x748/0xa00
   do_group_exit+0x3a/0xa0
   get_signal+0x254/0x560
   do_signal+0x23/0x5c0
   exit_to_usermode_loop+0x5d/0xa0
   prepare_exit_to_usermode+0x53/0x80
   retint_user+0x8/0x8

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: stable@vger.kernel.org # v4.2+
Fixes: ec438699a9ae ("cgroup, block: implement task_get_css() and use it in bio_associate_current()")
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>cgroup/pids: turn cgroup_subsys-&gt;free() into cgroup_subsys-&gt;release() to fix the accounting</title>
<updated>2019-04-05T20:33:13Z</updated>
<author>
<name>Oleg Nesterov</name>
<email>oleg@redhat.com</email>
</author>
<published>2019-01-28T16:00:13Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=d0bc74c5632f8e5080bcb4bb1843be5f3ed112cd'/>
<id>urn:sha1:d0bc74c5632f8e5080bcb4bb1843be5f3ed112cd</id>
<content type='text'>
[ Upstream commit 51bee5abeab2058ea5813c5615d6197a23dbf041 ]

The only user of cgroup_subsys-&gt;free() callback is pids_cgrp_subsys which
needs pids_free() to uncharge the pid.

However, -&gt;free() is called from __put_task_struct()-&gt;cgroup_free() and this
is too late. Even the trivial program which does

	for (;;) {
		int pid = fork();
		assert(pid &gt;= 0);
		if (pid)
			wait(NULL);
		else
			exit(0);
	}

can run out of limits because release_task()-&gt;call_rcu(delayed_put_task_struct)
implies an RCU gp after the task/pid goes away and before the final put().

Test-case:

	mkdir -p /tmp/CG
	mount -t cgroup2 none /tmp/CG
	echo '+pids' &gt; /tmp/CG/cgroup.subtree_control

	mkdir /tmp/CG/PID
	echo 2 &gt; /tmp/CG/PID/pids.max

	perl -e 'while ($p = fork) { wait; } $p // die "fork failed: $!\n"' &amp;
	echo $! &gt; /tmp/CG/PID/cgroup.procs

Without this patch the forking process fails soon after migration.

Rename cgroup_subsys-&gt;free() to cgroup_subsys-&gt;release() and move the callsite
into the new helper, cgroup_release(), called by release_task() which actually
frees the pid(s).

Reported-by: Herton R. Krzesinski &lt;hkrzesin@redhat.com&gt;
Reported-by: Jan Stancek &lt;jstancek@redhat.com&gt;
Signed-off-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>bpf: Introduce bpf_skb_ancestor_cgroup_id helper</title>
<updated>2018-08-12T23:02:39Z</updated>
<author>
<name>Andrey Ignatov</name>
<email>rdna@fb.com</email>
</author>
<published>2018-08-12T17:49:27Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=7723628101aaeb1d723786747529b4ea65c5b5c5'/>
<id>urn:sha1:7723628101aaeb1d723786747529b4ea65c5b5c5</id>
<content type='text'>
== Problem description ==

It's useful to be able to identify cgroup associated with skb in TC so
that a policy can be applied to this skb, and existing bpf_skb_cgroup_id
helper can help with this.

Though in real life cgroup hierarchy and hierarchy to apply a policy to
don't map 1:1.

It's often the case that there is a container and corresponding cgroup,
but there are many more sub-cgroups inside container, e.g. because it's
delegated to containerized application to control resources for its
subsystems, or to separate application inside container from infra that
belongs to containerization system (e.g. sshd).

At the same time it may be useful to apply a policy to container as a
whole.

If multiple containers like this are run on a host (what is often the
case) and many of them have sub-cgroups, it may not be possible to apply
per-container policy in TC with existing helpers such as
bpf_skb_under_cgroup or bpf_skb_cgroup_id:

* bpf_skb_cgroup_id will return id of immediate cgroup associated with
  skb, i.e. if it's a sub-cgroup inside container, it can't be used to
  identify container's cgroup;

* bpf_skb_under_cgroup can work only with one cgroup and doesn't scale,
  i.e. if there are N containers on a host and a policy has to be
  applied to M of them (0 &lt;= M &lt;= N), it'd require M calls to
  bpf_skb_under_cgroup, and, if M changes, it'd require to rebuild &amp;
  load new BPF program.

== Solution ==

The patch introduces new helper bpf_skb_ancestor_cgroup_id that can be
used to get id of cgroup v2 that is an ancestor of cgroup associated
with skb at specified level of cgroup hierarchy.

That way admin can place all containers on one level of cgroup hierarchy
(what is a good practice in general and already used in many
configurations) and identify specific cgroup on this level no matter
what sub-cgroup skb is associated with.

E.g. if there is a cgroup hierarchy:
  root/
  root/container1/
  root/container1/app11/
  root/container1/app11/sub-app-a/
  root/container1/app12/
  root/container2/
  root/container2/app21/
  root/container2/app22/
  root/container2/app22/sub-app-b/

, then having skb associated with root/container1/app11/sub-app-a/ it's
possible to get ancestor at level 1, what is container1 and apply policy
for this container, or apply another policy if it's container2.

Policies can be kept e.g. in a hash map where key is a container cgroup
id and value is an action.

Levels where container cgroups are created are usually known in advance
whether cgroup hierarchy inside container may be hard to predict
especially in case when its creation is delegated to containerized
application.

== Implementation details ==

The helper gets ancestor by walking parents up to specified level.

Another option would be to get different kind of "id" from
cgroup-&gt;ancestor_ids[level] and use it with idr_find() to get struct
cgroup for ancestor. But that would require radix lookup what doesn't
seem to be better (at least it's not obviously better).

Format of return value of the new helper is same as that of
bpf_skb_cgroup_id.

Signed-off-by: Andrey Ignatov &lt;rdna@fb.com&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
</content>
</entry>
<entry>
<title>cgroup: Replace cgroup_rstat_mutex with a spinlock</title>
<updated>2018-04-26T21:29:05Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2018-04-26T21:29:05Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=0fa294fb1985c06c4e3325e30e759d4ca580f59a'/>
<id>urn:sha1:0fa294fb1985c06c4e3325e30e759d4ca580f59a</id>
<content type='text'>
Currently, rstat flush path is protected with a mutex which is fine as
all the existing users are from interface file show path.  However,
rstat is being generalized for use by controllers and flushing from
atomic contexts will be necessary.

This patch replaces cgroup_rstat_mutex with a spinlock and adds a
irq-safe flush function - cgroup_rstat_flush_irqsafe().  Explicit
yield handling is added to the flush path so that other flush
functions can yield to other threads and flushers.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>cgroup: Factor out and expose cgroup_rstat_*() interface functions</title>
<updated>2018-04-26T21:29:05Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2018-04-26T21:29:05Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=6162cef0f741c70eb0c7ac7e6142f85808d8abc4'/>
<id>urn:sha1:6162cef0f741c70eb0c7ac7e6142f85808d8abc4</id>
<content type='text'>
cgroup_rstat is being generalized so that controllers can use it too.
This patch factors out and exposes the following interface functions.

* cgroup_rstat_updated(): Renamed from cgroup_rstat_cpu_updated() for
  consistency.

* cgroup_rstat_flush_hold/release(): Factored out from base stat
  implementation.

* cgroup_rstat_flush(): Verbatim expose.

While at it, drop assert on cgroup_rstat_mutex in
cgroup_base_stat_flush() as it crosses layers and make a minor comment
update.

v2: Added EXPORT_SYMBOL_GPL(cgroup_rstat_updated) to fix a build bug.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
</feed>
