<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/kernel/cgroup/cpuset.c, branch v5.2.7</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v5.2.7</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v5.2.7'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2019-06-12T18:00:08Z</updated>
<entry>
<title>cpuset: restore sanity to cpuset_cpus_allowed_fallback()</title>
<updated>2019-06-12T18:00:08Z</updated>
<author>
<name>Joel Savitz</name>
<email>jsavitz@redhat.com</email>
</author>
<published>2019-06-12T15:50:48Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=d477f8c202d1f0d4791ab1263ca7657bbe5cf79e'/>
<id>urn:sha1:d477f8c202d1f0d4791ab1263ca7657bbe5cf79e</id>
<content type='text'>
In the case that a process is constrained by taskset(1) (i.e.
sched_setaffinity(2)) to a subset of available cpus, and all of those are
subsequently offlined, the scheduler will set tsk-&gt;cpus_allowed to
the current value of task_cs(tsk)-&gt;effective_cpus.

This is done via a call to do_set_cpus_allowed() in the context of
cpuset_cpus_allowed_fallback() made by the scheduler when this case is
detected. This is the only call made to cpuset_cpus_allowed_fallback()
in the latest mainline kernel.

However, this is not sane behavior.

I will demonstrate this on a system running the latest upstream kernel
with the following initial configuration:

	# grep -i cpu /proc/$$/status
	Cpus_allowed:	ffffffff,fffffff
	Cpus_allowed_list:	0-63

(Where cpus 32-63 are provided via smt.)

If we limit our current shell process to cpu2 only and then offline it
and reonline it:

	# taskset -p 4 $$
	pid 2272's current affinity mask: ffffffffffffffff
	pid 2272's new affinity mask: 4

	# echo off &gt; /sys/devices/system/cpu/cpu2/online
	# dmesg | tail -3
	[ 2195.866089] process 2272 (bash) no longer affine to cpu2
	[ 2195.872700] IRQ 114: no longer affine to CPU2
	[ 2195.879128] smpboot: CPU 2 is now offline

	# echo on &gt; /sys/devices/system/cpu/cpu2/online
	# dmesg | tail -1
	[ 2617.043572] smpboot: Booting Node 0 Processor 2 APIC 0x4

We see that our current process now has an affinity mask containing
every cpu available on the system _except_ the one we originally
constrained it to:

	# grep -i cpu /proc/$$/status
	Cpus_allowed:   ffffffff,fffffffb
	Cpus_allowed_list:      0-1,3-63

This is not sane behavior, as the scheduler can now not only place the
process on previously forbidden cpus, it can't even schedule it on
the cpu it was originally constrained to!

Other cases result in even more exotic affinity masks. Take for instance
a process with an affinity mask containing only cpus provided by smt at
the moment that smt is toggled, in a configuration such as the following:

	# taskset -p f000000000 $$
	# grep -i cpu /proc/$$/status
	Cpus_allowed:	000000f0,00000000
	Cpus_allowed_list:	36-39

A double toggle of smt results in the following behavior:

	# echo off &gt; /sys/devices/system/cpu/smt/control
	# echo on &gt; /sys/devices/system/cpu/smt/control
	# grep -i cpus /proc/$$/status
	Cpus_allowed:	ffffff00,ffffffff
	Cpus_allowed_list:	0-31,40-63

This is even less sane than the previous case, as the new affinity mask
excludes all smt-provided cpus with ids less than those that were
previously in the affinity mask, as well as those that were actually in
the mask.

With this patch applied, both of these cases end in the following state:

	# grep -i cpu /proc/$$/status
	Cpus_allowed:	ffffffff,ffffffff
	Cpus_allowed_list:	0-63

The original policy is discarded. Though not ideal, it is the simplest way
to restore sanity to this fallback case without reinventing the cpuset
wheel that rolls down the kernel just fine in cgroup v2. A user who wishes
for the previous affinity mask to be restored in this fallback case can use
that mechanism instead.

This patch modifies scheduler behavior by instead resetting the mask to
task_cs(tsk)-&gt;cpus_allowed by default, and cpu_possible mask in legacy
mode. I tested the cases above on both modes.

Note that the scheduler uses this fallback mechanism if and only if
_every_ other valid avenue has been traveled, and it is the last resort
before calling BUG().

Suggested-by: Waiman Long &lt;longman@redhat.com&gt;
Suggested-by: Phil Auld &lt;pauld@redhat.com&gt;
Signed-off-by: Joel Savitz &lt;jsavitz@redhat.com&gt;
Acked-by: Phil Auld &lt;pauld@redhat.com&gt;
Acked-by: Waiman Long &lt;longman@redhat.com&gt;
Acked-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>cgroup/cpuset: Update stale generate_sched_domains() comments</title>
<updated>2019-04-19T17:44:14Z</updated>
<author>
<name>Juri Lelli</name>
<email>juri.lelli@redhat.com</email>
</author>
<published>2018-12-19T13:34:44Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=b6fbbf31d15b5072250ec6ed79e415a1160e5621'/>
<id>urn:sha1:b6fbbf31d15b5072250ec6ed79e415a1160e5621</id>
<content type='text'>
Commit:

  fc560a26acce ("cpuset: replace cpuset-&gt;stack_list with cpuset_for_each_descendant_pre()")

removed the local list (q) that was used to perform a top-down scan
of all cpusets; however, comments mentioning it were not updated.

Update comments to reflect current implementation.

Signed-off-by: Juri Lelli &lt;juri.lelli@redhat.com&gt;
Acked-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: cgroups@vger.kernel.org
Cc: lizefan@huawei.com
Link: http://lkml.kernel.org/r/20181219133445.31982-1-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
<entry>
<title>Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs</title>
<updated>2019-03-12T21:08:19Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2019-03-12T21:08:19Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=7b47a9e7c8f672b6fb0b77fca11a63a8a77f5a91'/>
<id>urn:sha1:7b47a9e7c8f672b6fb0b77fca11a63a8a77f5a91</id>
<content type='text'>
Pull vfs mount infrastructure updates from Al Viro:
 "The rest of core infrastructure; no new syscalls in that pile, but the
  old parts are switched to new infrastructure. At that point
  conversions of individual filesystems can happen independently; some
  are done here (afs, cgroup, procfs, etc.), there's also a large series
  outside of that pile dealing with NFS (quite a bit of option-parsing
  stuff is getting used there - it's one of the most convoluted
  filesystems in terms of mount-related logics), but NFS bits are the
  next cycle fodder.

  It got seriously simplified since the last cycle; documentation is
  probably the weakest bit at the moment - I considered dropping the
  commit introducing Documentation/filesystems/mount_api.txt (cutting
  the size increase by quarter ;-), but decided that it would be better
  to fix it up after -rc1 instead.

  That pile allows to do followup work in independent branches, which
  should make life much easier for the next cycle. fs/super.c size
  increase is unpleasant; there's a followup series that allows to
  shrink it considerably, but I decided to leave that until the next
  cycle"

* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (41 commits)
  afs: Use fs_context to pass parameters over automount
  afs: Add fs_context support
  vfs: Add some logging to the core users of the fs_context log
  vfs: Implement logging through fs_context
  vfs: Provide documentation for new mount API
  vfs: Remove kern_mount_data()
  hugetlbfs: Convert to fs_context
  cpuset: Use fs_context
  kernfs, sysfs, cgroup, intel_rdt: Support fs_context
  cgroup: store a reference to cgroup_ns into cgroup_fs_context
  cgroup1_get_tree(): separate "get cgroup_root to use" into a separate helper
  cgroup_do_mount(): massage calling conventions
  cgroup: stash cgroup_root reference into cgroup_fs_context
  cgroup2: switch to option-by-option parsing
  cgroup1: switch to option-by-option parsing
  cgroup: take options parsing into -&gt;parse_monolithic()
  cgroup: fold cgroup1_mount() into cgroup1_get_tree()
  cgroup: start switching to fs_context
  ipc: Convert mqueue fs to fs_context
  proc: Add fs_context support to procfs
  ...
</content>
</entry>
<entry>
<title>cpuset: Use fs_context</title>
<updated>2019-02-28T08:29:35Z</updated>
<author>
<name>David Howells</name>
<email>dhowells@redhat.com</email>
</author>
<published>2018-11-01T23:07:25Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=a18753747385b8b98577a18adc8ec99fda679044'/>
<id>urn:sha1:a18753747385b8b98577a18adc8ec99fda679044</id>
<content type='text'>
Make the cpuset filesystem use the filesystem context.  This is potentially
tricky as the cpuset fs is almost an alias for the cgroup filesystem, but
with some special parameters.

This can, however, be handled by setting up an appropriate cgroup
filesystem and returning the root directory of that as the root dir of this
one.

Signed-off-by: David Howells &lt;dhowells@redhat.com&gt;
cc: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
</entry>
<entry>
<title>cpuset: remove unused task_has_mempolicy()</title>
<updated>2019-02-19T14:26:02Z</updated>
<author>
<name>Masahiro Yamada</name>
<email>yamada.masahiro@socionext.com</email>
</author>
<published>2019-02-18T06:28:11Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=6a613d24effcb875271b8a1c510172e2d6eaaee8'/>
<id>urn:sha1:6a613d24effcb875271b8a1c510172e2d6eaaee8</id>
<content type='text'>
This is a remnant of commit 5f155f27cb7f ("mm, cpuset: always use
seqlock when changing task's nodemask").

Signed-off-by: Masahiro Yamada &lt;yamada.masahiro@socionext.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>Merge branch 'for-4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup</title>
<updated>2018-12-29T18:57:20Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2018-12-29T18:57:20Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=6f9d71c9c759b1e7d31189a4de228983192c7dc7'/>
<id>urn:sha1:6f9d71c9c759b1e7d31189a4de228983192c7dc7</id>
<content type='text'>
Pull cgroup updates from Tejun Heo:

 - Waiman's cgroup2 cpuset support has been finally merged closing one
   of the last remaining feature gaps.

 - cgroup.procs could show non-leader threads when cgroup2 threaded mode
   was used in certain ways. I forgot to push the fix during the last
   cycle.

 - A patch to fix mount option parsing when all mount options have been
   consumed by someone else (LSM).

 - cgroup_no_v1 boot param can now block named cgroup1 hierarchies too.

* 'for-4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: Add named hierarchy disabling to cgroup_no_v1 boot param
  cgroup: fix parsing empty mount option string
  cpuset: Remove set but not used variable 'cs'
  cgroup: fix CSS_TASK_ITER_PROCS
  cgroup: Add .__DEBUG__. prefix to debug file names
  cpuset: Minor cgroup2 interface updates
  cpuset: Expose cpuset.cpus.subpartitions with cgroup_debug
  cpuset: Add documentation about the new "cpuset.sched.partition" flag
  cpuset: Use descriptive text when reading/writing cpuset.sched.partition
  cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  cpuset: Make generate_sched_domains() work with partition
  cpuset: Make CPU hotplug work with partition
  cpuset: Track cpusets that use parent's effective_cpus
  cpuset: Add an error state to cpuset.sched.partition
  cpuset: Add new v2 cpuset.sched.partition flag
  cpuset: Simply allocation and freeing of cpumasks
  cpuset: Define data structures to support scheduling partition
  cpuset: Enable cpuset controller in default hierarchy
  cgroup: remove unnecessary unlikely()
</content>
</entry>
<entry>
<title>mm, oom: reorganize the oom report in dump_header</title>
<updated>2018-12-28T20:11:48Z</updated>
<author>
<name>yuzhoujian</name>
<email>yuzhoujian@didichuxing.com</email>
</author>
<published>2018-12-28T08:36:07Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=ef8444ea01d7442652f8e1b8a8b94278cb57eafd'/>
<id>urn:sha1:ef8444ea01d7442652f8e1b8a8b94278cb57eafd</id>
<content type='text'>
OOM report contains several sections.  The first one is the allocation
context that has triggered the OOM.  Then we have cpuset context followed
by the stack trace of the OOM path.  The tird one is the OOM memory
information.  Followed by the current memory state of all system tasks.
At last, we will show oom eligible tasks and the information about the
chosen oom victim.

One thing that makes parsing more awkward than necessary is that we do not
have a single and easily parsable line about the oom context.  This patch
is reorganizing the oom report to

1) who invoked oom and what was the allocation request

[  515.902945] tuned invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0

2) OOM stack trace

[  515.904273] CPU: 24 PID: 1809 Comm: tuned Not tainted 4.20.0-rc3+ #3
[  515.905518] Hardware name: Inspur SA5212M4/YZMB-00370-107, BIOS 4.1.10 11/14/2016
[  515.906821] Call Trace:
[  515.908062]  dump_stack+0x5a/0x73
[  515.909311]  dump_header+0x55/0x28c
[  515.914260]  oom_kill_process+0x2d8/0x300
[  515.916708]  out_of_memory+0x145/0x4a0
[  515.917932]  __alloc_pages_slowpath+0x7d2/0xa16
[  515.919157]  __alloc_pages_nodemask+0x277/0x290
[  515.920367]  filemap_fault+0x3d0/0x6c0
[  515.921529]  ? filemap_map_pages+0x2b8/0x420
[  515.922709]  ext4_filemap_fault+0x2c/0x40 [ext4]
[  515.923884]  __do_fault+0x20/0x80
[  515.925032]  __handle_mm_fault+0xbc0/0xe80
[  515.926195]  handle_mm_fault+0xfa/0x210
[  515.927357]  __do_page_fault+0x233/0x4c0
[  515.928506]  do_page_fault+0x32/0x140
[  515.929646]  ? page_fault+0x8/0x30
[  515.930770]  page_fault+0x1e/0x30

3) OOM memory information

[  515.958093] Mem-Info:
[  515.959647] active_anon:26501758 inactive_anon:1179809 isolated_anon:0
 active_file:4402672 inactive_file:483963 isolated_file:1344
 unevictable:0 dirty:4886753 writeback:0 unstable:0
 slab_reclaimable:148442 slab_unreclaimable:18741
 mapped:1347 shmem:1347 pagetables:58669 bounce:0
 free:88663 free_pcp:0 free_cma:0
...

4) current memory state of all system tasks

[  516.079544] [    744]     0   744     9211     1345   114688       82             0 systemd-journal
[  516.082034] [    787]     0   787    31764        0   143360       92             0 lvmetad
[  516.084465] [    792]     0   792    10930        1   110592      208         -1000 systemd-udevd
[  516.086865] [   1199]     0  1199    13866        0   131072      112         -1000 auditd
[  516.089190] [   1222]     0  1222    31990        1   110592      157             0 smartd
[  516.091477] [   1225]     0  1225     4864       85    81920       43             0 irqbalance
[  516.093712] [   1226]     0  1226    52612        0   258048      426             0 abrtd
[  516.112128] [   1280]     0  1280   109774       55   299008      400             0 NetworkManager
[  516.113998] [   1295]     0  1295    28817       37    69632       24             0 ksmtuned
[  516.144596] [  10718]     0 10718  2622484  1721372 15998976   267219             0 panic
[  516.145792] [  10719]     0 10719  2622484  1164767  9818112    53576             0 panic
[  516.146977] [  10720]     0 10720  2622484  1174361  9904128    53709             0 panic
[  516.148163] [  10721]     0 10721  2622484  1209070 10194944    54824             0 panic
[  516.149329] [  10722]     0 10722  2622484  1745799 14774272    91138             0 panic

5) oom context (contrains and the chosen victim).

oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,task=panic,pid=10737,uid=0

An admin can easily get the full oom context at a single line which
makes parsing much easier.

Link: http://lkml.kernel.org/r/1542799799-36184-1-git-send-email-ufo19890607@gmail.com
Signed-off-by: yuzhoujian &lt;yuzhoujian@didichuxing.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: "Kirill A . Shutemov" &lt;kirill.shutemov@linux.intel.com&gt;
Cc: Roman Gushchin &lt;guro@fb.com&gt;
Cc: Tetsuo Handa &lt;penguin-kernel@i-love.sakura.ne.jp&gt;
Cc: Yang Shi &lt;yang.s@alibaba-inc.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>cpuset: Remove set but not used variable 'cs'</title>
<updated>2018-12-03T16:23:22Z</updated>
<author>
<name>YueHaibing</name>
<email>yuehaibing@huawei.com</email>
</author>
<published>2018-12-01T03:12:56Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=1e7eacaf1db2f0f5f62fceda4e6c5a8869f00c13'/>
<id>urn:sha1:1e7eacaf1db2f0f5f62fceda4e6c5a8869f00c13</id>
<content type='text'>
Fixes gcc '-Wunused-but-set-variable' warning:

kernel/cgroup/cpuset.c: In function 'cpuset_cancel_attach':
kernel/cgroup/cpuset.c:2167:17: warning:
 variable 'cs' set but not used [-Wunused-but-set-variable]

It never used since introduction in commit 1f7dd3e5a6e4 ("cgroup: fix handling
of multi-destination migration from subtree_control enabling")

Signed-off-by: YueHaibing &lt;yuehaibing@huawei.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>cpuset: Minor cgroup2 interface updates</title>
<updated>2018-11-13T20:09:48Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2018-11-13T20:03:33Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=b1e3aeb11c5e86ee0988a038c4e7682d6beaa977'/>
<id>urn:sha1:b1e3aeb11c5e86ee0988a038c4e7682d6beaa977</id>
<content type='text'>
* Rename the partition file from "cpuset.sched.partition" to
  "cpuset.cpus.partition".

* When writing to the partition file, drop "0" and "1" and only accept
  "member" and "root".

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Waiman Long &lt;longman@redhat.com&gt;
</content>
</entry>
<entry>
<title>cpuset: Expose cpuset.cpus.subpartitions with cgroup_debug</title>
<updated>2018-11-08T20:27:32Z</updated>
<author>
<name>Waiman Long</name>
<email>longman@redhat.com</email>
</author>
<published>2018-11-08T15:08:46Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=5cf8114d6e90b3822be5eb6a2faedf99d1c08f77'/>
<id>urn:sha1:5cf8114d6e90b3822be5eb6a2faedf99d1c08f77</id>
<content type='text'>
For debugging purpose, it will be useful to expose the content of the
subparts_cpus as a read-only file to see if the code work correctly.
However, subparts_cpus will not be used at all in most use cases. So
adding a new cpuset file that clutters the cgroup directory may not be
desirable.  This is now being done by using the hidden "cgroup_debug"
kernel command line option to expose a new "cpuset.cpus.subpartitions"
file.

That option was originally used by the debug controller to expose
itself when configured into the kernel. This is now extended to set an
internal flag used by cgroup_addrm_files(). A new CFTYPE_DEBUG flag
can now be used to specify that a cgroup file should only be created
when the "cgroup_debug" option is specified.

Signed-off-by: Waiman Long &lt;longman@redhat.com&gt;
Acked-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
</feed>
