<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/kernel/fork.c, branch v5.14.9</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v5.14.9</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v5.14.9'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2021-09-18T11:44:02Z</updated>
<entry>
<title>mm/hugetlb: initialize hugetlb_usage in mm_init</title>
<updated>2021-09-18T11:44:02Z</updated>
<author>
<name>Liu Zixian</name>
<email>liuzixian4@huawei.com</email>
</author>
<published>2021-09-09T01:10:05Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=1b902d30d84d5d35db947872f9e90c0c935bb314'/>
<id>urn:sha1:1b902d30d84d5d35db947872f9e90c0c935bb314</id>
<content type='text'>
commit 13db8c50477d83ad3e3b9b0ae247e5cd833a7ae4 upstream.

After fork, the child process will get incorrect (2x) hugetlb_usage.  If
a process uses 5 2MB hugetlb pages in an anonymous mapping,

	HugetlbPages:	   10240 kB

and then forks, the child will show,

	HugetlbPages:	   20480 kB

The reason for double the amount is because hugetlb_usage will be copied
from the parent and then increased when we copy page tables from parent
to child.  Child will have 2x actual usage.

Fix this by adding hugetlb_count_init in mm_init.

Link: https://lkml.kernel.org/r/20210826071742.877-1-liuzixian4@huawei.com
Fixes: 5d317b2b6536 ("mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status")
Signed-off-by: Liu Zixian &lt;liuzixian4@huawei.com&gt;
Reviewed-by: Naoya Horiguchi &lt;naoya.horiguchi@nec.com&gt;
Reviewed-by: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>ucounts: Fix regression preventing increasing of rlimits in init_user_ns</title>
<updated>2021-08-23T21:10:42Z</updated>
<author>
<name>Eric W. Biederman</name>
<email>ebiederm@xmission.com</email>
</author>
<published>2021-08-23T16:12:17Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=5ddf994fa22f78ae3742d72520a8c3e8521d96cd'/>
<id>urn:sha1:5ddf994fa22f78ae3742d72520a8c3e8521d96cd</id>
<content type='text'>
"Ma, XinjianX" &lt;xinjianx.ma@intel.com&gt; reported:

&gt; When lkp team run kernel selftests, we found after these series of patches, testcase mqueue: mq_perf_tests
&gt; in kselftest failed with following message.
&gt;
&gt; # selftests: mqueue: mq_perf_tests
&gt; #
&gt; # Initial system state:
&gt; #       Using queue path:                       /mq_perf_tests
&gt; #       RLIMIT_MSGQUEUE(soft):                  819200
&gt; #       RLIMIT_MSGQUEUE(hard):                  819200
&gt; #       Maximum Message Size:                   8192
&gt; #       Maximum Queue Size:                     10
&gt; #       Nice value:                             0
&gt; #
&gt; # Adjusted system state for testing:
&gt; #       RLIMIT_MSGQUEUE(soft):                  (unlimited)
&gt; #       RLIMIT_MSGQUEUE(hard):                  (unlimited)
&gt; #       Maximum Message Size:                   16777216
&gt; #       Maximum Queue Size:                     65530
&gt; #       Nice value:                             -20
&gt; #       Continuous mode:                        (disabled)
&gt; #       CPUs to pin:                            3
&gt; # ./mq_perf_tests: mq_open() at 296: Too many open files
&gt; not ok 2 selftests: mqueue: mq_perf_tests # exit=1
&gt; ```
&gt;
&gt; Test env:
&gt; rootfs: debian-10
&gt; gcc version: 9

After investigation the problem turned out to be that ucount_max for
the rlimits in init_user_ns was being set to the initial rlimit value.
The practical problem is that ucount_max provides a limit that
applications inside the user namespace can not exceed.  Which means in
practice that rlimits that have been converted to use the ucount
infrastructure were not able to exceend their initial rlimits.

Solve this by setting the relevant values of ucount_max to
RLIM_INIFINITY.  A limit in init_user_ns is pointless so the code
should allow the values to grow as large as possible without riscking
an underflow or an overflow.

As the ltp test case was a bit of a pain I have reproduced the rlimit failure
and tested the fix with the following little C program:
&gt; #include &lt;stdio.h&gt;
&gt; #include &lt;fcntl.h&gt;
&gt; #include &lt;sys/stat.h&gt;
&gt; #include &lt;mqueue.h&gt;
&gt; #include &lt;sys/time.h&gt;
&gt; #include &lt;sys/resource.h&gt;
&gt; #include &lt;errno.h&gt;
&gt; #include &lt;string.h&gt;
&gt; #include &lt;stdlib.h&gt;
&gt; #include &lt;limits.h&gt;
&gt; #include &lt;unistd.h&gt;
&gt;
&gt; int main(int argc, char **argv)
&gt; {
&gt; 	struct mq_attr mq_attr;
&gt; 	struct rlimit rlim;
&gt; 	mqd_t mqd;
&gt; 	int ret;
&gt;
&gt; 	ret = getrlimit(RLIMIT_MSGQUEUE, &amp;rlim);
&gt; 	if (ret != 0) {
&gt; 		fprintf(stderr, "getrlimit(RLIMIT_MSGQUEUE) failed: %s\n", strerror(errno));
&gt; 		exit(EXIT_FAILURE);
&gt; 	}
&gt; 	printf("RLIMIT_MSGQUEUE %lu %lu\n",
&gt; 	       rlim.rlim_cur, rlim.rlim_max);
&gt; 	rlim.rlim_cur = RLIM_INFINITY;
&gt; 	rlim.rlim_max = RLIM_INFINITY;
&gt; 	ret = setrlimit(RLIMIT_MSGQUEUE, &amp;rlim);
&gt; 	if (ret != 0) {
&gt; 		fprintf(stderr, "setrlimit(RLIMIT_MSGQUEUE, RLIM_INFINITY) failed: %s\n", strerror(errno));
&gt; 		exit(EXIT_FAILURE);
&gt; 	}
&gt;
&gt; 	memset(&amp;mq_attr, 0, sizeof(struct mq_attr));
&gt; 	mq_attr.mq_maxmsg = 65536 - 1;
&gt; 	mq_attr.mq_msgsize = 16*1024*1024 - 1;
&gt;
&gt; 	mqd = mq_open("/mq_rlimit_test", O_RDONLY|O_CREAT, 0600, &amp;mq_attr);
&gt; 	if (mqd == (mqd_t)-1) {
&gt; 		fprintf(stderr, "mq_open failed: %s\n", strerror(errno));
&gt; 		exit(EXIT_FAILURE);
&gt; 	}
&gt; 	ret = mq_close(mqd);
&gt; 	if (ret) {
&gt; 		fprintf(stderr, "mq_close failed; %s\n", strerror(errno));
&gt; 		exit(EXIT_FAILURE);
&gt; 	}
&gt;
&gt; 	return EXIT_SUCCESS;
&gt; }

Fixes: 6e52a9f0532f ("Reimplement RLIMIT_MSGQUEUE on top of ucounts")
Fixes: d7c9e99aee48 ("Reimplement RLIMIT_MEMLOCK on top of ucounts")
Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
Reported-by: kernel test robot lkp@intel.com
Acked-by: Alexey Gladkov &lt;legion@kernel.org&gt;
Link: https://lkml.kernel.org/r/87eeajswfc.fsf_-_@disp2133
Signed-off-by: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
</content>
</entry>
<entry>
<title>Merge branch 'akpm' (patches from Andrew)</title>
<updated>2021-06-30T00:29:11Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2021-06-30T00:29:11Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=65090f30ab791810a3dc840317e57df05018559c'/>
<id>urn:sha1:65090f30ab791810a3dc840317e57df05018559c</id>
<content type='text'>
Merge misc updates from Andrew Morton:
 "191 patches.

  Subsystems affected by this patch series: kthread, ia64, scripts,
  ntfs, squashfs, ocfs2, kernel/watchdog, and mm (gup, pagealloc, slab,
  slub, kmemleak, dax, debug, pagecache, gup, swap, memcg, pagemap,
  mprotect, bootmem, dma, tracing, vmalloc, kasan, initialization,
  pagealloc, and memory-failure)"

* emailed patches from Andrew Morton &lt;akpm@linux-foundation.org&gt;: (191 commits)
  mm,hwpoison: make get_hwpoison_page() call get_any_page()
  mm,hwpoison: send SIGBUS with error virutal address
  mm/page_alloc: split pcp-&gt;high across all online CPUs for cpuless nodes
  mm/page_alloc: allow high-order pages to be stored on the per-cpu lists
  mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM
  mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA
  docs: remove description of DISCONTIGMEM
  arch, mm: remove stale mentions of DISCONIGMEM
  mm: remove CONFIG_DISCONTIGMEM
  m68k: remove support for DISCONTIGMEM
  arc: remove support for DISCONTIGMEM
  arc: update comment about HIGHMEM implementation
  alpha: remove DISCONTIGMEM and NUMA
  mm/page_alloc: move free_the_page
  mm/page_alloc: fix counting of managed_pages
  mm/page_alloc: improve memmap_pages dbg msg
  mm: drop SECTION_SHIFT in code comments
  mm/page_alloc: introduce vm.percpu_pagelist_high_fraction
  mm/page_alloc: limit the number of pages on PCP lists when reclaim is active
  mm/page_alloc: scale the number of pages that are batch freed
  ...
</content>
</entry>
<entry>
<title>mm: gup: pack has_pinned in MMF_HAS_PINNED</title>
<updated>2021-06-29T17:53:48Z</updated>
<author>
<name>Andrea Arcangeli</name>
<email>aarcange@redhat.com</email>
</author>
<published>2021-06-29T02:36:40Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=a458b76a4171f893efa7657dc079924580a8746a'/>
<id>urn:sha1:a458b76a4171f893efa7657dc079924580a8746a</id>
<content type='text'>
has_pinned 32bit can be packed in the MMF_HAS_PINNED bit as a noop
cleanup.

Any atomic_inc/dec to the mm cacheline shared by all threads in pin-fast
would reintroduce a loss of SMP scalability to pin-fast, so there's no
future potential usefulness to keep an atomic in the mm for this.

set_bit(MMF_HAS_PINNED) will be theoretically a bit slower than WRITE_ONCE
(atomic_set is equivalent to WRITE_ONCE), but the set_bit (just like
atomic_set after this commit) has to be still issued only once per "mm",
so the difference between the two will be lost in the noise.

will-it-scale "mmap2" shows no change in performance with enterprise
config as expected.

will-it-scale "pin_fast" retains the &gt; 4000% SMP scalability performance
improvement against upstream as expected.

This is a noop as far as overall performance and SMP scalability are
concerned.

[peterx@redhat.com: pack has_pinned in MMF_HAS_PINNED]
  Link: https://lkml.kernel.org/r/YJqWESqyxa8OZA+2@t490s
[akpm@linux-foundation.org: coding style fixes]
[peterx@redhat.com: fix build for task_mmu.c, introduce mm_set_has_pinned_flag, fix comments]

Link: https://lkml.kernel.org/r/20210507150553.208763-4-peterx@redhat.com
Signed-off-by: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Signed-off-by: Peter Xu &lt;peterx@redhat.com&gt;
Reviewed-by: John Hubbard &lt;jhubbard@nvidia.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Jason Gunthorpe &lt;jgg@nvidia.com&gt;
Cc: Kirill Shutemov &lt;kirill@shutemov.name&gt;
Cc: Kirill Tkhai &lt;ktkhai@virtuozzo.com&gt;
Cc: Matthew Wilcox &lt;willy@infradead.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace</title>
<updated>2021-06-29T03:39:26Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2021-06-29T03:39:26Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=c54b245d011855ea91c5beff07f1db74143ce614'/>
<id>urn:sha1:c54b245d011855ea91c5beff07f1db74143ce614</id>
<content type='text'>
Pull user namespace rlimit handling update from Eric Biederman:
 "This is the work mainly by Alexey Gladkov to limit rlimits to the
  rlimits of the user that created a user namespace, and to allow users
  to have stricter limits on the resources created within a user
  namespace."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  cred: add missing return error code when set_cred_ucounts() failed
  ucounts: Silence warning in dec_rlimit_ucounts
  ucounts: Set ucount_max to the largest positive value the type can hold
  kselftests: Add test to check for rlimit changes in different user namespaces
  Reimplement RLIMIT_MEMLOCK on top of ucounts
  Reimplement RLIMIT_SIGPENDING on top of ucounts
  Reimplement RLIMIT_MSGQUEUE on top of ucounts
  Reimplement RLIMIT_NPROC on top of ucounts
  Use atomic_t for ucounts reference counting
  Add a reference to ucounts for each cred
  Increase size of ucounts to atomic_long_t
</content>
</entry>
<entry>
<title>Merge tag 'sched-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip</title>
<updated>2021-06-28T19:14:19Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2021-06-28T19:14:19Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=54a728dc5e4feb0a9278ad62b19f34ad21ed0ee4'/>
<id>urn:sha1:54a728dc5e4feb0a9278ad62b19f34ad21ed0ee4</id>
<content type='text'>
Pull scheduler udpates from Ingo Molnar:

 - Changes to core scheduling facilities:

    - Add "Core Scheduling" via CONFIG_SCHED_CORE=y, which enables
      coordinated scheduling across SMT siblings. This is a much
      requested feature for cloud computing platforms, to allow the
      flexible utilization of SMT siblings, without exposing untrusted
      domains to information leaks &amp; side channels, plus to ensure more
      deterministic computing performance on SMT systems used by
      heterogenous workloads.

      There are new prctls to set core scheduling groups, which allows
      more flexible management of workloads that can share siblings.

    - Fix task-&gt;state access anti-patterns that may result in missed
      wakeups and rename it to -&gt;__state in the process to catch new
      abuses.

 - Load-balancing changes:

    - Tweak newidle_balance for fair-sched, to improve 'memcache'-like
      workloads.

    - "Age" (decay) average idle time, to better track &amp; improve
      workloads such as 'tbench'.

    - Fix &amp; improve energy-aware (EAS) balancing logic &amp; metrics.

    - Fix &amp; improve the uclamp metrics.

    - Fix task migration (taskset) corner case on !CONFIG_CPUSET.

    - Fix RT and deadline utilization tracking across policy changes

    - Introduce a "burstable" CFS controller via cgroups, which allows
      bursty CPU-bound workloads to borrow a bit against their future
      quota to improve overall latencies &amp; batching. Can be tweaked via
      /sys/fs/cgroup/cpu/&lt;X&gt;/cpu.cfs_burst_us.

    - Rework assymetric topology/capacity detection &amp; handling.

 - Scheduler statistics &amp; tooling:

    - Disable delayacct by default, but add a sysctl to enable it at
      runtime if tooling needs it. Use static keys and other
      optimizations to make it more palatable.

    - Use sched_clock() in delayacct, instead of ktime_get_ns().

 - Misc cleanups and fixes.

* tag 'sched-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
  sched/doc: Update the CPU capacity asymmetry bits
  sched/topology: Rework CPU capacity asymmetry detection
  sched/core: Introduce SD_ASYM_CPUCAPACITY_FULL sched_domain flag
  psi: Fix race between psi_trigger_create/destroy
  sched/fair: Introduce the burstable CFS controller
  sched/uclamp: Fix uclamp_tg_restrict()
  sched/rt: Fix Deadline utilization tracking during policy change
  sched/rt: Fix RT utilization tracking during policy change
  sched: Change task_struct::state
  sched,arch: Remove unused TASK_STATE offsets
  sched,timer: Use __set_current_state()
  sched: Add get_current_state()
  sched,perf,kvm: Fix preemption condition
  sched: Introduce task_is_running()
  sched: Unbreak wakeups
  sched/fair: Age the average idle time
  sched/cpufreq: Consider reduced CPU capacity in energy calculation
  sched/fair: Take thermal pressure into account while estimating energy
  thermal/cpufreq_cooling: Update offline CPUs per-cpu thermal_pressure
  sched/fair: Return early from update_tg_cfs_load() if delta == 0
  ...
</content>
</entry>
<entry>
<title>Revert "signal: Allow tasks to cache one sigqueue struct"</title>
<updated>2021-06-27T20:32:54Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2021-06-27T20:32:54Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=b4b27b9eed8ebdbf9f3046197d29d733c8c944f3'/>
<id>urn:sha1:b4b27b9eed8ebdbf9f3046197d29d733c8c944f3</id>
<content type='text'>
This reverts commits 4bad58ebc8bc4f20d89cff95417c9b4674769709 (and
399f8dd9a866e107639eabd3c1979cd526ca3a98, which tried to fix it).

I do not believe these are correct, and I'm about to release 5.13, so am
reverting them out of an abundance of caution.

The locking is odd, and appears broken.

On the allocation side (in __sigqueue_alloc()), the locking is somewhat
straightforward: it depends on sighand-&gt;siglock.  Since one caller
doesn't hold that lock, it further then tests 'sigqueue_flags' to avoid
the case with no locks held.

On the freeing side (in sigqueue_cache_or_free()), there is no locking
at all, and the logic instead depends on 'current' being a single
thread, and not able to race with itself.

To make things more exciting, there's also the data race between freeing
a signal and allocating one, which is handled by using WRITE_ONCE() and
READ_ONCE(), and being mutually exclusive wrt the initial state (ie
freeing will only free if the old state was NULL, while allocating will
obviously only use the value if it was non-NULL, so only one or the
other will actually act on the value).

However, while the free-&gt;alloc paths do seem mutually exclusive thanks
to just the data value dependency, it's not clear what the memory
ordering constraints are on it.  Could writes from the previous
allocation possibly be delayed and seen by the new allocation later,
causing logical inconsistencies?

So it's all very exciting and unusual.

And in particular, it seems that the freeing side is incorrect in
depending on "current" being single-threaded.  Yes, 'current' is a
single thread, but in the presense of asynchronous events even a single
thread can have data races.

And such asynchronous events can and do happen, with interrupts causing
signals to be flushed and thus free'd (for example - sending a
SIGCONT/SIGSTOP can happen from interrupt context, and can flush
previously queued process control signals).

So regardless of all the other questions about the memory ordering and
locking for this new cached allocation, the sigqueue_cache_or_free()
assumptions seem to be fundamentally incorrect.

It may be that people will show me the errors of my ways, and tell me
why this is all safe after all.  We can reinstate it if so.  But my
current belief is that the WRITE_ONCE() that sets the cached entry needs
to be a smp_store_release(), and the READ_ONCE() that finds a cached
entry needs to be a smp_load_acquire() to handle memory ordering
correctly.

And the sequence in sigqueue_cache_or_free() would need to either use a
lock or at least be interrupt-safe some way (perhaps by using something
like the percpu 'cmpxchg': it doesn't need to be SMP-safe, but like the
percpu operations it needs to be interrupt-safe).

Fixes: 399f8dd9a866 ("signal: Prevent sigqueue caching after task got released")
Fixes: 4bad58ebc8bc ("signal: Allow tasks to cache one sigqueue struct")
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>sched: Change task_struct::state</title>
<updated>2021-06-18T09:43:09Z</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2021-06-11T08:28:17Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=2f064a59a11ff9bc22e52e9678bc601404c7cb34'/>
<id>urn:sha1:2f064a59a11ff9bc22e52e9678bc601404c7cb34</id>
<content type='text'>
Change the type and name of task_struct::state. Drop the volatile and
shrink it to an 'unsigned int'. Rename it in order to find all uses
such that we can use READ_ONCE/WRITE_ONCE as appropriate.

Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Daniel Bristot de Oliveira &lt;bristot@redhat.com&gt;
Acked-by: Will Deacon &lt;will@kernel.org&gt;
Acked-by: Daniel Thompson &lt;daniel.thompson@linaro.org&gt;
Link: https://lore.kernel.org/r/20210611082838.550736351@infradead.org
</content>
</entry>
<entry>
<title>sched: Stop PF_NO_SETAFFINITY from being inherited by various init system threads</title>
<updated>2021-05-26T06:58:53Z</updated>
<author>
<name>Frederic Weisbecker</name>
<email>frederic@kernel.org</email>
</author>
<published>2021-05-25T23:58:49Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=a8ea6fc9b089156d9230bfeef964dd9be101a4a9'/>
<id>urn:sha1:a8ea6fc9b089156d9230bfeef964dd9be101a4a9</id>
<content type='text'>
Commit:

  00b89fe0197f ("sched: Make the idle task quack like a per-CPU kthread")

... added PF_KTHREAD | PF_NO_SETAFFINITY to the idle kernel threads.

Unfortunately these properties are inherited to the init/0 children
through kernel_thread() calls: init/1 and kthreadd. There are several
side effects to that:

1) kthreadd affinity can not be reset anymore from userspace. Also
   PF_NO_SETAFFINITY propagates to all kthreadd children, including
   the unbound kthreads Therefore it's not possible anymore to overwrite
   the affinity of any of them. Here is an example of warning reported
   by rcutorture:

		WARNING: CPU: 0 PID: 116 at kernel/rcu/tree_nocb.h:1306 rcu_bind_current_to_nocb+0x31/0x40
		Call Trace:
		 rcu_torture_fwd_prog+0x62/0x730
		 kthread+0x122/0x140
		 ret_from_fork+0x22/0x30

2) init/1 does an exec() in the end which clears both
   PF_KTHREAD and PF_NO_SETAFFINITY so we are fine once kernel_init()
   escapes to userspace. But until then, no initcall or init code can
   successfully call sched_setaffinity() to init/1.

   Also PF_KTHREAD looks legit on init/1 before it calls exec() but
   we better be careful with unknown introduced side effects.

One way to solve the PF_NO_SETAFFINITY issue is to not inherit this flag
on copy_process() at all. The cases where it matters are:

* fork_idle(): explicitly set the flag already.
* fork() syscalls: userspace tasks that shouldn't be concerned by that.
* create_io_thread(): the callers explicitly attribute the flag to the
                      newly created tasks.
* kernel_thread():
	- Fix the issues on init/1 and kthreadd
	- Fix the issues on kthreadd children.
	- Usermode helper created by an unbound workqueue. This shouldn't
	  matter. In the worst case it gives more control to userspace
	  on setting affinity to these short living tasks although this can
	  be tuned with inherited unbound workqueues affinity already.

Fixes: 00b89fe0197f ("sched: Make the idle task quack like a per-CPU kthread")
Reported-by: Paul E. McKenney &lt;paulmck@kernel.org&gt;
Signed-off-by: Frederic Weisbecker &lt;frederic@kernel.org&gt;
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Tested-by: Paul E. McKenney &lt;paulmck@kernel.org&gt;
Link: https://lore.kernel.org/r/20210525235849.441842-1-frederic@kernel.org
</content>
</entry>
<entry>
<title>sched/core: Initialize the idle task with preemption disabled</title>
<updated>2021-05-12T11:01:45Z</updated>
<author>
<name>Valentin Schneider</name>
<email>valentin.schneider@arm.com</email>
</author>
<published>2021-05-12T09:46:36Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=f1a0a376ca0c4ef1fc3d24e3e502acbb5b795674'/>
<id>urn:sha1:f1a0a376ca0c4ef1fc3d24e3e502acbb5b795674</id>
<content type='text'>
As pointed out by commit

  de9b8f5dcbd9 ("sched: Fix crash trying to dequeue/enqueue the idle thread")

init_idle() can and will be invoked more than once on the same idle
task. At boot time, it is invoked for the boot CPU thread by
sched_init(). Then smp_init() creates the threads for all the secondary
CPUs and invokes init_idle() on them.

As the hotplug machinery brings the secondaries to life, it will issue
calls to idle_thread_get(), which itself invokes init_idle() yet again.
In this case it's invoked twice more per secondary: at _cpu_up(), and at
bringup_cpu().

Given smp_init() already initializes the idle tasks for all *possible*
CPUs, no further initialization should be required. Now, removing
init_idle() from idle_thread_get() exposes some interesting expectations
with regards to the idle task's preempt_count: the secondary startup always
issues a preempt_disable(), requiring some reset of the preempt count to 0
between hot-unplug and hotplug, which is currently served by
idle_thread_get() -&gt; idle_init().

Given the idle task is supposed to have preemption disabled once and never
see it re-enabled, it seems that what we actually want is to initialize its
preempt_count to PREEMPT_DISABLED and leave it there. Do that, and remove
init_idle() from idle_thread_get().

Secondary startups were patched via coccinelle:

  @begone@
  @@

  -preempt_disable();
  ...
  cpu_startup_entry(CPUHP_AP_ONLINE_IDLE);

Signed-off-by: Valentin Schneider &lt;valentin.schneider@arm.com&gt;
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Acked-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Link: https://lore.kernel.org/r/20210512094636.2958515-1-valentin.schneider@arm.com
</content>
</entry>
</feed>
