<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/kernel, branch v6.6.54</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v6.6.54</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v6.6.54'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2024-10-04T14:30:03Z</updated>
<entry>
<title>bpf: Fix use-after-free in bpf_uprobe_multi_link_attach()</title>
<updated>2024-10-04T14:30:03Z</updated>
<author>
<name>Oleg Nesterov</name>
<email>oleg@redhat.com</email>
</author>
<published>2024-09-19T13:28:53Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=790c630ab0e7d7aba6d186581d4627c09fce60f3'/>
<id>urn:sha1:790c630ab0e7d7aba6d186581d4627c09fce60f3</id>
<content type='text'>
commit 5fe6e308abaea082c20fbf2aa5df8e14495622cf upstream.

If bpf_link_prime() fails, bpf_uprobe_multi_link_attach() goes to the
error_free label and frees the array of bpf_uprobe's without calling
bpf_uprobe_unregister().

This leaks bpf_uprobe-&gt;uprobe and worse, this frees bpf_uprobe-&gt;consumer
without removing it from the uprobe-&gt;consumers list.

Fixes: 89ae89f53d20 ("bpf: Add multi uprobe link")
Closes: https://lore.kernel.org/all/000000000000382d39061f59f2dd@google.com/
Reported-by: syzbot+f7a1c2c2711e4a780f19@syzkaller.appspotmail.com
Signed-off-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Andrii Nakryiko &lt;andrii@kernel.org&gt;
Acked-by: Jiri Olsa &lt;jolsa@kernel.org&gt;
Tested-by: syzbot+f7a1c2c2711e4a780f19@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240813152524.GA7292@redhat.com
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>module: Fix KCOV-ignored file name</title>
<updated>2024-10-04T14:30:03Z</updated>
<author>
<name>Dmitry Vyukov</name>
<email>dvyukov@google.com</email>
</author>
<published>2024-06-11T07:50:32Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=6ec62dba4a19998f88252ff522c07f87c6cd2cb4'/>
<id>urn:sha1:6ec62dba4a19998f88252ff522c07f87c6cd2cb4</id>
<content type='text'>
commit f34d086fb7102fec895fd58b9e816b981b284c17 upstream.

module.c was renamed to main.c, but the Makefile directive was copy-pasted
verbatim with the old file name.  Fix up the file name.

Fixes: cfc1d277891e ("module: Move all into module/")
Signed-off-by: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Reviewed-by: Alexander Potapenko &lt;glider@google.com&gt;
Reviewed-by: Marco Elver &lt;elver@google.com&gt;
Reviewed-by: Andrey Konovalov &lt;andreyknvl@gmail.com&gt;
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/bc0cf790b4839c5e38e2fafc64271f620568a39e.1718092070.git.dvyukov@google.com
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>lockdep: fix deadlock issue between lockdep and rcu</title>
<updated>2024-10-04T14:30:02Z</updated>
<author>
<name>Zhiguo Niu</name>
<email>zhiguo.niu@unisoc.com</email>
</author>
<published>2024-06-20T22:54:34Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=e1e734c1a085c263f26ecb420221587f80172088'/>
<id>urn:sha1:e1e734c1a085c263f26ecb420221587f80172088</id>
<content type='text'>
commit a6f88ac32c6e63e69c595bfae220d8641704c9b7 upstream.

There is a deadlock scenario between lockdep and rcu when
rcu nocb feature is enabled, just as following call stack:

     rcuop/x
-000|queued_spin_lock_slowpath(lock = 0xFFFFFF817F2A8A80, val = ?)
-001|queued_spin_lock(inline) // try to hold nocb_gp_lock
-001|do_raw_spin_lock(lock = 0xFFFFFF817F2A8A80)
-002|__raw_spin_lock_irqsave(inline)
-002|_raw_spin_lock_irqsave(lock = 0xFFFFFF817F2A8A80)
-003|wake_nocb_gp_defer(inline)
-003|__call_rcu_nocb_wake(rdp = 0xFFFFFF817F30B680)
-004|__call_rcu_common(inline)
-004|call_rcu(head = 0xFFFFFFC082EECC28, func = ?)
-005|call_rcu_zapped(inline)
-005|free_zapped_rcu(ch = ?)// hold graph lock
-006|rcu_do_batch(rdp = 0xFFFFFF817F245680)
-007|nocb_cb_wait(inline)
-007|rcu_nocb_cb_kthread(arg = 0xFFFFFF817F245680)
-008|kthread(_create = 0xFFFFFF80803122C0)
-009|ret_from_fork(asm)

     rcuop/y
-000|queued_spin_lock_slowpath(lock = 0xFFFFFFC08291BBC8, val = 0)
-001|queued_spin_lock()
-001|lockdep_lock()
-001|graph_lock() // try to hold graph lock
-002|lookup_chain_cache_add()
-002|validate_chain()
-003|lock_acquire
-004|_raw_spin_lock_irqsave(lock = 0xFFFFFF817F211D80)
-005|lock_timer_base(inline)
-006|mod_timer(inline)
-006|wake_nocb_gp_defer(inline)// hold nocb_gp_lock
-006|__call_rcu_nocb_wake(rdp = 0xFFFFFF817F2A8680)
-007|__call_rcu_common(inline)
-007|call_rcu(head = 0xFFFFFFC0822E0B58, func = ?)
-008|call_rcu_hurry(inline)
-008|rcu_sync_call(inline)
-008|rcu_sync_func(rhp = 0xFFFFFFC0822E0B58)
-009|rcu_do_batch(rdp = 0xFFFFFF817F266680)
-010|nocb_cb_wait(inline)
-010|rcu_nocb_cb_kthread(arg = 0xFFFFFF817F266680)
-011|kthread(_create = 0xFFFFFF8080363740)
-012|ret_from_fork(asm)

rcuop/x and rcuop/y are rcu nocb threads with the same nocb gp thread.
This patch release the graph lock before lockdep call_rcu.

Fixes: a0b0fd53e1e6 ("locking/lockdep: Free lock classes that are no longer in use")
Cc: stable@vger.kernel.org
Cc: Boqun Feng &lt;boqun.feng@gmail.com&gt;
Cc: Waiman Long &lt;longman@redhat.com&gt;
Cc: Carlos Llamas &lt;cmllamas@google.com&gt;
Cc: Bart Van Assche &lt;bvanassche@acm.org&gt;
Signed-off-by: Zhiguo Niu &lt;zhiguo.niu@unisoc.com&gt;
Signed-off-by: Xuewen Yan &lt;xuewen.yan@unisoc.com&gt;
Reviewed-by: Waiman Long &lt;longman@redhat.com&gt;
Reviewed-by: Carlos Llamas &lt;cmllamas@google.com&gt;
Reviewed-by: Bart Van Assche &lt;bvanassche@acm.org&gt;
Signed-off-by: Carlos Llamas &lt;cmllamas@google.com&gt;
Acked-by: Paul E. McKenney &lt;paulmck@kernel.org&gt;
Signed-off-by: Boqun Feng &lt;boqun.feng@gmail.com&gt;
Link: https://lore.kernel.org/r/20240620225436.3127927-1-cmllamas@google.com
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>padata: use integer wrap around to prevent deadlock on seq_nr overflow</title>
<updated>2024-10-04T14:29:56Z</updated>
<author>
<name>VanGiang Nguyen</name>
<email>vangiang.nguyen@rohde-schwarz.com</email>
</author>
<published>2024-08-09T06:21:42Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=1b8cf11b3ca593a8802a51802cd0c28c38501428'/>
<id>urn:sha1:1b8cf11b3ca593a8802a51802cd0c28c38501428</id>
<content type='text'>
commit 9a22b2812393d93d84358a760c347c21939029a6 upstream.

When submitting more than 2^32 padata objects to padata_do_serial, the
current sorting implementation incorrectly sorts padata objects with
overflowed seq_nr, causing them to be placed before existing objects in
the reorder list. This leads to a deadlock in the serialization process
as padata_find_next cannot match padata-&gt;seq_nr and pd-&gt;processed
because the padata instance with overflowed seq_nr will be selected
next.

To fix this, we use an unsigned integer wrap around to correctly sort
padata objects in scenarios with integer overflow.

Fixes: bfde23ce200e ("padata: unbind parallel jobs from specific CPUs")
Cc: &lt;stable@vger.kernel.org&gt;
Co-developed-by: Christian Gafert &lt;christian.gafert@rohde-schwarz.com&gt;
Signed-off-by: Christian Gafert &lt;christian.gafert@rohde-schwarz.com&gt;
Co-developed-by: Max Ferger &lt;max.ferger@rohde-schwarz.com&gt;
Signed-off-by: Max Ferger &lt;max.ferger@rohde-schwarz.com&gt;
Signed-off-by: Van Giang Nguyen &lt;vangiang.nguyen@rohde-schwarz.com&gt;
Acked-by: Daniel Jordan &lt;daniel.m.jordan@oracle.com&gt;
Signed-off-by: Herbert Xu &lt;herbert@gondor.apana.org.au&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>bpf: Zero former ARG_PTR_TO_{LONG,INT} args in case of error</title>
<updated>2024-10-04T14:29:23Z</updated>
<author>
<name>Daniel Borkmann</name>
<email>daniel@iogearbox.net</email>
</author>
<published>2024-09-13T19:17:50Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=a634fa8e480ac2423f86311a602f6295df2c8ed0'/>
<id>urn:sha1:a634fa8e480ac2423f86311a602f6295df2c8ed0</id>
<content type='text'>
[ Upstream commit 4b3786a6c5397dc220b1483d8e2f4867743e966f ]

For all non-tracing helpers which formerly had ARG_PTR_TO_{LONG,INT} as input
arguments, zero the value for the case of an error as otherwise it could leak
memory. For tracing, it is not needed given CAP_PERFMON can already read all
kernel memory anyway hence bpf_get_func_arg() and bpf_get_func_ret() is skipped
in here.

Also, the MTU helpers mtu_len pointer value is being written but also read.
Technically, the MEM_UNINIT should not be there in order to always force init.
Removing MEM_UNINIT needs more verifier rework though: MEM_UNINIT right now
implies two things actually: i) write into memory, ii) memory does not have
to be initialized. If we lift MEM_UNINIT, it then becomes: i) read into memory,
ii) memory must be initialized. This means that for bpf_*_check_mtu() we're
readding the issue we're trying to fix, that is, it would then be able to
write back into things like .rodata BPF maps. Follow-up work will rework the
MEM_UNINIT semantics such that the intent can be better expressed. For now
just clear the *mtu_len on error path which can be lifted later again.

Fixes: 8a67f2de9b1d ("bpf: expose bpf_strtol and bpf_strtoul to all program types")
Fixes: d7a4cb9b6705 ("bpf: Introduce bpf_strtol and bpf_strtoul helpers")
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Link: https://lore.kernel.org/bpf/e5edd241-59e7-5e39-0ee5-a51e31b6840a@iogearbox.net
Link: https://lore.kernel.org/r/20240913191754.13290-5-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>bpf: Improve check_raw_mode_ok test for MEM_UNINIT-tagged types</title>
<updated>2024-10-04T14:29:23Z</updated>
<author>
<name>Daniel Borkmann</name>
<email>daniel@iogearbox.net</email>
</author>
<published>2024-09-13T19:17:49Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=abf7559b4ff697714a6ca2600c16b02cfd75a2cb'/>
<id>urn:sha1:abf7559b4ff697714a6ca2600c16b02cfd75a2cb</id>
<content type='text'>
[ Upstream commit 18752d73c1898fd001569195ba4b0b8c43255f4a ]

When checking malformed helper function signatures, also take other argument
types into account aside from just ARG_PTR_TO_UNINIT_MEM.

This concerns (formerly) ARG_PTR_TO_{INT,LONG} given uninitialized memory can
be passed there, too.

The func proto sanity check goes back to commit 435faee1aae9 ("bpf, verifier:
add ARG_PTR_TO_RAW_STACK type"), and its purpose was to detect wrong func protos
which had more than just one MEM_UNINIT-tagged type as arguments.

The reason more than one is currently not supported is as we mark stack slots with
STACK_MISC in check_helper_call() in case of raw mode based on meta.access_size to
allow uninitialized stack memory to be passed to helpers when they just write into
the buffer.

Probing for base type as well as MEM_UNINIT tagging ensures that other types do not
get missed (as it used to be the case for ARG_PTR_TO_{INT,LONG}).

Fixes: 57c3bb725a3d ("bpf: Introduce ARG_PTR_TO_{INT,LONG} arg types")
Reported-by: Shung-Hsi Yu &lt;shung-hsi.yu@suse.com&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Acked-by: Andrii Nakryiko &lt;andrii@kernel.org&gt;
Acked-by: Shung-Hsi Yu &lt;shung-hsi.yu@suse.com&gt;
Link: https://lore.kernel.org/r/20240913191754.13290-4-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>bpf: Fix helper writes to read-only maps</title>
<updated>2024-10-04T14:29:23Z</updated>
<author>
<name>Daniel Borkmann</name>
<email>daniel@iogearbox.net</email>
</author>
<published>2024-09-13T19:17:48Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=a2c8dc7e21803257e762b0bf067fd13e9c995da0'/>
<id>urn:sha1:a2c8dc7e21803257e762b0bf067fd13e9c995da0</id>
<content type='text'>
[ Upstream commit 32556ce93bc45c730829083cb60f95a2728ea48b ]

Lonial found an issue that despite user- and BPF-side frozen BPF map
(like in case of .rodata), it was still possible to write into it from
a BPF program side through specific helpers having ARG_PTR_TO_{LONG,INT}
as arguments.

In check_func_arg() when the argument is as mentioned, the meta-&gt;raw_mode
is never set. Later, check_helper_mem_access(), under the case of
PTR_TO_MAP_VALUE as register base type, it assumes BPF_READ for the
subsequent call to check_map_access_type() and given the BPF map is
read-only it succeeds.

The helpers really need to be annotated as ARG_PTR_TO_{LONG,INT} | MEM_UNINIT
when results are written into them as opposed to read out of them. The
latter indicates that it's okay to pass a pointer to uninitialized memory
as the memory is written to anyway.

However, ARG_PTR_TO_{LONG,INT} is a special case of ARG_PTR_TO_FIXED_SIZE_MEM
just with additional alignment requirement. So it is better to just get
rid of the ARG_PTR_TO_{LONG,INT} special cases altogether and reuse the
fixed size memory types. For this, add MEM_ALIGNED to additionally ensure
alignment given these helpers write directly into the args via *&lt;ptr&gt; = val.
The .arg*_size has been initialized reflecting the actual sizeof(*&lt;ptr&gt;).

MEM_ALIGNED can only be used in combination with MEM_FIXED_SIZE annotated
argument types, since in !MEM_FIXED_SIZE cases the verifier does not know
the buffer size a priori and therefore cannot blindly write *&lt;ptr&gt; = val.

Fixes: 57c3bb725a3d ("bpf: Introduce ARG_PTR_TO_{INT,LONG} arg types")
Reported-by: Lonial Con &lt;kongln9170@gmail.com&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Acked-by: Andrii Nakryiko &lt;andrii@kernel.org&gt;
Acked-by: Shung-Hsi Yu &lt;shung-hsi.yu@suse.com&gt;
Link: https://lore.kernel.org/r/20240913191754.13290-3-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>bpf: Fix bpf_strtol and bpf_strtoul helpers for 32bit</title>
<updated>2024-10-04T14:29:22Z</updated>
<author>
<name>Daniel Borkmann</name>
<email>daniel@iogearbox.net</email>
</author>
<published>2024-09-13T19:17:46Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=81197a9b45109522d1eb2755eb65662c3b23a080'/>
<id>urn:sha1:81197a9b45109522d1eb2755eb65662c3b23a080</id>
<content type='text'>
[ Upstream commit cfe69c50b05510b24e26ccb427c7cc70beafd6c1 ]

The bpf_strtol() and bpf_strtoul() helpers are currently broken on 32bit:

The argument type ARG_PTR_TO_LONG is BPF-side "long", not kernel-side "long"
and therefore always considered fixed 64bit no matter if 64 or 32bit underlying
architecture.

This contract breaks in case of the two mentioned helpers since their BPF_CALL
definition for the helpers was added with {unsigned,}long *res. Meaning, the
transition from BPF-side "long" (BPF program) to kernel-side "long" (BPF helper)
breaks here.

Both helpers call __bpf_strtoll() with "long long" correctly, but later assigning
the result into 32-bit "*(long *)" on 32bit architectures. From a BPF program
point of view, this means upper bits will be seen as uninitialised.

Therefore, fix both BPF_CALL signatures to {s,u}64 types to fix this situation.

Now, changing also uapi/bpf.h helper documentation which generates bpf_helper_defs.h
for BPF programs is tricky: Changing signatures there to __{s,u}64 would trigger
compiler warnings (incompatible pointer types passing 'long *' to parameter of type
'__s64 *' (aka 'long long *')) for existing BPF programs.

Leaving the signatures as-is would be fine as from BPF program point of view it is
still BPF-side "long" and thus equivalent to __{s,u}64 on 64 or 32bit underlying
architectures.

Note that bpf_strtol() and bpf_strtoul() are the only helpers with this issue.

Fixes: d7a4cb9b6705 ("bpf: Introduce bpf_strtol and bpf_strtoul helpers")
Reported-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Acked-by: Andrii Nakryiko &lt;andrii@kernel.org&gt;
Link: https://lore.kernel.org/bpf/481fcec8-c12c-9abb-8ecb-76c71c009959@iogearbox.net
Link: https://lore.kernel.org/r/20240913191754.13290-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched/numa: Fix the vma scan starving issue</title>
<updated>2024-10-04T14:29:22Z</updated>
<author>
<name>Yujie Liu</name>
<email>yujie.liu@intel.com</email>
</author>
<published>2024-08-27T11:29:58Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=66f3fc741177848f0b5dcb51cc93328fbafc43f0'/>
<id>urn:sha1:66f3fc741177848f0b5dcb51cc93328fbafc43f0</id>
<content type='text'>
[ Upstream commit f22cde4371f3c624e947a35b075c06c771442a43 ]

Problem statement:
Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic"), the
Numa vma scan overhead has been reduced a lot.  Meanwhile, the reducing of
the vma scan might create less Numa page fault information.  The
insufficient information makes it harder for the Numa balancer to make
decision.  Later, commit b7a5b537c55c08 ("sched/numa: Complete scanning of
partial VMAs regardless of PID activity") and commit 84db47ca7146d7
("sched/numa: Fix mm numa_scan_seq based unconditional scan") are found to
bring back part of the performance.

Recently when running SPECcpu omnetpp_r on a 320 CPUs/2 Sockets system, a
long duration of remote Numa node read was observed by PMU events: A few
cores having ~500MB/s remote memory access for ~20 seconds.  It causes
high core-to-core variance and performance penalty.  After the
investigation, it is found that many vmas are skipped due to the active
PID check.  According to the trace events, in most cases,
vma_is_accessed() returns false because the history access info stored in
pids_active array has been cleared.

Proposal:
The main idea is to adjust vma_is_accessed() to let it return true easier.
Thus compare the diff between mm-&gt;numa_scan_seq and
vma-&gt;numab_state-&gt;prev_scan_seq.  If the diff has exceeded the threshold,
scan the vma.

This patch especially helps the cases where there are small number of
threads, like the process-based SPECcpu.  Without this patch, if the
SPECcpu process access the vma at the beginning, then sleeps for a long
time, the pid_active array will be cleared.  A a result, if this process
is woken up again, it never has a chance to set prot_none anymore.
Because only the first 2 times of access is granted for vma scan:
(current-&gt;mm-&gt;numa_scan_seq) - vma-&gt;numab_state-&gt;start_scan_seq) &lt; 2 to be
worse, no other threads within the task can help set the prot_none.  This
causes information lost.

Raghavendra helped test current patch and got the positive result
on the AMD platform:

autonumabench NUMA01
                            base                  patched
Amean     syst-NUMA01      194.05 (   0.00%)      165.11 *  14.92%*
Amean     elsp-NUMA01      324.86 (   0.00%)      315.58 *   2.86%*

Duration User      380345.36   368252.04
Duration System      1358.89     1156.23
Duration Elapsed     2277.45     2213.25

autonumabench NUMA02

Amean     syst-NUMA02        1.12 (   0.00%)        1.09 *   2.93%*
Amean     elsp-NUMA02        3.50 (   0.00%)        3.56 *  -1.84%*

Duration User        1513.23     1575.48
Duration System         8.33        8.13
Duration Elapsed       28.59       29.71

kernbench

Amean     user-256    22935.42 (   0.00%)    22535.19 *   1.75%*
Amean     syst-256     7284.16 (   0.00%)     7608.72 *  -4.46%*
Amean     elsp-256      159.01 (   0.00%)      158.17 *   0.53%*

Duration User       68816.41    67615.74
Duration System     21873.94    22848.08
Duration Elapsed      506.66      504.55

Intel 256 CPUs/2 Sockets:
autonuma benchmark also shows improvements:

                                               v6.10-rc5              v6.10-rc5
                                                                         +patch
Amean     syst-NUMA01                  245.85 (   0.00%)      230.84 *   6.11%*
Amean     syst-NUMA01_THREADLOCAL      205.27 (   0.00%)      191.86 *   6.53%*
Amean     syst-NUMA02                   18.57 (   0.00%)       18.09 *   2.58%*
Amean     syst-NUMA02_SMT                2.63 (   0.00%)        2.54 *   3.47%*
Amean     elsp-NUMA01                  517.17 (   0.00%)      526.34 *  -1.77%*
Amean     elsp-NUMA01_THREADLOCAL       99.92 (   0.00%)      100.59 *  -0.67%*
Amean     elsp-NUMA02                   15.81 (   0.00%)       15.72 *   0.59%*
Amean     elsp-NUMA02_SMT               13.23 (   0.00%)       12.89 *   2.53%*

                   v6.10-rc5   v6.10-rc5
                                  +patch
Duration User     1064010.16  1075416.23
Duration System      3307.64     3104.66
Duration Elapsed     4537.54     4604.73

The SPECcpu remote node access issue disappears with the patch applied.

Link: https://lkml.kernel.org/r/20240827112958.181388-1-yu.c.chen@intel.com
Fixes: fc137c0ddab2 ("sched/numa: enhance vma scanning logic")
Signed-off-by: Chen Yu &lt;yu.c.chen@intel.com&gt;
Co-developed-by: Chen Yu &lt;yu.c.chen@intel.com&gt;
Signed-off-by: Yujie Liu &lt;yujie.liu@intel.com&gt;
Reported-by: Xiaoping Zhou &lt;xiaoping.zhou@intel.com&gt;
Reviewed-and-tested-by: Raghavendra K T &lt;raghavendra.kt@amd.com&gt;
Acked-by: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Cc: "Chen, Tim C" &lt;tim.c.chen@intel.com&gt;
Cc: Ingo Molnar &lt;mingo@redhat.com&gt;
Cc: Juri Lelli &lt;juri.lelli@redhat.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Raghavendra K T &lt;raghavendra.kt@amd.com&gt;
Cc: Vincent Guittot &lt;vincent.guittot@linaro.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched/numa: Complete scanning of inactive VMAs when there is no alternative</title>
<updated>2024-10-04T14:29:22Z</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@techsingularity.net</email>
</author>
<published>2023-10-10T08:31:43Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=e3a2d3f6c40e749ad431a1cb7fd5dd78a02ab592'/>
<id>urn:sha1:e3a2d3f6c40e749ad431a1cb7fd5dd78a02ab592</id>
<content type='text'>
[ Upstream commit f169c62ff7cd1acf8bac8ae17bfeafa307d9e6fa ]

VMAs are skipped if there is no recent fault activity but this represents
a chicken-and-egg problem as there may be no fault activity if the PTEs
are never updated to trap NUMA hints. There is an indirect reliance on
scanning to be forced early in the lifetime of a task but this may fail
to detect changes in phase behaviour. Force inactive VMAs to be scanned
when all other eligible VMAs have been updated within the same scan
sequence.

Test results in general look good with some changes in performance, both
negative and positive, depending on whether the additional scanning and
faulting was beneficial or not to the workload. The autonuma benchmark
workload NUMA01_THREADLOCAL was picked for closer examination. The workload
creates two processes with numerous threads and thread-local storage that
is zero-filled in a loop. It exercises the corner case where unrelated
threads may skip VMAs that are thread-local to another thread and still
has some VMAs that inactive while the workload executes.

The VMA skipping activity frequency with and without the patch:

	6.6.0-rc2-sched-numabtrace-v1
	=============================
	    649 reason=scan_delay
	  9,094 reason=unsuitable
	 48,915 reason=shared_ro
	143,919 reason=inaccessible
	193,050 reason=pid_inactive

	6.6.0-rc2-sched-numabselective-v1
	=============================
	    146 reason=seq_completed
	    622 reason=ignore_pid_inactive

	    624 reason=scan_delay
	  6,570 reason=unsuitable
	 16,101 reason=shared_ro
	 27,608 reason=inaccessible
	 41,939 reason=pid_inactive

Note that with the patch applied, the PID activity is ignored
(ignore_pid_inactive) to ensure a VMA with some activity is completely
scanned. In addition, a small number of VMAs are scanned when no other
eligible VMA is available during a single scan window (seq_completed).
The number of times a VMA is skipped due to no PID activity from the
scanning task (pid_inactive) drops dramatically. It is expected that
this will increase the number of PTEs updated for NUMA hinting faults
as well as hinting faults but these represent PTEs that would otherwise
have been missed. The tradeoff is scan+fault overhead versus improving
locality due to migration.

On a 2-socket Cascade Lake test machine, the time to complete the
workload is as follows;

                                                 6.6.0-rc2              6.6.0-rc2
                                       sched-numabtrace-v1 sched-numabselective-v1
  Min       elsp-NUMA01_THREADLOCAL      174.22 (   0.00%)      117.64 (  32.48%)
  Amean     elsp-NUMA01_THREADLOCAL      175.68 (   0.00%)      123.34 *  29.79%*
  Stddev    elsp-NUMA01_THREADLOCAL        1.20 (   0.00%)        4.06 (-238.20%)
  CoeffVar  elsp-NUMA01_THREADLOCAL        0.68 (   0.00%)        3.29 (-381.70%)
  Max       elsp-NUMA01_THREADLOCAL      177.18 (   0.00%)      128.03 (  27.74%)

The time to complete the workload is reduced by almost 30%:

                     6.6.0-rc2   6.6.0-rc2
                  sched-numabtrace-v1 sched-numabselective-v1 /
  Duration User       91201.80    63506.64
  Duration System      2015.53     1819.78
  Duration Elapsed     1234.77      868.37

In this specific case, system CPU time was not increased but it's not
universally true.

From vmstat, the NUMA scanning and fault activity is as follows;

                                        6.6.0-rc2      6.6.0-rc2
                              sched-numabtrace-v1 sched-numabselective-v1
  Ops NUMA base-page range updates       64272.00    26374386.00
  Ops NUMA PTE updates                   36624.00       55538.00
  Ops NUMA PMD updates                      54.00       51404.00
  Ops NUMA hint faults                   15504.00       75786.00
  Ops NUMA hint local faults %           14860.00       56763.00
  Ops NUMA hint local percent               95.85          74.90
  Ops NUMA pages migrated                 1629.00     6469222.00

Both the number of PTE updates and hint faults is dramatically
increased. While this is superficially unfortunate, it represents
ranges that were simply skipped without the patch. As a result
of the scanning and hinting faults, many more pages were also
migrated but as the time to completion is reduced, the overhead
is offset by the gain.

Signed-off-by: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Tested-by: Raghavendra K T &lt;raghavendra.kt@amd.com&gt;
Link: https://lore.kernel.org/r/20231010083143.19593-7-mgorman@techsingularity.net
Stable-dep-of: f22cde4371f3 ("sched/numa: Fix the vma scan starving issue")
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
</feed>
