<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/mm, branch v6.11</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v6.11</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v6.11'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2024-09-12T19:10:00Z</updated>
<entry>
<title>mm: avoid leaving partial pfn mappings around in error case</title>
<updated>2024-09-12T19:10:00Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2024-09-12T00:11:23Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=79a61cc3fc0466ad2b7b89618a6157785f0293b3'/>
<id>urn:sha1:79a61cc3fc0466ad2b7b89618a6157785f0293b3</id>
<content type='text'>
As Jann points out, PFN mappings are special, because unlike normal
memory mappings, there is no lifetime information associated with the
mapping - it is just a raw mapping of PFNs with no reference counting of
a 'struct page'.

That's all very much intentional, but it does mean that it's easy to
mess up the cleanup in case of errors.  Yes, a failed mmap() will always
eventually clean up any partial mappings, but without any explicit
lifetime in the page table mapping itself, it's very easy to do the
error handling in the wrong order.

In particular, it's easy to mistakenly free the physical backing store
before the page tables are actually cleaned up and (temporarily) have
stale dangling PTE entries.

To make this situation less error-prone, just make sure that any partial
pfn mapping is torn down early, before any other error handling.

Reported-and-tested-by: Jann Horn &lt;jannh@google.com&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Jason Gunthorpe &lt;jgg@ziepe.ca&gt;
Cc: Simona Vetter &lt;simona.vetter@ffwll.ch&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>Merge tag 'vfs-6.11-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs</title>
<updated>2024-09-04T16:33:57Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2024-09-04T16:33:57Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=4356ab331c8f0dbed0f683abde345cd5503db1e4'/>
<id>urn:sha1:4356ab331c8f0dbed0f683abde345cd5503db1e4</id>
<content type='text'>
Pull vfs fixes from Christian Brauner:
 "Two netfs fixes for this merge window:

   - Ensure that fscache_cookie_lru_time is deleted when the fscache
     module is removed to prevent UAF

   - Fix filemap_invalidate_inode() to use invalidate_inode_pages2_range()

     Before it used truncate_inode_pages_partial() which causes
     copy_file_range() to fail on cifs"

* tag 'vfs-6.11-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fscache: delete fscache_cookie_lru_timer when fscache exits to avoid UAF
  mm: Fix filemap_invalidate_inode() to use invalidate_inode_pages2_range()
</content>
</entry>
<entry>
<title>Merge tag 'mm-hotfixes-stable-2024-09-03-20-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm</title>
<updated>2024-09-04T15:37:33Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2024-09-04T15:37:33Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=76c0f27d063079ce2c6e7fa971a185a38905d844'/>
<id>urn:sha1:76c0f27d063079ce2c6e7fa971a185a38905d844</id>
<content type='text'>
Pull misc fixes from Andrew Morton:
 "17 hotfixes, 15 of which are cc:stable.

  Mostly MM, no identifiable theme.  And a few nilfs2 fixups"

* tag 'mm-hotfixes-stable-2024-09-03-20-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  alloc_tag: fix allocation tag reporting when CONFIG_MODULES=n
  mm: vmalloc: optimize vmap_lazy_nr arithmetic when purging each vmap_area
  mailmap: update entry for Jan Kuliga
  codetag: debug: mark codetags for poisoned page as empty
  mm/memcontrol: respect zswap.writeback setting from parent cg too
  scripts: fix gfp-translate after ___GFP_*_BITS conversion to an enum
  Revert "mm: skip CMA pages when they are not available"
  maple_tree: remove rcu_read_lock() from mt_validate()
  kexec_file: fix elfcorehdr digest exclusion when CONFIG_CRASH_HOTPLUG=y
  mm/slub: add check for s-&gt;flags in the alloc_tagging_slab_free_hook
  nilfs2: fix state management in error path of log writing function
  nilfs2: fix missing cleanup on rollforward recovery error
  nilfs2: protect references to superblock parameters exposed in sysfs
  userfaultfd: don't BUG_ON() if khugepaged yanks our page table
  userfaultfd: fix checks for huge PMDs
  mm: vmalloc: ensure vmap_block is initialised before adding to queue
  selftests: mm: fix build errors on armhf
</content>
</entry>
<entry>
<title>mm: vmalloc: optimize vmap_lazy_nr arithmetic when purging each vmap_area</title>
<updated>2024-09-02T00:59:02Z</updated>
<author>
<name>Adrian Huang</name>
<email>ahuang12@lenovo.com</email>
</author>
<published>2024-08-29T13:06:33Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=409faf8c97d5abb0597ea43e99c8b3dd8dbe99e3'/>
<id>urn:sha1:409faf8c97d5abb0597ea43e99c8b3dd8dbe99e3</id>
<content type='text'>
When running the vmalloc stress on a 448-core system, observe the average
latency of purge_vmap_node() is about 2 seconds by using the eBPF/bcc
'funclatency.py' tool [1].

  # /your-git-repo/bcc/tools/funclatency.py -u purge_vmap_node &amp; pid1=$! &amp;&amp; sleep 8 &amp;&amp; modprobe test_vmalloc nr_threads=$(nproc) run_test_mask=0x7; kill -SIGINT $pid1

     usecs             : count    distribution
        0 -&gt; 1         : 0       |                                        |
        2 -&gt; 3         : 29      |                                        |
        4 -&gt; 7         : 19      |                                        |
        8 -&gt; 15        : 56      |                                        |
       16 -&gt; 31        : 483     |****                                    |
       32 -&gt; 63        : 1548    |************                            |
       64 -&gt; 127       : 2634    |*********************                   |
      128 -&gt; 255       : 2535    |*********************                   |
      256 -&gt; 511       : 1776    |**************                          |
      512 -&gt; 1023      : 1015    |********                                |
     1024 -&gt; 2047      : 573     |****                                    |
     2048 -&gt; 4095      : 488     |****                                    |
     4096 -&gt; 8191      : 1091    |*********                               |
     8192 -&gt; 16383     : 3078    |*************************               |
    16384 -&gt; 32767     : 4821    |****************************************|
    32768 -&gt; 65535     : 3318    |***************************             |
    65536 -&gt; 131071    : 1718    |**************                          |
   131072 -&gt; 262143    : 2220    |******************                      |
   262144 -&gt; 524287    : 1147    |*********                               |
   524288 -&gt; 1048575   : 1179    |*********                               |
  1048576 -&gt; 2097151   : 822     |******                                  |
  2097152 -&gt; 4194303   : 906     |*******                                 |
  4194304 -&gt; 8388607   : 2148    |*****************                       |
  8388608 -&gt; 16777215  : 4497    |*************************************   |
 16777216 -&gt; 33554431  : 289     |**                                      |

  avg = 2041714 usecs, total: 78381401772 usecs, count: 38390

  The worst case is over 16-33 seconds, so soft lockup is triggered [2].

[Root Cause]
1) Each purge_list has the long list. The following shows the number of
   vmap_area is purged.

   crash&gt; p vmap_nodes
   vmap_nodes = $27 = (struct vmap_node *) 0xff2de5a900100000
   crash&gt; vmap_node 0xff2de5a900100000 128 | grep nr_purged
     nr_purged = 663070
     ...
     nr_purged = 821670
     nr_purged = 692214
     nr_purged = 726808
     ...

2) atomic_long_sub() employs the 'lock' prefix to ensure the atomic
   operation when purging each vmap_area. However, the iteration is over
   600000 vmap_area (See 'nr_purged' above).

   Here is objdump output:

     $ objdump -D vmlinux
     ffffffff813e8c80 &lt;purge_vmap_node&gt;:
     ...
     ffffffff813e8d70:  f0 48 29 2d 68 0c bb  lock sub %rbp,0x2bb0c68(%rip)
     ...

   Quote from "Instruction tables" pdf file [3]:
     Instructions with a LOCK prefix have a long latency that depends on
     cache organization and possibly RAM speed. If there are multiple
     processors or cores or direct memory access (DMA) devices, then all
     locked instructions will lock a cache line for exclusive access,
     which may involve RAM access. A LOCK prefix typically costs more
     than a hundred clock cycles, even on single-processor systems.

   That's why the latency of purge_vmap_node() dramatically increases
   on a many-core system: One core is busy on purging each vmap_area of
   the *long* purge_list and executing atomic_long_sub() for each
   vmap_area, while other cores free vmalloc allocations and execute
   atomic_long_add_return() in free_vmap_area_noflush().

[Solution]
Employ a local variable to record the total purged pages, and execute
atomic_long_sub() after the traversal of the purge_list is done. The
experiment result shows the latency improvement is 99%.

[Experiment Result]
1) System Configuration: Three servers (with HT-enabled) are tested.
     * 72-core server: 3rd Gen Intel Xeon Scalable Processor*1
     * 192-core server: 5th Gen Intel Xeon Scalable Processor*2
     * 448-core server: AMD Zen 4 Processor*2

2) Kernel Config
     * CONFIG_KASAN is disabled

3) The data in column "w/o patch" and "w/ patch"
     * Unit: micro seconds (us)
     * Each data is the average of 3-time measurements

         System        w/o patch (us)   w/ patch (us)    Improvement (%)
     ---------------   --------------   -------------    -------------
     72-core server          2194              14            99.36%
     192-core server       143799            1139            99.21%
     448-core server      1992122            6883            99.65%

[1] https://github.com/iovisor/bcc/blob/master/tools/funclatency.py
[2] https://gist.github.com/AdrianHuang/37c15f67b45407b83c2d32f918656c12
[3] https://www.agner.org/optimize/instruction_tables.pdf

Link: https://lkml.kernel.org/r/20240829130633.2184-1-ahuang12@lenovo.com
Signed-off-by: Adrian Huang &lt;ahuang12@lenovo.com&gt;
Reviewed-by: Uladzislau Rezki (Sony) &lt;urezki@gmail.com&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>codetag: debug: mark codetags for poisoned page as empty</title>
<updated>2024-09-02T00:59:02Z</updated>
<author>
<name>Hao Ge</name>
<email>gehao@kylinos.cn</email>
</author>
<published>2024-08-25T16:36:49Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=5e9784e997620af7c1399029282f5d6964b41942'/>
<id>urn:sha1:5e9784e997620af7c1399029282f5d6964b41942</id>
<content type='text'>
When PG_hwpoison pages are freed they are treated differently in
free_pages_prepare() and instead of being released they are isolated.

Page allocation tag counters are decremented at this point since the page
is considered not in use.  Later on when such pages are released by
unpoison_memory(), the allocation tag counters will be decremented again
and the following warning gets reported:

[  113.930443][ T3282] ------------[ cut here ]------------
[  113.931105][ T3282] alloc_tag was not set
[  113.931576][ T3282] WARNING: CPU: 2 PID: 3282 at ./include/linux/alloc_tag.h:130 pgalloc_tag_sub.part.66+0x154/0x164
[  113.932866][ T3282] Modules linked in: hwpoison_inject fuse ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute ip6table_nat ip6table_man4
[  113.941638][ T3282] CPU: 2 UID: 0 PID: 3282 Comm: madvise11 Kdump: loaded Tainted: G        W          6.11.0-rc4-dirty #18
[  113.943003][ T3282] Tainted: [W]=WARN
[  113.943453][ T3282] Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
[  113.944378][ T3282] pstate: 40400005 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  113.945319][ T3282] pc : pgalloc_tag_sub.part.66+0x154/0x164
[  113.946016][ T3282] lr : pgalloc_tag_sub.part.66+0x154/0x164
[  113.946706][ T3282] sp : ffff800087093a10
[  113.947197][ T3282] x29: ffff800087093a10 x28: ffff0000d7a9d400 x27: ffff80008249f0a0
[  113.948165][ T3282] x26: 0000000000000000 x25: ffff80008249f2b0 x24: 0000000000000000
[  113.949134][ T3282] x23: 0000000000000001 x22: 0000000000000001 x21: 0000000000000000
[  113.950597][ T3282] x20: ffff0000c08fcad8 x19: ffff80008251e000 x18: ffffffffffffffff
[  113.952207][ T3282] x17: 0000000000000000 x16: 0000000000000000 x15: ffff800081746210
[  113.953161][ T3282] x14: 0000000000000000 x13: 205d323832335420 x12: 5b5d353031313339
[  113.954120][ T3282] x11: ffff800087093500 x10: 000000000000005d x9 : 00000000ffffffd0
[  113.955078][ T3282] x8 : 7f7f7f7f7f7f7f7f x7 : ffff80008236ba90 x6 : c0000000ffff7fff
[  113.956036][ T3282] x5 : ffff000b34bf4dc8 x4 : ffff8000820aba90 x3 : 0000000000000001
[  113.956994][ T3282] x2 : ffff800ab320f000 x1 : 841d1e35ac932e00 x0 : 0000000000000000
[  113.957962][ T3282] Call trace:
[  113.958350][ T3282]  pgalloc_tag_sub.part.66+0x154/0x164
[  113.959000][ T3282]  pgalloc_tag_sub+0x14/0x1c
[  113.959539][ T3282]  free_unref_page+0xf4/0x4b8
[  113.960096][ T3282]  __folio_put+0xd4/0x120
[  113.960614][ T3282]  folio_put+0x24/0x50
[  113.961103][ T3282]  unpoison_memory+0x4f0/0x5b0
[  113.961678][ T3282]  hwpoison_unpoison+0x30/0x48 [hwpoison_inject]
[  113.962436][ T3282]  simple_attr_write_xsigned.isra.34+0xec/0x1cc
[  113.963183][ T3282]  simple_attr_write+0x38/0x48
[  113.963750][ T3282]  debugfs_attr_write+0x54/0x80
[  113.964330][ T3282]  full_proxy_write+0x68/0x98
[  113.964880][ T3282]  vfs_write+0xdc/0x4d0
[  113.965372][ T3282]  ksys_write+0x78/0x100
[  113.965875][ T3282]  __arm64_sys_write+0x24/0x30
[  113.966440][ T3282]  invoke_syscall+0x7c/0x104
[  113.966984][ T3282]  el0_svc_common.constprop.1+0x88/0x104
[  113.967652][ T3282]  do_el0_svc+0x2c/0x38
[  113.968893][ T3282]  el0_svc+0x3c/0x1b8
[  113.969379][ T3282]  el0t_64_sync_handler+0x98/0xbc
[  113.969980][ T3282]  el0t_64_sync+0x19c/0x1a0
[  113.970511][ T3282] ---[ end trace 0000000000000000 ]---

To fix this, clear the page tag reference after the page got isolated
and accounted for.

Link: https://lkml.kernel.org/r/20240825163649.33294-1-hao.ge@linux.dev
Fixes: d224eb0287fb ("codetag: debug: mark codetags for reserved pages as empty")
Signed-off-by: Hao Ge &lt;gehao@kylinos.cn&gt;
Reviewed-by: Miaohe Lin &lt;linmiaohe@huawei.com&gt;
Acked-by: Suren Baghdasaryan &lt;surenb@google.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Hao Ge &lt;gehao@kylinos.cn&gt;
Cc: Kent Overstreet &lt;kent.overstreet@linux.dev&gt;
Cc: Naoya Horiguchi &lt;nao.horiguchi@gmail.com&gt;
Cc: Pasha Tatashin &lt;pasha.tatashin@soleen.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[6.10+]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/memcontrol: respect zswap.writeback setting from parent cg too</title>
<updated>2024-09-02T00:59:02Z</updated>
<author>
<name>Mike Yuan</name>
<email>me@yhndnzj.com</email>
</author>
<published>2024-08-23T16:27:06Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=e399257349098bf7c84343f99efb2bc9c22eb9fd'/>
<id>urn:sha1:e399257349098bf7c84343f99efb2bc9c22eb9fd</id>
<content type='text'>
Currently, the behavior of zswap.writeback wrt.  the cgroup hierarchy
seems a bit odd.  Unlike zswap.max, it doesn't honor the value from parent
cgroups.  This surfaced when people tried to globally disable zswap
writeback, i.e.  reserve physical swap space only for hibernation [1] -
disabling zswap.writeback only for the root cgroup results in subcgroups
with zswap.writeback=1 still performing writeback.

The inconsistency became more noticeable after I introduced the
MemoryZSwapWriteback= systemd unit setting [2] for controlling the knob.
The patch assumed that the kernel would enforce the value of parent
cgroups.  It could probably be workarounded from systemd's side, by going
up the slice unit tree and inheriting the value.  Yet I think it's more
sensible to make it behave consistently with zswap.max and friends.

[1] https://wiki.archlinux.org/title/Power_management/Suspend_and_hibernate#Disable_zswap_writeback_to_use_the_swap_space_only_for_hibernation
[2] https://github.com/systemd/systemd/pull/31734

Link: https://lkml.kernel.org/r/20240823162506.12117-1-me@yhndnzj.com
Fixes: 501a06fe8e4c ("zswap: memcontrol: implement zswap writeback disabling")
Signed-off-by: Mike Yuan &lt;me@yhndnzj.com&gt;
Reviewed-by: Nhat Pham &lt;nphamcs@gmail.com&gt;
Acked-by: Yosry Ahmed &lt;yosryahmed@google.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Michal Hocko &lt;mhocko@kernel.org&gt;
Cc: Michal Koutný &lt;mkoutny@suse.com&gt;
Cc: Muchun Song &lt;muchun.song@linux.dev&gt;
Cc: Roman Gushchin &lt;roman.gushchin@linux.dev&gt;
Cc: Shakeel Butt &lt;shakeel.butt@linux.dev&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>Revert "mm: skip CMA pages when they are not available"</title>
<updated>2024-09-02T00:59:01Z</updated>
<author>
<name>Usama Arif</name>
<email>usamaarif642@gmail.com</email>
</author>
<published>2024-08-21T19:26:07Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=bfe0857c20c663fcc1592fa4e3a61ca12b07dac9'/>
<id>urn:sha1:bfe0857c20c663fcc1592fa4e3a61ca12b07dac9</id>
<content type='text'>
This reverts commit 5da226dbfce3 ("mm: skip CMA pages when they are not
available") and b7108d66318a ("Multi-gen LRU: skip CMA pages when they are
not eligible").

lruvec-&gt;lru_lock is highly contended and is held when calling
isolate_lru_folios.  If the lru has a large number of CMA folios
consecutively, while the allocation type requested is not MIGRATE_MOVABLE,
isolate_lru_folios can hold the lock for a very long time while it skips
those.  For FIO workload, ~150million order=0 folios were skipped to
isolate a few ZONE_DMA folios [1].  This can cause lockups [1] and high
memory pressure for extended periods of time [2].

Remove skipping CMA for MGLRU as well, as it was introduced in sort_folio
for the same resaon as 5da226dbfce3a2f44978c2c7cf88166e69a6788b.

[1] https://lore.kernel.org/all/CAOUHufbkhMZYz20aM_3rHZ3OcK4m2puji2FGpUpn_-DevGk3Kg@mail.gmail.com/
[2] https://lore.kernel.org/all/ZrssOrcJIDy8hacI@gmail.com/

[usamaarif642@gmail.com: also revert b7108d66318a, per Johannes]
  Link: https://lkml.kernel.org/r/9060a32d-b2d7-48c0-8626-1db535653c54@gmail.com
  Link: https://lkml.kernel.org/r/357ac325-4c61-497a-92a3-bdbd230d5ec9@gmail.com
Link: https://lkml.kernel.org/r/9060a32d-b2d7-48c0-8626-1db535653c54@gmail.com
Fixes: 5da226dbfce3 ("mm: skip CMA pages when they are not available")
Signed-off-by: Usama Arif &lt;usamaarif642@gmail.com&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Bharata B Rao &lt;bharata@amd.com&gt;
Cc: Breno Leitao &lt;leitao@debian.org&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Matthew Wilcox &lt;willy@infradead.org&gt;
Cc: Rik van Riel &lt;riel@surriel.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Yu Zhao &lt;yuzhao@google.com&gt;
Cc: Zhaoyang Huang &lt;huangzhaoyang@gmail.com&gt;
Cc: Zhaoyang Huang &lt;zhaoyang.huang@unisoc.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/slub: add check for s-&gt;flags in the alloc_tagging_slab_free_hook</title>
<updated>2024-09-02T00:59:01Z</updated>
<author>
<name>Hao Ge</name>
<email>gehao@kylinos.cn</email>
</author>
<published>2024-08-16T01:33:36Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=ab7ca09520e9c41c219a4427fe0dae24024bfe7f'/>
<id>urn:sha1:ab7ca09520e9c41c219a4427fe0dae24024bfe7f</id>
<content type='text'>
When enable CONFIG_MEMCG &amp; CONFIG_KFENCE &amp; CONFIG_KMEMLEAK, the following
warning always occurs,This is because the following call stack occurred:
mem_pool_alloc
    kmem_cache_alloc_noprof
        slab_alloc_node
            kfence_alloc

Once the kfence allocation is successful,slab-&gt;obj_exts will not be empty,
because it has already been assigned a value in kfence_init_pool.

Since in the prepare_slab_obj_exts_hook function,we perform a check for
s-&gt;flags &amp; (SLAB_NO_OBJ_EXT | SLAB_NOLEAKTRACE),the alloc_tag_add function
will not be called as a result.Therefore,ref-&gt;ct remains NULL.

However,when we call mem_pool_free,since obj_ext is not empty, it
eventually leads to the alloc_tag_sub scenario being invoked.  This is
where the warning occurs.

So we should add corresponding checks in the alloc_tagging_slab_free_hook.
For __GFP_NO_OBJ_EXT case,I didn't see the specific case where it's using
kfence,so I won't add the corresponding check in
alloc_tagging_slab_free_hook for now.

[    3.734349] ------------[ cut here ]------------
[    3.734807] alloc_tag was not set
[    3.735129] WARNING: CPU: 4 PID: 40 at ./include/linux/alloc_tag.h:130 kmem_cache_free+0x444/0x574
[    3.735866] Modules linked in: autofs4
[    3.736211] CPU: 4 UID: 0 PID: 40 Comm: ksoftirqd/4 Tainted: G        W          6.11.0-rc3-dirty #1
[    3.736969] Tainted: [W]=WARN
[    3.737258] Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
[    3.737875] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    3.738501] pc : kmem_cache_free+0x444/0x574
[    3.738951] lr : kmem_cache_free+0x444/0x574
[    3.739361] sp : ffff80008357bb60
[    3.739693] x29: ffff80008357bb70 x28: 0000000000000000 x27: 0000000000000000
[    3.740338] x26: ffff80008207f000 x25: ffff000b2eb2fd60 x24: ffff0000c0005700
[    3.740982] x23: ffff8000804229e4 x22: ffff800082080000 x21: ffff800081756000
[    3.741630] x20: fffffd7ff8253360 x19: 00000000000000a8 x18: ffffffffffffffff
[    3.742274] x17: ffff800ab327f000 x16: ffff800083398000 x15: ffff800081756df0
[    3.742919] x14: 0000000000000000 x13: 205d344320202020 x12: 5b5d373038343337
[    3.743560] x11: ffff80008357b650 x10: 000000000000005d x9 : 00000000ffffffd0
[    3.744231] x8 : 7f7f7f7f7f7f7f7f x7 : ffff80008237bad0 x6 : c0000000ffff7fff
[    3.744907] x5 : ffff80008237ba78 x4 : ffff8000820bbad0 x3 : 0000000000000001
[    3.745580] x2 : 68d66547c09f7800 x1 : 68d66547c09f7800 x0 : 0000000000000000
[    3.746255] Call trace:
[    3.746530]  kmem_cache_free+0x444/0x574
[    3.746931]  mem_pool_free+0x44/0xf4
[    3.747306]  free_object_rcu+0xc8/0xdc
[    3.747693]  rcu_do_batch+0x234/0x8a4
[    3.748075]  rcu_core+0x230/0x3e4
[    3.748424]  rcu_core_si+0x14/0x1c
[    3.748780]  handle_softirqs+0x134/0x378
[    3.749189]  run_ksoftirqd+0x70/0x9c
[    3.749560]  smpboot_thread_fn+0x148/0x22c
[    3.749978]  kthread+0x10c/0x118
[    3.750323]  ret_from_fork+0x10/0x20
[    3.750696] ---[ end trace 0000000000000000 ]---

Link: https://lkml.kernel.org/r/20240816013336.17505-1-hao.ge@linux.dev
Fixes: 4b8736964640 ("mm/slab: add allocation accounting into slab allocation and free paths")
Signed-off-by: Hao Ge &lt;gehao@kylinos.cn&gt;
Cc: Christoph Lameter &lt;cl@linux.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Hyeonggon Yoo &lt;42.hyeyoo@gmail.com&gt;
Cc: Joonsoo Kim &lt;iamjoonsoo.kim@lge.com&gt;
Cc: Kees Cook &lt;kees@kernel.org&gt;
Cc: Kent Overstreet &lt;kent.overstreet@linux.dev&gt;
Cc: Pekka Enberg &lt;penberg@kernel.org&gt;
Cc: Roman Gushchin &lt;roman.gushchin@linux.dev&gt;
Cc: Suren Baghdasaryan &lt;surenb@google.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>userfaultfd: don't BUG_ON() if khugepaged yanks our page table</title>
<updated>2024-09-02T00:59:00Z</updated>
<author>
<name>Jann Horn</name>
<email>jannh@google.com</email>
</author>
<published>2024-08-13T20:25:22Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=4828d207dc5161dc7ddf9a4f6dcfd80c7dd7d20a'/>
<id>urn:sha1:4828d207dc5161dc7ddf9a4f6dcfd80c7dd7d20a</id>
<content type='text'>
Since khugepaged was changed to allow retracting page tables in file
mappings without holding the mmap lock, these BUG_ON()s are wrong - get
rid of them.

We could also remove the preceding "if (unlikely(...))" block, but then we
could reach pte_offset_map_lock() with transhuge pages not just for file
mappings but also for anonymous mappings - which would probably be fine
but I think is not necessarily expected.

Link: https://lkml.kernel.org/r/20240813-uffd-thp-flip-fix-v2-2-5efa61078a41@google.com
Fixes: 1d65b771bc08 ("mm/khugepaged: retract_page_tables() without mmap or vma lock")
Signed-off-by: Jann Horn &lt;jannh@google.com&gt;
Reviewed-by: Qi Zheng &lt;zhengqi.arch@bytedance.com&gt;
Acked-by: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Pavel Emelyanov &lt;xemul@virtuozzo.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>userfaultfd: fix checks for huge PMDs</title>
<updated>2024-09-02T00:58:59Z</updated>
<author>
<name>Jann Horn</name>
<email>jannh@google.com</email>
</author>
<published>2024-08-13T20:25:21Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=71c186efc1b2cf1aeabfeff3b9bd5ac4c5ac14d8'/>
<id>urn:sha1:71c186efc1b2cf1aeabfeff3b9bd5ac4c5ac14d8</id>
<content type='text'>
Patch series "userfaultfd: fix races around pmd_trans_huge() check", v2.

The pmd_trans_huge() code in mfill_atomic() is wrong in three different
ways depending on kernel version:

1. The pmd_trans_huge() check is racy and can lead to a BUG_ON() (if you hit
   the right two race windows) - I've tested this in a kernel build with
   some extra mdelay() calls. See the commit message for a description
   of the race scenario.
   On older kernels (before 6.5), I think the same bug can even
   theoretically lead to accessing transhuge page contents as a page table
   if you hit the right 5 narrow race windows (I haven't tested this case).
2. As pointed out by Qi Zheng, pmd_trans_huge() is not sufficient for
   detecting PMDs that don't point to page tables.
   On older kernels (before 6.5), you'd just have to win a single fairly
   wide race to hit this.
   I've tested this on 6.1 stable by racing migration (with a mdelay()
   patched into try_to_migrate()) against UFFDIO_ZEROPAGE - on my x86
   VM, that causes a kernel oops in ptlock_ptr().
3. On newer kernels (&gt;=6.5), for shmem mappings, khugepaged is allowed
   to yank page tables out from under us (though I haven't tested that),
   so I think the BUG_ON() checks in mfill_atomic() are just wrong.

I decided to write two separate fixes for these (one fix for bugs 1+2, one
fix for bug 3), so that the first fix can be backported to kernels
affected by bugs 1+2.


This patch (of 2):

This fixes two issues.

I discovered that the following race can occur:

  mfill_atomic                other thread
  ============                ============
                              &lt;zap PMD&gt;
  pmdp_get_lockless() [reads none pmd]
  &lt;bail if trans_huge&gt;
  &lt;if none:&gt;
                              &lt;pagefault creates transhuge zeropage&gt;
    __pte_alloc [no-op]
                              &lt;zap PMD&gt;
  &lt;bail if pmd_trans_huge(*dst_pmd)&gt;
  BUG_ON(pmd_none(*dst_pmd))

I have experimentally verified this in a kernel with extra mdelay() calls;
the BUG_ON(pmd_none(*dst_pmd)) triggers.

On kernels newer than commit 0d940a9b270b ("mm/pgtable: allow
pte_offset_map[_lock]() to fail"), this can't lead to anything worse than
a BUG_ON(), since the page table access helpers are actually designed to
deal with page tables concurrently disappearing; but on older kernels
(&lt;=6.4), I think we could probably theoretically race past the two
BUG_ON() checks and end up treating a hugepage as a page table.

The second issue is that, as Qi Zheng pointed out, there are other types
of huge PMDs that pmd_trans_huge() can't catch: devmap PMDs and swap PMDs
(in particular, migration PMDs).

On &lt;=6.4, this is worse than the first issue: If mfill_atomic() runs on a
PMD that contains a migration entry (which just requires winning a single,
fairly wide race), it will pass the PMD to pte_offset_map_lock(), which
assumes that the PMD points to a page table.

Breakage follows: First, the kernel tries to take the PTE lock (which will
crash or maybe worse if there is no "struct page" for the address bits in
the migration entry PMD - I think at least on X86 there usually is no
corresponding "struct page" thanks to the PTE inversion mitigation, amd64
looks different).

If that didn't crash, the kernel would next try to write a PTE into what
it wrongly thinks is a page table.

As part of fixing these issues, get rid of the check for pmd_trans_huge()
before __pte_alloc() - that's redundant, we're going to have to check for
that after the __pte_alloc() anyway.

Backport note: pmdp_get_lockless() is pmd_read_atomic() in older kernels.

Link: https://lkml.kernel.org/r/20240813-uffd-thp-flip-fix-v2-0-5efa61078a41@google.com
Link: https://lkml.kernel.org/r/20240813-uffd-thp-flip-fix-v2-1-5efa61078a41@google.com
Fixes: c1a4de99fada ("userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation")
Signed-off-by: Jann Horn &lt;jannh@google.com&gt;
Acked-by: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Pavel Emelyanov &lt;xemul@virtuozzo.com&gt;
Cc: Qi Zheng &lt;zhengqi.arch@bytedance.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
</feed>
