<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/include/linux/swap.h, branch v6.11.3</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v6.11.3</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v6.11.3'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2024-07-06T18:44:41Z</updated>
<entry>
<title>Merge branch 'mm-hotfixes-stable' into mm-stable to pick up "mm: fix</title>
<updated>2024-07-06T18:44:41Z</updated>
<author>
<name>Andrew Morton</name>
<email>akpm@linux-foundation.org</email>
</author>
<published>2024-07-06T18:44:41Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=8ef6fd0e9ea83a792ba53882ddc6e0d38ce0d636'/>
<id>urn:sha1:8ef6fd0e9ea83a792ba53882ddc6e0d38ce0d636</id>
<content type='text'>
crashes from deferred split racing folio migration", needed by "mm:
migrate: split folio_migrate_mapping()".
</content>
</entry>
<entry>
<title>mm: add swappiness= arg to memory.reclaim</title>
<updated>2024-07-05T01:05:55Z</updated>
<author>
<name>Dan Schatzberg</name>
<email>schatzberg.dan@gmail.com</email>
</author>
<published>2024-01-03T16:48:37Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=68cd9050d871e4db5433420b5ceb32f5512d18bc'/>
<id>urn:sha1:68cd9050d871e4db5433420b5ceb32f5512d18bc</id>
<content type='text'>
Allow proactive reclaimers to submit an additional swappiness=&lt;val&gt;
argument to memory.reclaim.  This overrides the global or per-memcg
swappiness setting for that reclaim attempt.

For example:

echo "2M swappiness=0" &gt; /sys/fs/cgroup/memory.reclaim

will perform reclaim on the rootcg with a swappiness setting of 0 (no
swap) regardless of the vm.swappiness sysctl setting.

Userspace proactive reclaimers use the memory.reclaim interface to trigger
reclaim.  The memory.reclaim interface does not allow for any way to
effect the balance of file vs anon during proactive reclaim.  The only
approach is to adjust the vm.swappiness setting.  However, there are a few
reasons we look to control the balance of file vs anon during proactive
reclaim, separately from reactive reclaim:

* Swapout should be limited to manage SSD write endurance.  In near-OOM
  situations we are fine with lots of swap-out to avoid OOMs.  As these
  are typically rare events, they have relatively little impact on write
  endurance.  However, proactive reclaim runs continuously and so its
  impact on SSD write endurance is more significant.  Therefore it is
  desireable to control swap-out for proactive reclaim separately from
  reactive reclaim

* Some userspace OOM killers like systemd-oomd[1] support OOM killing on
  swap exhaustion.  This makes sense if the swap exhaustion is triggered
  due to reactive reclaim but less so if it is triggered due to proactive
  reclaim (e.g.  one could see OOMs when free memory is ample but anon is
  just particularly cold).  Therefore, it's desireable to have proactive
  reclaim reduce or stop swap-out before the threshold at which OOM
  killing occurs.

In the case of Meta's Senpai proactive reclaimer, we adjust vm.swappiness
before writes to memory.reclaim[2].  This has been in production for
nearly two years and has addressed our needs to control proactive vs
reactive reclaim behavior but is still not ideal for a number of reasons:

* vm.swappiness is a global setting, adjusting it can race/interfere
  with other system administration that wishes to control vm.swappiness. 
  In our case, we need to disable Senpai before adjusting vm.swappiness.

* vm.swappiness is stateful - so a crash or restart of Senpai can leave
  a misconfigured setting.  This requires some additional management to
  record the "desired" setting and ensure Senpai always adjusts to it.

With this patch, we avoid these downsides of adjusting vm.swappiness
globally.

[1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
[2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598

Link: https://lkml.kernel.org/r/20240103164841.2800183-3-schatzberg.dan@gmail.com
Signed-off-by: Dan Schatzberg &lt;schatzberg.dan@gmail.com&gt;
Suggested-by: Yosry Ahmed &lt;yosryahmed@google.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Acked-by: David Rientjes &lt;rientjes@google.com&gt;
Acked-by: Chris Li &lt;chrisl@kernel.org&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Jonathan Corbet &lt;corbet@lwn.net&gt;
Cc: Kefeng Wang &lt;wangkefeng.wang@huawei.com&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Muchun Song &lt;muchun.song@linux.dev&gt;
Cc: Roman Gushchin &lt;roman.gushchin@linux.dev&gt;
Cc: Shakeel Butt &lt;shakeel.butt@linux.dev&gt;
Cc: Shakeel Butt &lt;shakeelb@google.com&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Yue Zhao &lt;findns94@gmail.com&gt;
Cc: Zefan Li &lt;lizefan.x@bytedance.com&gt;
Cc: Nhat Pham &lt;nphamcs@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: add defines for min/max swappiness</title>
<updated>2024-07-05T01:05:55Z</updated>
<author>
<name>Dan Schatzberg</name>
<email>schatzberg.dan@gmail.com</email>
</author>
<published>2024-01-03T16:48:36Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=410abb20acaea9679fc1b2276d47e70fba331451'/>
<id>urn:sha1:410abb20acaea9679fc1b2276d47e70fba331451</id>
<content type='text'>
Patch series "Add swappiness argument to memory.reclaim", v6.

This patch proposes augmenting the memory.reclaim interface with a
swappiness=&lt;val&gt; argument that overrides the swappiness value for that
instance of proactive reclaim.

Userspace proactive reclaimers use the memory.reclaim interface to trigger
reclaim.  The memory.reclaim interface does not allow for any way to
effect the balance of file vs anon during proactive reclaim.  The only
approach is to adjust the vm.swappiness setting.  However, there are a few
reasons we look to control the balance of file vs anon during proactive
reclaim, separately from reactive reclaim:

* Swapout should be limited to manage SSD write endurance.  In near-OOM
  situations we are fine with lots of swap-out to avoid OOMs.  As these
  are typically rare events, they have relatively little impact on write
  endurance.  However, proactive reclaim runs continuously and so its
  impact on SSD write endurance is more significant.  Therefore it is
  desireable to control swap-out for proactive reclaim separately from
  reactive reclaim

* Some userspace OOM killers like systemd-oomd[1] support OOM killing on
  swap exhaustion.  This makes sense if the swap exhaustion is triggered
  due to reactive reclaim but less so if it is triggered due to proactive
  reclaim (e.g.  one could see OOMs when free memory is ample but anon is
  just particularly cold).  Therefore, it's desireable to have proactive
  reclaim reduce or stop swap-out before the threshold at which OOM
  killing occurs.

In the case of Meta's Senpai proactive reclaimer, we adjust vm.swappiness
before writes to memory.reclaim[2].  This has been in production for
nearly two years and has addressed our needs to control proactive vs
reactive reclaim behavior but is still not ideal for a number of reasons:

* vm.swappiness is a global setting, adjusting it can race/interfere
  with other system administration that wishes to control vm.swappiness. 
  In our case, we need to disable Senpai before adjusting vm.swappiness.

* vm.swappiness is stateful - so a crash or restart of Senpai can leave
  a misconfigured setting.  This requires some additional management to
  record the "desired" setting and ensure Senpai always adjusts to it.

With this patch, we avoid these downsides of adjusting vm.swappiness
globally.

Previously, this exact interface addition was proposed by Yosry[3].  In
response, Roman proposed instead an interface to specify precise
file/anon/slab reclaim amounts[4].  More recently Huan also proposed this
as well[5] and others similarly questioned if this was the proper
interface.

Previous proposals sought to use this to allow proactive reclaimers to
effectively perform a custom reclaim algorithm by issuing proactive
reclaim with different settings to control file vs anon reclaim (e.g.  to
only reclaim anon from some applications).  Responses argued that
adjusting swappiness is a poor interface for custom reclaim.

In contrast, I argue in favor of a swappiness setting not as a way to
implement custom reclaim algorithms but rather to bias the balance of anon
vs file due to differences of proactive vs reactive reclaim.  In this
context, swappiness is the existing interface for controlling this balance
and this patch simply allows for it to be configured differently for
proactive vs reactive reclaim.

Specifying explicit amounts of anon vs file pages to reclaim feels
inappropriate for this prupose.  Proactive reclaimers are un-aware of the
relative age of file vs anon for a cgroup which makes it difficult to
manage proactive reclaim of different memory pools.  A proactive reclaimer
would need some amount of anon reclaim attempts separate from the amount
of file reclaim attempts which seems brittle given that it's difficult to
observe the impact.

[1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
[2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598
[3]https://lore.kernel.org/linux-mm/CAJD7tkbDpyoODveCsnaqBBMZEkDvshXJmNdbk51yKSNgD7aGdg@mail.gmail.com/
[4]https://lore.kernel.org/linux-mm/YoPHtHXzpK51F%2F1Z@carbon/
[5]https://lore.kernel.org/lkml/20231108065818.19932-1-link@vivo.com/


This patch (of 2):

We use the constants 0 and 200 in a few places in the mm code when
referring to the min and max swappiness.  This patch adds MIN_SWAPPINESS
and MAX_SWAPPINESS #defines to improve clarity.  There are no functional
changes.

Link: https://lkml.kernel.org/r/20240103164841.2800183-1-schatzberg.dan@gmail.com
Link: https://lkml.kernel.org/r/20240103164841.2800183-2-schatzberg.dan@gmail.com
Signed-off-by: Dan Schatzberg &lt;schatzberg.dan@gmail.com&gt;
Acked-by: David Rientjes &lt;rientjes@google.com&gt;
Acked-by: Chris Li &lt;chrisl@kernel.org&gt;
Reviewed-by: Nhat Pham &lt;nphamcs@gmail.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Jonathan Corbet &lt;corbet@lwn.net&gt;
Cc: Kefeng Wang &lt;wangkefeng.wang@huawei.com&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Muchun Song &lt;muchun.song@linux.dev&gt;
Cc: Roman Gushchin &lt;roman.gushchin@linux.dev&gt;
Cc: Shakeel Butt &lt;shakeelb@google.com&gt;
Cc: Shakeel Butt &lt;shakeel.butt@linux.dev&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Yosry Ahmed &lt;yosryahmed@google.com&gt;
Cc: Yue Zhao &lt;findns94@gmail.com&gt;
Cc: Zefan Li &lt;lizefan.x@bytedance.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>cachestat: do not flush stats in recency check</title>
<updated>2024-07-04T05:40:37Z</updated>
<author>
<name>Nhat Pham</name>
<email>nphamcs@gmail.com</email>
</author>
<published>2024-06-27T20:17:37Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=5a4d8944d6b1e1aaaa83ea42c116b520b4ed0394'/>
<id>urn:sha1:5a4d8944d6b1e1aaaa83ea42c116b520b4ed0394</id>
<content type='text'>
syzbot detects that cachestat() is flushing stats, which can sleep, in its
RCU read section (see [1]).  This is done in the workingset_test_recent()
step (which checks if the folio's eviction is recent).

Move the stat flushing step to before the RCU read section of cachestat,
and skip stat flushing during the recency check.

[1]: https://lore.kernel.org/cgroups/000000000000f71227061bdf97e0@google.com/

Link: https://lkml.kernel.org/r/20240627201737.3506959-1-nphamcs@gmail.com
Fixes: b00684722262 ("mm: workingset: move the stats flush into workingset_test_recent()")
Signed-off-by: Nhat Pham &lt;nphamcs@gmail.com&gt;
Reported-by: syzbot+b7f13b2d0cc156edf61a@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/cgroups/000000000000f71227061bdf97e0@google.com/
Debugged-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Suggested-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Acked-by: Shakeel Butt &lt;shakeel.butt@linux.dev&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: "Huang, Ying" &lt;ying.huang@intel.com&gt;
Cc: Kairui Song &lt;kasong@tencent.com&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Ryan Roberts &lt;ryan.roberts@arm.com&gt;
Cc: Yosry Ahmed &lt;yosryahmed@google.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[6.8+]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: remove the implementation of swap_free() and always use swap_free_nr()</title>
<updated>2024-07-04T02:30:01Z</updated>
<author>
<name>Barry Song</name>
<email>v-songbaohua@oppo.com</email>
</author>
<published>2024-05-29T08:28:20Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=54f7a49c20ebb5189980c53e6e66709d22bee572'/>
<id>urn:sha1:54f7a49c20ebb5189980c53e6e66709d22bee572</id>
<content type='text'>
To streamline maintenance efforts, we propose removing the implementation
of swap_free().  Instead, we can simply invoke swap_free_nr() with nr set
to 1.  swap_free_nr() is designed with a bitmap consisting of only one
long, resulting in overhead that can be ignored for cases where nr equals
1.

A prime candidate for leveraging swap_free_nr() lies within
kernel/power/swap.c.  Implementing this change facilitates the adoption of
batch processing for hibernation.

Link: https://lkml.kernel.org/r/20240529082824.150954-3-21cnbao@gmail.com
Signed-off-by: Barry Song &lt;v-songbaohua@oppo.com&gt;
Suggested-by: "Huang, Ying" &lt;ying.huang@intel.com&gt;
Reviewed-by: "Huang, Ying" &lt;ying.huang@intel.com&gt;
Acked-by: Chris Li &lt;chrisl@kernel.org&gt;
Reviewed-by: Ryan Roberts &lt;ryan.roberts@arm.com&gt;
Cc: "Rafael J. Wysocki" &lt;rafael@kernel.org&gt;
Cc: Pavel Machek &lt;pavel@ucw.cz&gt;
Cc: Len Brown &lt;len.brown@intel.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: Andreas Larsson &lt;andreas@gaisler.com&gt;
Cc: Baolin Wang &lt;baolin.wang@linux.alibaba.com&gt;
Cc: Chuanhua Han &lt;hanchuanhua@oppo.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: "David S. Miller" &lt;davem@davemloft.net&gt;
Cc: Gao Xiang &lt;xiang@kernel.org&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Kairui Song &lt;kasong@tencent.com&gt;
Cc: Khalid Aziz &lt;khalid.aziz@oracle.com&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Suren Baghdasaryan &lt;surenb@google.com&gt;
Cc: Yosry Ahmed &lt;yosryahmed@google.com&gt;
Cc: Yu Zhao &lt;yuzhao@google.com&gt;
Cc: Zi Yan &lt;ziy@nvidia.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: swap: introduce swap_free_nr() for batched swap_free()</title>
<updated>2024-07-04T02:30:01Z</updated>
<author>
<name>Chuanhua Han</name>
<email>hanchuanhua@oppo.com</email>
</author>
<published>2024-05-29T08:28:19Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=ebfba0045176cb013f49cb3e5bd9f0b16eba203c'/>
<id>urn:sha1:ebfba0045176cb013f49cb3e5bd9f0b16eba203c</id>
<content type='text'>
Patch series "large folios swap-in: handle refault cases first", v5.

This patchset is extracted from the large folio swapin series[1],
primarily addressing the handling of scenarios involving large folios in
the swap cache.  Currently, it is particularly focused on addressing the
refaulting of mTHP, which is still undergoing reclamation.  This approach
aims to streamline code review and expedite the integration of this
segment into the MM tree.

It relies on Ryan's swap-out series[2], leveraging the helper function
swap_pte_batch() introduced by that series.

Presently, do_swap_page only encounters a large folio in the swap cache
before the large folio is released by vmscan.  However, the code should
remain equally useful once we support large folio swap-in via
swapin_readahead().  This approach can effectively reduce page faults and
eliminate most redundant checks and early exits for MTE restoration in
recent MTE patchset[3].

The large folio swap-in for SWP_SYNCHRONOUS_IO and swapin_readahead() will
be split into separate patch sets and sent at a later time.

[1] https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmail.com/
[2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-ryan.roberts@arm.com/
[3] https://lore.kernel.org/linux-mm/20240322114136.61386-1-21cnbao@gmail.com/ 


This patch (of 6):

While swapping in a large folio, we need to free swaps related to the
whole folio.  To avoid frequently acquiring and releasing swap locks, it
is better to introduce an API for batched free.  Furthermore, this new
function, swap_free_nr(), is designed to efficiently handle various
scenarios for releasing a specified number, nr, of swap entries.

Link: https://lkml.kernel.org/r/20240529082824.150954-1-21cnbao@gmail.com
Link: https://lkml.kernel.org/r/20240529082824.150954-2-21cnbao@gmail.com
Signed-off-by: Chuanhua Han &lt;hanchuanhua@oppo.com&gt;
Co-developed-by: Barry Song &lt;v-songbaohua@oppo.com&gt;
Signed-off-by: Barry Song &lt;v-songbaohua@oppo.com&gt;
Reviewed-by: Ryan Roberts &lt;ryan.roberts@arm.com&gt;
Acked-by: Chris Li &lt;chrisl@kernel.org&gt;
Reviewed-by: "Huang, Ying" &lt;ying.huang@intel.com&gt;
Cc: Baolin Wang &lt;baolin.wang@linux.alibaba.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Gao Xiang &lt;xiang@kernel.org&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Kairui Song &lt;kasong@tencent.com&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Suren Baghdasaryan &lt;surenb@google.com&gt;
Cc: Yosry Ahmed &lt;yosryahmed@google.com&gt;
Cc: Yu Zhao &lt;yuzhao@google.com&gt;
Cc: Zi Yan &lt;ziy@nvidia.com&gt;
Cc: Andreas Larsson &lt;andreas@gaisler.com&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: "David S. Miller" &lt;davem@davemloft.net&gt;
Cc: Khalid Aziz &lt;khalid.aziz@oracle.com&gt;
Cc: Len Brown &lt;len.brown@intel.com&gt;
Cc: Pavel Machek &lt;pavel@ucw.cz&gt;
Cc: "Rafael J. Wysocki" &lt;rafael@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>Merge tag 'pull-set_blocksize' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs</title>
<updated>2024-05-21T15:34:51Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2024-05-21T15:34:51Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=5ad8b6ad9a08abdbc8c57a51a5faaf2ef1afc547'/>
<id>urn:sha1:5ad8b6ad9a08abdbc8c57a51a5faaf2ef1afc547</id>
<content type='text'>
Pull vfs blocksize updates from Al Viro:
 "This gets rid of bogus set_blocksize() uses, switches it over
  to be based on a 'struct file *' and verifies that the caller
  has the device opened exclusively"

* tag 'pull-set_blocksize' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  make set_blocksize() fail unless block device is opened exclusive
  set_blocksize(): switch to passing struct file *
  btrfs_get_bdev_and_sb(): call set_blocksize() only for exclusive opens
  swsusp: don't bother with setting block size
  zram: don't bother with reopening - just use O_EXCL for open
  swapon(2): open swap with O_EXCL
  swapon(2)/swapoff(2): don't bother with block size
  pktcdvd: sort set_blocksize() calls out
  bcache_register(): don't bother with set_blocksize()
</content>
</entry>
<entry>
<title>swapon(2): open swap with O_EXCL</title>
<updated>2024-05-02T21:23:30Z</updated>
<author>
<name>Al Viro</name>
<email>viro@zeniv.linux.org.uk</email>
</author>
<published>2024-04-17T22:33:34Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=51d908b3db0e588aeb2d06df37e4df3fb1754bb5'/>
<id>urn:sha1:51d908b3db0e588aeb2d06df37e4df3fb1754bb5</id>
<content type='text'>
... eliminating the need to reopen block devices so they could be
exclusively held.

Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed-by: Christian Brauner &lt;brauner@kernel.org&gt;
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
</entry>
<entry>
<title>swapon(2)/swapoff(2): don't bother with block size</title>
<updated>2024-05-02T21:23:30Z</updated>
<author>
<name>Al Viro</name>
<email>viro@zeniv.linux.org.uk</email>
</author>
<published>2024-04-17T22:26:54Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=798cb7f9aec35460c383eab57b9fa474d999a2eb'/>
<id>urn:sha1:798cb7f9aec35460c383eab57b9fa474d999a2eb</id>
<content type='text'>
once upon a time that used to matter; these days we do swap IO for
swap devices at the level that doesn't give a damn about block size,
buffer_head or anything of that sort - just attach the page to
bio, set the location and size (the latter to PAGE_SIZE) and feed
into queue.

Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed-by: Christian Brauner &lt;brauner@kernel.org&gt;
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
</entry>
<entry>
<title>mm: swap: allow storage of all mTHP orders</title>
<updated>2024-04-26T03:56:37Z</updated>
<author>
<name>Ryan Roberts</name>
<email>ryan.roberts@arm.com</email>
</author>
<published>2024-04-08T18:39:44Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=845982eb264bc64b0c3242ace217fb574f56a299'/>
<id>urn:sha1:845982eb264bc64b0c3242ace217fb574f56a299</id>
<content type='text'>
Multi-size THP enables performance improvements by allocating large,
pte-mapped folios for anonymous memory.  However I've observed that on an
arm64 system running a parallel workload (e.g.  kernel compilation) across
many cores, under high memory pressure, the speed regresses.  This is due
to bottlenecking on the increased number of TLBIs added due to all the
extra folio splitting when the large folios are swapped out.

Therefore, solve this regression by adding support for swapping out mTHP
without needing to split the folio, just like is already done for
PMD-sized THP.  This change only applies when CONFIG_THP_SWAP is enabled,
and when the swap backing store is a non-rotating block device.  These are
the same constraints as for the existing PMD-sized THP swap-out support.

Note that no attempt is made to swap-in (m)THP here - this is still done
page-by-page, like for PMD-sized THP.  But swapping-out mTHP is a
prerequisite for swapping-in mTHP.

The main change here is to improve the swap entry allocator so that it can
allocate any power-of-2 number of contiguous entries between [1, (1 &lt;&lt;
PMD_ORDER)].  This is done by allocating a cluster for each distinct order
and allocating sequentially from it until the cluster is full.  This
ensures that we don't need to search the map and we get no fragmentation
due to alignment padding for different orders in the cluster.  If there is
no current cluster for a given order, we attempt to allocate a free
cluster from the list.  If there are no free clusters, we fail the
allocation and the caller can fall back to splitting the folio and
allocates individual entries (as per existing PMD-sized THP fallback).

The per-order current clusters are maintained per-cpu using the existing
infrastructure.  This is done to avoid interleving pages from different
tasks, which would prevent IO being batched.  This is already done for the
order-0 allocations so we follow the same pattern.

As is done for order-0 per-cpu clusters, the scanner now can steal order-0
entries from any per-cpu-per-order reserved cluster.  This ensures that
when the swap file is getting full, space doesn't get tied up in the
per-cpu reserves.

This change only modifies swap to be able to accept any order mTHP.  It
doesn't change the callers to elide doing the actual split.  That will be
done in separate changes.

Link: https://lkml.kernel.org/r/20240408183946.2991168-6-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts &lt;ryan.roberts@arm.com&gt;
Reviewed-by: "Huang, Ying" &lt;ying.huang@intel.com&gt;
Cc: Barry Song &lt;21cnbao@gmail.com&gt;
Cc: Barry Song &lt;v-songbaohua@oppo.com&gt;
Cc: Chris Li &lt;chrisl@kernel.org&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Gao Xiang &lt;xiang@kernel.org&gt;
Cc: Kefeng Wang &lt;wangkefeng.wang@huawei.com&gt;
Cc: Lance Yang &lt;ioworker0@gmail.com&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Yang Shi &lt;shy828301@gmail.com&gt;
Cc: Yu Zhao &lt;yuzhao@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
</feed>
