From 69e0a3b490030d8d9e628ea69bdb3ff0010a0bb7 Mon Sep 17 00:00:00 2001 From: Baolin Wang Date: Wed, 3 Sep 2025 16:54:24 +0800 Subject: mm: shmem: fix the strategy for the tmpfs 'huge=' options After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"), we have extended tmpfs to allow any sized large folios, rather than just PMD-sized large folios. The strategy discussed previously was: : Considering that tmpfs already has the 'huge=' option to control the : PMD-sized large folios allocation, we can extend the 'huge=' option to : allow any sized large folios. The semantics of the 'huge=' mount option : are: : : huge=never: no any sized large folios : huge=always: any sized large folios : huge=within_size: like 'always' but respect the i_size : huge=advise: like 'always' if requested with madvise() : : Note: for tmpfs mmap() faults, due to the lack of a write size hint, still : allocate the PMD-sized huge folios if huge=always/within_size/advise is : set. : : Moreover, the 'deny' and 'force' testing options controlled by : '/sys/kernel/mm/transparent_hugepage/shmem_enabled', still retain the same : semantics. The 'deny' can disable any sized large folios for tmpfs, while : the 'force' can enable PMD sized large folios for tmpfs. This means that when tmpfs is mounted with 'huge=always' or 'huge=within_size', tmpfs will allow getting a highest order hint based on the size of write() and fallocate() paths. It will then try each allowable large order, rather than continually attempting to allocate PMD-sized large folios as before. However, this might break some user scenarios for those who want to use PMD-sized large folios, such as the i915 driver which did not supply a write size hint when allocating shmem [1]. Moreover, Hugh also complained that this will cause a regression in userspace with 'huge=always' or 'huge=within_size'. So, let's revisit the strategy for tmpfs large page allocation. A simple fix would be to always try PMD-sized large folios first, and if that fails, fall back to smaller large folios. This approach differs from the strategy for large folio allocation used by other file systems, however, tmpfs is somewhat different from other file systems, as quoted from David's opinion: : There were opinions in the past that tmpfs should just behave like any : other fs, and I think that's what we tried to satisfy here: use the write : size as an indication. : : I assume there will be workloads where either approach will be beneficial. : I also assume that workloads that use ordinary fs'es could benefit from : the same strategy (start with PMD), while others will clearly not. Link: https://lkml.kernel.org/r/10e7ac6cebe6535c137c064d5c5a235643eebb4a.1756888965.git.baolin.wang@linux.alibaba.com Link: https://lore.kernel.org/lkml/0d734549d5ed073c80b11601da3abdd5223e1889.1753689802.git.baolin.wang@linux.alibaba.com/ [1] Fixes: acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs") Signed-off-by: Baolin Wang Cc: Barry Song Cc: David Hildenbrand Cc: Dev Jain Cc: Hugh Dickins Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Mariano Pache Cc: Matthew Wilcox (Oracle) Cc: Ryan Roberts Cc: Zi Yan Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/transhuge.rst | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index a16a04841b96..1654211cc6cf 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -419,6 +419,8 @@ option: ``huge=``. It can have following values: always Attempt to allocate huge pages every time we need a new page; + Always try PMD-sized huge pages first, and fall back to smaller-sized + huge pages if the PMD-sized huge page allocation fails; never Do not allocate huge pages. Note that ``madvise(..., MADV_COLLAPSE)`` @@ -426,7 +428,9 @@ never is specified everywhere; within_size - Only allocate huge page if it will be fully within i_size. + Only allocate huge page if it will be fully within i_size; + Always try PMD-sized huge pages first, and fall back to smaller-sized + huge pages if the PMD-sized huge page allocation fails; Also respect madvise() hints; advise -- cgit v1.2.3