From 8e38607aa4aa8ee7ad4058d183465d248d04dca4 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Tue, 6 Jan 2026 23:20:02 -0800 Subject: treewide: provide a generic clear_user_page() variant Patch series "mm: folio_zero_user: clear page ranges", v11. This series adds clearing of contiguous page ranges for hugepages. The series improves on the current discontiguous clearing approach in two ways: - clear pages in a contiguous fashion. - use batched clearing via clear_pages() wherever exposed. The first is useful because it allows us to make much better use of hardware prefetchers. The second, enables advertising the real extent to the processor. Where specific instructions support it (ex. string instructions on x86; "mops" on arm64 etc), a processor can optimize based on this because, instead of seeing a sequence of 8-byte stores, or a sequence of 4KB pages, it sees a larger unit being operated on. For instance, AMD Zen uarchs (for extents larger than LLC-size) switch to a mode where they start eliding cacheline allocation. This is helpful not just because it results in higher bandwidth, but also because now the cache is not evicting useful cachelines and replacing them with zeroes. Demand faulting a 64GB region shows performance improvement: $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5 baseline +series (GBps +- %stdev) (GBps +- %stdev) pg-sz=2MB 11.76 +- 1.10% 25.34 +- 1.18% [*] +115.47% preempt=* pg-sz=1GB 24.85 +- 2.41% 39.22 +- 2.32% + 57.82% preempt=none|voluntary pg-sz=1GB (similar) 52.73 +- 0.20% [#] +112.19% preempt=full|lazy [*] This improvement is because switching to sequential clearing allows the hardware prefetchers to do a much better job. [#] For pg-sz=1GB a large part of the improvement is because of the cacheline elision mentioned above. preempt=full|lazy improves upon that because, not needing explicit invocations of cond_resched() to ensure reasonable preemption latency, it can clear the full extent as a single unit. In comparison the maximum extent used for preempt=none|voluntary is PROCESS_PAGES_NON_PREEMPT_BATCH (32MB). When provided the full extent the processor forgoes allocating cachelines on this path almost entirely. (The hope is that eventually, in the fullness of time, the lazy preemption model will be able to do the same job that none or voluntary models are used for, allowing us to do away with cond_resched().) Raghavendra also tested previous version of the series on AMD Genoa and sees similar improvement [1] with preempt=lazy. $ perf bench mem map -p $page-size -f populate -s 64GB -l 10 base patched change pg-sz=2MB 12.731939 GB/sec 26.304263 GB/sec 106.6% pg-sz=1GB 26.232423 GB/sec 61.174836 GB/sec 133.2% This patch (of 8): Let's drop all variants that effectively map to clear_page() and provide it in a generic variant instead. We'll use the macro clear_user_page to indicate whether an architecture provides it's own variant. Also, clear_user_page() is only called from the generic variant of clear_user_highpage(), so define it only if the architecture does not provide a clear_user_highpage(). And, for simplicity define it in linux/highmem.h. Note that for parisc, clear_page() and clear_user_page() map to clear_page_asm(), so we can just get rid of the custom clear_user_page() implementation. There is a clear_user_page_asm() function on parisc, that seems to be unused. Not sure what's up with that. Link: https://lkml.kernel.org/r/20260107072009.1615991-1-ankur.a.arora@oracle.com Link: https://lkml.kernel.org/r/20260107072009.1615991-2-ankur.a.arora@oracle.com Signed-off-by: David Hildenbrand Co-developed-by: Ankur Arora Signed-off-by: Ankur Arora Cc: Andy Lutomirski Cc: Ankur Arora Cc: "Borislav Petkov (AMD)" Cc: Boris Ostrovsky Cc: David Hildenbrand Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Konrad Rzessutek Wilk Cc: Lance Yang Cc: "Liam R. Howlett" Cc: Li Zhe Cc: Lorenzo Stoakes Cc: Mateusz Guzik Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Mike Rapoport Cc: Peter Zijlstra Cc: Raghavendra K T Cc: Suren Baghdasaryan Cc: Thomas Gleixner Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- arch/alpha/include/asm/page.h | 1 - 1 file changed, 1 deletion(-) (limited to 'arch/alpha/include/asm') diff --git a/arch/alpha/include/asm/page.h b/arch/alpha/include/asm/page.h index d2c6667d73e9..59d01f9b77f6 100644 --- a/arch/alpha/include/asm/page.h +++ b/arch/alpha/include/asm/page.h @@ -11,7 +11,6 @@ #define STRICT_MM_TYPECHECKS extern void clear_page(void *page); -#define clear_user_page(page, vaddr, pg) clear_page(page) #define vma_alloc_zeroed_movable_folio(vma, vaddr) \ vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr) -- cgit v1.2.3 From 44b079583f7d44ff7c7d88479b4398fe284834bf Mon Sep 17 00:00:00 2001 From: Qi Zheng Date: Tue, 27 Jan 2026 20:12:55 +0800 Subject: alpha: mm: enable MMU_GATHER_RCU_TABLE_FREE On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of empty PTE page table pages (such as 100GB+). To resolve this problem, first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the PT_RECLAIM feature, which resolves this problem. Link: https://lkml.kernel.org/r/3380f40a89b73c488202c85f9a8abf99fb08543b.1769515122.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng Acked-by: Magnus Lindholm [alpha] Cc: Richard Henderson Cc: Matt Turner Cc: Andreas Larsson Cc: "Aneesh Kumar K.V" Cc: Anton Ivanov Cc: Borislav Petkov Cc: Dave Hansen Cc: David Hildenbrand (Red Hat) Cc: Dev Jain Cc: Helge Deller Cc: "H. Peter Anvin" Cc: Huacai Chen Cc: Ingo Molnar Cc: "James E.J. Bottomley" Cc: Johannes Berg Cc: Lance Yang Cc: "Liam R. Howlett" Cc: Lorenzo Stoakes Cc: Michal Hocko Cc: Mike Rapoport Cc: Nicholas Piggin Cc: Peter Zijlstra Cc: Richard Weinberger Cc: Suren Baghdasaryan Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vlastimil Babka Cc: WANG Xuerui Cc: Wei Yang Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/alpha/Kconfig | 1 + arch/alpha/include/asm/tlb.h | 6 +++--- 2 files changed, 4 insertions(+), 3 deletions(-) (limited to 'arch/alpha/include/asm') diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig index 80367f2cf821..6c7dbf0adad6 100644 --- a/arch/alpha/Kconfig +++ b/arch/alpha/Kconfig @@ -38,6 +38,7 @@ config ALPHA select OLD_SIGSUSPEND select CPU_NO_EFFICIENT_FFS if !ALPHA_EV67 select MMU_GATHER_NO_RANGE + select MMU_GATHER_RCU_TABLE_FREE select SPARSEMEM_EXTREME if SPARSEMEM select ZONE_DMA help diff --git a/arch/alpha/include/asm/tlb.h b/arch/alpha/include/asm/tlb.h index 4f79e331af5e..ad586b898fd6 100644 --- a/arch/alpha/include/asm/tlb.h +++ b/arch/alpha/include/asm/tlb.h @@ -4,7 +4,7 @@ #include -#define __pte_free_tlb(tlb, pte, address) pte_free((tlb)->mm, pte) -#define __pmd_free_tlb(tlb, pmd, address) pmd_free((tlb)->mm, pmd) - +#define __pte_free_tlb(tlb, pte, address) tlb_remove_ptdesc((tlb), page_ptdesc(pte)) +#define __pmd_free_tlb(tlb, pmd, address) tlb_remove_ptdesc((tlb), virt_to_ptdesc(pmd)) + #endif -- cgit v1.2.3