<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/include/linux/memblock.h, branch v6.16.3</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v6.16.3</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v6.16.3'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2025-05-13T06:50:39Z</updated>
<entry>
<title>memblock: introduce memmap_init_kho_scratch()</title>
<updated>2025-05-13T06:50:39Z</updated>
<author>
<name>Mike Rapoport (Microsoft)</name>
<email>rppt@kernel.org</email>
</author>
<published>2025-05-09T07:46:21Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=b8a8f96a6dce527ad316184ff1e20f238ed413d8'/>
<id>urn:sha1:b8a8f96a6dce527ad316184ff1e20f238ed413d8</id>
<content type='text'>
With deferred initialization of struct page it will be necessary to
initialize memory map for KHO scratch regions early.

Add memmap_init_kho_scratch() method that will allow such initialization
in upcoming patches.

Link: https://lkml.kernel.org/r/20250509074635.3187114-4-changyuanl@google.com
Signed-off-by: Mike Rapoport (Microsoft) &lt;rppt@kernel.org&gt;
Signed-off-by: Changyuan Lyu &lt;changyuanl@google.com&gt;
Cc: Alexander Graf &lt;graf@amazon.com&gt;
Cc: Andy Lutomirski &lt;luto@kernel.org&gt;
Cc: Anthony Yznaga &lt;anthony.yznaga@oracle.com&gt;
Cc: Arnd Bergmann &lt;arnd@arndb.de&gt;
Cc: Ashish Kalra &lt;ashish.kalra@amd.com&gt;
Cc: Ben Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Cc: Borislav Betkov &lt;bp@alien8.de&gt;
Cc: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Cc: Dave Hansen &lt;dave.hansen@linux.intel.com&gt;
Cc: David Woodhouse &lt;dwmw2@infradead.org&gt;
Cc: Eric Biederman &lt;ebiederm@xmission.com&gt;
Cc: "H. Peter Anvin" &lt;hpa@zytor.com&gt;
Cc: Ingo Molnar &lt;mingo@redhat.com&gt;
Cc: James Gowans &lt;jgowans@amazon.com&gt;
Cc: Jason Gunthorpe &lt;jgg@nvidia.com&gt;
Cc: Jonathan Corbet &lt;corbet@lwn.net&gt;
Cc: Krzysztof Kozlowski &lt;krzk@kernel.org&gt;
Cc: Marc Rutland &lt;mark.rutland@arm.com&gt;
Cc: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
Cc: Pasha Tatashin &lt;pasha.tatashin@soleen.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Pratyush Yadav &lt;ptyadav@amazon.de&gt;
Cc: Rob Herring &lt;robh@kernel.org&gt;
Cc: Saravana Kannan &lt;saravanak@google.com&gt;
Cc: Stanislav Kinsburskii &lt;skinsburskii@linux.microsoft.com&gt;
Cc: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Cc: Thomas Gleinxer &lt;tglx@linutronix.de&gt;
Cc: Thomas Lendacky &lt;thomas.lendacky@amd.com&gt;
Cc: Will Deacon &lt;will@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>memblock: add support for scratch memory</title>
<updated>2025-05-13T06:50:39Z</updated>
<author>
<name>Alexander Graf</name>
<email>graf@amazon.com</email>
</author>
<published>2025-05-09T07:46:20Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=d59f43b5748092557d34244e29a618221a250501'/>
<id>urn:sha1:d59f43b5748092557d34244e29a618221a250501</id>
<content type='text'>
With KHO (Kexec HandOver), we need a way to ensure that the new kernel
does not allocate memory on top of any memory regions that the previous
kernel was handing over.  But to know where those are, we need to include
them in the memblock.reserved array which may not be big enough to hold
all ranges that need to be persisted across kexec.  To resize the array,
we need to allocate memory.  That brings us into a catch 22 situation.

The solution to that is limit memblock allocations to the scratch regions:
safe regions to operate in the case when there is memory that should
remain intact across kexec.

KHO provides several "scratch regions" as part of its metadata.  These
scratch regions are contiguous memory blocks that known not to contain any
memory that should be persisted across kexec.  These regions should be
large enough to accommodate all memblock allocations done by the kexeced
kernel.

We introduce a new memblock_set_scratch_only() function that allows KHO to
indicate that any memblock allocation must happen from the scratch
regions.

Later, we may want to perform another KHO kexec.  For that, we reuse the
same scratch regions.  To ensure that no eventually handed over data gets
allocated inside a scratch region, we flip the semantics of the scratch
region with memblock_clear_scratch_only(): After that call, no allocations
may happen from scratch memblock regions.  We will lift that restriction
in the next patch.

Link: https://lkml.kernel.org/r/20250509074635.3187114-3-changyuanl@google.com
Signed-off-by: Alexander Graf &lt;graf@amazon.com&gt;
Co-developed-by: Mike Rapoport (Microsoft) &lt;rppt@kernel.org&gt;
Signed-off-by: Mike Rapoport (Microsoft) &lt;rppt@kernel.org&gt;
Signed-off-by: Changyuan Lyu &lt;changyuanl@google.com&gt;
Cc: Andy Lutomirski &lt;luto@kernel.org&gt;
Cc: Anthony Yznaga &lt;anthony.yznaga@oracle.com&gt;
Cc: Arnd Bergmann &lt;arnd@arndb.de&gt;
Cc: Ashish Kalra &lt;ashish.kalra@amd.com&gt;
Cc: Ben Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Cc: Borislav Betkov &lt;bp@alien8.de&gt;
Cc: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Cc: Dave Hansen &lt;dave.hansen@linux.intel.com&gt;
Cc: David Woodhouse &lt;dwmw2@infradead.org&gt;
Cc: Eric Biederman &lt;ebiederm@xmission.com&gt;
Cc: "H. Peter Anvin" &lt;hpa@zytor.com&gt;
Cc: Ingo Molnar &lt;mingo@redhat.com&gt;
Cc: James Gowans &lt;jgowans@amazon.com&gt;
Cc: Jason Gunthorpe &lt;jgg@nvidia.com&gt;
Cc: Jonathan Corbet &lt;corbet@lwn.net&gt;
Cc: Krzysztof Kozlowski &lt;krzk@kernel.org&gt;
Cc: Marc Rutland &lt;mark.rutland@arm.com&gt;
Cc: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
Cc: Pasha Tatashin &lt;pasha.tatashin@soleen.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Pratyush Yadav &lt;ptyadav@amazon.de&gt;
Cc: Rob Herring &lt;robh@kernel.org&gt;
Cc: Saravana Kannan &lt;saravanak@google.com&gt;
Cc: Stanislav Kinsburskii &lt;skinsburskii@linux.microsoft.com&gt;
Cc: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Cc: Thomas Gleinxer &lt;tglx@linutronix.de&gt;
Cc: Thomas Lendacky &lt;thomas.lendacky@amd.com&gt;
Cc: Will Deacon &lt;will@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>memblock: add MEMBLOCK_RSRV_KERN flag</title>
<updated>2025-05-13T06:50:38Z</updated>
<author>
<name>Mike Rapoport (Microsoft)</name>
<email>rppt@kernel.org</email>
</author>
<published>2025-05-09T07:46:19Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=4c78cc596bb8d39532f059e0198eeabf370c50f5'/>
<id>urn:sha1:4c78cc596bb8d39532f059e0198eeabf370c50f5</id>
<content type='text'>
Patch series "kexec: introduce Kexec HandOver (KHO)", v8.

Kexec today considers itself purely a boot loader: When we enter the new
kernel, any state the previous kernel left behind is irrelevant and the
new kernel reinitializes the system.

However, there are use cases where this mode of operation is not what we
actually want.  In virtualization hosts for example, we want to use kexec
to update the host kernel while virtual machine memory stays untouched. 
When we add device assignment to the mix, we also need to ensure that
IOMMU and VFIO states are untouched.  If we add PCIe peer to peer DMA, we
need to do the same for the PCI subsystem.  If we want to kexec while an
SEV-SNP enabled virtual machine is running, we need to preserve the VM
context pages and physical memory.  See "pkernfs: Persisting guest memory
and kernel/device state safely across kexec" Linux Plumbers Conference
2023 presentation for details:

  https://lpc.events/event/17/contributions/1485/

To start us on the journey to support all the use cases above, this patch
implements basic infrastructure to allow hand over of kernel state across
kexec (Kexec HandOver, aka KHO).  As a really simple example target, we
use memblock's reserve_mem.

With this patchset applied, memory that was reserved using "reserve_mem"
command line options remains intact after kexec and it is guaranteed to
reside at the same physical address.

== Alternatives ==

There are alternative approaches to (parts of) the problems above:

  * Memory Pools [1] - preallocated persistent memory region + allocator
  * PRMEM [2] - resizable persistent memory regions with fixed metadata
                pointer on the kernel command line + allocator
  * Pkernfs [3] - preallocated file system for in-kernel data with fixed
                  address location on the kernel command line
  * PKRAM [4] - handover of user space pages using a fixed metadata page
                specified via command line

All of the approaches above fundamentally have the same problem: They
require the administrator to explicitly carve out a physical memory
location because they have no mechanism outside of the kernel command line
to pass data (including memory reservations) between kexec'ing kernels.

KHO provides that base foundation.  We will determine later whether we
still need any of the approaches above for fast bulk memory handover of
for example IOMMU page tables.  But IMHO they would all be users of KHO,
with KHO providing the foundational primitive to pass metadata and bulk
memory reservations as well as provide easy versioning for data.

== Overview ==

We introduce a metadata file that the kernels pass between each other. 
How they pass it is architecture specific.  The file's format is a
Flattened Device Tree (fdt) which has a generator and parser already
included in Linux.  KHO is enabled in the kernel command line by `kho=on`.
When the root user enables KHO through
/sys/kernel/debug/kho/out/finalize, the kernel invokes callbacks to every
KHO users to register preserved memory regions, which contain drivers'
states.

When the actual kexec happens, the fdt is part of the image set that we
boot into.  In addition, we keep "scratch regions" available for kexec:
physically contiguous memory regions that are guaranteed to not have any
memory that KHO would preserve.  The new kernel bootstraps itself using
the scratch regions and sets all handed over memory as in use.  When
drivers initialize that support KHO, they introspect the fdt, restore
preserved memory regions, and retrieve their states stored in the
preserved memory.

== Limitations ==

Currently KHO is only implemented for file based kexec.  The kernel
interfaces in the patch set are already in place to support user space
kexec as well, but it is still not implemented it yet inside kexec tools.

== How to Use ==

To use the code, please boot the kernel with the "kho=on" command line
parameter.  KHO will automatically create scratch regions.  If you want to
set the scratch size explicitly you can use "kho_scratch=" command line
parameter.  For instance, "kho_scratch=16M,512M,256M" will reserve a 16
MiB low memory scratch area, a 512 MiB global scratch region, and 256 MiB
per NUMA node scratch regions on boot.

Make sure to have a reserved memory range requested with reserv_mem
command line option, for example, "reserve_mem=64m:4k:n1".

Then before you invoke file based "kexec -l", finalize KHO FDT:

  # echo 1 &gt; /sys/kernel/debug/kho/out/finalize

You can preview the generated FDT using `dtc`,

  # dtc /sys/kernel/debug/kho/out/fdt
  # dtc /sys/kernel/debug/kho/out/sub_fdts/memblock

`dtc` is available on ubuntu by `sudo apt-get install device-tree-compiler`.

Now kexec into the new kernel,

  # kexec -l Image --initrd=initrd -s
  # kexec -e

(The order of KHO finalization and "kexec -l" does not matter.)

The new kernel will boot up and contain the previous kernel's reserve_mem
contents at the same physical address as the first kernel.

You can also review the FDT passed from the old kernel,

  # dtc /sys/kernel/debug/kho/in/fdt
  # dtc /sys/kernel/debug/kho/in/sub_fdts/memblock


This patch (of 17):

To denote areas that were reserved for kernel use either directly with
memblock_reserve_kern() or via memblock allocations.

Link: https://lore.kernel.org/lkml/20250424083258.2228122-1-changyuanl@google.com/
Link: https://lore.kernel.org/lkml/aAeaJ2iqkrv_ffhT@kernel.org/
Link: https://lore.kernel.org/lkml/35c58191-f774-40cf-8d66-d1e2aaf11a62@intel.com/
Link: https://lore.kernel.org/lkml/20250424093302.3894961-1-arnd@kernel.org/
Link: https://lkml.kernel.org/r/20250509074635.3187114-1-changyuanl@google.com
Link: https://lkml.kernel.org/r/20250509074635.3187114-2-changyuanl@google.com
Signed-off-by: Mike Rapoport (Microsoft) &lt;rppt@kernel.org&gt;
Co-developed-by: Changyuan Lyu &lt;changyuanl@google.com&gt;
Signed-off-by: Changyuan Lyu &lt;changyuanl@google.com&gt;
Cc: Alexander Graf &lt;graf@amazon.com&gt;
Cc: Andy Lutomirski &lt;luto@kernel.org&gt;
Cc: Anthony Yznaga &lt;anthony.yznaga@oracle.com&gt;
Cc: Arnd Bergmann &lt;arnd@arndb.de&gt;
Cc: Ashish Kalra &lt;ashish.kalra@amd.com&gt;
Cc: Ben Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Cc: Borislav Betkov &lt;bp@alien8.de&gt;
Cc: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Cc: David Woodhouse &lt;dwmw2@infradead.org&gt;
Cc: Eric Biederman &lt;ebiederm@xmission.com&gt;
Cc: "H. Peter Anvin" &lt;hpa@zytor.com&gt;
Cc: Ingo Molnar &lt;mingo@redhat.com&gt;
Cc: James Gowans &lt;jgowans@amazon.com&gt;
Cc: Jonathan Corbet &lt;corbet@lwn.net&gt;
Cc: Krzysztof Kozlowski &lt;krzk@kernel.org&gt;
Cc: Marc Rutland &lt;mark.rutland@arm.com&gt;
Cc: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
Cc: Pasha Tatashin &lt;pasha.tatashin@soleen.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Pratyush Yadav &lt;ptyadav@amazon.de&gt;
Cc: Rob Herring &lt;robh@kernel.org&gt;
Cc: Saravana Kannan &lt;saravanak@google.com&gt;
Cc: Stanislav Kinsburskii &lt;skinsburskii@linux.microsoft.com&gt;
Cc: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Cc: Thomas Gleinxer &lt;tglx@linutronix.de&gt;
Cc: Thomas Lendacky &lt;thomas.lendacky@amd.com&gt;
Cc: Will Deacon &lt;will@kernel.org&gt;
Cc: Dave Hansen &lt;dave.hansen@linux.intel.com&gt;
Cc: Jason Gunthorpe &lt;jgg@nvidia.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>arch, mm: make releasing of memory to page allocator more explicit</title>
<updated>2025-03-18T05:06:53Z</updated>
<author>
<name>Mike Rapoport (Microsoft)</name>
<email>rppt@kernel.org</email>
</author>
<published>2025-03-13T13:50:03Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=8afa901c147a41f92e83943cddf154bbb7995ee6'/>
<id>urn:sha1:8afa901c147a41f92e83943cddf154bbb7995ee6</id>
<content type='text'>
The point where the memory is released from memblock to the buddy
allocator is hidden inside arch-specific mem_init()s and the call to
memblock_free_all() is needlessly duplicated in every artiste cure and
after introduction of arch_mm_preinit() hook, mem_init() implementation on
many architecture only contains the call to memblock_free_all().

Pull memblock_free_all() call into mm_core_init() and drop mem_init() on
relevant architectures to make it more explicit where the free memory is
released from memblock to the buddy allocator and to reduce code
duplication in architecture specific code.

Link: https://lkml.kernel.org/r/20250313135003.836600-14-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) &lt;rppt@kernel.org&gt;
Acked-by: Dave Hansen &lt;dave.hansen@linux.intel.com&gt;	[x86]
Acked-by: Geert Uytterhoeven &lt;geert@linux-m68k.org&gt;	[m68k]
Tested-by: Mark Brown &lt;broonie@kernel.org&gt;
Cc: Alexander Gordeev &lt;agordeev@linux.ibm.com&gt;
Cc: Andreas Larsson &lt;andreas@gaisler.com&gt;
Cc: Andy Lutomirski &lt;luto@kernel.org&gt;
Cc: Ard Biesheuvel &lt;ardb@kernel.org&gt;
Cc: Arnd Bergmann &lt;arnd@arndb.de&gt;
Cc: Borislav Betkov &lt;bp@alien8.de&gt;
Cc: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Cc: David S. Miller &lt;davem@davemloft.net&gt;
Cc: Dinh Nguyen &lt;dinguyen@kernel.org&gt;
Cc: Gerald Schaefer &lt;gerald.schaefer@linux.ibm.com&gt;
Cc: Guo Ren (csky) &lt;guoren@kernel.org&gt;
Cc: Heiko Carstens &lt;hca@linux.ibm.com&gt;
Cc: Helge Deller &lt;deller@gmx.de&gt;
Cc: Huacai Chen &lt;chenhuacai@kernel.org&gt;
Cc: Ingo Molnar &lt;mingo@redhat.com&gt;
Cc: Jiaxun Yang &lt;jiaxun.yang@flygoat.com&gt;
Cc: Johannes Berg &lt;johannes@sipsolutions.net&gt;
Cc: John Paul Adrian Glaubitz &lt;glaubitz@physik.fu-berlin.de&gt;
Cc: Madhavan Srinivasan &lt;maddy@linux.ibm.com&gt;
Cc: Matt Turner &lt;mattst88@gmail.com&gt;
Cc: Max Filippov &lt;jcmvbkbc@gmail.com&gt;
Cc: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Cc: Michal Simek &lt;monstr@monstr.eu&gt;
Cc: Palmer Dabbelt &lt;palmer@dabbelt.com&gt;
Cc: Richard Weinberger &lt;richard@nod.at&gt;
Cc: Russel King &lt;linux@armlinux.org.uk&gt;
Cc: Stafford Horne &lt;shorne@gmail.com&gt;
Cc: Thomas Bogendoerfer &lt;tsbogend@alpha.franken.de&gt;
Cc: Thomas Gleinxer &lt;tglx@linutronix.de&gt;
Cc: Vasily Gorbik &lt;gor@linux.ibm.com&gt;
Cc: Vineet Gupta &lt;vgupta@kernel.org&gt;
Cc: Will Deacon &lt;will@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/memblock: add memblock_alloc_or_panic interface</title>
<updated>2025-01-26T04:22:38Z</updated>
<author>
<name>Guo Weikang</name>
<email>guoweikang.kernel@gmail.com</email>
</author>
<published>2025-01-02T07:25:28Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=c6f239796b55dbc4225a6fca9f96232092b9df83'/>
<id>urn:sha1:c6f239796b55dbc4225a6fca9f96232092b9df83</id>
<content type='text'>
Before SLUB initialization, various subsystems used memblock_alloc to
allocate memory.  In most cases, when memory allocation fails, an
immediate panic is required.  To simplify this behavior and reduce
repetitive checks, introduce `memblock_alloc_or_panic`.  This function
ensures that memory allocation failures result in a panic automatically,
improving code readability and consistency across subsystems that require
this behavior.

[guoweikang.kernel@gmail.com: arch/s390: save_area_alloc default failure behavior changed to panic]
  Link: https://lkml.kernel.org/r/20250109033136.2845676-1-guoweikang.kernel@gmail.com
  Link: https://lore.kernel.org/lkml/Z2fknmnNtiZbCc7x@kernel.org/
Link: https://lkml.kernel.org/r/20250102072528.650926-1-guoweikang.kernel@gmail.com
Signed-off-by: Guo Weikang &lt;guoweikang.kernel@gmail.com&gt;
Acked-by: Geert Uytterhoeven &lt;geert@linux-m68k.org&gt;	[m68k]
Reviewed-by: Alexander Gordeev &lt;agordeev@linux.ibm.com&gt;	[s390]
Acked-by: Mike Rapoport (Microsoft) &lt;rppt@kernel.org&gt;
Cc: Alexander Gordeev &lt;agordeev@linux.ibm.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/memmap: prevent double scanning of memmap by kmemleak</title>
<updated>2025-01-26T04:22:30Z</updated>
<author>
<name>Guo Weikang</name>
<email>guoweikang.kernel@gmail.com</email>
</author>
<published>2025-01-06T02:11:25Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=b2aad24b53333f1904a55d97e3fde2246ef05bb6'/>
<id>urn:sha1:b2aad24b53333f1904a55d97e3fde2246ef05bb6</id>
<content type='text'>
kmemleak explicitly scans the mem_map through the valid struct page
objects.  However, memmap_alloc() was also adding this memory to the gray
object list, causing it to be scanned twice.  Remove memmap_alloc() from
the scan list and add a comment to clarify the behavior.

Link: https://lore.kernel.org/lkml/CAOm6qn=FVeTpH54wGDFMHuCOeYtvoTx30ktnv9-w3Nh8RMofEA@mail.gmail.com/
Link: https://lkml.kernel.org/r/20250106021126.1678334-1-guoweikang.kernel@gmail.com
Signed-off-by: Guo Weikang &lt;guoweikang.kernel@gmail.com&gt;
Reviewed-by: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Cc: Mike Rapoport (Microsoft) &lt;rppt@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/memblock: introduce a new helper memblock_estimated_nr_free_pages()</title>
<updated>2024-08-11T16:18:52Z</updated>
<author>
<name>Wei Yang</name>
<email>richard.weiyang@gmail.com</email>
</author>
<published>2024-08-08T00:14:13Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=d0f8a8973f265f6a276f99d091af99edfb2b87de'/>
<id>urn:sha1:d0f8a8973f265f6a276f99d091af99edfb2b87de</id>
<content type='text'>
During bootup, system may need the number of free pages in the whole system
to do some calculation before all pages are freed to buddy system. Usually
this number is get from totalram_pages(). Since we plan to move the free
pages accounting in __free_pages_core(), this value may not represent
total free pages at the early stage, especially when
CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled.

Instead of using raw memblock api, let's introduce a new helper for user
to get the estimated number of free pages from memblock point of view.

Signed-off-by: Wei Yang &lt;richard.weiyang@gmail.com&gt;
CC: David Hildenbrand &lt;david@redhat.com&gt;
Reviewed-by: David Hildenbrand &lt;david@redhat.com&gt;
Link: https://lore.kernel.org/r/20240808001415.6298-1-richard.weiyang@gmail.com
Signed-off-by: Mike Rapoport (Microsoft) &lt;rppt@kernel.org&gt;
</content>
</entry>
<entry>
<title>Merge tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm</title>
<updated>2024-07-22T00:15:46Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2024-07-22T00:15:46Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=fbc90c042cd1dc7258ebfebe6d226017e5b5ac8c'/>
<id>urn:sha1:fbc90c042cd1dc7258ebfebe6d226017e5b5ac8c</id>
<content type='text'>
Pull MM updates from Andrew Morton:

 - In the series "mm: Avoid possible overflows in dirty throttling" Jan
   Kara addresses a couple of issues in the writeback throttling code.
   These fixes are also targetted at -stable kernels.

 - Ryusuke Konishi's series "nilfs2: fix potential issues related to
   reserved inodes" does that. This should actually be in the
   mm-nonmm-stable tree, along with the many other nilfs2 patches. My
   bad.

 - More folio conversions from Kefeng Wang in the series "mm: convert to
   folio_alloc_mpol()"

 - Kemeng Shi has sent some cleanups to the writeback code in the series
   "Add helper functions to remove repeated code and improve readability
   of cgroup writeback"

 - Kairui Song has made the swap code a little smaller and a little
   faster in the series "mm/swap: clean up and optimize swap cache
   index".

 - In the series "mm/memory: cleanly support zeropage in
   vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()" David
   Hildenbrand has reworked the rather sketchy handling of the use of
   the zeropage in MAP_SHARED mappings. I don't see any runtime effects
   here - more a cleanup/understandability/maintainablity thing.

 - Dev Jain has improved selftests/mm/va_high_addr_switch.c's handling
   of higher addresses, for aarch64. The (poorly named) series is
   "Restructure va_high_addr_switch".

 - The core TLB handling code gets some cleanups and possible slight
   optimizations in Bang Li's series "Add update_mmu_tlb_range() to
   simplify code".

 - Jane Chu has improved the handling of our
   fake-an-unrecoverable-memory-error testing feature MADV_HWPOISON in
   the series "Enhance soft hwpoison handling and injection".

 - Jeff Johnson has sent a billion patches everywhere to add
   MODULE_DESCRIPTION() to everything. Some landed in this pull.

 - In the series "mm: cleanup MIGRATE_SYNC_NO_COPY mode", Kefeng Wang
   has simplified migration's use of hardware-offload memory copying.

 - Yosry Ahmed performs more folio API conversions in his series "mm:
   zswap: trivial folio conversions".

 - In the series "large folios swap-in: handle refault cases first",
   Chuanhua Han inches us forward in the handling of large pages in the
   swap code. This is a cleanup and optimization, working toward the end
   objective of full support of large folio swapin/out.

 - In the series "mm,swap: cleanup VMA based swap readahead window
   calculation", Huang Ying has contributed some cleanups and a possible
   fixlet to his VMA based swap readahead code.

 - In the series "add mTHP support for anonymous shmem" Baolin Wang has
   taught anonymous shmem mappings to use multisize THP. By default this
   is a no-op - users must opt in vis sysfs controls. Dramatic
   improvements in pagefault latency are realized.

 - David Hildenbrand has some cleanups to our remaining use of
   page_mapcount() in the series "fs/proc: move page_mapcount() to
   fs/proc/internal.h".

 - David also has some highmem accounting cleanups in the series
   "mm/highmem: don't track highmem pages manually".

 - Build-time fixes and cleanups from John Hubbard in the series
   "cleanups, fixes, and progress towards avoiding "make headers"".

 - Cleanups and consolidation of the core pagemap handling from Barry
   Song in the series "mm: introduce pmd|pte_needs_soft_dirty_wp helpers
   and utilize them".

 - Lance Yang's series "Reclaim lazyfree THP without splitting" has
   reduced the latency of the reclaim of pmd-mapped THPs under fairly
   common circumstances. A 10x speedup is seen in a microbenchmark.

   It does this by punting to aother CPU but I guess that's a win unless
   all CPUs are pegged.

 - hugetlb_cgroup cleanups from Xiu Jianfeng in the series
   "mm/hugetlb_cgroup: rework on cftypes".

 - Miaohe Lin's series "Some cleanups for memory-failure" does just that
   thing.

 - Someone other than SeongJae has developed a DAMON feature in Honggyu
   Kim's series "DAMON based tiered memory management for CXL memory".
   This adds DAMON features which may be used to help determine the
   efficiency of our placement of CXL/PCIe attached DRAM.

 - DAMON user API centralization and simplificatio work in SeongJae
   Park's series "mm/damon: introduce DAMON parameters online commit
   function".

 - In the series "mm: page_type, zsmalloc and page_mapcount_reset()"
   David Hildenbrand does some maintenance work on zsmalloc - partially
   modernizing its use of pageframe fields.

 - Kefeng Wang provides more folio conversions in the series "mm: remove
   page_maybe_dma_pinned() and page_mkclean()".

 - More cleanup from David Hildenbrand, this time in the series
   "mm/memory_hotplug: use PageOffline() instead of PageReserved() for
   !ZONE_DEVICE". It "enlightens memory hotplug more about PageOffline()
   pages" and permits the removal of some virtio-mem hacks.

 - Barry Song's series "mm: clarify folio_add_new_anon_rmap() and
   __folio_add_anon_rmap()" is a cleanup to the anon folio handling in
   preparation for mTHP (multisize THP) swapin.

 - Kefeng Wang's series "mm: improve clear and copy user folio"
   implements more folio conversions, this time in the area of large
   folio userspace copying.

 - The series "Docs/mm/damon/maintaier-profile: document a mailing tool
   and community meetup series" tells people how to get better involved
   with other DAMON developers. From SeongJae Park.

 - A large series ("kmsan: Enable on s390") from Ilya Leoshkevich does
   that.

 - David Hildenbrand sends along more cleanups, this time against the
   migration code. The series is "mm/migrate: move NUMA hinting fault
   folio isolation + checks under PTL".

 - Jan Kara has found quite a lot of strangenesses and minor errors in
   the readahead code. He addresses this in the series "mm: Fix various
   readahead quirks".

 - SeongJae Park's series "selftests/damon: test DAMOS tried regions and
   {min,max}_nr_regions" adds features and addresses errors in DAMON's
   self testing code.

 - Gavin Shan has found a userspace-triggerable WARN in the pagecache
   code. The series "mm/filemap: Limit page cache size to that supported
   by xarray" addresses this. The series is marked cc:stable.

 - Chengming Zhou's series "mm/ksm: cmp_and_merge_page() optimizations
   and cleanup" cleans up and slightly optimizes KSM.

 - Roman Gushchin has separated the memcg-v1 and memcg-v2 code - lots of
   code motion. The series (which also makes the memcg-v1 code
   Kconfigurable) are "mm: memcg: separate legacy cgroup v1 code and put
   under config option" and "mm: memcg: put cgroup v1-specific memcg
   data under CONFIG_MEMCG_V1"

 - Dan Schatzberg's series "Add swappiness argument to memory.reclaim"
   adds an additional feature to this cgroup-v2 control file.

 - The series "Userspace controls soft-offline pages" from Jiaqi Yan
   permits userspace to stop the kernel's automatic treatment of
   excessive correctable memory errors. In order to permit userspace to
   monitor and handle this situation.

 - Kefeng Wang's series "mm: migrate: support poison recover from
   migrate folio" teaches the kernel to appropriately handle migration
   from poisoned source folios rather than simply panicing.

 - SeongJae Park's series "Docs/damon: minor fixups and improvements"
   does those things.

 - In the series "mm/zsmalloc: change back to per-size_class lock"
   Chengming Zhou improves zsmalloc's scalability and memory
   utilization.

 - Vivek Kasireddy's series "mm/gup: Introduce memfd_pin_folios() for
   pinning memfd folios" makes the GUP code use FOLL_PIN rather than
   bare refcount increments. So these paes can first be moved aside if
   they reside in the movable zone or a CMA block.

 - Andrii Nakryiko has added a binary ioctl()-based API to
   /proc/pid/maps for much faster reading of vma information. The series
   is "query VMAs from /proc/&lt;pid&gt;/maps".

 - In the series "mm: introduce per-order mTHP split counters" Lance
   Yang improves the kernel's presentation of developer information
   related to multisize THP splitting.

 - Michael Ellerman has developed the series "Reimplement huge pages
   without hugepd on powerpc (8xx, e500, book3s/64)". This permits
   userspace to use all available huge page sizes.

 - In the series "revert unconditional slab and page allocator fault
   injection calls" Vlastimil Babka removes a performance-affecting and
   not very useful feature from slab fault injection.

* tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (411 commits)
  mm/mglru: fix ineffective protection calculation
  mm/zswap: fix a white space issue
  mm/hugetlb: fix kernel NULL pointer dereference when migrating hugetlb folio
  mm/hugetlb: fix possible recursive locking detected warning
  mm/gup: clear the LRU flag of a page before adding to LRU batch
  mm/numa_balancing: teach mpol_to_str about the balancing mode
  mm: memcg1: convert charge move flags to unsigned long long
  alloc_tag: fix page_ext_get/page_ext_put sequence during page splitting
  lib: reuse page_ext_data() to obtain codetag_ref
  lib: add missing newline character in the warning message
  mm/mglru: fix overshooting shrinker memory
  mm/mglru: fix div-by-zero in vmpressure_calc_level()
  mm/kmemleak: replace strncpy() with strscpy()
  mm, page_alloc: put should_fail_alloc_page() back behing CONFIG_FAIL_PAGE_ALLOC
  mm, slab: put should_failslab() back behind CONFIG_SHOULD_FAILSLAB
  mm: ignore data-race in __swap_writepage
  hugetlbfs: ensure generic_hugetlb_get_unmapped_area() returns higher address than mmap_min_addr
  mm: shmem: rename mTHP shmem counters
  mm: swap_state: use folio_alloc_mpol() in __read_swap_cache_async()
  mm/migrate: putback split folios when numa hint migration fails
  ...
</content>
</entry>
<entry>
<title>mm/mm_init: use node's number of cpus in deferred_page_init_max_threads</title>
<updated>2024-07-04T02:29:58Z</updated>
<author>
<name>Eric Chanudet</name>
<email>echanude@redhat.com</email>
</author>
<published>2024-05-22T20:38:01Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=188f87f2648b13f5de17d5e068f18d317e0c1f98'/>
<id>urn:sha1:188f87f2648b13f5de17d5e068f18d317e0c1f98</id>
<content type='text'>
x86_64 is already using the node's cpu as maximum threads.  Make that the
default for all archs setting DEFERRED_STRUCT_PAGE_INIT.

This returns to the behavior prior making the function arch-specific with
commit ecd096506922 ("mm: make deferred init's max threads
arch-specific").

Setting DEFERRED_STRUCT_PAGE_INIT and testing on a few arm64 platforms
shows faster deferred_init_memmap completions:

|         | x13s        | SA8775p-ride | Ampere R137-P31 | Ampere HR330 |
|         | Metal, 32GB | VM, 36GB     | VM, 58GB        | Metal, 128GB |
|         | 8cpus       | 8cpus        | 8cpus           | 32cpus       |
|---------|-------------|--------------|-----------------|--------------|
| threads |  ms     (%) | ms       (%) |  ms         (%) |  ms      (%) |
|---------|-------------|--------------|-----------------|--------------|
| 1       | 108    (0%) | 72      (0%) | 224        (0%) | 324     (0%) |
| cpus    |  24  (-77%) | 36    (-50%) |  40      (-82%) |  56   (-82%) |

Michael Ellerman reported:

: On a machine here (1TB, 40 cores, 4KB pages) the existing code gives:
: 
:   [    0.500124] node 2 deferred pages initialised in 210ms
:   [    0.515790] node 3 deferred pages initialised in 230ms
:   [    0.516061] node 0 deferred pages initialised in 230ms
:   [    0.516522] node 7 deferred pages initialised in 230ms
:   [    0.516672] node 4 deferred pages initialised in 230ms
:   [    0.516798] node 6 deferred pages initialised in 230ms
:   [    0.517051] node 5 deferred pages initialised in 230ms
:   [    0.523887] node 1 deferred pages initialised in 240ms
: 
: vs with the patch:
: 
:   [    0.379613] node 0 deferred pages initialised in 90ms
:   [    0.380388] node 1 deferred pages initialised in 90ms
:   [    0.380540] node 4 deferred pages initialised in 100ms
:   [    0.390239] node 6 deferred pages initialised in 100ms
:   [    0.390249] node 2 deferred pages initialised in 100ms
:   [    0.390786] node 3 deferred pages initialised in 110ms
:   [    0.396721] node 5 deferred pages initialised in 110ms
:   [    0.397095] node 7 deferred pages initialised in 110ms
: 
: Which is a nice speedup.

[echanude@redhat.com: v3]
  Link: https://lkml.kernel.org/r/20240528185455.643227-4-echanude@redhat.com
Link: https://lkml.kernel.org/r/20240522203758.626932-4-echanude@redhat.com
Signed-off-by: Eric Chanudet &lt;echanude@redhat.com&gt;
Tested-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt; (powerpc)
Reviewed-by: Baoquan He &lt;bhe@redhat.com&gt;
Acked-by: Alexander Gordeev &lt;agordeev@linux.ibm.com&gt;
Acked-by: Mike Rapoport (IBM) &lt;rppt@kernel.org&gt;
Cc: Andy Lutomirski &lt;luto@kernel.org&gt;
Cc: Borislav Petkov (AMD) &lt;bp@alien8.de&gt;
Cc: Dave Hansen &lt;dave.hansen@linux.intel.com&gt;
Cc: "H. Peter Anvin" &lt;hpa@zytor.com&gt;
Cc: Ingo Molnar &lt;mingo@redhat.com&gt;
Cc: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/mm_init.c: not always search next deferred_init_pfn from very beginning</title>
<updated>2024-06-06T04:55:51Z</updated>
<author>
<name>Wei Yang</name>
<email>richard.weiyang@gmail.com</email>
</author>
<published>2024-06-05T07:13:39Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=f1180fd2a7c039691b64ebf404c746a74e40b7b0'/>
<id>urn:sha1:f1180fd2a7c039691b64ebf404c746a74e40b7b0</id>
<content type='text'>
In function deferred_init_memmap(), we call
deferred_init_mem_pfn_range_in_zone() to get the next deferred_init_pfn.
But we always search it from the very beginning.

Since we save the index in i, we can leverage this to search from i next
time.

[rppt refine the comment]

Signed-off-by: Wei Yang &lt;richard.weiyang@gmail.com&gt;
Link: https://lore.kernel.org/all/20240605071339.15330-1-richard.weiyang@gmail.com
Signed-off-by: Mike Rapoport (IBM) &lt;rppt@kernel.org&gt;
</content>
</entry>
</feed>
