<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/include/linux/mmzone.h, branch v5.3.14</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v5.3.14</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v5.3.14'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2019-08-31T01:00:50Z</updated>
<entry>
<title>mm: memcontrol: flush percpu slab vmstats on kmem offlining</title>
<updated>2019-08-31T01:00:50Z</updated>
<author>
<name>Roman Gushchin</name>
<email>guro@fb.com</email>
</author>
<published>2019-08-30T23:04:32Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=bee07b33db78d4ee7ed6a2fe810b9473d5471fe4'/>
<id>urn:sha1:bee07b33db78d4ee7ed6a2fe810b9473d5471fe4</id>
<content type='text'>
I've noticed that the "slab" value in memory.stat is sometimes 0, even
if some children memory cgroups have a non-zero "slab" value.  The
following investigation showed that this is the result of the kmem_cache
reparenting in combination with the per-cpu batching of slab vmstats.

At the offlining some vmstat value may leave in the percpu cache, not
being propagated upwards by the cgroup hierarchy.  It means that stats
on ancestor levels are lower than actual.  Later when slab pages are
released, the precise number of pages is substracted on the parent
level, making the value negative.  We don't show negative values, 0 is
printed instead.

To fix this issue, let's flush percpu slab memcg and lruvec stats on
memcg offlining.  This guarantees that numbers on all ancestor levels
are accurate and match the actual number of outstanding slab pages.

Link: http://lkml.kernel.org/r/20190819202338.363363-3-guro@fb.com
Fixes: fb2f2b0adb98 ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal")
Signed-off-by: Roman Gushchin &lt;guro@fb.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Michal Hocko &lt;mhocko@kernel.org&gt;
Cc: Vladimir Davydov &lt;vdavydov.dev@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>libnvdimm/pfn: stop padding pmem namespaces to section alignment</title>
<updated>2019-07-19T00:08:07Z</updated>
<author>
<name>Dan Williams</name>
<email>dan.j.williams@intel.com</email>
</author>
<published>2019-07-18T22:58:40Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=a3619190d62ed9d66416891be2416f6bea2b3ca4'/>
<id>urn:sha1:a3619190d62ed9d66416891be2416f6bea2b3ca4</id>
<content type='text'>
Now that the mm core supports section-unaligned hotplug of ZONE_DEVICE
memory, we no longer need to add padding at pfn/dax device creation
time.  The kernel will still honor padding established by older kernels.

Link: http://lkml.kernel.org/r/156092356588.979959.6793371748950931916.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams &lt;dan.j.williams@intel.com&gt;
Reported-by: Jeff Moyer &lt;jmoyer@redhat.com&gt;
Tested-by: Aneesh Kumar K.V &lt;aneesh.kumar@linux.ibm.com&gt;	[ppc64]
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Jane Chu &lt;jane.chu@oracle.com&gt;
Cc: Jérôme Glisse &lt;jglisse@redhat.com&gt;
Cc: Jonathan Corbet &lt;corbet@lwn.net&gt;
Cc: Logan Gunthorpe &lt;logang@deltatee.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Mike Rapoport &lt;rppt@linux.ibm.com&gt;
Cc: Oscar Salvador &lt;osalvador@suse.de&gt;
Cc: Pavel Tatashin &lt;pasha.tatashin@soleen.com&gt;
Cc: Toshi Kani &lt;toshi.kani@hpe.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Wei Yang &lt;richardw.yang@linux.intel.com&gt;
Cc: Jason Gunthorpe &lt;jgg@mellanox.com&gt;
Cc: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: kill is_dev_zone() helper</title>
<updated>2019-07-19T00:08:07Z</updated>
<author>
<name>Dan Williams</name>
<email>dan.j.williams@intel.com</email>
</author>
<published>2019-07-18T22:58:18Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=46d945aeab4d7dd837bd0724662de2caf712f047'/>
<id>urn:sha1:46d945aeab4d7dd837bd0724662de2caf712f047</id>
<content type='text'>
Given there are no more usages of is_dev_zone() outside of 'ifdef
CONFIG_ZONE_DEVICE' protection, kill off the compilation helper.

Link: http://lkml.kernel.org/r/156092353211.979959.1489004866360828964.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams &lt;dan.j.williams@intel.com&gt;
Reviewed-by: Oscar Salvador &lt;osalvador@suse.de&gt;
Reviewed-by: Pavel Tatashin &lt;pasha.tatashin@soleen.com&gt;
Reviewed-by: Wei Yang &lt;richardw.yang@linux.intel.com&gt;
Acked-by: David Hildenbrand &lt;david@redhat.com&gt;
Tested-by: Aneesh Kumar K.V &lt;aneesh.kumar@linux.ibm.com&gt;	[ppc64]
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Logan Gunthorpe &lt;logang@deltatee.com&gt;
Cc: Jane Chu &lt;jane.chu@oracle.com&gt;
Cc: Jeff Moyer &lt;jmoyer@redhat.com&gt;
Cc: Jérôme Glisse &lt;jglisse@redhat.com&gt;
Cc: Jonathan Corbet &lt;corbet@lwn.net&gt;
Cc: Mike Rapoport &lt;rppt@linux.ibm.com&gt;
Cc: Toshi Kani &lt;toshi.kani@hpe.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Jason Gunthorpe &lt;jgg@mellanox.com&gt;
Cc: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/sparsemem: add helpers track active portions of a section at boot</title>
<updated>2019-07-19T00:08:07Z</updated>
<author>
<name>Dan Williams</name>
<email>dan.j.williams@intel.com</email>
</author>
<published>2019-07-18T22:58:04Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=f46edbd1b1516da1fb34c917775168d5df576f78'/>
<id>urn:sha1:f46edbd1b1516da1fb34c917775168d5df576f78</id>
<content type='text'>
Prepare for hot{plug,remove} of sub-ranges of a section by tracking a
sub-section active bitmask, each bit representing a PMD_SIZE span of the
architecture's memory hotplug section size.

The implications of a partially populated section is that pfn_valid()
needs to go beyond a valid_section() check and either determine that the
section is an "early section", or read the sub-section active ranges
from the bitmask.  The expectation is that the bitmask (subsection_map)
fits in the same cacheline as the valid_section() / early_section()
data, so the incremental performance overhead to pfn_valid() should be
negligible.

The rationale for using early_section() to short-ciruit the
subsection_map check is that there are legacy code paths that use
pfn_valid() at section granularity before validating the pfn against
pgdat data.  So, the early_section() check allows those traditional
assumptions to persist while also permitting subsection_map to tell the
truth for purposes of populating the unused portions of early sections
with PMEM and other ZONE_DEVICE mappings.

Link: http://lkml.kernel.org/r/156092350874.979959.18185938451405518285.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams &lt;dan.j.williams@intel.com&gt;
Reported-by: Qian Cai &lt;cai@lca.pw&gt;
Tested-by: Jane Chu &lt;jane.chu@oracle.com&gt;
Tested-by: Aneesh Kumar K.V &lt;aneesh.kumar@linux.ibm.com&gt;	[ppc64]
Reviewed-by: Oscar Salvador &lt;osalvador@suse.de&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Logan Gunthorpe &lt;logang@deltatee.com&gt;
Cc: Pavel Tatashin &lt;pasha.tatashin@soleen.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Jeff Moyer &lt;jmoyer@redhat.com&gt;
Cc: Jérôme Glisse &lt;jglisse@redhat.com&gt;
Cc: Jonathan Corbet &lt;corbet@lwn.net&gt;
Cc: Mike Rapoport &lt;rppt@linux.ibm.com&gt;
Cc: Toshi Kani &lt;toshi.kani@hpe.com&gt;
Cc: Wei Yang &lt;richardw.yang@linux.intel.com&gt;
Cc: Jason Gunthorpe &lt;jgg@mellanox.com&gt;
Cc: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/sparsemem: introduce a SECTION_IS_EARLY flag</title>
<updated>2019-07-19T00:08:07Z</updated>
<author>
<name>Dan Williams</name>
<email>dan.j.williams@intel.com</email>
</author>
<published>2019-07-18T22:58:00Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=326e1b8f83a4318b09033ef754f40c785aed5e68'/>
<id>urn:sha1:326e1b8f83a4318b09033ef754f40c785aed5e68</id>
<content type='text'>
In preparation for sub-section hotplug, track whether a given section
was created during early memory initialization, or later via memory
hotplug.  This distinction is needed to maintain the coarse expectation
that pfn_valid() returns true for any pfn within a given section even if
that section has pages that are reserved from the page allocator.

For example one of the of goals of subsection hotplug is to support
cases where the system physical memory layout collides System RAM and
PMEM within a section.  Several pfn_valid() users expect to just check
if a section is valid, but they are not careful to check if the given
pfn is within a "System RAM" boundary and instead expect pgdat
information to further validate the pfn.

Rather than unwind those paths to make their pfn_valid() queries more
precise a follow on patch uses the SECTION_IS_EARLY flag to maintain the
traditional expectation that pfn_valid() returns true for all early
sections.

Link: https://lore.kernel.org/lkml/1560366952-10660-1-git-send-email-cai@lca.pw/
Link: http://lkml.kernel.org/r/156092350358.979959.5817209875548072819.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams &lt;dan.j.williams@intel.com&gt;
Reported-by: Qian Cai &lt;cai@lca.pw&gt;
Tested-by: Aneesh Kumar K.V &lt;aneesh.kumar@linux.ibm.com&gt;	[ppc64]
Reviewed-by: Oscar Salvador &lt;osalvador@suse.de&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Logan Gunthorpe &lt;logang@deltatee.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Pavel Tatashin &lt;pasha.tatashin@soleen.com&gt;
Cc: Jane Chu &lt;jane.chu@oracle.com&gt;
Cc: Jeff Moyer &lt;jmoyer@redhat.com&gt;
Cc: Jérôme Glisse &lt;jglisse@redhat.com&gt;
Cc: Jonathan Corbet &lt;corbet@lwn.net&gt;
Cc: Mike Rapoport &lt;rppt@linux.ibm.com&gt;
Cc: Toshi Kani &lt;toshi.kani@hpe.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Wei Yang &lt;richardw.yang@linux.intel.com&gt;
Cc: Jason Gunthorpe &lt;jgg@mellanox.com&gt;
Cc: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/sparsemem: introduce struct mem_section_usage</title>
<updated>2019-07-19T00:08:07Z</updated>
<author>
<name>Dan Williams</name>
<email>dan.j.williams@intel.com</email>
</author>
<published>2019-07-18T22:57:57Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=f1eca35a0dc7cb3cdb00c88c8c5e5138a65face0'/>
<id>urn:sha1:f1eca35a0dc7cb3cdb00c88c8c5e5138a65face0</id>
<content type='text'>
Patch series "mm: Sub-section memory hotplug support", v10.

The memory hotplug section is an arbitrary / convenient unit for memory
hotplug.  'Section-size' units have bled into the user interface
('memblock' sysfs) and can not be changed without breaking existing
userspace.  The section-size constraint, while mostly benign for typical
memory hotplug, has and continues to wreak havoc with 'device-memory'
use cases, persistent memory (pmem) in particular.  Recall that pmem
uses devm_memremap_pages(), and subsequently arch_add_memory(), to
allocate a 'struct page' memmap for pmem.  However, it does not use the
'bottom half' of memory hotplug, i.e.  never marks pmem pages online and
never exposes the userspace memblock interface for pmem.  This leaves an
opening to redress the section-size constraint.

To date, the libnvdimm subsystem has attempted to inject padding to
satisfy the internal constraints of arch_add_memory().  Beyond
complicating the code, leading to bugs [2], wasting memory, and limiting
configuration flexibility, the padding hack is broken when the platform
changes this physical memory alignment of pmem from one boot to the
next.  Device failure (intermittent or permanent) and physical
reconfiguration are events that can cause the platform firmware to
change the physical placement of pmem on a subsequent boot, and device
failure is an everyday event in a data-center.

It turns out that sections are only a hard requirement of the
user-facing interface for memory hotplug and with a bit more
infrastructure sub-section arch_add_memory() support can be added for
kernel internal usages like devm_memremap_pages().  Here is an analysis
of the current design assumptions in the current code and how they are
addressed in the new implementation:

Current design assumptions:

 - Sections that describe boot memory (early sections) are never
   unplugged / removed.

 - pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
   valid_section() check

 - __add_pages() and helper routines assume all operations occur in
   PAGES_PER_SECTION units.

 - The memblock sysfs interface only comprehends full sections

New design assumptions:

 - Sections are instrumented with a sub-section bitmask to track (on
   x86) individual 2MB sub-divisions of a 128MB section.

 - Partially populated early sections can be extended with additional
   sub-sections, and those sub-sections can be removed with
   arch_remove_memory(). With this in place we no longer lose usable
   memory capacity to padding.

 - pfn_valid() is updated to look deeper than valid_section() to also
   check the active-sub-section mask. This indication is in the same
   cacheline as the valid_section() so the performance impact is
   expected to be negligible. So far the lkp robot has not reported any
   regressions.

 - Outside of the core vmemmap population routines which are replaced,
   other helper routines like shrink_{zone,pgdat}_span() are updated to
   handle the smaller granularity. Core memory hotplug routines that
   deal with online memory are not touched.

 - The existing memblock sysfs user api guarantees / assumptions are not
   touched since this capability is limited to !online
   !memblock-sysfs-accessible sections.

Meanwhile the issue reports continue to roll in from users that do not
understand when and how the 128MB constraint will bite them.  The current
implementation relied on being able to support at least one misaligned
namespace, but that immediately falls over on any moderately complex
namespace creation attempt.  Beyond the initial problem of 'System RAM'
colliding with pmem, and the unsolvable problem of physical alignment
changes, Linux is now being exposed to platforms that collide pmem ranges
with other pmem ranges by default [3].  In short, devm_memremap_pages()
has pushed the venerable section-size constraint past the breaking point,
and the simplicity of section-aligned arch_add_memory() is no longer
tenable.

These patches are exposed to the kbuild robot on a subsection-v10 branch
[4], and a preview of the unit test for this functionality is available
on the 'subsection-pending' branch of ndctl [5].

[2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
[3]: https://github.com/pmem/ndctl/issues/76
[4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=subsection-v10
[5]: https://github.com/pmem/ndctl/commit/7c59b4867e1c

This patch (of 13):

Towards enabling memory hotplug to track partial population of a section,
introduce 'struct mem_section_usage'.

A pointer to a 'struct mem_section_usage' instance replaces the existing
pointer to a 'pageblock_flags' bitmap.  Effectively it adds one more
'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to house
a new 'subsection_map' bitmap.  The new bitmap enables the memory
hot{plug,remove} implementation to act on incremental sub-divisions of a
section.

SUBSECTION_SHIFT is defined as global constant instead of per-architecture
value like SECTION_SIZE_BITS in order to allow cross-arch compatibility of
subsection users.  Specifically a common subsection size allows for the
possibility that persistent memory namespace configurations be made
compatible across architectures.

The primary motivation for this functionality is to support platforms that
mix "System RAM" and "Persistent Memory" within a single section, or
multiple PMEM ranges with different mapping lifetimes within a single
section.  The section restriction for hotplug has caused an ongoing saga
of hacks and bugs for devm_memremap_pages() users.

Beyond the fixups to teach existing paths how to retrieve the 'usemap'
from a section, and updates to usemap allocation path, there are no
expected behavior changes.

Link: http://lkml.kernel.org/r/156092349845.979959.73333291612799019.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams &lt;dan.j.williams@intel.com&gt;
Reviewed-by: Oscar Salvador &lt;osalvador@suse.de&gt;
Reviewed-by: Wei Yang &lt;richardw.yang@linux.intel.com&gt;
Tested-by: Aneesh Kumar K.V &lt;aneesh.kumar@linux.ibm.com&gt;	[ppc64]
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Logan Gunthorpe &lt;logang@deltatee.com&gt;
Cc: Pavel Tatashin &lt;pasha.tatashin@soleen.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Jérôme Glisse &lt;jglisse@redhat.com&gt;
Cc: Mike Rapoport &lt;rppt@linux.ibm.com&gt;
Cc: Jane Chu &lt;jane.chu@oracle.com&gt;
Cc: Pavel Tatashin &lt;pasha.tatashin@soleen.com&gt;
Cc: Jonathan Corbet &lt;corbet@lwn.net&gt;
Cc: Qian Cai &lt;cai@lca.pw&gt;
Cc: Logan Gunthorpe &lt;logang@deltatee.com&gt;
Cc: Toshi Kani &lt;toshi.kani@hpe.com&gt;
Cc: Jeff Moyer &lt;jmoyer@redhat.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Jason Gunthorpe &lt;jgg@mellanox.com&gt;
Cc: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: section numbers use the type "unsigned long"</title>
<updated>2019-07-19T00:08:06Z</updated>
<author>
<name>David Hildenbrand</name>
<email>david@redhat.com</email>
</author>
<published>2019-07-18T22:57:37Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=2491f0a2c0b117b9097e9c9eee0c21f2e5f716d7'/>
<id>urn:sha1:2491f0a2c0b117b9097e9c9eee0c21f2e5f716d7</id>
<content type='text'>
Patch series "mm: Further memory block device cleanups", v1.

Some further cleanups around memory block devices.  Especially, clean up
and simplify walk_memory_range().  Including some other minor cleanups.

This patch (of 6):

We are using a mixture of "int" and "unsigned long".  Let's make this
consistent by using "unsigned long" everywhere.  We'll do the same with
memory block ids next.

While at it, turn the "unsigned long i" in removable_show() into an int
- sections_per_block is an int.

[akpm@linux-foundation.org: s/unsigned long i/unsigned long nr/]
[david@redhat.com: v3]
  Link: http://lkml.kernel.org/r/20190620183139.4352-2-david@redhat.com
Link: http://lkml.kernel.org/r/20190614100114.311-2-david@redhat.com
Signed-off-by: David Hildenbrand &lt;david@redhat.com&gt;
Reviewed-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
Cc: "Rafael J. Wysocki" &lt;rafael@kernel.org&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Dan Williams &lt;dan.j.williams@intel.com&gt;
Cc: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Cc: Wei Yang &lt;richard.weiyang@gmail.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Arun KS &lt;arunks@codeaurora.org&gt;
Cc: Pavel Tatashin &lt;pasha.tatashin@soleen.com&gt;
Cc: Oscar Salvador &lt;osalvador@suse.de&gt;
Cc: Stephen Rothwell &lt;sfr@canb.auug.org.au&gt;
Cc: Mike Rapoport &lt;rppt@linux.vnet.ibm.com&gt;
Cc: Baoquan He &lt;bhe@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: maintain randomization of page free lists</title>
<updated>2019-05-15T02:52:48Z</updated>
<author>
<name>Dan Williams</name>
<email>dan.j.williams@intel.com</email>
</author>
<published>2019-05-14T22:41:35Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=97500a4a54876d3d6d2d1b8419223eb4e69b32d8'/>
<id>urn:sha1:97500a4a54876d3d6d2d1b8419223eb4e69b32d8</id>
<content type='text'>
When freeing a page with an order &gt;= shuffle_page_order randomly select
the front or back of the list for insertion.

While the mm tries to defragment physical pages into huge pages this can
tend to make the page allocator more predictable over time.  Inject the
front-back randomness to preserve the initial randomness established by
shuffle_free_memory() when the kernel was booted.

The overhead of this manipulation is constrained by only being applied
for MAX_ORDER sized pages by default.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/154899812788.3165233.9066631950746578517.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams &lt;dan.j.williams@intel.com&gt;
Reviewed-by: Kees Cook &lt;keescook@chromium.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Dave Hansen &lt;dave.hansen@linux.intel.com&gt;
Cc: Keith Busch &lt;keith.busch@intel.com&gt;
Cc: Robert Elliott &lt;elliott@hpe.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: move buddy list manipulations into helpers</title>
<updated>2019-05-15T02:52:48Z</updated>
<author>
<name>Dan Williams</name>
<email>dan.j.williams@intel.com</email>
</author>
<published>2019-05-14T22:41:32Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=b03641af680959df57c275a80ff0dc116627c7ae'/>
<id>urn:sha1:b03641af680959df57c275a80ff0dc116627c7ae</id>
<content type='text'>
In preparation for runtime randomization of the zone lists, take all
(well, most of) the list_*() functions in the buddy allocator and put
them in helper functions.  Provide a common control point for injecting
additional behavior when freeing pages.

[dan.j.williams@intel.com: fix buddy list helpers]
  Link: http://lkml.kernel.org/r/155033679702.1773410.13041474192173212653.stgit@dwillia2-desk3.amr.corp.intel.com
[vbabka@suse.cz: remove del_page_from_free_area() migratetype parameter]
  Link: http://lkml.kernel.org/r/4672701b-6775-6efd-0797-b6242591419e@suse.cz
Link: http://lkml.kernel.org/r/154899812264.3165233.5219320056406926223.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams &lt;dan.j.williams@intel.com&gt;
Signed-off-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Tested-by: Tetsuo Handa &lt;penguin-kernel@i-love.sakura.ne.jp&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Dave Hansen &lt;dave.hansen@linux.intel.com&gt;
Cc: Kees Cook &lt;keescook@chromium.org&gt;
Cc: Keith Busch &lt;keith.busch@intel.com&gt;
Cc: Robert Elliott &lt;elliott@hpe.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: shuffle initial free memory to improve memory-side-cache utilization</title>
<updated>2019-05-15T02:52:48Z</updated>
<author>
<name>Dan Williams</name>
<email>dan.j.williams@intel.com</email>
</author>
<published>2019-05-14T22:41:28Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=e900a918b0984ec8f2eb150b8477a47b75d17692'/>
<id>urn:sha1:e900a918b0984ec8f2eb150b8477a47b75d17692</id>
<content type='text'>
Patch series "mm: Randomize free memory", v10.

This patch (of 3):

Randomization of the page allocator improves the average utilization of
a direct-mapped memory-side-cache.  Memory side caching is a platform
capability that Linux has been previously exposed to in HPC
(high-performance computing) environments on specialty platforms.  In
that instance it was a smaller pool of high-bandwidth-memory relative to
higher-capacity / lower-bandwidth DRAM.  Now, this capability is going
to be found on general purpose server platforms where DRAM is a cache in
front of higher latency persistent memory [1].

Robert offered an explanation of the state of the art of Linux
interactions with memory-side-caches [2], and I copy it here:

    It's been a problem in the HPC space:
    http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/

    A kernel module called zonesort is available to try to help:
    https://software.intel.com/en-us/articles/xeon-phi-software

    and this abandoned patch series proposed that for the kernel:
    https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@intel.com

    Dan's patch series doesn't attempt to ensure buffers won't conflict, but
    also reduces the chance that the buffers will. This will make performance
    more consistent, albeit slower than "optimal" (which is near impossible
    to attain in a general-purpose kernel).  That's better than forcing
    users to deploy remedies like:
        "To eliminate this gradual degradation, we have added a Stream
         measurement to the Node Health Check that follows each job;
         nodes are rebooted whenever their measured memory bandwidth
         falls below 300 GB/s."

A replacement for zonesort was merged upstream in commit cc9aec03e58f
("x86/numa_emulation: Introduce uniform split capability").  With this
numa_emulation capability, memory can be split into cache sized
("near-memory" sized) numa nodes.  A bind operation to such a node, and
disabling workloads on other nodes, enables full cache performance.
However, once the workload exceeds the cache size then cache conflicts
are unavoidable.  While HPC environments might be able to tolerate
time-scheduling of cache sized workloads, for general purpose server
platforms, the oversubscribed cache case will be the common case.

The worst case scenario is that a server system owner benchmarks a
workload at boot with an un-contended cache only to see that performance
degrade over time, even below the average cache performance due to
excessive conflicts.  Randomization clips the peaks and fills in the
valleys of cache utilization to yield steady average performance.

Here are some performance impact details of the patches:

1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
   3X speedup in a contrived case that tries to force cache conflicts.
   The contrived cased used the numa_emulation capability to force an
   instance of the benchmark to be run in two of the near-memory sized
   numa nodes.  If both instances were placed on the same emulated they
   would fit and cause zero conflicts.  While on separate emulated nodes
   without randomization they underutilized the cache and conflicted
   unnecessarily due to the in-order allocation per node.

2/ A well known Java server application benchmark was run with a heap
   size that exceeded cache size by 3X.  The cache conflict rate was 8%
   for the first run and degraded to 21% after page allocator aging.  With
   randomization enabled the rate levelled out at 11%.

3/ A MongoDB workload did not observe measurable difference in
   cache-conflict rates, but the overall throughput dropped by 7% with
   randomization in one case.

4/ Mel Gorman ran his suite of performance workloads with randomization
   enabled on platforms without a memory-side-cache and saw a mix of some
   improvements and some losses [3].

While there is potentially significant improvement for applications that
depend on low latency access across a wide working-set, the performance
may be negligible to negative for other workloads.  For this reason the
shuffle capability defaults to off unless a direct-mapped
memory-side-cache is detected.  Even then, the page_alloc.shuffle=0
parameter can be specified to disable the randomization on those systems.

Outside of memory-side-cache utilization concerns there is potentially
security benefit from randomization.  Some data exfiltration and
return-oriented-programming attacks rely on the ability to infer the
location of sensitive data objects.  The kernel page allocator, especially
early in system boot, has predictable first-in-first out behavior for
physical pages.  Pages are freed in physical address order when first
onlined.

Quoting Kees:
    "While we already have a base-address randomization
     (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
     memory layouts would certainly be using the predictability of
     allocation ordering (i.e. for attacks where the base address isn't
     important: only the relative positions between allocated memory).
     This is common in lots of heap-style attacks. They try to gain
     control over ordering by spraying allocations, etc.

     I'd really like to see this because it gives us something similar
     to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."

While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
caches it leaves vast bulk of memory to be predictably in order allocated.
However, it should be noted, the concrete security benefits are hard to
quantify, and no known CVE is mitigated by this randomization.

Introduce shuffle_free_memory(), and its helper shuffle_zone(), to perform
a Fisher-Yates shuffle of the page allocator 'free_area' lists when they
are initially populated with free memory at boot and at hotplug time.  Do
this based on either the presence of a page_alloc.shuffle=Y command line
parameter, or autodetection of a memory-side-cache (to be added in a
follow-on patch).

The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e.  10,
4MB this trades off randomization granularity for time spent shuffling.
MAX_ORDER-1 was chosen to be minimally invasive to the page allocator
while still showing memory-side cache behavior improvements, and the
expectation that the security implications of finer granularity
randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM.  The
performance impact of the shuffling appears to be in the noise compared to
other memory initialization work.

This initial randomization can be undone over time so a follow-on patch is
introduced to inject entropy on page free decisions.  It is reasonable to
ask if the page free entropy is sufficient, but it is not enough due to
the in-order initial freeing of pages.  At the start of that process
putting page1 in front or behind page0 still keeps them close together,
page2 is still near page1 and has a high chance of being adjacent.  As
more pages are added ordering diversity improves, but there is still high
page locality for the low address pages and this leads to no significant
impact to the cache conflict rate.

[1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
[2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM
[3]: https://lkml.org/lkml/2018/10/12/309

[dan.j.williams@intel.com: fix shuffle enable]
  Link: http://lkml.kernel.org/r/154943713038.3858443.4125180191382062871.stgit@dwillia2-desk3.amr.corp.intel.com
[cai@lca.pw: fix SHUFFLE_PAGE_ALLOCATOR help texts]
  Link: http://lkml.kernel.org/r/20190425201300.75650-1-cai@lca.pw
Link: http://lkml.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams &lt;dan.j.williams@intel.com&gt;
Signed-off-by: Qian Cai &lt;cai@lca.pw&gt;
Reviewed-by: Kees Cook &lt;keescook@chromium.org&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Dave Hansen &lt;dave.hansen@linux.intel.com&gt;
Cc: Keith Busch &lt;keith.busch@intel.com&gt;
Cc: Robert Elliott &lt;elliott@hpe.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
</feed>
