<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/include/linux/memcontrol.h, branch v6.2.7</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v6.2.7</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v6.2.7'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2023-02-01T00:44:10Z</updated>
<entry>
<title>mm: memcg: fix NULL pointer in mem_cgroup_track_foreign_dirty_slowpath()</title>
<updated>2023-02-01T00:44:10Z</updated>
<author>
<name>Kefeng Wang</name>
<email>wangkefeng.wang@huawei.com</email>
</author>
<published>2023-01-29T04:09:45Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=ac86f547ca1002aec2ef66b9e64d03f45bbbfbb9'/>
<id>urn:sha1:ac86f547ca1002aec2ef66b9e64d03f45bbbfbb9</id>
<content type='text'>
As commit 18365225f044 ("hwpoison, memcg: forcibly uncharge LRU pages"),
hwpoison will forcibly uncharg a LRU hwpoisoned page, the folio_memcg
could be NULl, then, mem_cgroup_track_foreign_dirty_slowpath() could
occurs a NULL pointer dereference, let's do not record the foreign
writebacks for folio memcg is null in mem_cgroup_track_foreign_dirty() to
fix it.

Link: https://lkml.kernel.org/r/20230129040945.180629-1-wangkefeng.wang@huawei.com
Fixes: 97b27821b485 ("writeback, memcg: Implement foreign dirty flushing")
Signed-off-by: Kefeng Wang &lt;wangkefeng.wang@huawei.com&gt;
Reported-by: Ma Wupeng &lt;mawupeng1@huawei.com&gt;
Tested-by: Miko Larsson &lt;mikoxyzzz@gmail.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: Jens Axboe &lt;axboe@kernel.dk&gt;
Cc: Kefeng Wang &lt;wangkefeng.wang@huawei.com&gt;
Cc: Ma Wupeng &lt;mawupeng1@huawei.com&gt;
Cc: Naoya Horiguchi &lt;naoya.horiguchi@nec.com&gt;
Cc: Shakeel Butt &lt;shakeelb@google.com&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: memcg: fix stale protection of reclaim target memcg</title>
<updated>2022-12-12T02:12:19Z</updated>
<author>
<name>Yosry Ahmed</name>
<email>yosryahmed@google.com</email>
</author>
<published>2022-12-02T03:15:10Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=adb8213014b25c7f1d75d5b219becaadcd695efb'/>
<id>urn:sha1:adb8213014b25c7f1d75d5b219becaadcd695efb</id>
<content type='text'>
Patch series "mm: memcg: fix protection of reclaim target memcg", v3.

This series fixes a bug in calculating the protection of the reclaim
target memcg where we end up using stale effective protection values from
the last reclaim operation, instead of completely ignoring the protection
of the reclaim target as intended.  More detailed explanation and examples
in patch 1, which includes the fix.  Patches 2 &amp; 3 introduce a selftest
case that catches the bug.


This patch (of 3):

When we are doing memcg reclaim, the intended behavior is that we
ignore any protection (memory.min, memory.low) of the target memcg (but
not its children).  Ever since the patch pointed to by the "Fixes" tag,
we actually read a stale value for the target memcg protection when
deciding whether to skip the memcg or not because it is protected.  If
the stale value happens to be high enough, we don't reclaim from the
target memcg.

Essentially, in some cases we may falsely skip reclaiming from the
target memcg of reclaim because we read a stale protection value from
last time we reclaimed from it.


During reclaim, mem_cgroup_calculate_protection() is used to determine the
effective protection (emin and elow) values of a memcg.  The protection of
the reclaim target is ignored, but we cannot set their effective
protection to 0 due to a limitation of the current implementation (see
comment in mem_cgroup_protection()).  Instead, we leave their effective
protection values unchaged, and later ignore it in
mem_cgroup_protection().

However, mem_cgroup_protection() is called later in
shrink_lruvec()-&gt;get_scan_count(), which is after the
mem_cgroup_below_{min/low}() checks in shrink_node_memcgs().  As a result,
the stale effective protection values of the target memcg may lead us to
skip reclaiming from the target memcg entirely, before calling
shrink_lruvec().  This can be even worse with recursive protection, where
the stale target memcg protection can be higher than its standalone
protection.  See two examples below (a similar version of example (a) is
added to test_memcontrol in a later patch).

(a) A simple example with proactive reclaim is as follows. Consider the
following hierarchy:
ROOT
 |
 A
 |
 B (memory.min = 10M)

Consider the following scenario:
- B has memory.current = 10M.
- The system undergoes global reclaim (or memcg reclaim in A).
- In shrink_node_memcgs():
  - mem_cgroup_calculate_protection() calculates the effective min (emin)
    of B as 10M.
  - mem_cgroup_below_min() returns true for B, we do not reclaim from B.
- Now if we want to reclaim 5M from B using proactive reclaim
  (memory.reclaim), we should be able to, as the protection of the
  target memcg should be ignored.
- In shrink_node_memcgs():
  - mem_cgroup_calculate_protection() immediately returns for B without
    doing anything, as B is the target memcg, relying on
    mem_cgroup_protection() to ignore B's stale effective min (still 10M).
  - mem_cgroup_below_min() reads the stale effective min for B and we
    skip it instead of ignoring its protection as intended, as we never
    reach mem_cgroup_protection().

(b) An more complex example with recursive protection is as follows.
Consider the following hierarchy with memory_recursiveprot:
ROOT
 |
 A (memory.min = 50M)
 |
 B (memory.min = 10M, memory.high = 40M)

Consider the following scenario:
- B has memory.current = 35M.
- The system undergoes global reclaim (target memcg is NULL).
- B will have an effective min of 50M (all of A's unclaimed protection).
- B will not be reclaimed from.
- Now allocate 10M more memory in B, pushing it above it's high limit.
- The system undergoes memcg reclaim from B (target memcg is B).
- Like example (a), we do nothing in mem_cgroup_calculate_protection(),
  then call mem_cgroup_below_min(), which will read the stale effective
  min for B (50M) and skip it. In this case, it's even worse because we
  are not just considering B's standalone protection (10M), but we are
  reading a much higher stale protection (50M) which will cause us to not
  reclaim from B at all.

This is an artifact of commit 45c7f7e1ef17 ("mm, memcg: decouple
e{low,min} state mutations from protection checks") which made
mem_cgroup_calculate_protection() only change the state without returning
any value.  Before that commit, we used to return MEMCG_PROT_NONE for the
target memcg, which would cause us to skip the
mem_cgroup_below_{min/low}() checks.  After that commit we do not return
anything and we end up checking the min &amp; low effective protections for
the target memcg, which are stale.

Update mem_cgroup_supports_protection() to also check if we are reclaiming
from the target, and rename it to mem_cgroup_unprotected() (now returns
true if we should not protect the memcg, much simpler logic).

Link: https://lkml.kernel.org/r/20221202031512.1365483-1-yosryahmed@google.com
Link: https://lkml.kernel.org/r/20221202031512.1365483-2-yosryahmed@google.com
Fixes: 45c7f7e1ef17 ("mm, memcg: decouple e{low,min} state mutations from protection checks")
Signed-off-by: Yosry Ahmed &lt;yosryahmed@google.com&gt;
Reviewed-by: Roman Gushchin &lt;roman.gushchin@linux.dev&gt;
Cc: Chris Down &lt;chris@chrisdown.name&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Matthew Wilcox &lt;willy@infradead.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Muchun Song &lt;songmuchun@bytedance.com&gt;
Cc: Shakeel Butt &lt;shakeelb@google.com&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Vasily Averin &lt;vasily.averin@linux.dev&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Yu Zhao &lt;yuzhao@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>Merge tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm</title>
<updated>2022-10-11T00:53:04Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2022-10-11T00:53:04Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=27bc50fc90647bbf7b734c3fc306a5e61350da53'/>
<id>urn:sha1:27bc50fc90647bbf7b734c3fc306a5e61350da53</id>
<content type='text'>
Pull MM updates from Andrew Morton:

 - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
   linux-next for a couple of months without, to my knowledge, any
   negative reports (or any positive ones, come to that).

 - Also the Maple Tree from Liam Howlett. An overlapping range-based
   tree for vmas. It it apparently slightly more efficient in its own
   right, but is mainly targeted at enabling work to reduce mmap_lock
   contention.

   Liam has identified a number of other tree users in the kernel which
   could be beneficially onverted to mapletrees.

   Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
   at [1]. This has yet to be addressed due to Liam's unfortunately
   timed vacation. He is now back and we'll get this fixed up.

 - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses
   clang-generated instrumentation to detect used-unintialized bugs down
   to the single bit level.

   KMSAN keeps finding bugs. New ones, as well as the legacy ones.

 - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
   memory into THPs.

 - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to
   support file/shmem-backed pages.

 - userfaultfd updates from Axel Rasmussen

 - zsmalloc cleanups from Alexey Romanov

 - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and
   memory-failure

 - Huang Ying adds enhancements to NUMA balancing memory tiering mode's
   page promotion, with a new way of detecting hot pages.

 - memcg updates from Shakeel Butt: charging optimizations and reduced
   memory consumption.

 - memcg cleanups from Kairui Song.

 - memcg fixes and cleanups from Johannes Weiner.

 - Vishal Moola provides more folio conversions

 - Zhang Yi removed ll_rw_block() :(

 - migration enhancements from Peter Xu

 - migration error-path bugfixes from Huang Ying

 - Aneesh Kumar added ability for a device driver to alter the memory
   tiering promotion paths. For optimizations by PMEM drivers, DRM
   drivers, etc.

 - vma merging improvements from Jakub Matěn.

 - NUMA hinting cleanups from David Hildenbrand.

 - xu xin added aditional userspace visibility into KSM merging
   activity.

 - THP &amp; KSM code consolidation from Qi Zheng.

 - more folio work from Matthew Wilcox.

 - KASAN updates from Andrey Konovalov.

 - DAMON cleanups from Kaixu Xia.

 - DAMON work from SeongJae Park: fixes, cleanups.

 - hugetlb sysfs cleanups from Muchun Song.

 - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.

Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1]

* tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits)
  hugetlb: allocate vma lock for all sharable vmas
  hugetlb: take hugetlb vma_lock when clearing vma_lock-&gt;vma pointer
  hugetlb: fix vma lock handling during split vma and range unmapping
  mglru: mm/vmscan.c: fix imprecise comments
  mm/mglru: don't sync disk for each aging cycle
  mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol
  mm: memcontrol: use do_memsw_account() in a few more places
  mm: memcontrol: deprecate swapaccounting=0 mode
  mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled
  mm/secretmem: remove reduntant return value
  mm/hugetlb: add available_huge_pages() func
  mm: remove unused inline functions from include/linux/mm_inline.h
  selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory
  selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd
  selftests/vm: add thp collapse shmem testing
  selftests/vm: add thp collapse file and tmpfs testing
  selftests/vm: modularize thp collapse memory operations
  selftests/vm: dedup THP helpers
  mm/khugepaged: add tracepoint to hpage_collapse_scan_file()
  mm/madvise: add file and shmem support to MADV_COLLAPSE
  ...
</content>
</entry>
<entry>
<title>mm: memcontrol: use memcg_kmem_enabled in count_objcg_event</title>
<updated>2022-10-03T21:03:32Z</updated>
<author>
<name>Kairui Song</name>
<email>kasong@tencent.com</email>
</author>
<published>2022-09-19T18:06:33Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=2eb989195d9a361d13d66ffb8738847649e080ad'/>
<id>urn:sha1:2eb989195d9a361d13d66ffb8738847649e080ad</id>
<content type='text'>
Patch series "mm: memcontrol: cleanup and optimize for two accounting
params", v2.


This patch (of 2):

There are currently two helpers for checking if cgroup kmem
accounting is enabled:

- mem_cgroup_kmem_disabled
- memcg_kmem_enabled

mem_cgroup_kmem_disabled is a simple helper that returns true
if cgroup.memory=nokmem is specified, otherwise returns false.

memcg_kmem_enabled is a bit different, it returns true if
cgroup.memory=nokmem is not specified and there was at least one
non-root memory control enabled cgroup ever created. This help improve
performance when kmem accounting was not actually activated. And it's
optimized with static branch.

The usage of mem_cgroup_kmem_disabled is for sub-systems that need to
preallocate data for kmem accounting since they could be initialized
before kmem accounting is activated. But count_objcg_event doesn't
need that, so using memcg_kmem_enabled is better here.

Link: https://lkml.kernel.org/r/20220919180634.45958-1-ryncsn@gmail.com
Link: https://lkml.kernel.org/r/20220919180634.45958-2-ryncsn@gmail.com
Signed-off-by: Kairui Song &lt;kasong@tencent.com&gt;
Acked-by: Shakeel Butt &lt;shakeelb@google.com&gt;
Acked-by: Roman Gushchin &lt;roman.gushchin@linux.dev&gt;
Acked-by: Muchun Song &lt;songmuchun@bytedance.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>memcg: extract memcg_vmstats from struct mem_cgroup</title>
<updated>2022-10-03T21:03:04Z</updated>
<author>
<name>Shakeel Butt</name>
<email>shakeelb@google.com</email>
</author>
<published>2022-09-07T04:35:35Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=410f8e82689e1e66044fea51ef852054a09502b7'/>
<id>urn:sha1:410f8e82689e1e66044fea51ef852054a09502b7</id>
<content type='text'>
Patch series "memcg: reduce memory overhead of memory cgroups".

Currently a lot of memory is wasted to maintain the vmevents for memory
cgroups as we have multiple arrays of size NR_VM_EVENT_ITEMS which can be
as large as 110.  However memcg code uses small portion of those entries. 
This patch series eliminate this overhead by removing the unneeded vmevent
entries from memory cgroup data structures.


This patch (of 3):

This is a preparatory patch to reduce the memory overhead of memory
cgroup. The struct memcg_vmstats is the largest object embedded into the
struct mem_cgroup. This patch extracts struct memcg_vmstats from struct
mem_cgroup to ease the following patches in reducing the size of struct
memcg_vmstats.

Link: https://lkml.kernel.org/r/20220907043537.3457014-1-shakeelb@google.com
Link: https://lkml.kernel.org/r/20220907043537.3457014-2-shakeelb@google.com
Signed-off-by: Shakeel Butt &lt;shakeelb@google.com&gt;
Acked-by: Roman Gushchin &lt;roman.gushchin@linux.dev&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Michal Hocko &lt;mhocko@kernel.org&gt;
Cc: Muchun Song &lt;songmuchun@bytedance.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>memcg: convert mem_cgroup_swapin_charge_page() to mem_cgroup_swapin_charge_folio()</title>
<updated>2022-10-03T21:02:47Z</updated>
<author>
<name>Matthew Wilcox (Oracle)</name>
<email>willy@infradead.org</email>
</author>
<published>2022-09-02T19:46:12Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=6599591816f522c1cc8ec4eb5cea75738963756a'/>
<id>urn:sha1:6599591816f522c1cc8ec4eb5cea75738963756a</id>
<content type='text'>
All callers now have a folio, so pass it in here and remove an unnecessary
call to page_folio().

Link: https://lkml.kernel.org/r/20220902194653.1739778-17-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>Revert "net: set proper memcg for net_init hooks allocations"</title>
<updated>2022-09-28T16:20:37Z</updated>
<author>
<name>Shakeel Butt</name>
<email>shakeelb@google.com</email>
</author>
<published>2022-09-26T19:59:31Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=b1cab78ba392162a5ef11173d02db6e182e04edd'/>
<id>urn:sha1:b1cab78ba392162a5ef11173d02db6e182e04edd</id>
<content type='text'>
This reverts commit 1d0403d20f6c281cb3d14c5f1db5317caeec48e9.

Anatoly Pugachev reported that the commit 1d0403d20f6c ("net: set proper
memcg for net_init hooks allocations") is somehow causing the sparc64
VMs failed to boot and the VMs boot fine with that patch reverted. So,
revert the patch for now and later we can debug the issue.

Link: https://lore.kernel.org/all/20220918092849.GA10314@u164.east.ru/
Reported-by: Anatoly Pugachev &lt;matorola@gmail.com&gt;
Signed-off-by: Shakeel Butt &lt;shakeelb@google.com&gt;
Cc: Vasily Averin &lt;vvs@openvz.org&gt;
Cc: Jakub Kicinski &lt;kuba@kernel.org&gt;
Cc: Michal Koutný &lt;mkoutny@suse.com&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: cgroups@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Tested-by: Anatoly Pugachev &lt;matorola@gmail.com&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Fixes: 1d0403d20f6c ("net: set proper memcg for net_init hooks allocations")
Reviewed-by: Muchun Song &lt;songmuchun@bytedance.com&gt;
Acked-by: Roman Gushchin &lt;roman.gushchin@linux.dev&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: deduplicate cacheline padding code</title>
<updated>2022-09-27T02:46:29Z</updated>
<author>
<name>Shakeel Butt</name>
<email>shakeelb@google.com</email>
</author>
<published>2022-08-26T23:06:42Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=e6ad640bc404eb298dd1880113131768ddf5c6a8'/>
<id>urn:sha1:e6ad640bc404eb298dd1880113131768ddf5c6a8</id>
<content type='text'>
There are three users (mmzone.h, memcontrol.h, page_counter.h) using
similar code for forcing cacheline padding between fields of different
structures.  Dedup that code.

Link: https://lkml.kernel.org/r/20220826230642.566725-1-shakeelb@google.com
Signed-off-by: Shakeel Butt &lt;shakeelb@google.com&gt;
Suggested-by: Feng Tang &lt;feng.tang@intel.com&gt;
Reviewed-by: Feng Tang &lt;feng.tang@intel.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: multi-gen LRU: support page table walks</title>
<updated>2022-09-27T02:46:09Z</updated>
<author>
<name>Yu Zhao</name>
<email>yuzhao@google.com</email>
</author>
<published>2022-09-18T08:00:05Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=bd74fdaea146029e4fa12c6de89adbe0779348a9'/>
<id>urn:sha1:bd74fdaea146029e4fa12c6de89adbe0779348a9</id>
<content type='text'>
To further exploit spatial locality, the aging prefers to walk page tables
to search for young PTEs and promote hot pages.  A kill switch will be
added in the next patch to disable this behavior.  When disabled, the
aging relies on the rmap only.

NB: this behavior has nothing similar with the page table scanning in the
2.4 kernel [1], which searches page tables for old PTEs, adds cold pages
to swapcache and unmaps them.

To avoid confusion, the term "iteration" specifically means the traversal
of an entire mm_struct list; the term "walk" will be applied to page
tables and the rmap, as usual.

An mm_struct list is maintained for each memcg, and an mm_struct follows
its owner task to the new memcg when this task is migrated.  Given an
lruvec, the aging iterates lruvec_memcg()-&gt;mm_list and calls
walk_page_range() with each mm_struct on this list to promote hot pages
before it increments max_seq.

When multiple page table walkers iterate the same list, each of them gets
a unique mm_struct; therefore they can run concurrently.  Page table
walkers ignore any misplaced pages, e.g., if an mm_struct was migrated,
pages it left in the previous memcg will not be promoted when its current
memcg is under reclaim.  Similarly, page table walkers will not promote
pages from nodes other than the one under reclaim.

This patch uses the following optimizations when walking page tables:
1. It tracks the usage of mm_struct's between context switches so that
   page table walkers can skip processes that have been sleeping since
   the last iteration.
2. It uses generational Bloom filters to record populated branches so
   that page table walkers can reduce their search space based on the
   query results, e.g., to skip page tables containing mostly holes or
   misplaced pages.
3. It takes advantage of the accessed bit in non-leaf PMD entries when
   CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
4. It does not zigzag between a PGD table and the same PMD table
   spanning multiple VMAs. IOW, it finishes all the VMAs within the
   range of the same PMD table before it returns to a PGD table. This
   improves the cache performance for workloads that have large
   numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.

Server benchmark results:
  Single workload:
    fio (buffered I/O): no change

  Single workload:
    memcached (anon): +[8, 10]%
                Ops/sec      KB/sec
      patch1-7: 1147696.57   44640.29
      patch1-8: 1245274.91   48435.66

  Configurations:
    no change

Client benchmark results:
  kswapd profiles:
    patch1-7
      48.16%  lzo1x_1_do_compress (real work)
       8.20%  page_vma_mapped_walk (overhead)
       7.06%  _raw_spin_unlock_irq
       2.92%  ptep_clear_flush
       2.53%  __zram_bvec_write
       2.11%  do_raw_spin_lock
       2.02%  memmove
       1.93%  lru_gen_look_around
       1.56%  free_unref_page_list
       1.40%  memset

    patch1-8
      49.44%  lzo1x_1_do_compress (real work)
       6.19%  page_vma_mapped_walk (overhead)
       5.97%  _raw_spin_unlock_irq
       3.13%  get_pfn_folio
       2.85%  ptep_clear_flush
       2.42%  __zram_bvec_write
       2.08%  do_raw_spin_lock
       1.92%  memmove
       1.44%  alloc_zspage
       1.36%  memset

  Configurations:
    no change

Thanks to the following developers for their efforts [3].
  kernel test robot &lt;lkp@intel.com&gt;

[1] https://lwn.net/Articles/23732/
[2] https://llvm.org/docs/ScudoHardenedAllocator.html
[3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/

Link: https://lkml.kernel.org/r/20220918080010.2920238-9-yuzhao@google.com
Signed-off-by: Yu Zhao &lt;yuzhao@google.com&gt;
Acked-by: Brian Geffon &lt;bgeffon@google.com&gt;
Acked-by: Jan Alexander Steffens (heftig) &lt;heftig@archlinux.org&gt;
Acked-by: Oleksandr Natalenko &lt;oleksandr@natalenko.name&gt;
Acked-by: Steven Barrett &lt;steven@liquorix.net&gt;
Acked-by: Suleiman Souhlal &lt;suleiman@google.com&gt;
Tested-by: Daniel Byrne &lt;djbyrne@mtu.edu&gt;
Tested-by: Donald Carr &lt;d@chaos-reins.com&gt;
Tested-by: Holger Hoffstätte &lt;holger@applied-asynchrony.com&gt;
Tested-by: Konstantin Kharlamov &lt;Hi-Angel@yandex.ru&gt;
Tested-by: Shuang Zhai &lt;szhai2@cs.rochester.edu&gt;
Tested-by: Sofia Trinh &lt;sofia.trinh@edi.works&gt;
Tested-by: Vaibhav Jain &lt;vaibhav@linux.ibm.com&gt;
Cc: Andi Kleen &lt;ak@linux.intel.com&gt;
Cc: Aneesh Kumar K.V &lt;aneesh.kumar@linux.ibm.com&gt;
Cc: Barry Song &lt;baohua@kernel.org&gt;
Cc: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Cc: Dave Hansen &lt;dave.hansen@linux.intel.com&gt;
Cc: Hillf Danton &lt;hdanton@sina.com&gt;
Cc: Jens Axboe &lt;axboe@kernel.dk&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Jonathan Corbet &lt;corbet@lwn.net&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Matthew Wilcox &lt;willy@infradead.org&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Miaohe Lin &lt;linmiaohe@huawei.com&gt;
Cc: Michael Larabel &lt;Michael@MichaelLarabel.com&gt;
Cc: Michal Hocko &lt;mhocko@kernel.org&gt;
Cc: Mike Rapoport &lt;rppt@kernel.org&gt;
Cc: Mike Rapoport &lt;rppt@linux.ibm.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Qi Zheng &lt;zhengqi.arch@bytedance.com&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Will Deacon &lt;will@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: multi-gen LRU: exploit locality in rmap</title>
<updated>2022-09-27T02:46:09Z</updated>
<author>
<name>Yu Zhao</name>
<email>yuzhao@google.com</email>
</author>
<published>2022-09-18T08:00:04Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=018ee47f14893d500131dfca2ff9f3ff8ebd4ed2'/>
<id>urn:sha1:018ee47f14893d500131dfca2ff9f3ff8ebd4ed2</id>
<content type='text'>
Searching the rmap for PTEs mapping each page on an LRU list (to test and
clear the accessed bit) can be expensive because pages from different VMAs
(PA space) are not cache friendly to the rmap (VA space).  For workloads
mostly using mapped pages, searching the rmap can incur the highest CPU
cost in the reclaim path.

This patch exploits spatial locality to reduce the trips into the rmap. 
When shrink_page_list() walks the rmap and finds a young PTE, a new
function lru_gen_look_around() scans at most BITS_PER_LONG-1 adjacent
PTEs.  On finding another young PTE, it clears the accessed bit and
updates the gen counter of the page mapped by this PTE to
(max_seq%MAX_NR_GENS)+1.

Server benchmark results:
  Single workload:
    fio (buffered I/O): no change

  Single workload:
    memcached (anon): +[3, 5]%
                Ops/sec      KB/sec
      patch1-6: 1106168.46   43025.04
      patch1-7: 1147696.57   44640.29

  Configurations:
    no change

Client benchmark results:
  kswapd profiles:
    patch1-6
      39.03%  lzo1x_1_do_compress (real work)
      18.47%  page_vma_mapped_walk (overhead)
       6.74%  _raw_spin_unlock_irq
       3.97%  do_raw_spin_lock
       2.49%  ptep_clear_flush
       2.48%  anon_vma_interval_tree_iter_first
       1.92%  folio_referenced_one
       1.88%  __zram_bvec_write
       1.48%  memmove
       1.31%  vma_interval_tree_iter_next

    patch1-7
      48.16%  lzo1x_1_do_compress (real work)
       8.20%  page_vma_mapped_walk (overhead)
       7.06%  _raw_spin_unlock_irq
       2.92%  ptep_clear_flush
       2.53%  __zram_bvec_write
       2.11%  do_raw_spin_lock
       2.02%  memmove
       1.93%  lru_gen_look_around
       1.56%  free_unref_page_list
       1.40%  memset

  Configurations:
    no change

Link: https://lkml.kernel.org/r/20220918080010.2920238-8-yuzhao@google.com
Signed-off-by: Yu Zhao &lt;yuzhao@google.com&gt;
Acked-by: Barry Song &lt;baohua@kernel.org&gt;
Acked-by: Brian Geffon &lt;bgeffon@google.com&gt;
Acked-by: Jan Alexander Steffens (heftig) &lt;heftig@archlinux.org&gt;
Acked-by: Oleksandr Natalenko &lt;oleksandr@natalenko.name&gt;
Acked-by: Steven Barrett &lt;steven@liquorix.net&gt;
Acked-by: Suleiman Souhlal &lt;suleiman@google.com&gt;
Tested-by: Daniel Byrne &lt;djbyrne@mtu.edu&gt;
Tested-by: Donald Carr &lt;d@chaos-reins.com&gt;
Tested-by: Holger Hoffstätte &lt;holger@applied-asynchrony.com&gt;
Tested-by: Konstantin Kharlamov &lt;Hi-Angel@yandex.ru&gt;
Tested-by: Shuang Zhai &lt;szhai2@cs.rochester.edu&gt;
Tested-by: Sofia Trinh &lt;sofia.trinh@edi.works&gt;
Tested-by: Vaibhav Jain &lt;vaibhav@linux.ibm.com&gt;
Cc: Andi Kleen &lt;ak@linux.intel.com&gt;
Cc: Aneesh Kumar K.V &lt;aneesh.kumar@linux.ibm.com&gt;
Cc: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Cc: Dave Hansen &lt;dave.hansen@linux.intel.com&gt;
Cc: Hillf Danton &lt;hdanton@sina.com&gt;
Cc: Jens Axboe &lt;axboe@kernel.dk&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Jonathan Corbet &lt;corbet@lwn.net&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Matthew Wilcox &lt;willy@infradead.org&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Miaohe Lin &lt;linmiaohe@huawei.com&gt;
Cc: Michael Larabel &lt;Michael@MichaelLarabel.com&gt;
Cc: Michal Hocko &lt;mhocko@kernel.org&gt;
Cc: Mike Rapoport &lt;rppt@kernel.org&gt;
Cc: Mike Rapoport &lt;rppt@linux.ibm.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Qi Zheng &lt;zhengqi.arch@bytedance.com&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Will Deacon &lt;will@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
</feed>
