<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/include/linux/mmzone.h, branch v5.6.10</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v5.6.10</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v5.6.10'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2020-02-04T03:05:23Z</updated>
<entry>
<title>mm: factor out next_present_section_nr()</title>
<updated>2020-02-04T03:05:23Z</updated>
<author>
<name>David Hildenbrand</name>
<email>david@redhat.com</email>
</author>
<published>2020-02-04T01:34:02Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=4c6058814ec4460c25111e29452ef596acdcd61b'/>
<id>urn:sha1:4c6058814ec4460c25111e29452ef596acdcd61b</id>
<content type='text'>
Let's move it to the header and use the shorter variant from
mm/page_alloc.c (the original one will also check
"__highest_present_section_nr + 1", which is not necessary).  While at
it, make the section_nr in next_pfn() const.

In next_pfn(), we now return section_nr_to_pfn(-1) instead of -1 once we
exceed __highest_present_section_nr, which doesn't make a difference in
the caller as it is big enough (&gt;= all sane end_pfn).

Link: http://lkml.kernel.org/r/20200113144035.10848-3-david@redhat.com
Signed-off-by: David Hildenbrand &lt;david@redhat.com&gt;
Acked-by: Kirill A. Shutemov &lt;kirill.shutemov@linux.intel.com&gt;
Cc: Baoquan He &lt;bhe@redhat.com&gt;
Cc: Dan Williams &lt;dan.j.williams@intel.com&gt;
Cc: "Jin, Zhi" &lt;zhi.jin@intel.com&gt;
Cc: "Kirill A. Shutemov" &lt;kirill@shutemov.name&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Michal Hocko &lt;mhocko@kernel.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Oscar Salvador &lt;osalvador@suse.de&gt;
Cc: Pavel Tatashin &lt;pasha.tatashin@oracle.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: fix comments related to node reclaim</title>
<updated>2020-01-31T18:30:39Z</updated>
<author>
<name>Hao Lee</name>
<email>haolee.swjtu@gmail.com</email>
</author>
<published>2020-01-31T06:15:19Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=0a3c57729768e08de690ba872dd52ee9c3c11e8b'/>
<id>urn:sha1:0a3c57729768e08de690ba872dd52ee9c3c11e8b</id>
<content type='text'>
As zone reclaim has been replaced by node reclaim, this patch fixes
related comments.

Link: http://lkml.kernel.org/r/20191126141346.GA22665@haolee.github.io
Signed-off-by: Hao Lee &lt;haolee.swjtu@gmail.com&gt;
Reviewed-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Anshuman Khandual &lt;anshuman.khandual@arm.com&gt;
Cc: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: memcg/slab: fix percpu slab vmstats flushing</title>
<updated>2020-01-14T02:19:02Z</updated>
<author>
<name>Roman Gushchin</name>
<email>guro@fb.com</email>
</author>
<published>2020-01-14T00:29:16Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=4a87e2a25dc27131c3cce5e94421622193305638'/>
<id>urn:sha1:4a87e2a25dc27131c3cce5e94421622193305638</id>
<content type='text'>
Currently slab percpu vmstats are flushed twice: during the memcg
offlining and just before freeing the memcg structure.  Each time percpu
counters are summed, added to the atomic counterparts and propagated up
by the cgroup tree.

The second flushing is required due to how recursive vmstats are
implemented: counters are batched in percpu variables on a local level,
and once a percpu value is crossing some predefined threshold, it spills
over to atomic values on the local and each ascendant levels.  It means
that without flushing some numbers cached in percpu variables will be
dropped on floor each time a cgroup is destroyed.  And with uptime the
error on upper levels might become noticeable.

The first flushing aims to make counters on ancestor levels more
precise.  Dying cgroups may resume in the dying state for a long time.
After kmem_cache reparenting which is performed during the offlining
slab counters of the dying cgroup don't have any chances to be updated,
because any slab operations will be performed on the parent level.  It
means that the inaccuracy caused by percpu batching will not decrease up
to the final destruction of the cgroup.  By the original idea flushing
slab counters during the offlining should minimize the visible
inaccuracy of slab counters on the parent level.

The problem is that percpu counters are not zeroed after the first
flushing.  So every cached percpu value is summed twice.  It creates a
small error (up to 32 pages per cpu, but usually less) which accumulates
on parent cgroup level.  After creating and destroying of thousands of
child cgroups, slab counter on parent level can be way off the real
value.

For now, let's just stop flushing slab counters on memcg offlining.  It
can't be done correctly without scheduling a work on each cpu: reading
and zeroing it during css offlining can race with an asynchronous
update, which doesn't expect values to be changed underneath.

With this change, slab counters on parent level will become eventually
consistent.  Once all dying children are gone, values are correct.  And
if not, the error is capped by 32 * NR_CPUS pages per dying cgroup.

It's not perfect, as slab are reparented, so any updates after the
reparenting will happen on the parent level.  It means that if a slab
page was allocated, a counter on child level was bumped, then the page
was reparented and freed, the annihilation of positive and negative
counter values will not happen until the child cgroup is released.  It
makes slab counters different from others, and it might want us to
implement flushing in a correct form again.  But it's also a question of
performance: scheduling a work on each cpu isn't free, and it's an open
question if the benefit of having more accurate counters is worth it.

We might also consider flushing all counters on offlining, not only slab
counters.

So let's fix the main problem now: make the slab counters eventually
consistent, so at least the error won't grow with uptime (or more
precisely the number of created and destroyed cgroups).  And think about
the accuracy of counters separately.

Link: http://lkml.kernel.org/r/20191220042728.1045881-1-guro@fb.com
Fixes: bee07b33db78 ("mm: memcontrol: flush percpu slab vmstats on kmem offlining")
Signed-off-by: Roman Gushchin &lt;guro@fb.com&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: fix struct member name in function comments</title>
<updated>2019-12-01T20:59:10Z</updated>
<author>
<name>Hao Lee</name>
<email>haolee.swjtu@gmail.com</email>
</author>
<published>2019-12-01T01:58:14Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=84218b552e0a591ac706a926d5e1e8eaf0d5a03a'/>
<id>urn:sha1:84218b552e0a591ac706a926d5e1e8eaf0d5a03a</id>
<content type='text'>
The member in struct zonelist is _zonerefs instead of zones.

Link: http://lkml.kernel.org/r/20190927144049.GA29622@haolee.github.io
Signed-off-by: Hao Lee &lt;haolee.swjtu@gmail.com&gt;
Reviewed-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Reviewed-by: Wei Yang &lt;richardw.yang@linux.intel.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: vmscan: enforce inactive:active ratio at the reclaim root</title>
<updated>2019-12-01T20:59:07Z</updated>
<author>
<name>Johannes Weiner</name>
<email>hannes@cmpxchg.org</email>
</author>
<published>2019-12-01T01:56:02Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=b91ac374346ba206cfd568bb0ab830af6b205cfd'/>
<id>urn:sha1:b91ac374346ba206cfd568bb0ab830af6b205cfd</id>
<content type='text'>
We split the LRU lists into inactive and an active parts to maximize
workingset protection while allowing just enough inactive cache space to
faciltate readahead and writeback for one-off file accesses (e.g.  a
linear scan through a file, or logging); or just enough inactive anon to
maintain recent reference information when reclaim needs to swap.

With cgroups and their nested LRU lists, we currently don't do this
correctly.  While recursive cgroup reclaim establishes a relative LRU
order among the pages of all involved cgroups, inactive:active size
decisions are done on a per-cgroup level.  As a result, we'll reclaim a
cgroup's workingset when it doesn't have cold pages, even when one of its
siblings has plenty of it that should be reclaimed first.

For example: workload A has 50M worth of hot cache but doesn't do any
one-off file accesses; meanwhile, parallel workload B scans files and
rarely accesses the same page twice.

If these workloads were to run in an uncgrouped system, A would be
protected from the high rate of cache faults from B.  But if they were put
in parallel cgroups for memory accounting purposes, B's fast cache fault
rate would push out the hot cache pages of A.  This is unexpected and
undesirable - the "scan resistance" of the page cache is broken.

This patch moves inactive:active size balancing decisions to the root of
reclaim - the same level where the LRU order is established.

It does this by looking at the recursive size of the inactive and the
active file sets of the cgroup subtree at the beginning of the reclaim
cycle, and then making a decision - scan or skip active pages - that
applies throughout the entire run and to every cgroup involved.

With that in place, in the test above, the VM will recognize that there
are plenty of inactive pages in the combined cache set of workloads A and
B and prefer the one-off cache in B over the hot pages in A.  The scan
resistance of the cache is restored.

[cai@lca.pw: fix some -Wenum-conversion warnings]
  Link: http://lkml.kernel.org/r/1573848697-29262-1-git-send-email-cai@lca.pw
Link: http://lkml.kernel.org/r/20191107205334.158354-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Reviewed-by: Suren Baghdasaryan &lt;surenb@google.com&gt;
Reviewed-by: Shakeel Butt &lt;shakeelb@google.com&gt;
Cc: Andrey Ryabinin &lt;aryabinin@virtuozzo.com&gt;
Cc: Rik van Riel &lt;riel@surriel.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: vmscan: harmonize writeback congestion tracking for nodes &amp; memcgs</title>
<updated>2019-12-01T20:59:07Z</updated>
<author>
<name>Johannes Weiner</name>
<email>hannes@cmpxchg.org</email>
</author>
<published>2019-12-01T01:55:52Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=1b05117df78e035afb5f66ef50bf8750d976ef08'/>
<id>urn:sha1:1b05117df78e035afb5f66ef50bf8750d976ef08</id>
<content type='text'>
The current writeback congestion tracking has separate flags for kswapd
reclaim (node level) and cgroup limit reclaim (memcg-node level).  This is
unnecessarily complicated: the lruvec is an existing abstraction layer for
that node-memcg intersection.

Introduce lruvec-&gt;flags and LRUVEC_CONGESTED.  Then track that at the
reclaim root level, which is either the NUMA node for global reclaim, or
the cgroup-node intersection for cgroup reclaim.

Link: http://lkml.kernel.org/r/20191022144803.302233-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Reviewed-by: Roman Gushchin &lt;guro@fb.com&gt;
Reviewed-by: Shakeel Butt &lt;shakeelb@google.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: clean up and clarify lruvec lookup procedure</title>
<updated>2019-12-01T20:59:06Z</updated>
<author>
<name>Johannes Weiner</name>
<email>hannes@cmpxchg.org</email>
</author>
<published>2019-12-01T01:55:34Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=867e5e1de14b2b2bde324cdfeec3f3f83eb21424'/>
<id>urn:sha1:867e5e1de14b2b2bde324cdfeec3f3f83eb21424</id>
<content type='text'>
There is a per-memcg lruvec and a NUMA node lruvec.  Which one is being
used is somewhat confusing right now, and it's easy to make mistakes -
especially when it comes to global reclaim.

How it works: when memory cgroups are enabled, we always use the
root_mem_cgroup's per-node lruvecs.  When memory cgroups are not compiled
in or disabled at runtime, we use pgdat-&gt;lruvec.

Document that in a comment.

Due to the way the reclaim code is generalized, all lookups use the
mem_cgroup_lruvec() helper function, and nobody should have to find the
right lruvec manually right now.  But to avoid future mistakes, rename the
pgdat-&gt;lruvec member to pgdat-&gt;__lruvec and delete the convenience wrapper
that suggests it's a commonly accessed member.

While in this area, swap the mem_cgroup_lruvec() argument order.  The name
suggests a memcg operation, yet it takes a pgdat first and a memcg second.
I have to double take every time I call this.  Fix that.

Link: http://lkml.kernel.org/r/20191022144803.302233-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Reviewed-by: Shakeel Butt &lt;shakeelb@google.com&gt;
Cc: Roman Gushchin &lt;guro@fb.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>include/linux/mmzone.h: fix comment for ISOLATE_UNMAPPED macro</title>
<updated>2019-12-01T20:59:06Z</updated>
<author>
<name>Hao Lee</name>
<email>haolee.swjtu@gmail.com</email>
</author>
<published>2019-12-01T01:55:18Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=653e003d7f37716f84c17edcad3c228497888bfc'/>
<id>urn:sha1:653e003d7f37716f84c17edcad3c228497888bfc</id>
<content type='text'>
Both file-backed pages and anonymous pages can be unmapped.
ISOLATE_UNMAPPED is not just for file-backed pages.

Link: http://lkml.kernel.org/r/20191024151621.GA20400@haolee.github.io
Signed-off-by: Hao Lee &lt;haolee.swjtu@gmail.com&gt;
Reviewed-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Dan Williams &lt;dan.j.williams@intel.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Wei Yang &lt;richard.weiyang@gmail.com&gt;
Cc: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: refresh ZONE_DMA and ZONE_DMA32 comments in 'enum zone_type'</title>
<updated>2019-10-14T09:56:29Z</updated>
<author>
<name>Nicolas Saenz Julienne</name>
<email>nsaenzjulienne@suse.de</email>
</author>
<published>2019-09-11T18:25:46Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=734f9246e791d8da278957b2c326d7709b2a97c0'/>
<id>urn:sha1:734f9246e791d8da278957b2c326d7709b2a97c0</id>
<content type='text'>
These zones usage has evolved with time and the comments were outdated.
This joins both ZONE_DMA and ZONE_DMA32 explanation and gives up to date
examples on how they are used on different architectures.

Signed-off-by: Nicolas Saenz Julienne &lt;nsaenzjulienne@suse.de&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed-by: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Signed-off-by: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
</content>
</entry>
<entry>
<title>mm: thp: extract split_queue_* into a struct</title>
<updated>2019-09-24T22:54:11Z</updated>
<author>
<name>Yang Shi</name>
<email>yang.shi@linux.alibaba.com</email>
</author>
<published>2019-09-23T22:38:06Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=364c1eebe453f06f0c1e837eb155a5725c9cd272'/>
<id>urn:sha1:364c1eebe453f06f0c1e837eb155a5725c9cd272</id>
<content type='text'>
Patch series "Make deferred split shrinker memcg aware", v6.

Currently THP deferred split shrinker is not memcg aware, this may cause
premature OOM with some configuration.  For example the below test would
run into premature OOM easily:

$ cgcreate -g memory:thp
$ echo 4G &gt; /sys/fs/cgroup/memory/thp/memory/limit_in_bytes
$ cgexec -g memory:thp transhuge-stress 4000

transhuge-stress comes from kernel selftest.

It is easy to hit OOM, but there are still a lot THP on the deferred split
queue, memcg direct reclaim can't touch them since the deferred split
shrinker is not memcg aware.

Convert deferred split shrinker memcg aware by introducing per memcg
deferred split queue.  The THP should be on either per node or per memcg
deferred split queue if it belongs to a memcg.  When the page is
immigrated to the other memcg, it will be immigrated to the target memcg's
deferred split queue too.

Reuse the second tail page's deferred_list for per memcg list since the
same THP can't be on multiple deferred split queues.

Make deferred split shrinker not depend on memcg kmem since it is not
slab.  It doesn't make sense to not shrink THP even though memcg kmem is
disabled.

With the above change the test demonstrated above doesn't trigger OOM even
though with cgroup.memory=nokmem.

This patch (of 4):

Put split_queue, split_queue_lock and split_queue_len into a struct in
order to reduce code duplication when we convert deferred_split to memcg
aware in the later patches.

Link: http://lkml.kernel.org/r/1565144277-36240-2-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi &lt;yang.shi@linux.alibaba.com&gt;
Suggested-by: "Kirill A . Shutemov" &lt;kirill.shutemov@linux.intel.com&gt;
Acked-by: Kirill A. Shutemov &lt;kirill.shutemov@linux.intel.com&gt;
Reviewed-by: Kirill Tkhai &lt;ktkhai@virtuozzo.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Shakeel Butt &lt;shakeelb@google.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Qian Cai &lt;cai@lca.pw&gt;
Cc: Vladimir Davydov &lt;vdavydov.dev@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
</feed>
