<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/lib/xarray.c, branch v5.8.5</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v5.8.5</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v5.8.5'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2020-03-12T21:42:08Z</updated>
<entry>
<title>xarray: Fix early termination of xas_for_each_marked</title>
<updated>2020-03-12T21:42:08Z</updated>
<author>
<name>Matthew Wilcox (Oracle)</name>
<email>willy@infradead.org</email>
</author>
<published>2020-03-12T21:29:11Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=7e934cf5ace1dceeb804f7493fa28bb697ed3c52'/>
<id>urn:sha1:7e934cf5ace1dceeb804f7493fa28bb697ed3c52</id>
<content type='text'>
xas_for_each_marked() is using entry == NULL as a termination condition
of the iteration. When xas_for_each_marked() is used protected only by
RCU, this can however race with xas_store(xas, NULL) in the following
way:

TASK1                                   TASK2
page_cache_delete()         	        find_get_pages_range_tag()
                                          xas_for_each_marked()
                                            xas_find_marked()
                                              off = xas_find_chunk()

  xas_store(&amp;xas, NULL)
    xas_init_marks(&amp;xas);
    ...
    rcu_assign_pointer(*slot, NULL);
                                              entry = xa_entry(off);

And thus xas_for_each_marked() terminates prematurely possibly leading
to missed entries in the iteration (translating to missing writeback of
some pages or a similar problem).

If we find a NULL entry that has been marked, skip it (unless we're trying
to allocate an entry).

Reported-by: Jan Kara &lt;jack@suse.cz&gt;
CC: stable@vger.kernel.org
Fixes: ef8e5717db01 ("page cache: Convert delete_batch to XArray")
Signed-off-by: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
</content>
</entry>
<entry>
<title>XArray: Optimise xas_sibling() if !CONFIG_XARRAY_MULTI</title>
<updated>2020-02-27T12:37:40Z</updated>
<author>
<name>Matthew Wilcox (Oracle)</name>
<email>willy@infradead.org</email>
</author>
<published>2020-02-27T12:37:40Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=d8e93e3f22d9fd2e6a3ccae3623c3af8789ccfc0'/>
<id>urn:sha1:d8e93e3f22d9fd2e6a3ccae3623c3af8789ccfc0</id>
<content type='text'>
If CONFIG_XARRAY_MULTI is disabled, then xas_sibling() must be false.

Reported-by: JaeJoon Jung &lt;rgbi3307@gmail.com&gt;
Signed-off-by: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
</content>
</entry>
<entry>
<title>XArray: Fix xas_pause for large multi-index entries</title>
<updated>2020-01-31T20:09:49Z</updated>
<author>
<name>Matthew Wilcox (Oracle)</name>
<email>willy@infradead.org</email>
</author>
<published>2020-01-31T11:17:09Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=c36d451ad386b34f452fc3c8621ff14b9eaa31a6'/>
<id>urn:sha1:c36d451ad386b34f452fc3c8621ff14b9eaa31a6</id>
<content type='text'>
Inspired by the recent Coverity report, I looked for other places where
the offset wasn't being converted to an unsigned long before being
shifted, and I found one in xas_pause() when the entry being paused is
of order &gt;32.

Fixes: b803b42823d0 ("xarray: Add XArray iterators")
Signed-off-by: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: stable@vger.kernel.org
</content>
</entry>
<entry>
<title>XArray: Fix xa_find_next for large multi-index entries</title>
<updated>2020-01-31T20:09:36Z</updated>
<author>
<name>Matthew Wilcox (Oracle)</name>
<email>willy@infradead.org</email>
</author>
<published>2020-01-31T10:07:55Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=bd40b17ca49d7d110adf456e647701ce74de2241'/>
<id>urn:sha1:bd40b17ca49d7d110adf456e647701ce74de2241</id>
<content type='text'>
Coverity pointed out that xas_sibling() was shifting xa_offset without
promoting it to an unsigned long first, so the shift could cause an
overflow and we'd get the wrong answer.  The fix is obvious, and the
new test-case provokes UBSAN to report an error:
runtime error: shift exponent 60 is too large for 32-bit type 'int'

Fixes: 19c30f4dd092 ("XArray: Fix xa_find_after with multi-index entries")
Reported-by: Bjorn Helgaas &lt;bhelgaas@google.com&gt;
Reported-by: Kees Cook &lt;keescook@chromium.org&gt;
Signed-off-by: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: stable@vger.kernel.org
</content>
</entry>
<entry>
<title>XArray: Fix xas_find returning too many entries</title>
<updated>2020-01-18T03:33:33Z</updated>
<author>
<name>Matthew Wilcox (Oracle)</name>
<email>willy@infradead.org</email>
</author>
<published>2020-01-18T03:13:21Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=c44aa5e8ab58b5f4cf473970ec784c3333496a2e'/>
<id>urn:sha1:c44aa5e8ab58b5f4cf473970ec784c3333496a2e</id>
<content type='text'>
If you call xas_find() with the initial index &gt; max, it should have
returned NULL but was returning the entry at index.

Signed-off-by: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: stable@vger.kernel.org
</content>
</entry>
<entry>
<title>XArray: Fix xa_find_after with multi-index entries</title>
<updated>2020-01-18T03:33:27Z</updated>
<author>
<name>Matthew Wilcox (Oracle)</name>
<email>willy@infradead.org</email>
</author>
<published>2020-01-18T03:00:41Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=19c30f4dd0923ef191f35c652ee4058e91e89056'/>
<id>urn:sha1:19c30f4dd0923ef191f35c652ee4058e91e89056</id>
<content type='text'>
If the entry is of an order which is a multiple of XA_CHUNK_SIZE,
the current detection of sibling entries does not work.  Factor out
an xas_sibling() function to make xa_find_after() a little more
understandable, and write a new implementation that doesn't suffer from
the same bug.

Signed-off-by: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: stable@vger.kernel.org
</content>
</entry>
<entry>
<title>XArray: Fix infinite loop with entry at ULONG_MAX</title>
<updated>2020-01-18T03:32:24Z</updated>
<author>
<name>Matthew Wilcox (Oracle)</name>
<email>willy@infradead.org</email>
</author>
<published>2020-01-17T22:45:12Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=430f24f94c8a174d411a550d7b5529301922e67a'/>
<id>urn:sha1:430f24f94c8a174d411a550d7b5529301922e67a</id>
<content type='text'>
If there is an entry at ULONG_MAX, xa_for_each() will overflow the
'index + 1' in xa_find_after() and wrap around to 0.  Catch this case
and terminate the loop by returning NULL.

Signed-off-by: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: stable@vger.kernel.org
</content>
</entry>
<entry>
<title>XArray: Fix xas_pause at ULONG_MAX</title>
<updated>2019-11-09T04:48:40Z</updated>
<author>
<name>Matthew Wilcox (Oracle)</name>
<email>willy@infradead.org</email>
</author>
<published>2019-11-08T03:49:11Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=82a22311b7a68a78709699dc8c098953b70e4fd2'/>
<id>urn:sha1:82a22311b7a68a78709699dc8c098953b70e4fd2</id>
<content type='text'>
If we were unlucky enough to call xas_pause() when the index was at
ULONG_MAX (or a multi-slot entry which ends at ULONG_MAX), we would
wrap the index back around to 0 and restart the iteration from the
beginning.  Use the XAS_BOUNDS state to indicate that we should just
stop the iteration.

Signed-off-by: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
</content>
</entry>
<entry>
<title>XArray: Fix xas_next() with a single entry at 0</title>
<updated>2019-07-01T21:11:16Z</updated>
<author>
<name>Matthew Wilcox (Oracle)</name>
<email>willy@infradead.org</email>
</author>
<published>2019-07-01T21:03:29Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=91abab83839aa2eba073e4a63c729832fdb27ea1'/>
<id>urn:sha1:91abab83839aa2eba073e4a63c729832fdb27ea1</id>
<content type='text'>
If there is only a single entry at 0, the first time we call xas_next(),
we return the entry.  Unfortunately, all subsequent times we call
xas_next(), we also return the entry at 0 instead of noticing that the
xa_index is now greater than zero.  This broke find_get_pages_contig().

Fixes: 64d3e9a9e0cc ("xarray: Step through an XArray")
Reported-by: Kent Overstreet &lt;kent.overstreet@gmail.com&gt;
Signed-off-by: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
</content>
</entry>
<entry>
<title>mm: fix page cache convergence regression</title>
<updated>2019-05-31T17:52:41Z</updated>
<author>
<name>Johannes Weiner</name>
<email>hannes@cmpxchg.org</email>
</author>
<published>2019-05-24T14:12:46Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=7b785645e8f13e17cbce492708cf6e7039d32e46'/>
<id>urn:sha1:7b785645e8f13e17cbce492708cf6e7039d32e46</id>
<content type='text'>
Since a28334862993 ("page cache: Finish XArray conversion"), on most
major Linux distributions, the page cache doesn't correctly transition
when the hot data set is changing, and leaves the new pages thrashing
indefinitely instead of kicking out the cold ones.

On a freshly booted, freshly ssh'd into virtual machine with 1G RAM
running stock Arch Linux:

[root@ham ~]# ./reclaimtest.sh
+ dd of=workingset-a bs=1M count=0 seek=600
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ ./mincore workingset-a
153600/153600 workingset-a
+ dd of=workingset-b bs=1M count=0 seek=600
+ cat workingset-b
+ cat workingset-b
+ cat workingset-b
+ cat workingset-b
+ ./mincore workingset-a workingset-b
104029/153600 workingset-a
120086/153600 workingset-b
+ cat workingset-b
+ cat workingset-b
+ cat workingset-b
+ cat workingset-b
+ ./mincore workingset-a workingset-b
104029/153600 workingset-a
120268/153600 workingset-b

workingset-b is a 600M file on a 1G host that is otherwise entirely
idle. No matter how often it's being accessed, it won't get cached.

While investigating, I noticed that the non-resident information gets
aggressively reclaimed - /proc/vmstat::workingset_nodereclaim. This is
a problem because a workingset transition like this relies on the
non-resident information tracked in the page cache tree of evicted
file ranges: when the cache faults are refaults of recently evicted
cache, we challenge the existing active set, and that allows a new
workingset to establish itself.

Tracing the shrinker that maintains this memory revealed that all page
cache tree nodes were allocated to the root cgroup. This is a problem,
because 1) the shrinker sizes the amount of non-resident information
it keeps to the size of the cgroup's other memory and 2) on most major
Linux distributions, only kernel threads live in the root cgroup and
everything else gets put into services or session groups:

[root@ham ~]# cat /proc/self/cgroup
0::/user.slice/user-0.slice/session-c1.scope

As a result, we basically maintain no non-resident information for the
workloads running on the system, thus breaking the caching algorithm.

Looking through the code, I found the culprit in the above-mentioned
patch: when switching from the radix tree to xarray, it dropped the
__GFP_ACCOUNT flag from the tree node allocations - the flag that
makes sure the allocated memory gets charged to and tracked by the
cgroup of the calling process - in this case, the one doing the fault.

To fix this, allow xarray users to specify per-tree flag that makes
xarray allocate nodes using __GFP_ACCOUNT. Then restore the page cache
tree annotation to request such cgroup tracking for the cache nodes.

With this patch applied, the page cache correctly converges on new
workingsets again after just a few iterations:

[root@ham ~]# ./reclaimtest.sh
+ dd of=workingset-a bs=1M count=0 seek=600
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ ./mincore workingset-a
153600/153600 workingset-a
+ dd of=workingset-b bs=1M count=0 seek=600
+ cat workingset-b
+ ./mincore workingset-a workingset-b
124607/153600 workingset-a
87876/153600 workingset-b
+ cat workingset-b
+ ./mincore workingset-a workingset-b
81313/153600 workingset-a
133321/153600 workingset-b
+ cat workingset-b
+ ./mincore workingset-a workingset-b
63036/153600 workingset-a
153600/153600 workingset-b

Cc: stable@vger.kernel.org # 4.20+
Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Reviewed-by: Shakeel Butt &lt;shakeelb@google.com&gt;
Signed-off-by: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
</content>
</entry>
</feed>
