<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/include/linux/swap.h, branch v4.14.283</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v4.14.283</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v4.14.283'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2019-12-05T14:37:51Z</updated>
<entry>
<title>vmscan: return NODE_RECLAIM_NOSCAN in node_reclaim() when CONFIG_NUMA is n</title>
<updated>2019-12-05T14:37:51Z</updated>
<author>
<name>Wei Yang</name>
<email>richard.weiyang@gmail.com</email>
</author>
<published>2018-12-28T08:34:36Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=1a439b577e36e8f6a9452c3731d2a6a088a85c33'/>
<id>urn:sha1:1a439b577e36e8f6a9452c3731d2a6a088a85c33</id>
<content type='text'>
[ Upstream commit 8b09549c2bfd9f3f8f4cdad74107ef4f4ff9cdd7 ]

Commit fa5e084e43eb ("vmscan: do not unconditionally treat zones that
fail zone_reclaim() as full") changed the return value of
node_reclaim().  The original return value 0 means NODE_RECLAIM_SOME
after this commit.

While the return value of node_reclaim() when CONFIG_NUMA is n is not
changed.  This will leads to call zone_watermark_ok() again.

This patch fixes the return value by adjusting to NODE_RECLAIM_NOSCAN.
Since node_reclaim() is only called in page_alloc.c, move it to
mm/internal.h.

Link: http://lkml.kernel.org/r/20181113080436.22078-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang &lt;richard.weiyang@gmail.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Reviewed-by: Matthew Wilcox &lt;willy@infradead.org&gt;
Cc: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>include/linux/swap.h: use offsetof() instead of custom __swapoffset macro</title>
<updated>2019-04-20T07:15:07Z</updated>
<author>
<name>Pi-Hsun Shih</name>
<email>pihsun@chromium.org</email>
</author>
<published>2019-03-13T18:44:33Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=c43e2bdd14941e969dac515efadc3d379f333fdf'/>
<id>urn:sha1:c43e2bdd14941e969dac515efadc3d379f333fdf</id>
<content type='text'>
[ Upstream commit a4046c06be50a4f01d435aa7fe57514818e6cc82 ]

Use offsetof() to calculate offset of a field to take advantage of
compiler built-in version when possible, and avoid UBSAN warning when
compiling with Clang:

  UBSAN: Undefined behaviour in mm/swapfile.c:3010:38
  member access within null pointer of type 'union swap_header'
  CPU: 6 PID: 1833 Comm: swapon Tainted: G S                4.19.23 #43
  Call trace:
   dump_backtrace+0x0/0x194
   show_stack+0x20/0x2c
   __dump_stack+0x20/0x28
   dump_stack+0x70/0x94
   ubsan_epilogue+0x14/0x44
   ubsan_type_mismatch_common+0xf4/0xfc
   __ubsan_handle_type_mismatch_v1+0x34/0x54
   __se_sys_swapon+0x654/0x1084
   __arm64_sys_swapon+0x1c/0x24
   el0_svc_common+0xa8/0x150
   el0_svc_compat_handler+0x2c/0x38
   el0_svc_compat+0x8/0x18

Link: http://lkml.kernel.org/r/20190312081902.223764-1-pihsun@chromium.org
Signed-off-by: Pi-Hsun Shih &lt;pihsun@chromium.org&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Reviewed-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>mm/swap: use nr_node_ids for avail_lists in swap_info_struct</title>
<updated>2019-01-26T08:37:06Z</updated>
<author>
<name>Aaron Lu</name>
<email>aaron.lu@intel.com</email>
</author>
<published>2018-12-28T08:34:39Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=4fb12a087a80dbedf56e4157b755bdea958aae7e'/>
<id>urn:sha1:4fb12a087a80dbedf56e4157b755bdea958aae7e</id>
<content type='text'>
[ Upstream commit 66f71da9dd38af17dc17209cdde7987d4679a699 ]

Since a2468cc9bfdf ("swap: choose swap device according to numa node"),
avail_lists field of swap_info_struct is changed to an array with
MAX_NUMNODES elements.  This made swap_info_struct size increased to 40KiB
and needs an order-4 page to hold it.

This is not optimal in that:
1 Most systems have way less than MAX_NUMNODES(1024) nodes so it
  is a waste of memory;
2 It could cause swapon failure if the swap device is swapped on
  after system has been running for a while, due to no order-4
  page is available as pointed out by Vasily Averin.

Solve the above two issues by using nr_node_ids(which is the actual
possible node number the running system has) for avail_lists instead of
MAX_NUMNODES.

nr_node_ids is unknown at compile time so can't be directly used when
declaring this array.  What I did here is to declare avail_lists as zero
element array and allocate space for it when allocating space for
swap_info_struct.  The reason why keep using array but not pointer is
plist_for_each_entry needs the field to be part of the struct, so pointer
will not work.

This patch is on top of Vasily Averin's fix commit.  I think the use of
kvzalloc for swap_info_struct is still needed in case nr_node_ids is
really big on some systems.

Link: http://lkml.kernel.org/r/20181115083847.GA11129@intel.com
Signed-off-by: Aaron Lu &lt;aaron.lu@intel.com&gt;
Reviewed-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Vasily Averin &lt;vvs@virtuozzo.com&gt;
Cc: Huang Ying &lt;ying.huang@intel.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>mm, swap: fix race between swap count continuation operations</title>
<updated>2017-11-03T14:39:19Z</updated>
<author>
<name>Huang Ying</name>
<email>ying.huang@intel.com</email>
</author>
<published>2017-11-02T22:59:50Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=2628bd6fc052bd85e9864dae4de494d8a6313391'/>
<id>urn:sha1:2628bd6fc052bd85e9864dae4de494d8a6313391</id>
<content type='text'>
One page may store a set of entries of the sis-&gt;swap_map
(swap_info_struct-&gt;swap_map) in multiple swap clusters.

If some of the entries has sis-&gt;swap_map[offset] &gt; SWAP_MAP_MAX,
multiple pages will be used to store the set of entries of the
sis-&gt;swap_map.  And the pages are linked with page-&gt;lru.  This is called
swap count continuation.  To access the pages which store the set of
entries of the sis-&gt;swap_map simultaneously, previously, sis-&gt;lock is
used.  But to improve the scalability of __swap_duplicate(), swap
cluster lock may be used in swap_count_continued() now.  This may race
with add_swap_count_continuation() which operates on a nearby swap
cluster, in which the sis-&gt;swap_map entries are stored in the same page.

The race can cause wrong swap count in practice, thus cause unfreeable
swap entries or software lockup, etc.

To fix the race, a new spin lock called cont_lock is added to struct
swap_info_struct to protect the swap count continuation page list.  This
is a lock at the swap device level, so the scalability isn't very well.
But it is still much better than the original sis-&gt;lock, because it is
only acquired/released when swap count continuation is used.  Which is
considered rare in practice.  If it turns out that the scalability
becomes an issue for some workloads, we can split the lock into some
more fine grained locks.

Link: http://lkml.kernel.org/r/20171017081320.28133-1-ying.huang@intel.com
Fixes: 235b62176712 ("mm/swap: add cluster lock")
Signed-off-by: "Huang, Ying" &lt;ying.huang@intel.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Shaohua Li &lt;shli@kernel.org&gt;
Cc: Tim Chen &lt;tim.c.chen@intel.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Aaron Lu &lt;aaron.lu@intel.com&gt;
Cc: Dave Hansen &lt;dave.hansen@intel.com&gt;
Cc: Andi Kleen &lt;ak@linux.intel.com&gt;
Cc: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[4.11+]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>License cleanup: add SPDX GPL-2.0 license identifier to files with no license</title>
<updated>2017-11-02T10:10:55Z</updated>
<author>
<name>Greg Kroah-Hartman</name>
<email>gregkh@linuxfoundation.org</email>
</author>
<published>2017-11-01T14:07:57Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=b24413180f5600bcb3bb70fbed5cf186b60864bd'/>
<id>urn:sha1:b24413180f5600bcb3bb70fbed5cf186b60864bd</id>
<content type='text'>
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier.  The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
 - file had no licensing information it it.
 - file was a */uapi/* one with no licensing information in it,
 - file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode &amp; Windriver) producing SPDX
tag:value files created by Philippe Ombredanne.  Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed.  Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
 - Files considered eligible had to be source code files.
 - Make and config files were included as candidates if they contained &gt;5
   lines of source
 - File already had some variant of a license header in it (even if &lt;5
   lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

 - when both scanners couldn't find any license traces, file was
   considered to have no license information in it, and the top level
   COPYING file license applied.

   For non */uapi/* files that summary was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0                                              11139

   and resulted in the first patch in this series.

   If that file was a */uapi/* path one, it was "GPL-2.0 WITH
   Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0 WITH Linux-syscall-note                        930

   and resulted in the second patch in this series.

 - if a file had some form of licensing information in it, and was one
   of the */uapi/* ones, it was denoted with the Linux-syscall-note if
   any GPL family license was found in the file or had no licensing in
   it (per prior point).  Results summary:

   SPDX license identifier                            # files
   ---------------------------------------------------|------
   GPL-2.0 WITH Linux-syscall-note                       270
   GPL-2.0+ WITH Linux-syscall-note                      169
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
   LGPL-2.1+ WITH Linux-syscall-note                      15
   GPL-1.0+ WITH Linux-syscall-note                       14
   ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
   LGPL-2.0+ WITH Linux-syscall-note                       4
   LGPL-2.1 WITH Linux-syscall-note                        3
   ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
   ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

   and that resulted in the third patch in this series.

 - when the two scanners agreed on the detected license(s), that became
   the concluded license(s).

 - when there was disagreement between the two scanners (one detected a
   license but the other didn't, or they both detected different
   licenses) a manual inspection of the file occurred.

 - In most cases a manual inspection of the information in the file
   resulted in a clear resolution of the license that should apply (and
   which scanner probably needed to revisit its heuristics).

 - When it was not immediately clear, the license identifier was
   confirmed with lawyers working with the Linux Foundation.

 - If there was any question as to the appropriate license identifier,
   the file was flagged for further research and to be revisited later
   in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.  The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
 - a full scancode scan run, collecting the matched texts, detected
   license ids and scores
 - reviewing anything where there was a license detected (about 500+
   files) to ensure that the applied SPDX license was correct
 - reviewing anything where there was no detection but the patch license
   was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
   SPDX license was correct

This produced a worksheet with 20 files needing minor correction.  This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg.  Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected.  This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.)  Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart &lt;kstewart@linuxfoundation.org&gt;
Reviewed-by: Philippe Ombredanne &lt;pombredanne@nexb.com&gt;
Reviewed-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory</title>
<updated>2017-09-09T01:26:46Z</updated>
<author>
<name>Jérôme Glisse</name>
<email>jglisse@redhat.com</email>
</author>
<published>2017-09-08T23:11:43Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=5042db43cc26f51eed51c56192e2c2317e44315f'/>
<id>urn:sha1:5042db43cc26f51eed51c56192e2c2317e44315f</id>
<content type='text'>
HMM (heterogeneous memory management) need struct page to support
migration from system main memory to device memory.  Reasons for HMM and
migration to device memory is explained with HMM core patch.

This patch deals with device memory that is un-addressable memory (ie CPU
can not access it).  Hence we do not want those struct page to be manage
like regular memory.  That is why we extend ZONE_DEVICE to support
different types of memory.

A persistent memory type is define for existing user of ZONE_DEVICE and a
new device un-addressable type is added for the un-addressable memory
type.  There is a clear separation between what is expected from each
memory type and existing user of ZONE_DEVICE are un-affected by new
requirement and new use of the un-addressable type.  All specific code
path are protect with test against the memory type.

Because memory is un-addressable we use a new special swap type for when a
page is migrated to device memory (this reduces the number of maximum swap
file).

The main two additions beside memory type to ZONE_DEVICE is two callbacks.
First one, page_free() is call whenever page refcount reach 1 (which
means the page is free as ZONE_DEVICE page never reach a refcount of 0).
This allow device driver to manage its memory and associated struct page.

The second callback page_fault() happens when there is a CPU access to an
address that is back by a device page (which are un-addressable by the
CPU).  This callback is responsible to migrate the page back to system
main memory.  Device driver can not block migration back to system memory,
HMM make sure that such page can not be pin into device memory.

If device is in some error condition and can not migrate memory back then
a CPU page fault to device memory should end with SIGBUS.

[arnd@arndb.de: fix warning]
  Link: http://lkml.kernel.org/r/20170823133213.712917-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/20170817000548.32038-8-jglisse@redhat.com
Signed-off-by: Jérôme Glisse &lt;jglisse@redhat.com&gt;
Signed-off-by: Arnd Bergmann &lt;arnd@arndb.de&gt;
Acked-by: Dan Williams &lt;dan.j.williams@intel.com&gt;
Cc: Ross Zwisler &lt;ross.zwisler@linux.intel.com&gt;
Cc: Aneesh Kumar &lt;aneesh.kumar@linux.vnet.ibm.com&gt;
Cc: Balbir Singh &lt;bsingharora@gmail.com&gt;
Cc: Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Cc: David Nellans &lt;dnellans@nvidia.com&gt;
Cc: Evgeny Baskakov &lt;ebaskakov@nvidia.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: John Hubbard &lt;jhubbard@nvidia.com&gt;
Cc: Kirill A. Shutemov &lt;kirill.shutemov@linux.intel.com&gt;
Cc: Mark Hairgrove &lt;mhairgrove@nvidia.com&gt;
Cc: Michal Hocko &lt;mhocko@kernel.org&gt;
Cc: Paul E. McKenney &lt;paulmck@linux.vnet.ibm.com&gt;
Cc: Sherry Cheung &lt;SCheung@nvidia.com&gt;
Cc: Subhash Gutti &lt;sgutti@nvidia.com&gt;
Cc: Vladimir Davydov &lt;vdavydov.dev@gmail.com&gt;
Cc: Bob Liu &lt;liubo95@huawei.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>swap: choose swap device according to numa node</title>
<updated>2017-09-07T00:27:30Z</updated>
<author>
<name>Aaron Lu</name>
<email>aaron.lu@intel.com</email>
</author>
<published>2017-09-06T23:24:57Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=a2468cc9bfdff6139f59ca896671e5819ff5f94a'/>
<id>urn:sha1:a2468cc9bfdff6139f59ca896671e5819ff5f94a</id>
<content type='text'>
If the system has more than one swap device and swap device has the node
information, we can make use of this information to decide which swap
device to use in get_swap_pages() to get better performance.

The current code uses a priority based list, swap_avail_list, to decide
which swap device to use and if multiple swap devices share the same
priority, they are used round robin.  This patch changes the previous
single global swap_avail_list into a per-numa-node list, i.e.  for each
numa node, it sees its own priority based list of available swap
devices.  Swap device's priority can be promoted on its matching node's
swap_avail_list.

The current swap device's priority is set as: user can set a &gt;=0 value,
or the system will pick one starting from -1 then downwards.  The
priority value in the swap_avail_list is the negated value of the swap
device's due to plist being sorted from low to high.  The new policy
doesn't change the semantics for priority &gt;=0 cases, the previous
starting from -1 then downwards now becomes starting from -2 then
downwards and -1 is reserved as the promoted value.

Take 4-node EX machine as an example, suppose 4 swap devices are
available, each sit on a different node:
swapA on node 0
swapB on node 1
swapC on node 2
swapD on node 3

After they are all swapped on in the sequence of ABCD.

Current behaviour:
their priorities will be:
swapA: -1
swapB: -2
swapC: -3
swapD: -4
And their position in the global swap_avail_list will be:
swapA   -&gt; swapB   -&gt; swapC   -&gt; swapD
prio:1     prio:2     prio:3     prio:4

New behaviour:
their priorities will be(note that -1 is skipped):
swapA: -2
swapB: -3
swapC: -4
swapD: -5
And their positions in the 4 swap_avail_lists[nid] will be:
swap_avail_lists[0]: /* node 0's available swap device list */
swapA   -&gt; swapB   -&gt; swapC   -&gt; swapD
prio:1     prio:3     prio:4     prio:5
swap_avali_lists[1]: /* node 1's available swap device list */
swapB   -&gt; swapA   -&gt; swapC   -&gt; swapD
prio:1     prio:2     prio:4     prio:5
swap_avail_lists[2]: /* node 2's available swap device list */
swapC   -&gt; swapA   -&gt; swapB   -&gt; swapD
prio:1     prio:2     prio:3     prio:5
swap_avail_lists[3]: /* node 3's available swap device list */
swapD   -&gt; swapA   -&gt; swapB   -&gt; swapC
prio:1     prio:2     prio:3     prio:4

To see the effect of the patch, a test that starts N process, each mmap
a region of anonymous memory and then continually write to it at random
position to trigger both swap in and out is used.

On a 2 node Skylake EP machine with 64GiB memory, two 170GB SSD drives
are used as swap devices with each attached to a different node, the
result is:

runtime=30m/processes=32/total test size=128G/each process mmap region=4G
kernel         throughput
vanilla        13306
auto-binding   15169 +14%

runtime=30m/processes=64/total test size=128G/each process mmap region=2G
kernel         throughput
vanilla        11885
auto-binding   14879 +25%

[aaron.lu@intel.com: v2]
  Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com
  Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.com
[akpm@linux-foundation.org: use kmalloc_array()]
Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com
Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.com
Signed-off-by: Aaron Lu &lt;aaron.lu@intel.com&gt;
Cc: "Chen, Tim C" &lt;tim.c.chen@intel.com&gt;
Cc: Huang Ying &lt;ying.huang@intel.com&gt;
Cc: Andi Kleen &lt;andi@firstfloor.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm, swap: don't use VMA based swap readahead if HDD is used as swap</title>
<updated>2017-09-07T00:27:30Z</updated>
<author>
<name>Huang Ying</name>
<email>ying.huang@intel.com</email>
</author>
<published>2017-09-06T23:24:43Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=81a0298bdfab0203d360df7c9bf690d1d457f999'/>
<id>urn:sha1:81a0298bdfab0203d360df7c9bf690d1d457f999</id>
<content type='text'>
VMA based swap readahead will readahead the virtual pages that is
continuous in the virtual address space.  While the original swap
readahead will readahead the swap slots that is continuous in the swap
device.  Although VMA based swap readahead is more correct for the swap
slots to be readahead, it will trigger more small random readings, which
may cause the performance of HDD (hard disk) to degrade heavily, and may
finally exceed the benefit.

To avoid the issue, in this patch, if the HDD is used as swap, the VMA
based swap readahead will be disabled, and the original swap readahead
will be used instead.

Link: http://lkml.kernel.org/r/20170807054038.1843-6-ying.huang@intel.com
Signed-off-by: "Huang, Ying" &lt;ying.huang@intel.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Shaohua Li &lt;shli@kernel.org&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Fengguang Wu &lt;fengguang.wu@intel.com&gt;
Cc: Tim Chen &lt;tim.c.chen@intel.com&gt;
Cc: Dave Hansen &lt;dave.hansen@intel.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm, swap: VMA based swap readahead</title>
<updated>2017-09-07T00:27:29Z</updated>
<author>
<name>Huang Ying</name>
<email>ying.huang@intel.com</email>
</author>
<published>2017-09-06T23:24:36Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=ec560175c0b6fce86994bdf036754d48122c5c87'/>
<id>urn:sha1:ec560175c0b6fce86994bdf036754d48122c5c87</id>
<content type='text'>
The swap readahead is an important mechanism to reduce the swap in
latency.  Although pure sequential memory access pattern isn't very
popular for anonymous memory, the space locality is still considered
valid.

In the original swap readahead implementation, the consecutive blocks in
swap device are readahead based on the global space locality estimation.
But the consecutive blocks in swap device just reflect the order of page
reclaiming, don't necessarily reflect the access pattern in virtual
memory.  And the different tasks in the system may have different access
patterns, which makes the global space locality estimation incorrect.

In this patch, when page fault occurs, the virtual pages near the fault
address will be readahead instead of the swap slots near the fault swap
slot in swap device.  This avoid to readahead the unrelated swap slots.
At the same time, the swap readahead is changed to work on per-VMA from
globally.  So that the different access patterns of the different VMAs
could be distinguished, and the different readahead policy could be
applied accordingly.  The original core readahead detection and scaling
algorithm is reused, because it is an effect algorithm to detect the
space locality.

The test and result is as follow,

Common test condition
=====================

Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM) Swap device:
NVMe disk

Micro-benchmark with combined access pattern
============================================

vm-scalability, sequential swap test case, 4 processes to eat 50G
virtual memory space, repeat the sequential memory writing until 300
seconds.  The first round writing will trigger swap out, the following
rounds will trigger sequential swap in and out.

At the same time, run vm-scalability random swap test case in
background, 8 processes to eat 30G virtual memory space, repeat the
random memory write until 300 seconds.  This will trigger random swap-in
in the background.

This is a combined workload with sequential and random memory accessing
at the same time.  The result (for sequential workload) is as follow,

			Base		Optimized
			----		---------
throughput		345413 KB/s	414029 KB/s (+19.9%)
latency.average		97.14 us	61.06 us (-37.1%)
latency.50th		2 us		1 us
latency.60th		2 us		1 us
latency.70th		98 us		2 us
latency.80th		160 us		2 us
latency.90th		260 us		217 us
latency.95th		346 us		369 us
latency.99th		1.34 ms		1.09 ms
ra_hit%			52.69%		99.98%

The original swap readahead algorithm is confused by the background
random access workload, so readahead hit rate is lower.  The VMA-base
readahead algorithm works much better.

Linpack
=======

The test memory size is bigger than RAM to trigger swapping.

			Base		Optimized
			----		---------
elapsed_time		393.49 s	329.88 s (-16.2%)
ra_hit%			86.21%		98.82%

The score of base and optimized kernel hasn't visible changes.  But the
elapsed time reduced and readahead hit rate improved, so the optimized
kernel runs better for startup and tear down stages.  And the absolute
value of readahead hit rate is high, shows that the space locality is
still valid in some practical workloads.

Link: http://lkml.kernel.org/r/20170807054038.1843-4-ying.huang@intel.com
Signed-off-by: "Huang, Ying" &lt;ying.huang@intel.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Shaohua Li &lt;shli@kernel.org&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Fengguang Wu &lt;fengguang.wu@intel.com&gt;
Cc: Tim Chen &lt;tim.c.chen@intel.com&gt;
Cc: Dave Hansen &lt;dave.hansen@intel.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: rename global_page_state to global_zone_page_state</title>
<updated>2017-09-07T00:27:29Z</updated>
<author>
<name>Michal Hocko</name>
<email>mhocko@suse.com</email>
</author>
<published>2017-09-06T23:23:36Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=c41f012ade0b95b0a6e25c7150673e0554736165'/>
<id>urn:sha1:c41f012ade0b95b0a6e25c7150673e0554736165</id>
<content type='text'>
global_page_state is error prone as a recent bug report pointed out [1].
It only returns proper values for zone based counters as the enum it
gets suggests.  We already have global_node_page_state so let's rename
global_page_state to global_zone_page_state to be more explicit here.
All existing users seems to be correct:

$ git grep "global_page_state(NR_" | sed 's@.*(\(NR_[A-Z_]*\)).*@\1@' | sort | uniq -c
      2 NR_BOUNCE
      2 NR_FREE_CMA_PAGES
     11 NR_FREE_PAGES
      1 NR_KERNEL_STACK_KB
      1 NR_MLOCK
      2 NR_PAGETABLE

This patch shouldn't introduce any functional change.

[1] http://lkml.kernel.org/r/201707260628.v6Q6SmaS030814@www262.sakura.ne.jp

Link: http://lkml.kernel.org/r/20170801134256.5400-2-hannes@cmpxchg.org
Signed-off-by: Michal Hocko &lt;mhocko@suse.com&gt;
Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Tetsuo Handa &lt;penguin-kernel@i-love.sakura.ne.jp&gt;
Cc: Josef Bacik &lt;josef@toxicpanda.com&gt;
Cc: Vladimir Davydov &lt;vdavydov.dev@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
</feed>
