<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/include/linux/swap.h, branch v5.3.2</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v5.3.2</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v5.3.2'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2019-07-12T18:05:43Z</updated>
<entry>
<title>mm, swap: use rbtree for swap_extent</title>
<updated>2019-07-12T18:05:43Z</updated>
<author>
<name>Aaron Lu</name>
<email>ziqian.lzq@antfin.com</email>
</author>
<published>2019-07-12T03:55:41Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=4efaceb1c5f8136d5fec3f26549d294b8e898bd7'/>
<id>urn:sha1:4efaceb1c5f8136d5fec3f26549d294b8e898bd7</id>
<content type='text'>
swap_extent is used to map swap page offset to backing device's block
offset.  For a continuous block range, one swap_extent is used and all
these swap_extents are managed in a linked list.

These swap_extents are used by map_swap_entry() during swap's read and
write path.  To find out the backing device's block offset for a page
offset, the swap_extent list will be traversed linearly, with
curr_swap_extent being used as a cache to speed up the search.

This works well as long as swap_extents are not huge or when the number
of processes that access swap device are few, but when the swap device
has many extents and there are a number of processes accessing the swap
device concurrently, it can be a problem.  On one of our servers, the
disk's remaining size is tight:

  $df -h
  Filesystem      Size  Used Avail Use% Mounted on
  ... ...
  /dev/nvme0n1p1  1.8T  1.3T  504G  72% /home/t4

When creating a 80G swapfile there, there are as many as 84656 swap
extents.  The end result is, kernel spends abou 30% time in
map_swap_entry() and swap throughput is only 70MB/s.

As a comparison, when I used smaller sized swapfile, like 4G whose
swap_extent dropped to 2000, swap throughput is back to 400-500MB/s and
map_swap_entry() is about 3%.

One downside of using rbtree for swap_extent is, 'struct rbtree' takes
24 bytes while 'struct list_head' takes 16 bytes, that's 8 bytes more
for each swap_extent.  For a swapfile that has 80k swap_extents, that
means 625KiB more memory consumed.

Test:

Since it's not possible to reboot that server, I can not test this patch
diretly there.  Instead, I tested it on another server with NVMe disk.

I created a 20G swapfile on an NVMe backed XFS fs.  By default, the
filesystem is quite clean and the created swapfile has only 2 extents.
Testing vanilla and this patch shows no obvious performance difference
when swapfile is not fragmented.

To see the patch's effects, I used some tweaks to manually fragment the
swapfile by breaking the extent at 1M boundary.  This made the swapfile
have 20K extents.

  nr_task=4
  kernel   swapout(KB/s) map_swap_entry(perf)  swapin(KB/s) map_swap_entry(perf)
  vanilla  165191           90.77%             171798          90.21%
  patched  858993 +420%      2.16%             715827 +317%     0.77%

  nr_task=8
  kernel   swapout(KB/s) map_swap_entry(perf)  swapin(KB/s) map_swap_entry(perf)
  vanilla  306783           92.19%             318145          87.76%
  patched  954437 +211%      2.35%            1073741 +237%     1.57%

swapout: the throughput of swap out, in KB/s, higher is better 1st
map_swap_entry: cpu cycles percent sampled by perf swapin: the
throughput of swap in, in KB/s, higher is better.  2nd map_swap_entry:
cpu cycles percent sampled by perf

nr_task=1 doesn't show any difference, this is due to the curr_swap_extent
can be effectively used to cache the correct swap extent for single task
workload.

[akpm@linux-foundation.org: s/BUG_ON(1)/BUG()/]
Link: http://lkml.kernel.org/r/20190523142404.GA181@aaronlu
Signed-off-by: Aaron Lu &lt;ziqian.lzq@antfin.com&gt;
Cc: Huang Ying &lt;ying.huang@intel.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm, swap: fix race between swapoff and some swap operations</title>
<updated>2019-07-12T18:05:43Z</updated>
<author>
<name>Huang Ying</name>
<email>ying.huang@intel.com</email>
</author>
<published>2019-07-12T03:55:33Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=eb085574a7526c4375965c5fbf7e5b0c19cdd336'/>
<id>urn:sha1:eb085574a7526c4375965c5fbf7e5b0c19cdd336</id>
<content type='text'>
When swapin is performed, after getting the swap entry information from
the page table, system will swap in the swap entry, without any lock held
to prevent the swap device from being swapoff.  This may cause the race
like below,

CPU 1				CPU 2
-----				-----
				do_swap_page
				  swapin_readahead
				    __read_swap_cache_async
swapoff				      swapcache_prepare
  p-&gt;swap_map = NULL		        __swap_duplicate
					  p-&gt;swap_map[?] /* !!! NULL pointer access */

Because swapoff is usually done when system shutdown only, the race may
not hit many people in practice.  But it is still a race need to be fixed.

To fix the race, get_swap_device() is added to check whether the specified
swap entry is valid in its swap device.  If so, it will keep the swap
entry valid via preventing the swap device from being swapoff, until
put_swap_device() is called.

Because swapoff() is very rare code path, to make the normal path runs as
fast as possible, rcu_read_lock/unlock() and synchronize_rcu() instead of
reference count is used to implement get/put_swap_device().  &gt;From
get_swap_device() to put_swap_device(), RCU reader side is locked, so
synchronize_rcu() in swapoff() will wait until put_swap_device() is
called.

In addition to swap_map, cluster_info, etc.  data structure in the struct
swap_info_struct, the swap cache radix tree will be freed after swapoff,
so this patch fixes the race between swap cache looking up and swapoff
too.

Races between some other swap cache usages and swapoff are fixed too via
calling synchronize_rcu() between clearing PageSwapCache() and freeing
swap cache data structure.

Another possible method to fix this is to use preempt_off() +
stop_machine() to prevent the swap device from being swapoff when its data
structure is being accessed.  The overhead in hot-path of both methods is
similar.  The advantages of RCU based method are,

1. stop_machine() may disturb the normal execution code path on other
   CPUs.

2. File cache uses RCU to protect its radix tree.  If the similar
   mechanism is used for swap cache too, it is easier to share code
   between them.

3. RCU is used to protect swap cache in total_swapcache_pages() and
   exit_swap_address_space() already.  The two mechanisms can be
   merged to simplify the logic.

Link: http://lkml.kernel.org/r/20190522015423.14418-1-ying.huang@intel.com
Fixes: 235b62176712 ("mm/swap: add cluster lock")
Signed-off-by: "Huang, Ying" &lt;ying.huang@intel.com&gt;
Reviewed-by: Andrea Parri &lt;andrea.parri@amarulasolutions.com&gt;
Not-nacked-by: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Paul E. McKenney &lt;paulmck@linux.vnet.ibm.com&gt;
Cc: Daniel Jordan &lt;daniel.m.jordan@oracle.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Tim Chen &lt;tim.c.chen@linux.intel.com&gt;
Cc: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Cc: Jérôme Glisse &lt;jglisse@redhat.com&gt;
Cc: Yang Shi &lt;yang.shi@linux.alibaba.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: Dave Jiang &lt;dave.jiang@intel.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>include/linux/swap.h: use offsetof() instead of custom __swapoffset macro</title>
<updated>2019-03-14T21:36:20Z</updated>
<author>
<name>Pi-Hsun Shih</name>
<email>pihsun@chromium.org</email>
</author>
<published>2019-03-13T18:44:33Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=a4046c06be50a4f01d435aa7fe57514818e6cc82'/>
<id>urn:sha1:a4046c06be50a4f01d435aa7fe57514818e6cc82</id>
<content type='text'>
Use offsetof() to calculate offset of a field to take advantage of
compiler built-in version when possible, and avoid UBSAN warning when
compiling with Clang:

  UBSAN: Undefined behaviour in mm/swapfile.c:3010:38
  member access within null pointer of type 'union swap_header'
  CPU: 6 PID: 1833 Comm: swapon Tainted: G S                4.19.23 #43
  Call trace:
   dump_backtrace+0x0/0x194
   show_stack+0x20/0x2c
   __dump_stack+0x20/0x28
   dump_stack+0x70/0x94
   ubsan_epilogue+0x14/0x44
   ubsan_type_mismatch_common+0xf4/0xfc
   __ubsan_handle_type_mismatch_v1+0x34/0x54
   __se_sys_swapon+0x654/0x1084
   __arm64_sys_swapon+0x1c/0x24
   el0_svc_common+0xa8/0x150
   el0_svc_compat_handler+0x2c/0x38
   el0_svc_compat+0x8/0x18

Link: http://lkml.kernel.org/r/20190312081902.223764-1-pihsun@chromium.org
Signed-off-by: Pi-Hsun Shih &lt;pihsun@chromium.org&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Reviewed-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/workingset: remove unused @mapping argument in workingset_eviction()</title>
<updated>2019-03-06T05:07:21Z</updated>
<author>
<name>Andrey Ryabinin</name>
<email>aryabinin@virtuozzo.com</email>
</author>
<published>2019-03-05T23:49:35Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=a7ca12f9d905e7437dd3beb9cbb8e85bc2b991f4'/>
<id>urn:sha1:a7ca12f9d905e7437dd3beb9cbb8e85bc2b991f4</id>
<content type='text'>
workingset_eviction() doesn't use and never did use the @mapping
argument.  Remove it.

Link: http://lkml.kernel.org/r/20190228083329.31892-1-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin &lt;aryabinin@virtuozzo.com&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Acked-by: Rik van Riel &lt;riel@surriel.com&gt;
Acked-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Acked-by: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Cc: Michal Hocko &lt;mhocko@kernel.org&gt;
Cc: William Kucharski &lt;william.kucharski@oracle.com&gt;
Cc: John Hubbard &lt;jhubbard@nvidia.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: swap: use mem_cgroup_is_root() instead of deferencing css-&gt;parent</title>
<updated>2019-03-06T05:07:19Z</updated>
<author>
<name>Yang Shi</name>
<email>yang.shi@linux.alibaba.com</email>
</author>
<published>2019-03-05T23:48:02Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=59118c42a60b997d277ad04d2309a6ec30682e5e'/>
<id>urn:sha1:59118c42a60b997d277ad04d2309a6ec30682e5e</id>
<content type='text'>
mem_cgroup_is_root() is the preferred API to check if memcg is root or
not.  Use it instead of deferencing css-&gt;parent.

Link: http://lkml.kernel.org/r/1547232913-118148-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi &lt;yang.shi@linux.alibaba.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Huang Ying &lt;ying.huang@intel.com&gt;
Cc: Tim Chen &lt;tim.c.chen@intel.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/swap: use nr_node_ids for avail_lists in swap_info_struct</title>
<updated>2018-12-28T20:11:47Z</updated>
<author>
<name>Aaron Lu</name>
<email>aaron.lu@intel.com</email>
</author>
<published>2018-12-28T08:34:39Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=66f71da9dd38af17dc17209cdde7987d4679a699'/>
<id>urn:sha1:66f71da9dd38af17dc17209cdde7987d4679a699</id>
<content type='text'>
Since a2468cc9bfdf ("swap: choose swap device according to numa node"),
avail_lists field of swap_info_struct is changed to an array with
MAX_NUMNODES elements.  This made swap_info_struct size increased to 40KiB
and needs an order-4 page to hold it.

This is not optimal in that:
1 Most systems have way less than MAX_NUMNODES(1024) nodes so it
  is a waste of memory;
2 It could cause swapon failure if the swap device is swapped on
  after system has been running for a while, due to no order-4
  page is available as pointed out by Vasily Averin.

Solve the above two issues by using nr_node_ids(which is the actual
possible node number the running system has) for avail_lists instead of
MAX_NUMNODES.

nr_node_ids is unknown at compile time so can't be directly used when
declaring this array.  What I did here is to declare avail_lists as zero
element array and allocate space for it when allocating space for
swap_info_struct.  The reason why keep using array but not pointer is
plist_for_each_entry needs the field to be part of the struct, so pointer
will not work.

This patch is on top of Vasily Averin's fix commit.  I think the use of
kvzalloc for swap_info_struct is still needed in case nr_node_ids is
really big on some systems.

Link: http://lkml.kernel.org/r/20181115083847.GA11129@intel.com
Signed-off-by: Aaron Lu &lt;aaron.lu@intel.com&gt;
Reviewed-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Vasily Averin &lt;vvs@virtuozzo.com&gt;
Cc: Huang Ying &lt;ying.huang@intel.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>vmscan: return NODE_RECLAIM_NOSCAN in node_reclaim() when CONFIG_NUMA is n</title>
<updated>2018-12-28T20:11:47Z</updated>
<author>
<name>Wei Yang</name>
<email>richard.weiyang@gmail.com</email>
</author>
<published>2018-12-28T08:34:36Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=8b09549c2bfd9f3f8f4cdad74107ef4f4ff9cdd7'/>
<id>urn:sha1:8b09549c2bfd9f3f8f4cdad74107ef4f4ff9cdd7</id>
<content type='text'>
Commit fa5e084e43eb ("vmscan: do not unconditionally treat zones that
fail zone_reclaim() as full") changed the return value of
node_reclaim().  The original return value 0 means NODE_RECLAIM_SOME
after this commit.

While the return value of node_reclaim() when CONFIG_NUMA is n is not
changed.  This will leads to call zone_watermark_ok() again.

This patch fixes the return value by adjusting to NODE_RECLAIM_NOSCAN.
Since node_reclaim() is only called in page_alloc.c, move it to
mm/internal.h.

Link: http://lkml.kernel.org/r/20181113080436.22078-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang &lt;richard.weiyang@gmail.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Reviewed-by: Matthew Wilcox &lt;willy@infradead.org&gt;
Cc: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: convert totalram_pages and totalhigh_pages variables to atomic</title>
<updated>2018-12-28T20:11:47Z</updated>
<author>
<name>Arun KS</name>
<email>arunks@codeaurora.org</email>
</author>
<published>2018-12-28T08:34:29Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=ca79b0c211af63fa3276f0e3fd7dd9ada2439839'/>
<id>urn:sha1:ca79b0c211af63fa3276f0e3fd7dd9ada2439839</id>
<content type='text'>
totalram_pages and totalhigh_pages are made static inline function.

Main motivation was that managed_page_count_lock handling was complicating
things.  It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
better to remove the lock and convert variables to atomic, with preventing
poteintial store-to-read tearing as a bonus.

[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS &lt;arunks@codeaurora.org&gt;
Suggested-by: Michal Hocko &lt;mhocko@suse.com&gt;
Suggested-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Reviewed-by: Konstantin Khlebnikov &lt;khlebnikov@yandex-team.ru&gt;
Reviewed-by: Pavel Tatashin &lt;pasha.tatashin@soleen.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Acked-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>Merge drm/drm-next into drm-intel-next-queued</title>
<updated>2018-11-20T11:14:08Z</updated>
<author>
<name>Jani Nikula</name>
<email>jani.nikula@intel.com</email>
</author>
<published>2018-11-20T11:14:08Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=2ac5e38ea4203852d6e99edd3cf11f044b0a409f'/>
<id>urn:sha1:2ac5e38ea4203852d6e99edd3cf11f044b0a409f</id>
<content type='text'>
Pull in v4.20-rc3 via drm-next.

Signed-off-by: Jani Nikula &lt;jani.nikula@intel.com&gt;
</content>
</entry>
<entry>
<title>mm, drm/i915: mark pinned shmemfs pages as unevictable</title>
<updated>2018-11-07T15:28:32Z</updated>
<author>
<name>Kuo-Hsin Yang</name>
<email>vovoy@chromium.org</email>
</author>
<published>2018-11-06T13:23:24Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=64e3d12f769d60eaee6d2e53a9b7f0b3814f32ed'/>
<id>urn:sha1:64e3d12f769d60eaee6d2e53a9b7f0b3814f32ed</id>
<content type='text'>
The i915 driver uses shmemfs to allocate backing storage for gem
objects. These shmemfs pages can be pinned (increased ref count) by
shmem_read_mapping_page_gfp(). When a lot of pages are pinned, vmscan
wastes a lot of time scanning these pinned pages. In some extreme case,
all pages in the inactive anon lru are pinned, and only the inactive
anon lru is scanned due to inactive_ratio, the system cannot swap and
invokes the oom-killer. Mark these pinned pages as unevictable to speed
up vmscan.

Export pagevec API check_move_unevictable_pages().

This patch was inspired by Chris Wilson's change [1].

[1]: https://patchwork.kernel.org/patch/9768741/

Cc: Chris Wilson &lt;chris@chris-wilson.co.uk&gt;
Cc: Joonas Lahtinen &lt;joonas.lahtinen@linux.intel.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Dave Hansen &lt;dave.hansen@intel.com&gt;
Signed-off-by: Kuo-Hsin Yang &lt;vovoy@chromium.org&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt; # mm part
Reviewed-by: Chris Wilson &lt;chris@chris-wilson.co.uk&gt;
Acked-by: Dave Hansen &lt;dave.hansen@intel.com&gt;
Acked-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Link: https://patchwork.freedesktop.org/patch/msgid/20181106132324.17390-1-chris@chris-wilson.co.uk
Signed-off-by: Chris Wilson &lt;chris@chris-wilson.co.uk&gt;
</content>
</entry>
</feed>
