<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/mm, branch v5.4.29</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v5.4.29</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v5.4.29'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2020-04-01T09:02:03Z</updated>
<entry>
<title>mm: fork: fix kernel_stack memcg stats for various stack implementations</title>
<updated>2020-04-01T09:02:03Z</updated>
<author>
<name>Roman Gushchin</name>
<email>guro@fb.com</email>
</author>
<published>2020-03-29T02:17:25Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=159aef18f05cad378e7361a5c4ce544e0a0096d9'/>
<id>urn:sha1:159aef18f05cad378e7361a5c4ce544e0a0096d9</id>
<content type='text'>
commit 8380ce479010f2f779587b462a9b4681934297c3 upstream.

Depending on CONFIG_VMAP_STACK and the THREAD_SIZE / PAGE_SIZE ratio the
space for task stacks can be allocated using __vmalloc_node_range(),
alloc_pages_node() and kmem_cache_alloc_node().

In the first and the second cases page-&gt;mem_cgroup pointer is set, but
in the third it's not: memcg membership of a slab page should be
determined using the memcg_from_slab_page() function, which looks at
page-&gt;slab_cache-&gt;memcg_params.memcg .  In this case, using
mod_memcg_page_state() (as in account_kernel_stack()) is incorrect:
page-&gt;mem_cgroup pointer is NULL even for pages charged to a non-root
memory cgroup.

It can lead to kernel_stack per-memcg counters permanently showing 0 on
some architectures (depending on the configuration).

In order to fix it, let's introduce a mod_memcg_obj_state() helper,
which takes a pointer to a kernel object as a first argument, uses
mem_cgroup_from_obj() to get a RCU-protected memcg pointer and calls
mod_memcg_state().  It allows to handle all possible configurations
(CONFIG_VMAP_STACK and various THREAD_SIZE/PAGE_SIZE values) without
spilling any memcg/kmem specifics into fork.c .

Note: This is a special version of the patch created for stable
backports.  It contains code from the following two patches:
  - mm: memcg/slab: introduce mem_cgroup_from_obj()
  - mm: fork: fix kernel_stack memcg stats for various stack implementations

[guro@fb.com: introduce mem_cgroup_from_obj()]
  Link: http://lkml.kernel.org/r/20200324004221.GA36662@carbon.dhcp.thefacebook.com
Fixes: 4d96ba353075 ("mm: memcg/slab: stop setting page-&gt;mem_cgroup pointer for slab pages")
Signed-off-by: Roman Gushchin &lt;guro@fb.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Reviewed-by: Shakeel Butt &lt;shakeelb@google.com&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Michal Hocko &lt;mhocko@kernel.org&gt;
Cc: Bharata B Rao &lt;bharata@linux.ibm.com&gt;
Cc: Shakeel Butt &lt;shakeelb@google.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Link: http://lkml.kernel.org/r/20200303233550.251375-1-guro@fb.com
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>mm/sparse: fix kernel crash with pfn_section_valid check</title>
<updated>2020-04-01T09:02:03Z</updated>
<author>
<name>Aneesh Kumar K.V</name>
<email>aneesh.kumar@linux.ibm.com</email>
</author>
<published>2020-03-29T02:17:29Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=cc5da743a4566a5e3cb49db2346362fcd3db7d00'/>
<id>urn:sha1:cc5da743a4566a5e3cb49db2346362fcd3db7d00</id>
<content type='text'>
commit b943f045a9af9fd02f923e43fe8d7517e9961701 upstream.

Fix the crash like this:

    BUG: Kernel NULL pointer dereference on read at 0x00000000
    Faulting instruction address: 0xc000000000c3447c
    Oops: Kernel access of bad area, sig: 11 [#1]
    LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
    CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
    ...
    NIP [c000000000c3447c] vmemmap_populated+0x98/0xc0
    LR [c000000000088354] vmemmap_free+0x144/0x320
    Call Trace:
       section_deactivate+0x220/0x240
       __remove_pages+0x118/0x170
       arch_remove_memory+0x3c/0x150
       memunmap_pages+0x1cc/0x2f0
       devm_action_release+0x30/0x50
       release_nodes+0x2f8/0x3e0
       device_release_driver_internal+0x168/0x270
       unbind_store+0x130/0x170
       drv_attr_store+0x44/0x60
       sysfs_kf_write+0x68/0x80
       kernfs_fop_write+0x100/0x290
       __vfs_write+0x3c/0x70
       vfs_write+0xcc/0x240
       ksys_write+0x7c/0x140
       system_call+0x5c/0x68

The crash is due to NULL dereference at

	test_bit(idx, ms-&gt;usage-&gt;subsection_map);

due to ms-&gt;usage = NULL in pfn_section_valid()

With commit d41e2f3bd546 ("mm/hotplug: fix hot remove failure in
SPARSEMEM|!VMEMMAP case") section_mem_map is set to NULL after
depopulate_section_mem().  This was done so that pfn_page() can work
correctly with kernel config that disables SPARSEMEM_VMEMMAP.  With that
config pfn_to_page does

	__section_mem_map_addr(__sec) + __pfn;

where

  static inline struct page *__section_mem_map_addr(struct mem_section *section)
  {
	unsigned long map = section-&gt;section_mem_map;
	map &amp;= SECTION_MAP_MASK;
	return (struct page *)map;
  }

Now with SPASEMEM_VMEMAP enabled, mem_section-&gt;usage-&gt;subsection_map is
used to check the pfn validity (pfn_valid()).  Since section_deactivate
release mem_section-&gt;usage if a section is fully deactivated,
pfn_valid() check after a subsection_deactivate cause a kernel crash.

  static inline int pfn_valid(unsigned long pfn)
  {
  ...
	return early_section(ms) || pfn_section_valid(ms, pfn);
  }

where

  static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
  {
	int idx = subsection_map_index(pfn);

	return test_bit(idx, ms-&gt;usage-&gt;subsection_map);
  }

Avoid this by clearing SECTION_HAS_MEM_MAP when mem_section-&gt;usage is
freed.  For architectures like ppc64 where large pages are used for
vmmemap mapping (16MB), a specific vmemmap mapping can cover multiple
sections.  Hence before a vmemmap mapping page can be freed, the kernel
needs to make sure there are no valid sections within that mapping.
Clearing the section valid bit before depopulate_section_memap enables
this.

[aneesh.kumar@linux.ibm.com: add comment]
  Link: http://lkml.kernel.org/r/20200326133235.343616-1-aneesh.kumar@linux.ibm.comLink: http://lkml.kernel.org/r/20200325031914.107660-1-aneesh.kumar@linux.ibm.com
Fixes: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case")
Reported-by: Sachin Sant &lt;sachinp@linux.vnet.ibm.com&gt;
Signed-off-by: Aneesh Kumar K.V &lt;aneesh.kumar@linux.ibm.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Tested-by: Sachin Sant &lt;sachinp@linux.vnet.ibm.com&gt;
Reviewed-by: Baoquan He &lt;bhe@redhat.com&gt;
Reviewed-by: Wei Yang &lt;richard.weiyang@gmail.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Acked-by: Pankaj Gupta &lt;pankaj.gupta.linux@gmail.com&gt;
Cc: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Cc: Dan Williams &lt;dan.j.williams@intel.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Oscar Salvador &lt;osalvador@suse.de&gt;
Cc: Mike Rapoport &lt;rppt@linux.ibm.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>mm/swapfile.c: move inode_lock out of claim_swapfile</title>
<updated>2020-04-01T09:02:02Z</updated>
<author>
<name>Naohiro Aota</name>
<email>naohiro.aota@wdc.com</email>
</author>
<published>2020-03-29T02:17:15Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=da458bbfb6cfcbacc6e3f4984bcaf45af98db24c'/>
<id>urn:sha1:da458bbfb6cfcbacc6e3f4984bcaf45af98db24c</id>
<content type='text'>
commit d795a90e2ba024dbf2f22107ae89c210b98b08b8 upstream.

claim_swapfile() currently keeps the inode locked when it is successful,
or the file is already swapfile (with -EBUSY).  And, on the other error
cases, it does not lock the inode.

This inconsistency of the lock state and return value is quite confusing
and actually causing a bad unlock balance as below in the "bad_swap"
section of __do_sys_swapon().

This commit fixes this issue by moving the inode_lock() and IS_SWAPFILE
check out of claim_swapfile().  The inode is unlocked in
"bad_swap_unlock_inode" section, so that the inode is ensured to be
unlocked at "bad_swap".  Thus, error handling codes after the locking now
jumps to "bad_swap_unlock_inode" instead of "bad_swap".

    =====================================
    WARNING: bad unlock balance detected!
    5.5.0-rc7+ #176 Not tainted
    -------------------------------------
    swapon/4294 is trying to release lock (&amp;sb-&gt;s_type-&gt;i_mutex_key) at: __do_sys_swapon+0x94b/0x3550
    but there are no more locks to release!

    other info that might help us debug this:
    no locks held by swapon/4294.

    stack backtrace:
    CPU: 5 PID: 4294 Comm: swapon Not tainted 5.5.0-rc7-BTRFS-ZNS+ #176
    Hardware name: ASUS All Series/H87-PRO, BIOS 2102 07/29/2014
    Call Trace:
     dump_stack+0xa1/0xea
     print_unlock_imbalance_bug.cold+0x114/0x123
     lock_release+0x562/0xed0
     up_write+0x2d/0x490
     __do_sys_swapon+0x94b/0x3550
     __x64_sys_swapon+0x54/0x80
     do_syscall_64+0xa4/0x4b0
     entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x7f15da0a0dc7

Fixes: 1638045c3677 ("mm: set S_SWAPFILE on blockdev swap devices")
Signed-off-by: Naohiro Aota &lt;naohiro.aota@wdc.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Tested-by: Qais Youef &lt;qais.yousef@arm.com&gt;
Reviewed-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Reviewed-by: Darrick J. Wong &lt;darrick.wong@oracle.com&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Link: http://lkml.kernel.org/r/20200206090132.154869-1-naohiro.aota@wdc.com
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>x86/mm: split vmalloc_sync_all()</title>
<updated>2020-03-25T07:25:58Z</updated>
<author>
<name>Joerg Roedel</name>
<email>jroedel@suse.de</email>
</author>
<published>2020-03-22T01:22:41Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=66f28e110565595800d6253938befef3fa88625b'/>
<id>urn:sha1:66f28e110565595800d6253938befef3fa88625b</id>
<content type='text'>
commit 763802b53a427ed3cbd419dbba255c414fdd9e7c upstream.

Commit 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in
__purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in
the vunmap() code-path.  While this change was necessary to maintain
correctness on x86-32-pae kernels, it also adds additional cycles for
architectures that don't need it.

Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported
severe performance regressions in micro-benchmarks because it now also
calls the x86-64 implementation of vmalloc_sync_all() on vunmap().  But
the vmalloc_sync_all() implementation on x86-64 is only needed for newly
created mappings.

To avoid the unnecessary work on x86-64 and to gain the performance
back, split up vmalloc_sync_all() into two functions:

	* vmalloc_sync_mappings(), and
	* vmalloc_sync_unmappings()

Most call-sites to vmalloc_sync_all() only care about new mappings being
synchronized.  The only exception is the new call-site added in the
above mentioned commit.

Shile Zhang directed us to a report of an 80% regression in reaim
throughput.

Fixes: 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()")
Reported-by: kernel test robot &lt;oliver.sang@intel.com&gt;
Reported-by: Shile Zhang &lt;shile.zhang@linux.alibaba.com&gt;
Signed-off-by: Joerg Roedel &lt;jroedel@suse.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Tested-by: Borislav Petkov &lt;bp@suse.de&gt;
Acked-by: Rafael J. Wysocki &lt;rafael.j.wysocki@intel.com&gt;	[GHES]
Cc: Dave Hansen &lt;dave.hansen@linux.intel.com&gt;
Cc: Andy Lutomirski &lt;luto@kernel.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Ingo Molnar &lt;mingo@redhat.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org
Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/
Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.com
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>mm, slub: prevent kmalloc_node crashes and memory leaks</title>
<updated>2020-03-25T07:25:58Z</updated>
<author>
<name>Vlastimil Babka</name>
<email>vbabka@suse.cz</email>
</author>
<published>2020-03-22T01:22:37Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=32991c960d0bb72ec144b6dfe43ed498911c8a8d'/>
<id>urn:sha1:32991c960d0bb72ec144b6dfe43ed498911c8a8d</id>
<content type='text'>
commit 0715e6c516f106ed553828a671d30ad9a3431536 upstream.

Sachin reports [1] a crash in SLUB __slab_alloc():

  BUG: Kernel NULL pointer dereference on read at 0x000073b0
  Faulting instruction address: 0xc0000000003d55f4
  Oops: Kernel access of bad area, sig: 11 [#1]
  LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
  Modules linked in:
  CPU: 19 PID: 1 Comm: systemd Not tainted 5.6.0-rc2-next-20200218-autotest #1
  NIP:  c0000000003d55f4 LR: c0000000003d5b94 CTR: 0000000000000000
  REGS: c0000008b37836d0 TRAP: 0300   Not tainted  (5.6.0-rc2-next-20200218-autotest)
  MSR:  8000000000009033 &lt;SF,EE,ME,IR,DR,RI,LE&gt;  CR: 24004844  XER: 00000000
  CFAR: c00000000000dec4 DAR: 00000000000073b0 DSISR: 40000000 IRQMASK: 1
  GPR00: c0000000003d5b94 c0000008b3783960 c00000000155d400 c0000008b301f500
  GPR04: 0000000000000dc0 0000000000000002 c0000000003443d8 c0000008bb398620
  GPR08: 00000008ba2f0000 0000000000000001 0000000000000000 0000000000000000
  GPR12: 0000000024004844 c00000001ec52a00 0000000000000000 0000000000000000
  GPR16: c0000008a1b20048 c000000001595898 c000000001750c18 0000000000000002
  GPR20: c000000001750c28 c000000001624470 0000000fffffffe0 5deadbeef0000122
  GPR24: 0000000000000001 0000000000000dc0 0000000000000002 c0000000003443d8
  GPR28: c0000008b301f500 c0000008bb398620 0000000000000000 c00c000002287180
  NIP ___slab_alloc+0x1f4/0x760
  LR __slab_alloc+0x34/0x60
  Call Trace:
    ___slab_alloc+0x334/0x760 (unreliable)
    __slab_alloc+0x34/0x60
    __kmalloc_node+0x110/0x490
    kvmalloc_node+0x58/0x110
    mem_cgroup_css_online+0x108/0x270
    online_css+0x48/0xd0
    cgroup_apply_control_enable+0x2ec/0x4d0
    cgroup_mkdir+0x228/0x5f0
    kernfs_iop_mkdir+0x90/0xf0
    vfs_mkdir+0x110/0x230
    do_mkdirat+0xb0/0x1a0
    system_call+0x5c/0x68

This is a PowerPC platform with following NUMA topology:

  available: 2 nodes (0-1)
  node 0 cpus:
  node 0 size: 0 MB
  node 0 free: 0 MB
  node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
  node 1 size: 35247 MB
  node 1 free: 30907 MB
  node distances:
  node   0   1
    0:  10  40
    1:  40  10

  possible numa nodes: 0-31

This only happens with a mmotm patch "mm/memcontrol.c: allocate
shrinker_map on appropriate NUMA node" [2] which effectively calls
kmalloc_node for each possible node.  SLUB however only allocates
kmem_cache_node on online N_NORMAL_MEMORY nodes, and relies on
node_to_mem_node to return such valid node for other nodes since commit
a561ce00b09e ("slub: fall back to node_to_mem_node() node if allocating
on memoryless node").  This is however not true in this configuration
where the _node_numa_mem_ array is not initialized for nodes 0 and 2-31,
thus it contains zeroes and get_partial() ends up accessing
non-allocated kmem_cache_node.

A related issue was reported by Bharata (originally by Ramachandran) [3]
where a similar PowerPC configuration, but with mainline kernel without
patch [2] ends up allocating large amounts of pages by kmalloc-1k
kmalloc-512.  This seems to have the same underlying issue with
node_to_mem_node() not behaving as expected, and might probably also
lead to an infinite loop with CONFIG_SLUB_CPU_PARTIAL [4].

This patch should fix both issues by not relying on node_to_mem_node()
anymore and instead simply falling back to NUMA_NO_NODE, when
kmalloc_node(node) is attempted for a node that's not online, or has no
usable memory.  The "usable memory" condition is also changed from
node_present_pages() to N_NORMAL_MEMORY node state, as that is exactly
the condition that SLUB uses to allocate kmem_cache_node structures.
The check in get_partial() is removed completely, as the checks in
___slab_alloc() are now sufficient to prevent get_partial() being
reached with an invalid node.

[1] https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@linux.vnet.ibm.com/
[2] https://lore.kernel.org/linux-mm/fff0e636-4c36-ed10-281c-8cdb0687c839@virtuozzo.com/
[3] https://lore.kernel.org/linux-mm/20200317092624.GB22538@in.ibm.com/
[4] https://lore.kernel.org/linux-mm/088b5996-faae-8a56-ef9c-5b567125ae54@suse.cz/

Fixes: a561ce00b09e ("slub: fall back to node_to_mem_node() node if allocating on memoryless node")
Reported-by: Sachin Sant &lt;sachinp@linux.vnet.ibm.com&gt;
Reported-by: PUVICHAKRAVARTHY RAMACHANDRAN &lt;puvichakravarthy@in.ibm.com&gt;
Signed-off-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Tested-by: Sachin Sant &lt;sachinp@linux.vnet.ibm.com&gt;
Tested-by: Bharata B Rao &lt;bharata@linux.ibm.com&gt;
Reviewed-by: Srikar Dronamraju &lt;srikar@linux.vnet.ibm.com&gt;
Cc: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Cc: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Cc: Michal Hocko &lt;mhocko@kernel.org&gt;
Cc: Christopher Lameter &lt;cl@linux.com&gt;
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Joonsoo Kim &lt;iamjoonsoo.kim@lge.com&gt;
Cc: Pekka Enberg &lt;penberg@kernel.org&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Kirill Tkhai &lt;ktkhai@virtuozzo.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Nathan Lynch &lt;nathanl@linux.ibm.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Link: http://lkml.kernel.org/r/20200320115533.9604-1-vbabka@suse.cz
Debugged-by: Srikar Dronamraju &lt;srikar@linux.vnet.ibm.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>mm: slub: be more careful about the double cmpxchg of freelist</title>
<updated>2020-03-25T07:25:57Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2020-03-17T18:04:09Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=623515739282e419e3f38d82728b8f174fbfb79d'/>
<id>urn:sha1:623515739282e419e3f38d82728b8f174fbfb79d</id>
<content type='text'>
commit 5076190daded2197f62fe92cf69674488be44175 upstream.

This is just a cleanup addition to Jann's fix to properly update the
transaction ID for the slub slowpath in commit fd4d9c7d0c71 ("mm: slub:
add missing TID bump..").

The transaction ID is what protects us against any concurrent accesses,
but we should really also make sure to make the 'freelist' comparison
itself always use the same freelist value that we then used as the new
next free pointer.

Jann points out that if we do all of this carefully, we could skip the
transaction ID update for all the paths that only remove entries from
the lists, and only update the TID when adding entries (to avoid the ABA
issue with cmpxchg and list handling re-adding a previously seen value).

But this patch just does the "make sure to cmpxchg the same value we
used" rather than then try to be clever.

Acked-by: Jann Horn &lt;jannh@google.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>mm: do not allow MADV_PAGEOUT for CoW pages</title>
<updated>2020-03-25T07:25:57Z</updated>
<author>
<name>Michal Hocko</name>
<email>mhocko@suse.com</email>
</author>
<published>2020-03-22T01:22:26Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=69f434a05fb4ed71684f1bd06391bf1e70a65288'/>
<id>urn:sha1:69f434a05fb4ed71684f1bd06391bf1e70a65288</id>
<content type='text'>
commit 12e967fd8e4e6c3d275b4c69c890adc838891300 upstream.

Jann has brought up a very interesting point [1].  While shared pages
are excluded from MADV_PAGEOUT normally, CoW pages can be easily
reclaimed that way.  This can lead to all sorts of hard to debug
problems.  E.g.  performance problems outlined by Daniel [2].

There are runtime environments where there is a substantial memory
shared among security domains via CoW memory and a easy to reclaim way
of that memory, which MADV_{COLD,PAGEOUT} offers, can lead to either
performance degradation in for the parent process which might be more
privileged or even open side channel attacks.

The feasibility of the latter is not really clear to me TBH but there is
no real reason for exposure at this stage.  It seems there is no real
use case to depend on reclaiming CoW memory via madvise at this stage so
it is much easier to simply disallow it and this is what this patch
does.  Put it simply MADV_{PAGEOUT,COLD} can operate only on the
exclusively owned memory which is a straightforward semantic.

[1] http://lkml.kernel.org/r/CAG48ez0G3JkMq61gUmyQAaCq=_TwHbi1XKzWRooxZkv08PQKuw@mail.gmail.com
[2] http://lkml.kernel.org/r/CAKOZueua_v8jHCpmEtTB6f3i9e2YnmX4mqdYVWhV4E=Z-n+zRQ@mail.gmail.com

Fixes: 9c276cc65a58 ("mm: introduce MADV_COLD")
Reported-by: Jann Horn &lt;jannh@google.com&gt;
Signed-off-by: Michal Hocko &lt;mhocko@suse.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Acked-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Daniel Colascione &lt;dancol@google.com&gt;
Cc: Dave Hansen &lt;dave.hansen@intel.com&gt;
Cc: "Joel Fernandes (Google)" &lt;joel@joelfernandes.org&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Link: http://lkml.kernel.org/r/20200312082248.GS23944@dhcp22.suse.cz
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case</title>
<updated>2020-03-25T07:25:57Z</updated>
<author>
<name>Baoquan He</name>
<email>bhe@redhat.com</email>
</author>
<published>2020-03-22T01:22:13Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=c3f54f0a68bfc7c285890b1b2e0d78abe29d3f3a'/>
<id>urn:sha1:c3f54f0a68bfc7c285890b1b2e0d78abe29d3f3a</id>
<content type='text'>
commit d41e2f3bd54699f85b3d6f45abd09fa24a222cb9 upstream.

In section_deactivate(), pfn_to_page() doesn't work any more after
ms-&gt;section_mem_map is resetting to NULL in SPARSEMEM|!VMEMMAP case.  It
causes a hot remove failure:

  kernel BUG at mm/page_alloc.c:4806!
  invalid opcode: 0000 [#1] SMP PTI
  CPU: 3 PID: 8 Comm: kworker/u16:0 Tainted: G        W         5.5.0-next-20200205+ #340
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
  Workqueue: kacpi_hotplug acpi_hotplug_work_fn
  RIP: 0010:free_pages+0x85/0xa0
  Call Trace:
   __remove_pages+0x99/0xc0
   arch_remove_memory+0x23/0x4d
   try_remove_memory+0xc8/0x130
   __remove_memory+0xa/0x11
   acpi_memory_device_remove+0x72/0x100
   acpi_bus_trim+0x55/0x90
   acpi_device_hotplug+0x2eb/0x3d0
   acpi_hotplug_work_fn+0x1a/0x30
   process_one_work+0x1a7/0x370
   worker_thread+0x30/0x380
   kthread+0x112/0x130
   ret_from_fork+0x35/0x40

Let's move the -&gt;section_mem_map resetting after
depopulate_section_memmap() to fix it.

[akpm@linux-foundation.org: remove unneeded initialization, per David]
Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
Signed-off-by: Baoquan He &lt;bhe@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Reviewed-by: Pankaj Gupta &lt;pankaj.gupta.linux@gmail.com&gt;
Reviewed-by: David Hildenbrand &lt;david@redhat.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Wei Yang &lt;richardw.yang@linux.intel.com&gt;
Cc: Oscar Salvador &lt;osalvador@suse.de&gt;
Cc: Mike Rapoport &lt;rppt@linux.ibm.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Link: http://lkml.kernel.org/r/20200307084229.28251-2-bhe@redhat.com
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>mm, memcg: throttle allocators based on ancestral memory.high</title>
<updated>2020-03-25T07:25:57Z</updated>
<author>
<name>Chris Down</name>
<email>chris@chrisdown.name</email>
</author>
<published>2020-03-22T01:22:23Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=61cfbcce9e09b72992a698397b1d4be8fd2cba9f'/>
<id>urn:sha1:61cfbcce9e09b72992a698397b1d4be8fd2cba9f</id>
<content type='text'>
commit e26733e0d0ec6798eca93daa300bc3f43616127f upstream.

Prior to this commit, we only directly check the affected cgroup's
memory.high against its usage.  However, it's possible that we are being
reclaimed as a result of hitting an ancestor memory.high and should be
penalised based on that, instead.

This patch changes memory.high overage throttling to use the largest
overage in its ancestors when considering how many penalty jiffies to
charge.  This makes sure that we penalise poorly behaving cgroups in the
same way regardless of at what level of the hierarchy memory.high was
breached.

Fixes: 0e4b01df8659 ("mm, memcg: throttle allocators when failing reclaim over memory.high")
Reported-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Signed-off-by: Chris Down &lt;chris@chrisdown.name&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Michal Hocko &lt;mhocko@kernel.org&gt;
Cc: Nathan Chancellor &lt;natechancellor@gmail.com&gt;
Cc: Roman Gushchin &lt;guro@fb.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[5.4.x+]
Link: http://lkml.kernel.org/r/8cd132f84bd7e16cdb8fde3378cdbf05ba00d387.1584036142.git.chris@chrisdown.name
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>mm, memcg: fix corruption on 64-bit divisor in memory.high throttling</title>
<updated>2020-03-25T07:25:57Z</updated>
<author>
<name>Chris Down</name>
<email>chris@chrisdown.name</email>
</author>
<published>2020-03-22T01:22:20Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=77c4bc4bf612dbb293b94655599fe494174aedea'/>
<id>urn:sha1:77c4bc4bf612dbb293b94655599fe494174aedea</id>
<content type='text'>
commit d397a45fc741c80c32a14e2de008441e9976f50c upstream.

Commit 0e4b01df8659 had a bunch of fixups to use the right division
method.  However, it seems that after all that it still wasn't right --
div_u64 takes a 32-bit divisor.

The headroom is still large (2^32 pages), so on mundane systems you
won't hit this, but this should definitely be fixed.

Fixes: 0e4b01df8659 ("mm, memcg: throttle allocators when failing reclaim over memory.high")
Reported-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Signed-off-by: Chris Down &lt;chris@chrisdown.name&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Roman Gushchin &lt;guro@fb.com&gt;
Cc: Michal Hocko &lt;mhocko@kernel.org&gt;
Cc: Nathan Chancellor &lt;natechancellor@gmail.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[5.4.x+]
Link: http://lkml.kernel.org/r/80780887060514967d414b3cd91f9a316a16ab98.1584036142.git.chris@chrisdown.name
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
</feed>
