<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/include/linux/mm.h, branch v5.15.16</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v5.15.16</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v5.15.16'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2021-09-07T18:03:45Z</updated>
<entry>
<title>Revert "mm/gup: remove try_get_page(), call try_get_compound_head() directly"</title>
<updated>2021-09-07T18:03:45Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2021-09-07T18:03:45Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=cd1adf1b63a112d762832e9c64b0a886fbb840d6'/>
<id>urn:sha1:cd1adf1b63a112d762832e9c64b0a886fbb840d6</id>
<content type='text'>
This reverts commit 9857a17f206ff374aea78bccfb687f145368be2e.

That commit was completely broken, and I should have caught on to it
earlier.  But happily, the kernel test robot noticed the breakage fairly
quickly.

The breakage is because "try_get_page()" is about avoiding the page
reference count overflow case, but is otherwise the exact same as a
plain "get_page()".

In contrast, "try_get_compound_head()" is an entirely different beast,
and uses __page_cache_add_speculative() because it's not just about the
page reference count, but also about possibly racing with the underlying
page going away.

So all the commentary about how

 "try_get_page() has fallen a little behind in terms of maintenance,
  try_get_compound_head() handles speculative page references more
  thoroughly"

was just completely wrong: yes, try_get_compound_head() handles
speculative page references, but the point is that try_get_page() does
not, and must not.

So there's no lack of maintainance - there are fundamentally different
semantics.

A speculative page reference would be entirely wrong in "get_page()",
and it's entirely wrong in "try_get_page()".  It's not about
speculation, it's purely about "uhhuh, you can't get this page because
you've tried to increment the reference count too much already".

The reason the kernel test robot noticed this bug was that it hit the
VM_BUG_ON() in __page_cache_add_speculative(), which is all about
verifying that the context of any speculative page access is correct.
But since that isn't what try_get_page() is all about, the VM_BUG_ON()
tests things that are not correct to test for try_get_page().

Reported-by: kernel test robot &lt;oliver.sang@intel.com&gt;
Cc: John Hubbard &lt;jhubbard@nvidia.com&gt;
Cc: Christoph Hellwig &lt;hch@lst.de&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>Merge tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux</title>
<updated>2021-09-04T18:35:47Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2021-09-04T18:35:47Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=49624efa65ac9889f4e7c7b2452b2e6ce42ba37d'/>
<id>urn:sha1:49624efa65ac9889f4e7c7b2452b2e6ce42ba37d</id>
<content type='text'>
Pull MAP_DENYWRITE removal from David Hildenbrand:
 "Remove all in-tree usage of MAP_DENYWRITE from the kernel and remove
  VM_DENYWRITE.

  There are some (minor) user-visible changes:

   - We no longer deny write access to shared libaries loaded via legacy
     uselib(); this behavior matches modern user space e.g. dlopen().

   - We no longer deny write access to the elf interpreter after exec
     completed, treating it just like shared libraries (which it often
     is).

   - We always deny write access to the file linked via /proc/pid/exe:
     sys_prctl(PR_SET_MM_MAP/EXE_FILE) will fail if write access to the
     file cannot be denied, and write access to the file will remain
     denied until the link is effectivel gone (exec, termination,
     sys_prctl(PR_SET_MM_MAP/EXE_FILE)) -- just as if exec'ing the file.

  Cross-compiled for a bunch of architectures (alpha, microblaze, i386,
  s390x, ...) and verified via ltp that especially the relevant tests
  (i.e., creat07 and execve04) continue working as expected"

* tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux:
  fs: update documentation of get_write_access() and friends
  mm: ignore MAP_DENYWRITE in ksys_mmap_pgoff()
  mm: remove VM_DENYWRITE
  binfmt: remove in-tree usage of MAP_DENYWRITE
  kernel/fork: always deny write access to current MM exe_file
  kernel/fork: factor out replacing the current MM exe_file
  binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib()
</content>
</entry>
<entry>
<title>Merge branch 'akpm' (patches from Andrew)</title>
<updated>2021-09-03T17:08:28Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2021-09-03T17:08:28Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=14726903c835101cd8d0a703b609305094350d61'/>
<id>urn:sha1:14726903c835101cd8d0a703b609305094350d61</id>
<content type='text'>
Merge misc updates from Andrew Morton:
 "173 patches.

  Subsystems affected by this series: ia64, ocfs2, block, and mm (debug,
  pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
  bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure,
  hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock,
  oom-kill, migration, ksm, percpu, vmstat, and madvise)"

* emailed patches from Andrew Morton &lt;akpm@linux-foundation.org&gt;: (173 commits)
  mm/madvise: add MADV_WILLNEED to process_madvise()
  mm/vmstat: remove unneeded return value
  mm/vmstat: simplify the array size calculation
  mm/vmstat: correct some wrong comments
  mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()
  selftests: vm: add COW time test for KSM pages
  selftests: vm: add KSM merging time test
  mm: KSM: fix data type
  selftests: vm: add KSM merging across nodes test
  selftests: vm: add KSM zero page merging test
  selftests: vm: add KSM unmerge test
  selftests: vm: add KSM merge test
  mm/migrate: correct kernel-doc notation
  mm: wire up syscall process_mrelease
  mm: introduce process_mrelease system call
  memblock: make memblock_find_in_range method private
  mm/mempolicy.c: use in_task() in mempolicy_slab_node()
  mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
  mm/mempolicy: advertise new MPOL_PREFERRED_MANY
  mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
  ...
</content>
</entry>
<entry>
<title>mm: hwpoison: don't drop slab caches for offlining non-LRU page</title>
<updated>2021-09-03T16:58:15Z</updated>
<author>
<name>Yang Shi</name>
<email>shy828301@gmail.com</email>
</author>
<published>2021-09-02T21:58:31Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=d0505e9f7dcec85da6634ec66da2b17656ee177b'/>
<id>urn:sha1:d0505e9f7dcec85da6634ec66da2b17656ee177b</id>
<content type='text'>
In the current implementation of soft offline, if non-LRU page is met,
all the slab caches will be dropped to free the page then offline.  But
if the page is not slab page all the effort is wasted in vain.  Even
though it is a slab page, it is not guaranteed the page could be freed
at all.

However the side effect and cost is quite high.  It does not only drop
the slab caches, but also may drop a significant amount of page caches
which are associated with inode caches.  It could make the most
workingset gone in order to just offline a page.  And the offline is not
guaranteed to succeed at all, actually I really doubt the success rate
for real life workload.

Furthermore the worse consequence is the system may be locked up and
unusable since the page cache release may incur huge amount of works
queued for memcg release.

Actually we ran into such unpleasant case in our production environment.
Firstly, the workqueue of memory_failure_work_func is locked up as
below:

    BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 53s!
    Showing busy workqueues and worker pools:
    workqueue events: flags=0x0
     pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=14/256 refcnt=15
      in-flight: 409271:memory_failure_work_func
      pending: kfree_rcu_work, kfree_rcu_monitor, kfree_rcu_work, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, drain_local_stock, kfree_rcu_work
    workqueue mm_percpu_wq: flags=0x8
     pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
      pending: vmstat_update
    workqueue cgroup_destroy: flags=0x0
      pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1 refcnt=12072
        pending: css_release_work_fn

There were over 12K css_release_work_fn queued, and this caused a few
lockups due to the contention of worker pool lock with IRQ disabled, for
example:

    NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
    Modules linked in: amd64_edac_mod edac_mce_amd crct10dif_pclmul crc32_pclmul ghash_clmulni_intel xt_DSCP iptable_mangle kvm_amd bpfilter vfat fat acpi_ipmi i2c_piix4 usb_storage ipmi_si k10temp i2c_core ipmi_devintf ipmi_msghandler acpi_cpufreq sch_fq_codel xfs libcrc32c crc32c_intel mlx5_core mlxfw nvme xhci_pci ptp nvme_core pps_core xhci_hcd
    CPU: 1 PID: 205500 Comm: kworker/1:0 Tainted: G             L    5.10.32-t1.el7.twitter.x86_64 #1
    Hardware name: TYAN F5AMT /z        /S8026GM2NRE-CGN, BIOS V8.030 03/30/2021
    Workqueue: events memory_failure_work_func
    RIP: 0010:queued_spin_lock_slowpath+0x41/0x1a0
    Code: 41 f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 &lt;8b&gt; 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 f6 c4 01 75 04 c6 47
    RSP: 0018:ffff9b2ac278f900 EFLAGS: 00000002
    RAX: 0000000000480101 RBX: ffff8ce98ce71800 RCX: 0000000000000084
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8ce98ce6a140
    RBP: 00000000000284c8 R08: ffffd7248dcb6808 R09: 0000000000000000
    R10: 0000000000000003 R11: ffff9b2ac278f9b0 R12: 0000000000000001
    R13: ffff8cb44dab9c00 R14: ffffffffbd1ce6a0 R15: ffff8cacaa37f068
    FS:  0000000000000000(0000) GS:ffff8ce98ce40000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fcf6e8cb000 CR3: 0000000a0c60a000 CR4: 0000000000350ee0
    Call Trace:
     __queue_work+0xd6/0x3c0
     queue_work_on+0x1c/0x30
     uncharge_batch+0x10e/0x110
     mem_cgroup_uncharge_list+0x6d/0x80
     release_pages+0x37f/0x3f0
     __pagevec_release+0x1c/0x50
     __invalidate_mapping_pages+0x348/0x380
     inode_lru_isolate+0x10a/0x160
     __list_lru_walk_one+0x7b/0x170
     list_lru_walk_one+0x4a/0x60
     prune_icache_sb+0x37/0x50
     super_cache_scan+0x123/0x1a0
     do_shrink_slab+0x10c/0x2c0
     shrink_slab+0x1f1/0x290
     drop_slab_node+0x4d/0x70
     soft_offline_page+0x1ac/0x5b0
     memory_failure_work_func+0x6a/0x90
     process_one_work+0x19e/0x340
     worker_thread+0x30/0x360
     kthread+0x116/0x130

The lockup made the machine is quite unusable.  And it also made the
most workingset gone, the reclaimabled slab caches were reduced from 12G
to 300MB, the page caches were decreased from 17G to 4G.

But the most disappointing thing is all the effort doesn't make the page
offline, it just returns:

    soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 ()

It seems the aggressive behavior for non-LRU page didn't pay back, so it
doesn't make too much sense to keep it considering the terrible side
effect.

Link: https://lkml.kernel.org/r/20210819054116.266126-1-shy828301@gmail.com
Signed-off-by: Yang Shi &lt;shy828301@gmail.com&gt;
Reported-by: David Mackey &lt;tdmackey@twitter.com&gt;
Acked-by: David Hildenbrand &lt;david@redhat.com&gt;
Acked-by: Naoya Horiguchi &lt;naoya.horiguchi@nec.com&gt;
Cc: Oscar Salvador &lt;osalvador@suse.de&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Jonathan Corbet &lt;corbet@lwn.net&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: delete unused get_kernel_page()</title>
<updated>2021-09-03T16:58:11Z</updated>
<author>
<name>John Hubbard</name>
<email>jhubbard@nvidia.com</email>
</author>
<published>2021-09-02T21:54:00Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=3969b1a654fb09b7915efc1aa4ad45daf932e12f'/>
<id>urn:sha1:3969b1a654fb09b7915efc1aa4ad45daf932e12f</id>
<content type='text'>
get_kernel_page() was added in 2012 by [1].  It was used for a while for
NFS, but then in 2014, a refactoring [2] removed all callers, and it has
apparently not been used since.

Remove get_kernel_page() because it has no callers.

[1] commit 18022c5d8627 ("mm: add get_kernel_page[s] for pinning of
    kernel addresses for I/O")
[2] commit 91f79c43d1b5 ("new helper: iov_iter_get_pages_alloc()")

Link: https://lkml.kernel.org/r/20210729221847.1165665-1-jhubbard@nvidia.com
Signed-off-by: John Hubbard &lt;jhubbard@nvidia.com&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed-by: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: David S. Miller &lt;davem@davemloft.net&gt;
Cc: Eric B Munson &lt;emunson@mgebm.net&gt;
Cc: Eric Paris &lt;eparis@redhat.com&gt;
Cc: James Morris &lt;jmorris@namei.org&gt;
Cc: Mike Christie &lt;michaelc@cs.wisc.edu&gt;
Cc: Neil Brown &lt;neilb@suse.de&gt;
Cc: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Cc: Sebastian Andrzej Siewior &lt;sebastian@breakpoint.cc&gt;
Cc: Trond Myklebust &lt;Trond.Myklebust@netapp.com&gt;
Cc: Xiaotian Feng &lt;dfeng@redhat.com&gt;
Cc: Mark Salter &lt;msalter@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/gup: remove try_get_page(), call try_get_compound_head() directly</title>
<updated>2021-09-03T16:58:11Z</updated>
<author>
<name>John Hubbard</name>
<email>jhubbard@nvidia.com</email>
</author>
<published>2021-09-02T21:53:54Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=9857a17f206ff374aea78bccfb687f145368be2e'/>
<id>urn:sha1:9857a17f206ff374aea78bccfb687f145368be2e</id>
<content type='text'>
try_get_page() is very similar to try_get_compound_head(), and in fact
try_get_page() has fallen a little behind in terms of maintenance:
try_get_compound_head() handles speculative page references more
thoroughly.

There are only two try_get_page() callsites, so just call
try_get_compound_head() directly from those, and remove try_get_page()
entirely.

Also, seeing as how this changes try_get_compound_head() into a non-static
function, provide some kerneldoc documentation for it.

Link: https://lkml.kernel.org/r/20210813044133.1536842-4-jhubbard@nvidia.com
Signed-off-by: John Hubbard &lt;jhubbard@nvidia.com&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Cc: Matthew Wilcox &lt;willy@infradead.org&gt;
Cc: Christian Borntraeger &lt;borntraeger@de.ibm.com&gt;
Cc: Heiko Carstens &lt;hca@linux.ibm.com&gt;
Cc: Vasily Gorbik &lt;gor@linux.ibm.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/gup: small refactoring: simplify try_grab_page()</title>
<updated>2021-09-03T16:58:11Z</updated>
<author>
<name>John Hubbard</name>
<email>jhubbard@nvidia.com</email>
</author>
<published>2021-09-02T21:53:51Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=54d516b1d62ff8f17cee2da06e5e4706a0d00b8a'/>
<id>urn:sha1:54d516b1d62ff8f17cee2da06e5e4706a0d00b8a</id>
<content type='text'>
try_grab_page() does the same thing as try_grab_compound_head(..., refs=1,
...), just with a different API.  So there is a lot of code duplication
there.

Change try_grab_page() to call try_grab_compound_head(), while keeping the
API contract identical for callers.

Also, now that try_grab_compound_head() always has a caller, remove the
__maybe_unused annotation.

Link: https://lkml.kernel.org/r/20210813044133.1536842-3-jhubbard@nvidia.com
Signed-off-by: John Hubbard &lt;jhubbard@nvidia.com&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Cc: Matthew Wilcox &lt;willy@infradead.org&gt;
Cc: Christian Borntraeger &lt;borntraeger@de.ibm.com&gt;
Cc: Heiko Carstens &lt;hca@linux.ibm.com&gt;
Cc: Vasily Gorbik &lt;gor@linux.ibm.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: remove VM_DENYWRITE</title>
<updated>2021-09-03T16:42:01Z</updated>
<author>
<name>David Hildenbrand</name>
<email>david@redhat.com</email>
</author>
<published>2021-04-22T10:08:20Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=8d0920bde5eb8ec7e567939b85e65a0596c8580d'/>
<id>urn:sha1:8d0920bde5eb8ec7e567939b85e65a0596c8580d</id>
<content type='text'>
All in-tree users of MAP_DENYWRITE are gone. MAP_DENYWRITE cannot be
set from user space, so all users are gone; let's remove it.

Acked-by: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Acked-by: Christian König &lt;christian.koenig@amd.com&gt;
Signed-off-by: David Hildenbrand &lt;david@redhat.com&gt;
</content>
</entry>
<entry>
<title>kernel/fork: always deny write access to current MM exe_file</title>
<updated>2021-09-03T16:42:01Z</updated>
<author>
<name>David Hildenbrand</name>
<email>david@redhat.com</email>
</author>
<published>2021-04-23T08:29:59Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=fe69d560b5bd9ec77b5d5749bd7027344daef47e'/>
<id>urn:sha1:fe69d560b5bd9ec77b5d5749bd7027344daef47e</id>
<content type='text'>
We want to remove VM_DENYWRITE only currently only used when mapping the
executable during exec. During exec, we already deny_write_access() the
executable, however, after exec completes the VMAs mapped
with VM_DENYWRITE effectively keeps write access denied via
deny_write_access().

Let's deny write access when setting or replacing the MM exe_file. With
this change, we can remove VM_DENYWRITE for mapping executables.

Make set_mm_exe_file() return an error in case deny_write_access()
fails; note that this should never happen, because exec code does a
deny_write_access() early and keeps write access denied when calling
set_mm_exe_file. However, it makes the code easier to read and makes
set_mm_exe_file() and replace_mm_exe_file() look more similar.

This represents a minor user space visible change:
sys_prctl(PR_SET_MM_MAP/EXE_FILE) can now fail if the file is already
opened writable. Also, after sys_prctl(PR_SET_MM_MAP/EXE_FILE) the file
cannot be opened writable. Note that we can already fail with -EACCES if
the file doesn't have execute permissions.

Acked-by: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Acked-by: Christian König &lt;christian.koenig@amd.com&gt;
Signed-off-by: David Hildenbrand &lt;david@redhat.com&gt;
</content>
</entry>
<entry>
<title>kernel/fork: factor out replacing the current MM exe_file</title>
<updated>2021-09-03T16:42:01Z</updated>
<author>
<name>David Hildenbrand</name>
<email>david@redhat.com</email>
</author>
<published>2021-04-23T08:20:25Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=35d7bdc86031a2c1ae05ac27dfa93b2acdcbaecc'/>
<id>urn:sha1:35d7bdc86031a2c1ae05ac27dfa93b2acdcbaecc</id>
<content type='text'>
Let's factor the main logic out into replace_mm_exe_file(), such that
all mm-&gt;exe_file logic is contained in kernel/fork.c.

While at it, perform some simple cleanups that are possible now that
we're simplifying the individual functions.

Acked-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Acked-by: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Acked-by: Christian König &lt;christian.koenig@amd.com&gt;
Signed-off-by: David Hildenbrand &lt;david@redhat.com&gt;
</content>
</entry>
</feed>
