<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/mm/backing-dev.c, branch v4.9.280</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v4.9.280</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v4.9.280'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2021-02-23T12:59:15Z</updated>
<entry>
<title>memcg: fix a crash in wb_workfn when a device disappears</title>
<updated>2021-02-23T12:59:15Z</updated>
<author>
<name>Theodore Ts'o</name>
<email>tytso@mit.edu</email>
</author>
<published>2020-01-31T06:11:04Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=aff82146369808334fc93948ed41ab0d05ebb9a7'/>
<id>urn:sha1:aff82146369808334fc93948ed41ab0d05ebb9a7</id>
<content type='text'>
[ Upstream commit 68f23b89067fdf187763e75a56087550624fdbee ]

Without memcg, there is a one-to-one mapping between the bdi and
bdi_writeback structures.  In this world, things are fairly
straightforward; the first thing bdi_unregister() does is to shutdown
the bdi_writeback structure (or wb), and part of that writeback ensures
that no other work queued against the wb, and that the wb is fully
drained.

With memcg, however, there is a one-to-many relationship between the bdi
and bdi_writeback structures; that is, there are multiple wb objects
which can all point to a single bdi.  There is a refcount which prevents
the bdi object from being released (and hence, unregistered).  So in
theory, the bdi_unregister() *should* only get called once its refcount
goes to zero (bdi_put will drop the refcount, and when it is zero,
release_bdi gets called, which calls bdi_unregister).

Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
the Brave New memcg World, and calls bdi_unregister directly.  It does
this without informing the file system, or the memcg code, or anything
else.  This causes the root wb associated with the bdi to be
unregistered, but none of the memcg-specific wb's are shutdown.  So when
one of these wb's are woken up to do delayed work, they try to
dereference their wb-&gt;bdi-&gt;dev to fetch the device name, but
unfortunately bdi-&gt;dev is now NULL, thanks to the bdi_unregister()
called by del_gendisk().  As a result, *boom*.

Fortunately, it looks like the rest of the writeback path is perfectly
happy with bdi-&gt;dev and bdi-&gt;owner being NULL, so the simplest fix is to
create a bdi_dev_name() function which can handle bdi-&gt;dev being NULL.
This also allows us to bulletproof the writeback tracepoints to prevent
them from dereferencing a NULL pointer and crashing the kernel if one is
tracing with memcg's enabled, and an iSCSI device dies or a USB storage
stick is pulled.

The most common way of triggering this will be hotremoval of a device
while writeback with memcg enabled is going on.  It was triggering
several times a day in a heavily loaded production environment.

Google Bug Id: 145475544

Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu
Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o &lt;tytso@mit.edu&gt;
Cc: Chris Mason &lt;clm@fb.com&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Jens Axboe &lt;axboe@kernel.dk&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>writeback: synchronize sync(2) against cgroup writeback membership switches</title>
<updated>2019-05-21T16:49:01Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2017-12-12T16:38:30Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=1cfaba5b49a5e9133b28da6d1a459240c1f61993'/>
<id>urn:sha1:1cfaba5b49a5e9133b28da6d1a459240c1f61993</id>
<content type='text'>
commit 7fc5854f8c6efae9e7624970ab49a1eac2faefb1 upstream.

sync_inodes_sb() can race against cgwb (cgroup writeback) membership
switches and fail to writeback some inodes.  For example, if an inode
switches to another wb while sync_inodes_sb() is in progress, the new
wb might not be visible to bdi_split_work_to_wbs() at all or the inode
might jump from a wb which hasn't issued writebacks yet to one which
already has.

This patch adds backing_dev_info-&gt;wb_switch_rwsem to synchronize cgwb
switch path against sync_inodes_sb() so that sync_inodes_sb() is
guaranteed to see all the target wbs and inodes can't jump wbs to
escape syncing.

v2: Fixed misplaced rwsem init.  Spotted by Jiufei.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reported-by: Jiufei Xue &lt;xuejiufei@gmail.com&gt;
Link: http://lkml.kernel.org/r/dc694ae2-f07f-61e1-7097-7c8411cee12d@gmail.com
Acked-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>block: fix double-free in the failure path of cgwb_bdi_init()</title>
<updated>2017-02-26T10:10:52Z</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2017-02-08T20:19:07Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=1cb3de83ab740c17dafe9148c2a5b5ac41a736cf'/>
<id>urn:sha1:1cb3de83ab740c17dafe9148c2a5b5ac41a736cf</id>
<content type='text'>
commit 5f478e4ea5c5560b4e40eb136991a09f9389f331 upstream.

When !CONFIG_CGROUP_WRITEBACK, bdi has single bdi_writeback_congested
at bdi-&gt;wb_congested.  cgwb_bdi_init() allocates it with kzalloc() and
doesn't do further initialization.  This usually works fine as the
reference count gets bumped to 1 by wb_init() and the put from
wb_exit() releases it.

However, when wb_init() fails, it puts the wb base ref automatically
freeing the wb and the explicit kfree() in cgwb_bdi_init() error path
ends up trying to free the same pointer the second time causing a
double-free.

Fix it by explicitly initilizing the refcnt to 1 and putting the base
ref from cgwb_bdi_destroy().

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reported-by: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Fixes: a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback")
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>block: fix bdi vs gendisk lifetime mismatch</title>
<updated>2016-08-04T20:19:16Z</updated>
<author>
<name>Dan Williams</name>
<email>dan.j.williams@intel.com</email>
</author>
<published>2016-07-31T18:15:13Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=df08c32ce3be5be138c1dbfcba203314a3a7cd6f'/>
<id>urn:sha1:df08c32ce3be5be138c1dbfcba203314a3a7cd6f</id>
<content type='text'>
The name for a bdi of a gendisk is derived from the gendisk's devt.
However, since the gendisk is destroyed before the bdi it leaves a
window where a new gendisk could dynamically reuse the same devt while a
bdi with the same name is still live.  Arrange for the bdi to hold a
reference against its "owner" disk device while it is registered.
Otherwise we can hit sysfs duplicate name collisions like the following:

 WARNING: CPU: 10 PID: 2078 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x64/0x80
 sysfs: cannot create duplicate filename '/devices/virtual/bdi/259:1'

 Hardware name: HP ProLiant DL580 Gen8, BIOS P79 05/06/2015
  0000000000000286 0000000002c04ad5 ffff88006f24f970 ffffffff8134caec
  ffff88006f24f9c0 0000000000000000 ffff88006f24f9b0 ffffffff8108c351
  0000001f0000000c ffff88105d236000 ffff88105d1031e0 ffff8800357427f8
 Call Trace:
  [&lt;ffffffff8134caec&gt;] dump_stack+0x63/0x87
  [&lt;ffffffff8108c351&gt;] __warn+0xd1/0xf0
  [&lt;ffffffff8108c3cf&gt;] warn_slowpath_fmt+0x5f/0x80
  [&lt;ffffffff812a0d34&gt;] sysfs_warn_dup+0x64/0x80
  [&lt;ffffffff812a0e1e&gt;] sysfs_create_dir_ns+0x7e/0x90
  [&lt;ffffffff8134faaa&gt;] kobject_add_internal+0xaa/0x320
  [&lt;ffffffff81358d4e&gt;] ? vsnprintf+0x34e/0x4d0
  [&lt;ffffffff8134ff55&gt;] kobject_add+0x75/0xd0
  [&lt;ffffffff816e66b2&gt;] ? mutex_lock+0x12/0x2f
  [&lt;ffffffff8148b0a5&gt;] device_add+0x125/0x610
  [&lt;ffffffff8148b788&gt;] device_create_groups_vargs+0xd8/0x100
  [&lt;ffffffff8148b7cc&gt;] device_create_vargs+0x1c/0x20
  [&lt;ffffffff811b775c&gt;] bdi_register+0x8c/0x180
  [&lt;ffffffff811b7877&gt;] bdi_register_dev+0x27/0x30
  [&lt;ffffffff813317f5&gt;] add_disk+0x175/0x4a0

Cc: &lt;stable@vger.kernel.org&gt;
Reported-by: Yi Zhang &lt;yizhan@redhat.com&gt;
Tested-by: Yi Zhang &lt;yizhan@redhat.com&gt;
Signed-off-by: Dan Williams &lt;dan.j.williams@intel.com&gt;

Fixed up missing 0 return in bdi_register_owner().

Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</content>
</entry>
<entry>
<title>mm, vmscan: move LRU lists to node</title>
<updated>2016-07-28T23:07:41Z</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@techsingularity.net</email>
</author>
<published>2016-07-28T22:45:31Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=599d0c954f91d0689c9bb421b5bc04ea02437a41'/>
<id>urn:sha1:599d0c954f91d0689c9bb421b5bc04ea02437a41</id>
<content type='text'>
This moves the LRU lists from the zone to the node and related data such
as counters, tracing, congestion tracking and writeback tracking.

Unfortunately, due to reclaim and compaction retry logic, it is
necessary to account for the number of LRU pages on both zone and node
logic.  Most reclaim logic is based on the node counters but the retry
logic uses the zone counters which do not distinguish inactive and
active sizes.  It would be possible to leave the LRU counters on a
per-zone basis but it's a heavier calculation across multiple cache
lines that is much more frequent than the retry checks.

Other than the LRU counters, this is mostly a mechanical patch but note
that it introduces a number of anomalies.  For example, the scans are
per-zone but using per-node counters.  We also mark a node as congested
when a zone is congested.  This causes weird problems that are fixed
later but is easier to review.

In the event that there is excessive overhead on 32-bit systems due to
the nodes being on LRU then there are two potential solutions

1. Long-term isolation of highmem pages when reclaim is lowmem

   When pages are skipped, they are immediately added back onto the LRU
   list. If lowmem reclaim persisted for long periods of time, the same
   highmem pages get continually scanned. The idea would be that lowmem
   keeps those pages on a separate list until a reclaim for highmem pages
   arrives that splices the highmem pages back onto the LRU. It potentially
   could be implemented similar to the UNEVICTABLE list.

   That would reduce the skip rate with the potential corner case is that
   highmem pages have to be scanned and reclaimed to free lowmem slab pages.

2. Linear scan lowmem pages if the initial LRU shrink fails

   This will break LRU ordering but may be preferable and faster during
   memory pressure than skipping LRU pages.

Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Acked-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Hillf Danton &lt;hillf.zj@alibaba-inc.com&gt;
Cc: Joonsoo Kim &lt;iamjoonsoo.kim@lge.com&gt;
Cc: Michal Hocko &lt;mhocko@kernel.org&gt;
Cc: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Rik van Riel &lt;riel@surriel.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: throttle on IO only when there are too many dirty and writeback pages</title>
<updated>2016-05-21T00:58:30Z</updated>
<author>
<name>Michal Hocko</name>
<email>mhocko@suse.com</email>
</author>
<published>2016-05-20T23:57:03Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=ede37713737834d98ec72ed299a305d53e909f73'/>
<id>urn:sha1:ede37713737834d98ec72ed299a305d53e909f73</id>
<content type='text'>
wait_iff_congested has been used to throttle allocator before it retried
another round of direct reclaim to allow the writeback to make some
progress and prevent reclaim from looping over dirty/writeback pages
without making any progress.

We used to do congestion_wait before commit 0e093d99763e ("writeback: do
not sleep on the congestion queue if there are no congested BDIs or if
significant congestion is not being encountered in the current zone")
but that led to undesirable stalls and sleeping for the full timeout
even when the BDI wasn't congested.  Hence wait_iff_congested was used
instead.

But it seems that even wait_iff_congested doesn't work as expected.  We
might have a small file LRU list with all pages dirty/writeback and yet
the bdi is not congested so this is just a cond_resched in the end and
can end up triggering pre mature OOM.

This patch replaces the unconditional wait_iff_congested by
congestion_wait which is executed only if we _know_ that the last round
of direct reclaim didn't make any progress and dirty+writeback pages are
more than a half of the reclaimable pages on the zone which might be
usable for our target allocation.  This shouldn't reintroduce stalls
fixed by 0e093d99763e because congestion_wait is called only when we are
getting hopeless when sleeping is a better choice than OOM with many
pages under IO.

We have to preserve logic introduced by commit 373ccbe59270 ("mm,
vmstat: allow WQ concurrency to discover memory reclaim doesn't make any
progress") into the __alloc_pages_slowpath now that wait_iff_congested
is not used anymore.  As the only remaining user of wait_iff_congested
is shrink_inactive_list we can remove the WQ specific short sleep from
wait_iff_congested because the sleep is needed to be done only once in
the allocation retry cycle.

[mhocko@suse.com: high_zoneidx-&gt;ac_classzone_idx to evaluate memory reserves properly]
 Link: http://lkml.kernel.org/r/1463051677-29418-2-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko &lt;mhocko@suse.com&gt;
Acked-by: Hillf Danton &lt;hillf.zj@alibaba-inc.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Joonsoo Kim &lt;js1304@gmail.com&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Tetsuo Handa &lt;penguin-kernel@I-love.SAKURA.ne.jp&gt;
Cc: Vladimir Davydov &lt;vdavydov@virtuozzo.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>writeback: fix the wrong congested state variable definition</title>
<updated>2016-03-31T18:26:25Z</updated>
<author>
<name>Kaixu Xia</name>
<email>xiakaixu@huawei.com</email>
</author>
<published>2016-03-31T13:19:41Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=c877ef8ae7b8edaedccad0fc8c23d4d1de7e2480'/>
<id>urn:sha1:c877ef8ae7b8edaedccad0fc8c23d4d1de7e2480</id>
<content type='text'>
The right variable definition should be wb_congested_state that
include WB_async_congested and WB_sync_congested. So fix it.

Signed-off-by: Kaixu Xia &lt;xiakaixu@huawei.com&gt;
Acked-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</content>
</entry>
<entry>
<title>mm: convert printk(KERN_&lt;LEVEL&gt; to pr_&lt;level&gt;</title>
<updated>2016-03-17T22:09:34Z</updated>
<author>
<name>Joe Perches</name>
<email>joe@perches.com</email>
</author>
<published>2016-03-17T21:19:50Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=1170532bb49f9468aedabdc1d5a560e2521a2bcc'/>
<id>urn:sha1:1170532bb49f9468aedabdc1d5a560e2521a2bcc</id>
<content type='text'>
Most of the mm subsystem uses pr_&lt;level&gt; so make it consistent.

Miscellanea:

 - Realign arguments
 - Add missing newline to format
 - kmemleak-test.c has a "kmemleak: " prefix added to the
   "Kmemleak testing" logging message via pr_fmt

Signed-off-by: Joe Perches &lt;joe@perches.com&gt;
Acked-by: Tejun Heo &lt;tj@kernel.org&gt;	[percpu]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/backing-dev.c: fix error path in wb_init()</title>
<updated>2016-02-12T02:35:48Z</updated>
<author>
<name>Rasmus Villemoes</name>
<email>linux@rasmusvillemoes.dk</email>
</author>
<published>2016-02-12T00:13:06Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=078c6c3a5e7dc53a9a23408cc32c83954abb5d0d'/>
<id>urn:sha1:078c6c3a5e7dc53a9a23408cc32c83954abb5d0d</id>
<content type='text'>
We need to use post-decrement to get percpu_counter_destroy() called on
&amp;wb-&gt;stat[0].  Moreover, the pre-decremebt would cause infinite
out-of-bounds accesses if the setup code failed at i==0.

Signed-off-by: Rasmus Villemoes &lt;linux@rasmusvillemoes.dk&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Vladimir Davydov &lt;vdavydov@virtuozzo.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm, vmstat: fix wrong WQ sleep when memory reclaim doesn't make any progress</title>
<updated>2016-02-06T02:10:40Z</updated>
<author>
<name>Tetsuo Handa</name>
<email>penguin-kernel@i-love.sakura.ne.jp</email>
</author>
<published>2016-02-05T23:36:30Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=564e81a57f9788b1475127012e0fd44e9049e342'/>
<id>urn:sha1:564e81a57f9788b1475127012e0fd44e9049e342</id>
<content type='text'>
Jan Stancek has reported that system occasionally hanging after "oom01"
testcase from LTP triggers OOM.  Guessing from a result that there is a
kworker thread doing memory allocation and the values between "Node 0
Normal free:" and "Node 0 Normal:" differs when hanging, vmstat is not
up-to-date for some reason.

According to commit 373ccbe59270 ("mm, vmstat: allow WQ concurrency to
discover memory reclaim doesn't make any progress"), it meant to force
the kworker thread to take a short sleep, but it by error used
schedule_timeout(1).  We missed that schedule_timeout() in state
TASK_RUNNING doesn't do anything.

Fix it by using schedule_timeout_uninterruptible(1) which forces the
kworker thread to take a short sleep in order to make sure that vmstat
is up-to-date.

Fixes: 373ccbe59270 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress")
Signed-off-by: Tetsuo Handa &lt;penguin-kernel@I-love.SAKURA.ne.jp&gt;
Reported-by: Jan Stancek &lt;jstancek@redhat.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Cristopher Lameter &lt;clameter@sgi.com&gt;
Cc: Joonsoo Kim &lt;iamjoonsoo.kim@lge.com&gt;
Cc: Arkadiusz Miskiewicz &lt;arekm@maven.pl&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
</feed>
