<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/drivers/md, branch v4.19.75</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v4.19.75</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v4.19.75'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2019-09-16T06:22:23Z</updated>
<entry>
<title>bcache: fix race in btree_flush_write()</title>
<updated>2019-09-16T06:22:23Z</updated>
<author>
<name>Coly Li</name>
<email>colyli@suse.de</email>
</author>
<published>2019-06-28T11:59:58Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=b113f98432aed624fd9b80af818bd87e4db83537'/>
<id>urn:sha1:b113f98432aed624fd9b80af818bd87e4db83537</id>
<content type='text'>
[ Upstream commit 50a260e859964002dab162513a10f91ae9d3bcd3 ]

There is a race between mca_reap(), btree_node_free() and journal code
btree_flush_write(), which results very rare and strange deadlock or
panic and are very hard to reproduce.

Let me explain how the race happens. In btree_flush_write() one btree
node with oldest journal pin is selected, then it is flushed to cache
device, the select-and-flush is a two steps operation. Between these two
steps, there are something may happen inside the race window,
- The selected btree node was reaped by mca_reap() and allocated to
  other requesters for other btree node.
- The slected btree node was selected, flushed and released by mca
  shrink callback bch_mca_scan().
When btree_flush_write() tries to flush the selected btree node, firstly
b-&gt;write_lock is held by mutex_lock(). If the race happens and the
memory of selected btree node is allocated to other btree node, if that
btree node's write_lock is held already, a deadlock very probably
happens here. A worse case is the memory of the selected btree node is
released, then all references to this btree node (e.g. b-&gt;write_lock)
will trigger NULL pointer deference panic.

This race was introduced in commit cafe56359144 ("bcache: A block layer
cache"), and enlarged by commit c4dc2497d50d ("bcache: fix high CPU
occupancy during journal"), which selected 128 btree nodes and flushed
them one-by-one in a quite long time period.

Such race is not easy to reproduce before. On a Lenovo SR650 server with
48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0
device assembled by 3 NVMe SSDs as backing device, this race can be
observed around every 10,000 times btree_flush_write() gets called. Both
deadlock and kernel panic all happened as aftermath of the race.

The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It
is set when selecting btree nodes, and cleared after btree nodes
flushed. Then when mca_reap() selects a btree node with this bit set,
this btree node will be skipped. Since mca_reap() only reaps btree node
without BTREE_NODE_journal_flush flag, such race is avoided.

Once corner case should be noticed, that is btree_node_free(). It might
be called in some error handling code path. For example the following
code piece from btree_split(),
        2149 err_free2:
        2150         bkey_put(b-&gt;c, &amp;n2-&gt;key);
        2151         btree_node_free(n2);
        2152         rw_unlock(true, n2);
        2153 err_free1:
        2154         bkey_put(b-&gt;c, &amp;n1-&gt;key);
        2155         btree_node_free(n1);
        2156         rw_unlock(true, n1);
At line 2151 and 2155, the btree node n2 and n1 are released without
mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here.
If btree_node_free() is called directly in such error handling path,
and the selected btree node has BTREE_NODE_journal_flush bit set, just
delay for 1 us and retry again. In this case this btree node won't
be skipped, just retry until the BTREE_NODE_journal_flush bit cleared,
and free the btree node memory.

Fixes: cafe56359144 ("bcache: A block layer cache")
Signed-off-by: Coly Li &lt;colyli@suse.de&gt;
Reported-and-tested-by: kbuild test robot &lt;lkp@intel.com&gt;
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>bcache: add comments for mutex_lock(&amp;b-&gt;write_lock)</title>
<updated>2019-09-16T06:22:23Z</updated>
<author>
<name>Coly Li</name>
<email>colyli@suse.de</email>
</author>
<published>2019-06-28T11:59:56Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=f73c35d9297698cb9ce03dc84eaae19e2e1cd7a7'/>
<id>urn:sha1:f73c35d9297698cb9ce03dc84eaae19e2e1cd7a7</id>
<content type='text'>
[ Upstream commit 41508bb7d46b74dba631017e5a702a86caf1db8c ]

When accessing or modifying BTREE_NODE_dirty bit, it is not always
necessary to acquire b-&gt;write_lock. In bch_btree_cache_free() and
mca_reap() acquiring b-&gt;write_lock is necessary, and this patch adds
comments to explain why mutex_lock(&amp;b-&gt;write_lock) is necessary for
checking or clearing BTREE_NODE_dirty bit there.

Signed-off-by: Coly Li &lt;colyli@suse.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>bcache: only clear BTREE_NODE_dirty bit when it is set</title>
<updated>2019-09-16T06:22:23Z</updated>
<author>
<name>Coly Li</name>
<email>colyli@suse.de</email>
</author>
<published>2019-06-28T11:59:55Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=7989a5026fd12c7208448b66c51402a65a8a7f16'/>
<id>urn:sha1:7989a5026fd12c7208448b66c51402a65a8a7f16</id>
<content type='text'>
[ Upstream commit e5ec5f4765ada9c75fb3eee93a6e72f0e50599d5 ]

In bch_btree_cache_free() and btree_node_free(), BTREE_NODE_dirty is
always set no matter btree node is dirty or not. The code looks like
this,
	if (btree_node_dirty(b))
		btree_complete_write(b, btree_current_write(b));
	clear_bit(BTREE_NODE_dirty, &amp;b-&gt;flags);

Indeed if btree_node_dirty(b) returns false, it means BTREE_NODE_dirty
bit is cleared, then it is unnecessary to clear the bit again.

This patch only clears BTREE_NODE_dirty when btree_node_dirty(b) is
true (the bit is set), to save a few CPU cycles.

Signed-off-by: Coly Li &lt;colyli@suse.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>dm thin metadata: check if in fail_io mode when setting needs_check</title>
<updated>2019-09-16T06:22:21Z</updated>
<author>
<name>Mike Snitzer</name>
<email>snitzer@redhat.com</email>
</author>
<published>2019-07-02T19:50:08Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=ecf99cdea02dcc792c27a52d1cf3e1c532551479'/>
<id>urn:sha1:ecf99cdea02dcc792c27a52d1cf3e1c532551479</id>
<content type='text'>
[ Upstream commit 54fa16ee532705985e6c946da455856f18f63ee1 ]

Check if in fail_io mode at start of dm_pool_metadata_set_needs_check().
Otherwise dm_pool_metadata_set_needs_check()'s superblock_lock() can
crash in dm_bm_write_lock() while accessing the block manager object
that was previously destroyed as part of a failed
dm_pool_abort_metadata() that ultimately set fail_io to begin with.

Also, update DMERR() message to more accurately describe
superblock_lock() failure.

Cc: stable@vger.kernel.org
Reported-by: Zdenek Kabelac &lt;zkabelac@redhat.com&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>dm crypt: move detailed message into debug level</title>
<updated>2019-09-16T06:22:14Z</updated>
<author>
<name>Milan Broz</name>
<email>gmazyland@gmail.com</email>
</author>
<published>2019-05-15T14:23:43Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=fcb2f1e2ea687b3507b11c8e74c30dd3d967f1b0'/>
<id>urn:sha1:fcb2f1e2ea687b3507b11c8e74c30dd3d967f1b0</id>
<content type='text'>
[ Upstream commit 7a1cd7238fde6ab367384a4a2998cba48330c398 ]

The information about tag size should not be printed without debug info
set. Also print device major:minor in the error message to identify the
device instance.

Also use rate limiting and debug level for info about used crypto API
implementaton.  This is important because during online reencryption
the existing message saturates syslog (because we are moving hotzone
across the whole device).

Cc: stable@vger.kernel.org
Signed-off-by: Milan Broz &lt;gmazyland@gmail.com&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>dm mpath: fix missing call of path selector type-&gt;end_io</title>
<updated>2019-09-16T06:22:12Z</updated>
<author>
<name>Yufen Yu</name>
<email>yuyufen@huawei.com</email>
</author>
<published>2019-04-24T15:19:05Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=69409854ba08d3aeb28a3989703381857842e2ab'/>
<id>urn:sha1:69409854ba08d3aeb28a3989703381857842e2ab</id>
<content type='text'>
[ Upstream commit 5de719e3d01b4abe0de0d7b857148a880ff2a90b ]

After commit 396eaf21ee17 ("blk-mq: improve DM's blk-mq IO merging via
blk_insert_cloned_request feedback"), map_request() will requeue the tio
when issued clone request return BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE.

Thus, if device driver status is error, a tio may be requeued multiple
times until the return value is not DM_MAPIO_REQUEUE.  That means
type-&gt;start_io may be called multiple times, while type-&gt;end_io is only
called when IO complete.

In fact, even without commit 396eaf21ee17, setup_clone() failure can
also cause tio requeue and associated missed call to type-&gt;end_io.

The service-time path selector selects path based on in_flight_size,
which is increased by st_start_io() and decreased by st_end_io().
Missed calls to st_end_io() can lead to in_flight_size count error and
will cause the selector to make the wrong choice.  In addition,
queue-length path selector will also be affected.

To fix the problem, call type-&gt;end_io in -&gt;release_clone_rq before tio
requeue.  map_info is passed to -&gt;release_clone_rq() for map_request()
error path that result in requeue.

Fixes: 396eaf21ee17 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
Cc: stable@vger.kernl.org
Signed-off-by: Yufen Yu &lt;yuyufen@huawei.com&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>bcache: treat stale &amp;&amp; dirty keys as bad keys</title>
<updated>2019-09-16T06:22:05Z</updated>
<author>
<name>Tang Junhui</name>
<email>tang.junhui.linux@gmail.com</email>
</author>
<published>2019-02-09T04:52:58Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=687e470e9123a72a25ba56e9dec5929619edf4b1'/>
<id>urn:sha1:687e470e9123a72a25ba56e9dec5929619edf4b1</id>
<content type='text'>
[ Upstream commit 58ac323084ebf44f8470eeb8b82660f9d0ee3689 ]

Stale &amp;&amp; dirty keys can be produced in the follow way:
After writeback in write_dirty_finish(), dirty keys k1 will
replace by clean keys k2
==&gt;ret = bch_btree_insert(dc-&gt;disk.c, &amp;keys, NULL, &amp;w-&gt;key);
==&gt;btree_insert_fn(struct btree_op *b_op, struct btree *b)
==&gt;static int bch_btree_insert_node(struct btree *b,
       struct btree_op *op,
       struct keylist *insert_keys,
       atomic_t *journal_ref,
Then two steps:
A) update k1 to k2 in btree node memory;
   bch_btree_insert_keys(b, op, insert_keys, replace_key)
B) Write the bset(contains k2) to cache disk by a 30s delay work
   bch_btree_leaf_dirty(b, journal_ref).
But before the 30s delay work write the bset to cache device,
these things happened:
A) GC works, and reclaim the bucket k2 point to;
B) Allocator works, and invalidate the bucket k2 point to,
   and increase the gen of the bucket, and place it into free_inc
   fifo;
C) Until now, the 30s delay work still does not finish work,
   so in the disk, the key still is k1, it is dirty and stale
   (its gen is smaller than the gen of the bucket). and then the
   machine power off suddenly happens;
D) When the machine power on again, after the btree reconstruction,
   the stale dirty key appear.

In bch_extent_bad(), when expensive_debug_checks is off, it would
treat the dirty key as good even it is stale keys, and it would
cause bellow probelms:
A) In read_dirty() it would cause machine crash:
   BUG_ON(ptr_stale(dc-&gt;disk.c, &amp;w-&gt;key, 0));
B) It could be worse when reads hits stale dirty keys, it would
   read old incorrect data.

This patch tolerate the existence of these stale &amp;&amp; dirty keys,
and treat them as bad key in bch_extent_bad().

(Coly Li: fix indent which was modified by sender's email client)

Signed-off-by: Tang Junhui &lt;tang.junhui.linux@gmail.com&gt;
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li &lt;colyli@suse.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>bcache: replace hard coded number with BUCKET_GC_GEN_MAX</title>
<updated>2019-09-16T06:22:05Z</updated>
<author>
<name>Coly Li</name>
<email>colyli@suse.de</email>
</author>
<published>2018-10-08T12:41:18Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=d1cec665de2c30e4fcad23b871173ad51c2946b7'/>
<id>urn:sha1:d1cec665de2c30e4fcad23b871173ad51c2946b7</id>
<content type='text'>
[ Upstream commit 149d0efada7777ad5a5242b095692af142f533d8 ]

In extents.c:bch_extent_bad(), number 96 is used as parameter to call
btree_bug_on(). The purpose is to check whether stale gen value exceeds
BUCKET_GC_GEN_MAX, so it is better to use macro BUCKET_GC_GEN_MAX to
make the code more understandable.

Signed-off-by: Coly Li &lt;colyli@suse.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>dm zoned: fix potential NULL dereference in dmz_do_reclaim()</title>
<updated>2019-08-29T06:28:59Z</updated>
<author>
<name>Dan Carpenter</name>
<email>dan.carpenter@oracle.com</email>
</author>
<published>2019-08-19T09:58:14Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=0d5e34c1e2633e6256826b8ae2f7fe0d6b3b45d1'/>
<id>urn:sha1:0d5e34c1e2633e6256826b8ae2f7fe0d6b3b45d1</id>
<content type='text'>
[ Upstream commit e0702d90b79d430b0ccc276ead4f88440bb51352 ]

This function is supposed to return error pointers so it matches the
dmz_get_rnd_zone_for_reclaim() function.  The current code could lead to
a NULL dereference in dmz_do_reclaim()

Fixes: b234c6d7a703 ("dm zoned: improve error handling in reclaim")
Signed-off-by: Dan Carpenter &lt;dan.carpenter@oracle.com&gt;
Reviewed-by: Dmitry Fomichev &lt;dmitry.fomichev@wdc.com&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>dm zoned: properly handle backing device failure</title>
<updated>2019-08-29T06:28:56Z</updated>
<author>
<name>Dmitry Fomichev</name>
<email>dmitry.fomichev@wdc.com</email>
</author>
<published>2019-08-10T21:43:11Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=c14fe4e8fd011c702c8867c8dc685d396fb5f538'/>
<id>urn:sha1:c14fe4e8fd011c702c8867c8dc685d396fb5f538</id>
<content type='text'>
commit 75d66ffb48efb30f2dd42f041ba8b39c5b2bd115 upstream.

dm-zoned is observed to lock up or livelock in case of hardware
failure or some misconfiguration of the backing zoned device.

This patch adds a new dm-zoned target function that checks the status of
the backing device. If the request queue of the backing device is found
to be in dying state or the SCSI backing device enters offline state,
the health check code sets a dm-zoned target flag prompting all further
incoming I/O to be rejected. In order to detect backing device failures
timely, this new function is called in the request mapping path, at the
beginning of every reclaim run and before performing any metadata I/O.

The proper way out of this situation is to do

dmsetup remove &lt;dm-zoned target&gt;

and recreate the target when the problem with the backing device
is resolved.

Fixes: 3b1a94c88b79 ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Fomichev &lt;dmitry.fomichev@wdc.com&gt;
Reviewed-by: Damien Le Moal &lt;damien.lemoal@wdc.com&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
</feed>
