<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/drivers/md, branch v4.10.3</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v4.10.3</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v4.10.3'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2017-03-12T05:44:20Z</updated>
<entry>
<title>md linear: fix a race between linear_add() and linear_congested()</title>
<updated>2017-03-12T05:44:20Z</updated>
<author>
<name>colyli@suse.de</name>
<email>colyli@suse.de</email>
</author>
<published>2017-01-28T13:11:49Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=fe83da6961f829932d4bca07d42a1d319c688379'/>
<id>urn:sha1:fe83da6961f829932d4bca07d42a1d319c688379</id>
<content type='text'>
commit 03a9e24ef2aaa5f1f9837356aed79c860521407a upstream.

Recently I receive a bug report that on Linux v3.0 based kerenl, hot add
disk to a md linear device causes kernel crash at linear_congested(). From
the crash image analysis, I find in linear_congested(), mddev-&gt;raid_disks
contains value N, but conf-&gt;disks[] only has N-1 pointers available. Then
a NULL pointer deference crashes the kernel.

There is a race between linear_add() and linear_congested(), RCU stuffs
used in these two functions cannot avoid the race. Since Linuv v4.0
RCU code is replaced by introducing mddev_suspend().  After checking the
upstream code, it seems linear_congested() is not called in
generic_make_request() code patch, so mddev_suspend() cannot provent it
from being called. The possible race still exists.

Here I explain how the race still exists in current code.  For a machine
has many CPUs, on one CPU, linear_add() is called to add a hard disk to a
md linear device; at the same time on other CPU, linear_congested() is
called to detect whether this md linear device is congested before issuing
an I/O request onto it.

Now I use a possible code execution time sequence to demo how the possible
race happens,

seq    linear_add()                linear_congested()
 0                                 conf=mddev-&gt;private
 1   oldconf=mddev-&gt;private
 2   mddev-&gt;raid_disks++
 3                              for (i=0; i&lt;mddev-&gt;raid_disks;i++)
 4                                bdev_get_queue(conf-&gt;disks[i].rdev-&gt;bdev)
 5   mddev-&gt;private=newconf

In linear_add() mddev-&gt;raid_disks is increased in time seq 2, and on
another CPU in linear_congested() the for-loop iterates conf-&gt;disks[i] by
the increased mddev-&gt;raid_disks in time seq 3,4. But conf with one more
element (which is a pointer to struct dev_info type) to conf-&gt;disks[] is
not updated yet, accessing its structure member in time seq 4 will cause a
NULL pointer deference fault.

To fix this race, there are 2 parts of modification in the patch,
 1) Add 'int raid_disks' in struct linear_conf, as a copy of
    mddev-&gt;raid_disks. It is initialized in linear_conf(), always being
    consistent with pointers number of 'struct dev_info disks[]'. When
    iterating conf-&gt;disks[] in linear_congested(), use conf-&gt;raid_disks to
    replace mddev-&gt;raid_disks in the for-loop, then NULL pointer deference
    will not happen again.
 2) RCU stuffs are back again, and use kfree_rcu() in linear_add() to
    free oldconf memory. Because oldconf may be referenced as mddev-&gt;private
    in linear_congested(), kfree_rcu() makes sure that its memory will not
    be released until no one uses it any more.
Also some code comments are added in this patch, to make this modification
to be easier understandable.

This patch can be applied for kernels since v4.0 after commit:
3be260cc18f8 ("md/linear: remove rcu protections in favour of
suspend/resume"). But this bug is reported on Linux v3.0 based kernel, for
people who maintain kernels before Linux v4.0, they need to do some back
back port to this patch.

Changelog:
 - V3: add 'int raid_disks' in struct linear_conf, and use kfree_rcu() to
       replace rcu_call() in linear_add().
 - v2: add RCU stuffs by suggestion from Shaohua and Neil.
 - v1: initial effort.

Signed-off-by: Coly Li &lt;colyli@suse.de&gt;
Cc: Shaohua Li &lt;shli@fb.com&gt;
Cc: Neil Brown &lt;neilb@suse.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>dm raid: fix data corruption on reshape request</title>
<updated>2017-03-12T05:44:11Z</updated>
<author>
<name>Heinz Mauelshagen</name>
<email>heinzm@redhat.com</email>
</author>
<published>2017-02-28T18:17:49Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=4517ad77e7e6bb08b8ebb0487ef508b127b0c8f2'/>
<id>urn:sha1:4517ad77e7e6bb08b8ebb0487ef508b127b0c8f2</id>
<content type='text'>
commit d36a19541fe8f392778ac137d60f9be8dfdd8f9d upstream.

The lvm2 sequence to manage dm-raid constructor flags that trigger a
rebuild or a reshape is defined as:

1) load table with flags (e.g. rebuild/delta_disks/data_offset)
2) clear out the flags in lvm2 metadata
3) store the lvm2 metadata, reload the table to reset the flags
   previously established during the initial load (1) -- in order to
   prevent repeatedly requesting a rebuild or a reshape on activation

Currently, loading an inactive table with rebuild/reshape flags
specified will cause dm-raid to rebuild/reshape on resume and thus start
updating the raid metadata (about the progress).  When the second table
reload, to reset the flags, occurs the constructor accesses the volatile
progress state kept in the raid superblocks.  Because the active mapping
is still processing the rebuild/reshape, that position will be stale by
the time the device is resumed.

In the reshape case, this causes data corruption by processing already
reshaped stripes again.  In the rebuild case, it does _not_ cause data
corruption but instead involves superfluous rebuilds.

Fix by keeping the raid set frozen during the first resume and then
allow the rebuild/reshape during the second resume.

Fixes: 9dbd1aa3a ("dm raid: add reshaping support to the target")
Signed-off-by: Heinz Mauelshagen &lt;heinzm@redhat.com&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>dm round robin: revert "use percpu 'repeat_count' and 'current_path'"</title>
<updated>2017-03-12T05:44:11Z</updated>
<author>
<name>Mike Snitzer</name>
<email>snitzer@redhat.com</email>
</author>
<published>2017-02-17T04:57:17Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=37ce3ec1e70b68c82e64a747e4892b5e15ff620f'/>
<id>urn:sha1:37ce3ec1e70b68c82e64a747e4892b5e15ff620f</id>
<content type='text'>
commit 37a098e9d10db6e2efc05fe61e3a6ff2e9802c53 upstream.

The sloppy nature of lockless access to percpu pointers
(s-&gt;current_path) in rr_select_path(), from multiple threads, is
causing some paths to used more than others -- which results in less
IO performance being observed.

Revert these upstream commits to restore truly symmetric round-robin
IO submission in DM multipath:

b0b477c dm round robin: use percpu 'repeat_count' and 'current_path'
802934b dm round robin: do not use this_cpu_ptr() without having preemption disabled

There is no benefit to all this complexity if repeat_count = 1 (which is
the recommended default).

Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>dm stats: fix a leaked s-&gt;histogram_boundaries array</title>
<updated>2017-03-12T05:44:11Z</updated>
<author>
<name>Mikulas Patocka</name>
<email>mpatocka@redhat.com</email>
</author>
<published>2017-02-15T17:06:19Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=72ea8179bc800548984f82c43e4c1b6fea796392'/>
<id>urn:sha1:72ea8179bc800548984f82c43e4c1b6fea796392</id>
<content type='text'>
commit 6085831883c25860264721df15f05bbded45e2a2 upstream.

Fixes: dfcfac3e4cd9 ("dm stats: collect and report histogram of IO latencies")
Signed-off-by: Mikulas Patocka &lt;mpatocka@redhat.com&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>dm cache: fix corruption seen when using cache &gt; 2TB</title>
<updated>2017-03-12T05:44:11Z</updated>
<author>
<name>Joe Thornber</name>
<email>ejt@redhat.com</email>
</author>
<published>2017-02-09T16:46:18Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=d18f5797ecf3d7ee59bbd2be2f729cf4acfff915'/>
<id>urn:sha1:d18f5797ecf3d7ee59bbd2be2f729cf4acfff915</id>
<content type='text'>
commit ca763d0a53b264a650342cee206512bc92ac7050 upstream.

A rounding bug due to compiler generated temporary being 32bit was found
in remap_to_cache().  A localized cast in remap_to_cache() fixes the
corruption but this preferred fix (changing from uint32_t to sector_t)
eliminates potential for future rounding errors elsewhere.

Signed-off-by: Joe Thornber &lt;ejt@redhat.com&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>dm crypt: replace RCU read-side section with rwsem</title>
<updated>2017-02-03T15:26:14Z</updated>
<author>
<name>Ondrej Kozina</name>
<email>okozina@redhat.com</email>
</author>
<published>2017-01-31T14:47:11Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=f5b0cba8f23915e92932f11eb063e37d70556a89'/>
<id>urn:sha1:f5b0cba8f23915e92932f11eb063e37d70556a89</id>
<content type='text'>
The lockdep splat below hints at a bug in RCU usage in dm-crypt that
was introduced with commit c538f6ec9f56 ("dm crypt: add ability to use
keys from the kernel key retention service").  The kernel keyring
function user_key_payload() is in fact a wrapper for
rcu_dereference_protected() which must not be called with only
rcu_read_lock() section mark.

Unfortunately the kernel keyring subsystem doesn't currently provide
an interface that allows the use of an RCU read-side section.  So for
now we must drop RCU in favour of rwsem until a proper function is
made available in the kernel keyring subsystem.

===============================
[ INFO: suspicious RCU usage. ]
4.10.0-rc5 #2 Not tainted
-------------------------------
./include/keys/user-type.h:53 suspicious rcu_dereference_protected() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
2 locks held by cryptsetup/6464:
 #0:  (&amp;md-&gt;type_lock){+.+.+.}, at: [&lt;ffffffffa02472a2&gt;] dm_lock_md_type+0x12/0x20 [dm_mod]
 #1:  (rcu_read_lock){......}, at: [&lt;ffffffffa02822f8&gt;] crypt_set_key+0x1d8/0x4b0 [dm_crypt]
stack backtrace:
CPU: 1 PID: 6464 Comm: cryptsetup Not tainted 4.10.0-rc5 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014
Call Trace:
 dump_stack+0x67/0x92
 lockdep_rcu_suspicious+0xc5/0x100
 crypt_set_key+0x351/0x4b0 [dm_crypt]
 ? crypt_set_key+0x1d8/0x4b0 [dm_crypt]
 crypt_ctr+0x341/0xa53 [dm_crypt]
 dm_table_add_target+0x147/0x330 [dm_mod]
 table_load+0x111/0x350 [dm_mod]
 ? retrieve_status+0x1c0/0x1c0 [dm_mod]
 ctl_ioctl+0x1f5/0x510 [dm_mod]
 dm_ctl_ioctl+0xe/0x20 [dm_mod]
 do_vfs_ioctl+0x8e/0x690
 ? ____fput+0x9/0x10
 ? task_work_run+0x7e/0xa0
 ? trace_hardirqs_on_caller+0x122/0x1b0
 SyS_ioctl+0x3c/0x70
 entry_SYSCALL_64_fastpath+0x18/0xad
RIP: 0033:0x7f392c9a4ec7
RSP: 002b:00007ffef6383378 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffef63830a0 RCX: 00007f392c9a4ec7
RDX: 000000000124fcc0 RSI: 00000000c138fd09 RDI: 0000000000000005
RBP: 00007ffef6383090 R08: 00000000ffffffff R09: 00000000012482b0
R10: 2a28205d34383336 R11: 0000000000000246 R12: 00007f392d803a08
R13: 00007ffef63831e0 R14: 0000000000000000 R15: 00007f392d803a0b

Fixes: c538f6ec9f56 ("dm crypt: add ability to use keys from the kernel key retention service")
Reported-by: Milan Broz &lt;mbroz@redhat.com&gt;
Signed-off-by: Ondrej Kozina &lt;okozina@redhat.com&gt;
Reviewed-by: Mikulas Patocka &lt;mpatocka@redhat.com&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
</content>
</entry>
<entry>
<title>dm rq: cope with DM device destruction while in dm_old_request_fn()</title>
<updated>2017-02-03T15:18:43Z</updated>
<author>
<name>Mike Snitzer</name>
<email>snitzer@redhat.com</email>
</author>
<published>2017-01-25T15:24:52Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=4087a1fffe38106e10646606a27f10d40451862d'/>
<id>urn:sha1:4087a1fffe38106e10646606a27f10d40451862d</id>
<content type='text'>
Fixes a crash in dm_table_find_target() due to a NULL struct dm_table
being passed from dm_old_request_fn() that races with DM device
destruction.

Reported-by: artem@flashgrid.io
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
Cc: stable@vger.kernel.org
</content>
</entry>
<entry>
<title>dm mpath: cleanup -Wbool-operation warning in choose_pgpath()</title>
<updated>2017-02-03T15:18:37Z</updated>
<author>
<name>Mike Snitzer</name>
<email>snitzer@redhat.com</email>
</author>
<published>2017-01-06T20:33:14Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=d19a55ccad15a486ffe03030570744e5d5bd9f8e'/>
<id>urn:sha1:d19a55ccad15a486ffe03030570744e5d5bd9f8e</id>
<content type='text'>
Reported-by: David Binderman &lt;dcb314@hotmail.com&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
</content>
</entry>
<entry>
<title>md/r5cache: disable write back for degraded array</title>
<updated>2017-01-24T19:26:06Z</updated>
<author>
<name>Song Liu</name>
<email>songliubraving@fb.com</email>
</author>
<published>2017-01-24T18:45:30Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=2e38a37f23c98d7fad87ff022670060b8a0e2bf5'/>
<id>urn:sha1:2e38a37f23c98d7fad87ff022670060b8a0e2bf5</id>
<content type='text'>
write-back cache in degraded mode introduces corner cases to the array.
Although we try to cover all these corner cases, it is safer to just
disable write-back cache when the array is in degraded mode.

In this patch, we disable writeback cache for degraded mode:
1. On device failure, if the array enters degraded mode, raid5_error()
   will submit async job r5c_disable_writeback_async to disable
   writeback;
2. In r5c_journal_mode_store(), it is invalid to enable writeback in
   degraded mode;
3. In r5c_try_caching_write(), stripes with s-&gt;failed&gt;0 will be handled
   in write-through mode.

Signed-off-by: Song Liu &lt;songliubraving@fb.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</content>
</entry>
<entry>
<title>md/r5cache: shift complex rmw from read path to write path</title>
<updated>2017-01-24T19:20:15Z</updated>
<author>
<name>Song Liu</name>
<email>songliubraving@fb.com</email>
</author>
<published>2017-01-24T01:12:58Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=07e83364845e1e1c7e189a01206a9d7d33831568'/>
<id>urn:sha1:07e83364845e1e1c7e189a01206a9d7d33831568</id>
<content type='text'>
Write back cache requires a complex RMW mechanism, where old data is
read into dev-&gt;orig_page for prexor, and then xor is done with
dev-&gt;page. This logic is already implemented in the write path.

However, current read path is not awared of this requirement. When
the array is optimal, the RMW is not required, as the data are
read from raid disks. However, when the target stripe is degraded,
complex RMW is required to generate right data.

To keep read path as clean as possible, we handle read path by
flushing degraded, in-journal stripes before processing reads to
missing dev.

Specifically, when there is read requests to a degraded stripe
with data in journal, handle_stripe_fill() calls
r5c_make_stripe_write_out() and exits. Then handle_stripe_dirtying()
will do the complex RMW and flush the stripe to RAID disks. After
that, read requests are handled.

There is one more corner case when there is non-overwrite bio for
the missing (or out of sync) dev. handle_stripe_dirtying() will not
be able to process the non-overwrite bios without constructing the
data in handle_stripe_fill(). This is fixed by delaying non-overwrite
bios in handle_stripe_dirtying(). So handle_stripe_fill() works on
these bios after the stripe is flushed to raid disks.

Signed-off-by: Song Liu &lt;songliubraving@fb.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</content>
</entry>
</feed>
