user/sven/linux.git/drivers/md, branch v4.12.5

md/raid5: add thread_group worker async_tx_issue_pending_all

2017-08-06T16:21:09Z

commit 7e96d559634b73a8158ee99a7abece2eacec2668 upstream. Since thread_group worker and raid5d kthread are not in sync, if worker writes stripe before raid5d then requests will be waiting for issue_pendig. Issue observed when building raid5 with ext4, in some build runs jbd2 would get hung and requests were waiting in the HW engine waiting to be issued. Fix this by adding a call to async_tx_issue_pending_all in the raid5_do_work. Signed-off-by: Ofer Heifetz Signed-off-by: Shaohua Li Signed-off-by: Greg Kroah-Hartman

md/raid1: fix writebehind bio clone

2017-08-06T16:21:09Z

commit 16d56e2fcc1fc15b981369653c3b41d7ff0b443d upstream. After bio is submitted, we should not clone it as its bi_iter might be invalid by driver. This is the case of behind_master_bio. In certain situration, we could dispatch behind_master_bio immediately for the first disk and then clone it for other disks. https://bugzilla.kernel.org/show_bug.cgi?id=196383 Reported-and-tested-by: Markus Reviewed-by: Ming Lei Fix: 841c1316c7da(md: raid1: improve write behind) Signed-off-by: Shaohua Li Signed-off-by: Greg Kroah-Hartman

md: remove 'idx' from 'struct resync_pages'

2017-08-06T16:21:09Z

commit 022e510fcbda79183fd2cdc01abb01b4be80d03f upstream. bio_add_page() won't fail for resync bio, and the page index for each bio is same, so remove it. More importantly the 'idx' of 'struct resync_pages' is initialized in mempool allocator function, the current way is wrong since mempool is only responsible for allocation, we can't use that for initialization. Suggested-by: NeilBrown Reported-by: NeilBrown Reported-and-tested-by: Patrick Fixes: f0250618361d(md: raid10: don't use bio's vec table to manage resync pages) Fixes: 98d30c5812c3(md: raid1: don't use bio's vec table to manage resync pages) Signed-off-by: Ming Lei Signed-off-by: Shaohua Li Signed-off-by: Greg Kroah-Hartman

dm integrity: test for corrupted disk format during table load

2017-08-06T16:21:09Z

commit bc86a41e96c5b6f07453c405e036d95acc673389 upstream. If the dm-integrity superblock was corrupted in such a way that the journal_sections field was zero, the integrity target would deadlock because it would wait forever for free space in the journal. Detect this situation and refuse to activate the device. Signed-off-by: Mikulas Patocka Fixes: 7eada909bfd7 ("dm: add integrity target") Signed-off-by: Mike Snitzer Signed-off-by: Greg Kroah-Hartman

dm integrity: fix inefficient allocation of journal space

2017-08-06T16:21:09Z

commit 9dd59727dbc21d33a9add4c5b308a5775cd5a6ef upstream. When using a block size greater than 512 bytes, the dm-integrity target allocates journal space inefficiently. It allocates one journal entry for each 512-byte chunk of data, fills an entry for each block of data and leaves the remaining entries unused. This issue doesn't cause data corruption, but all the unused journal entries degrade performance severely. For example, with 4k blocks and an 8k bio, it would allocate 16 journal entries but only use 2 entries. The remaining 14 entries were left unused. Fix this by adding the missing 'log2_sectors_per_block' shifts that are required to have each journal entry map to a full block. Signed-off-by: Mikulas Patocka Fixes: 7eada909bfd7 ("dm: add integrity target") Signed-off-by: Mike Snitzer Signed-off-by: Greg Kroah-Hartman

Raid5 should update rdev->sectors after reshape

2017-07-27T22:10:13Z

commit b5d27718f38843a74552e9a93d32e2391fd3999f upstream. The raid5 md device is created by the disks which we don't use the total size. For example, the size of the device is 5G and it just uses 3G of the devices to create one raid5 device. Then change the chunksize and wait reshape to finish. After reshape finishing stop the raid and assemble it again. It fails. mdadm -CR /dev/md0 -l5 -n3 /dev/loop[0-2] --size=3G --chunk=32 --assume-clean mdadm /dev/md0 --grow --chunk=64 wait reshape to finish mdadm -S /dev/md0 mdadm -As The error messages: [197519.814302] md: loop1 does not have a valid v1.2 superblock, not importing! [197519.821686] md: md_import_device returned -22 After reshape the data offset is changed. It selects backwards direction in this condition. In function super_1_load it compares the available space of the underlying device with sb->data_size. The new data offset gets bigger after reshape. So super_1_load returns -EINVAL. rdev->sectors is updated in md_finish_reshape. Then sb->data_size is set in super_1_sync based on rdev->sectors. So add md_finish_reshape in end_reshape. Signed-off-by: Xiao Ni Acked-by: Guoqing Jiang Signed-off-by: Shaohua Li Signed-off-by: Greg Kroah-Hartman

dm raid: stop using BUG() in __rdev_sectors()

2017-07-27T22:10:12Z

commit 4d49f1b4a1fcab16b6dd1c79ef14f2b6531d50a6 upstream. Return 0 rather than BUG() if __rdev_sectors() fails and catch invalid rdev size in the constructor. Reported-by: Hannes Reinecke Signed-off-by: Heinz Mauelshagen Signed-off-by: Mike Snitzer Signed-off-by: Greg Kroah-Hartman

md: fix deadlock between mddev_suspend() and md_write_start()

2017-07-27T22:10:11Z

commit cc27b0c78c79680d128dbac79de0d40556d041bb upstream. If mddev_suspend() races with md_write_start() we can deadlock with mddev_suspend() waiting for the request that is currently in md_write_start() to complete the ->make_request() call, and md_write_start() waiting for the metadata to be updated to mark the array as 'dirty'. As metadata updates done by md_check_recovery() only happen then the mddev_lock() can be claimed, and as mddev_suspend() is often called with the lock held, these threads wait indefinitely for each other. We fix this by having md_write_start() abort if mddev_suspend() is happening, and ->make_request() aborts if md_write_start() aborted. md_make_request() can detect this abort, decrease the ->active_io count, and wait for mddev_suspend(). Reported-by: Nix Fix: 68866e425be2(MD: no sync IO while suspended) Signed-off-by: NeilBrown Signed-off-by: Shaohua Li Signed-off-by: Greg Kroah-Hartman

md: don't use flush_signals in userspace processes

2017-07-27T22:10:11Z

commit f9c79bc05a2a91f4fba8bfd653579e066714b1ec upstream. The function flush_signals clears all pending signals for the process. It may be used by kernel threads when we need to prepare a kernel thread for responding to signals. However using this function for an userspaces processes is incorrect - clearing signals without the program expecting it can cause misbehavior. The raid1 and raid5 code uses flush_signals in its request routine because it wants to prepare for an interruptible wait. This patch drops flush_signals and uses sigprocmask instead to block all signals (including SIGKILL) around the schedule() call. The signals are not lost, but the schedule() call won't respond to them. Signed-off-by: Mikulas Patocka Acked-by: NeilBrown Signed-off-by: Shaohua Li Signed-off-by: Greg Kroah-Hartman

dm thin: do not queue freed thin mapping for next stage processing

2017-06-27T19:14:34Z

process_prepared_discard_passdown_pt1() should cleanup dm_thin_new_mapping in cases of error. dm_pool_inc_data_range() can fail trying to get a block reference: metadata operation 'dm_pool_inc_data_range' failed: error = -61 When dm_pool_inc_data_range() fails, dm thin aborts current metadata transaction and marks pool as PM_READ_ONLY. Memory for thin mapping is released as well. However, current thin mapping will be queued onto next stage as part of queue_passdown_pt2() or passdown_endio(). This dangling thin mapping memory when processed and accessed in next stage will lead to device mapper crashing. Code flow without fix: -> process_prepared_discard_passdown_pt1(m) -> dm_thin_remove_range() -> discard passdown --> passdown_endio(m) queues m onto next stage -> dm_pool_inc_data_range() fails, frees memory m but does not remove it from next stage queue -> process_prepared_discard_passdown_pt2(m) -> processes freed memory m and crashes One such stack: Call Trace: [] dm_cell_release_no_holder+0x2f/0x70 [dm_bio_prison] [] cell_defer_no_holder+0x3c/0x80 [dm_thin_pool] [] process_prepared_discard_passdown_pt2+0x4b/0x90 [dm_thin_pool] [] process_prepared+0x81/0xa0 [dm_thin_pool] [] do_worker+0xc5/0x820 [dm_thin_pool] [] ? __schedule+0x244/0x680 [] ? pwq_activate_delayed_work+0x42/0xb0 [] process_one_work+0x153/0x3f0 [] worker_thread+0x12b/0x4b0 [] ? rescuer_thread+0x350/0x350 [] kthread+0xca/0xe0 [] ? kthread_park+0x60/0x60 [] ret_from_fork+0x25/0x30 The fix is to first take the block ref count for discarded block and then do a passdown discard of this block. If block ref count fails, then bail out aborting current metadata transaction, mark pool as PM_READ_ONLY and also free current thin mapping memory (existing error handling code) without queueing this thin mapping onto next stage of processing. If block ref count succeeds, then passdown discard of this block. Discard callback of passdown_endio() will queue this thin mapping onto next stage of processing. Code flow with fix: -> process_prepared_discard_passdown_pt1(m) -> dm_thin_remove_range() -> dm_pool_inc_data_range() --> if fails, free memory m and bail out -> discard passdown --> passdown_endio(m) queues m onto next stage Cc: stable # v4.9+ Reviewed-by: Eduardo Valentin Reviewed-by: Cristian Gafton Reviewed-by: Anchal Agarwal Signed-off-by: Vallish Vaidyeshwara Reviewed-by: Joe Thornber Signed-off-by: Mike Snitzer