user/sven/linux.git/fs, branch v3.18.28

fs-writeback: unplug before cond_resched in writeback_sb_inodes

2016-03-04T15:19:42Z

[ Upstream commit 590dca3a71875461e8fea3013af74386945191b2 ] Commit 505a666ee3fc ("writeback: plug writeback in wb_writeback() and writeback_inodes_wb()") has us holding a plug during writeback_sb_inodes, which increases the merge rate when relatively contiguous small files are written by the filesystem. It helps both on flash and spindles. For an fs_mark workload creating 4K files in parallel across 8 drives, this commit improves performance ~9% more by unplugging before calling cond_resched(). cond_resched() doesn't trigger an implicit unplug, so explicitly getting the IO down to the device before scheduling reduces latencies for anyone waiting on clean pages. It also cuts down on how often we use kblockd to unplug, which means less work bouncing from one workqueue to another. Many more details about how we got here: https://lkml.org/lkml/2015/9/11/570 Signed-off-by: Chris Mason Signed-off-by: Linus Torvalds Signed-off-by: Sasha Levin

ext4: fix crashes in dioread_nolock mode

2016-03-04T15:18:43Z

[ Upstream commit 74dae4278546b897eb81784fdfcce872ddd8b2b8 ] Competing overwrite DIO in dioread_nolock mode will just overwrite pointer to io_end in the inode. This may result in data corruption or extent conversion happening from IO completion interrupt because we don't properly set buffer_defer_completion() when unlocked DIO races with locked DIO to unwritten extent. Since unlocked DIO doesn't need io_end for anything, just avoid allocating it and corrupting pointer from inode for locked DIO. A cleaner fix would be to avoid these games with io_end pointer from the inode but that requires more intrusive changes so we leave that for later. Cc: stable@vger.kernel.org Signed-off-by: Jan Kara Signed-off-by: Theodore Ts'o Signed-off-by: Sasha Levin

ext4: don't read blocks from disk after extents being swapped

2016-03-04T15:18:41Z

[ Upstream commit bcff24887d00bce102e0857d7b0a8c44a40f53d1 ] I notice ext4/307 fails occasionally on ppc64 host, reporting md5 checksum mismatch after moving data from original file to donor file. The reason is that move_extent_per_page() calls __block_write_begin() and block_commit_write() to write saved data from original inode blocks to donor inode blocks, but __block_write_begin() not only maps buffer heads but also reads block content from disk if the size is not block size aligned. At this time the physical block number in mapped buffer head is pointing to the donor file not the original file, and that results in reading wrong data to page, which get written to disk in following block_commit_write call. This also can be reproduced by the following script on 1k block size ext4 on x86_64 host: mnt=/mnt/ext4 donorfile=$mnt/donor testfile=$mnt/testfile e4compact=~/xfstests/src/e4compact rm -f $donorfile $testfile # reserve space for donor file, written by 0xaa and sync to disk to # avoid EBUSY on EXT4_IOC_MOVE_EXT xfs_io -fc "pwrite -S 0xaa 0 1m" -c "fsync" $donorfile # create test file written by 0xbb xfs_io -fc "pwrite -S 0xbb 0 1023" -c "fsync" $testfile # compute initial md5sum md5sum $testfile | tee md5sum.txt # drop cache, force e4compact to read data from disk echo 3 > /proc/sys/vm/drop_caches # test defrag echo "$testfile" | $e4compact -i -v -f $donorfile # check md5sum md5sum -c md5sum.txt Fix it by creating & mapping buffer heads only but not reading blocks from disk, because all the data in page is guaranteed to be up-to-date in mext_page_mkuptodate(). Cc: stable@vger.kernel.org Signed-off-by: Eryu Guan Signed-off-by: Theodore Ts'o Signed-off-by: Sasha Levin

ext4: move_extent improve bh vanishing success factor

2016-03-04T15:18:41Z

[ Upstream commit 88c6b61ff1cfb4013a3523227d91ad11b2892388 ] Xiaoguang Wang has reported sporadic EBUSY failures of ext4/302 Unfortunetly there is nothing we can do if some other task holds BH's refenrence. So we must return EBUSY in this case. But we can try kicking the journal to see if the other task releases the bh reference after the commit is complete. Also decrease false positives by properly checking for ENOSPC and retrying the allocation after kicking the journal --- which is done by ext4_should_retry_alloc(). [ Modified by tytso to properly check for ENOSPC. ] Signed-off-by: Dmitry Monakhov Signed-off-by: Theodore Ts'o Signed-off-by: Sasha Levin

ext4: fix potential integer overflow

2016-03-04T15:18:41Z

[ Upstream commit 46901760b46064964b41015d00c140c83aa05bcf ] Since sizeof(ext_new_group_data) > sizeof(ext_new_flex_group_data), integer overflow could be happened. Therefore, need to fix integer overflow sanitization. Cc: stable@vger.kernel.org Signed-off-by: Insu Yun Signed-off-by: Theodore Ts'o Signed-off-by: Sasha Levin

btrfs: properly set the termination value of ctx->pos in readdir

2016-03-04T15:18:40Z

[ Upstream commit bc4ef7592f657ae81b017207a1098817126ad4cb ] The value of ctx->pos in the last readdir call is supposed to be set to INT_MAX due to 32bit compatibility, unless 'pos' is intentially set to a larger value, then it's LLONG_MAX. There's a report from PaX SIZE_OVERFLOW plugin that "ctx->pos++" overflows (https://forums.grsecurity.net/viewtopic.php?f=1&t=4284), on a 64bit arch, where the value is 0x7fffffffffffffff ie. LLONG_MAX before the increment. We can get to that situation like that: * emit all regular readdir entries * still in the same call to readdir, bump the last pos to INT_MAX * next call to readdir will not emit any entries, but will reach the bump code again, finds pos to be INT_MAX and sets it to LLONG_MAX Normally this is not a problem, but if we call readdir again, we'll find 'pos' set to LLONG_MAX and the unconditional increment will overflow. The report from Victor at (http://thread.gmane.org/gmane.comp.file-systems.btrfs/49500) with debugging print shows that pattern: Overflow: e Overflow: 7fffffff Overflow: 7fffffffffffffff PAX: size overflow detected in function btrfs_real_readdir fs/btrfs/inode.c:5760 cicus.935_282 max, count: 9, decl: pos; num: 0; context: dir_context; CPU: 0 PID: 2630 Comm: polkitd Not tainted 4.2.3-grsec #1 Hardware name: Gigabyte Technology Co., Ltd. H81ND2H/H81ND2H, BIOS F3 08/11/2015 ffffffff81901608 0000000000000000 ffffffff819015e6 ffffc90004973d48 ffffffff81742f0f 0000000000000007 ffffffff81901608 ffffc90004973d78 ffffffff811cb706 0000000000000000 ffff8800d47359e0 ffffc90004973ed8 Call Trace: [] dump_stack+0x4c/0x7f [] report_size_overflow+0x36/0x40 [] btrfs_real_readdir+0x69c/0x6d0 [] iterate_dir+0xa8/0x150 [] ? __fget_light+0x2d/0x70 [] SyS_getdents+0xba/0x1c0 Overflow: 1a [] ? iterate_dir+0x150/0x150 [] entry_SYSCALL_64_fastpath+0x12/0x83 The jump from 7fffffff to 7fffffffffffffff happens when new dir entries are not yet synced and are processed from the delayed list. Then the code could go to the bump section again even though it might not emit any new dir entries from the delayed list. The fix avoids entering the "bump" section again once we've finished emitting the entries, both for synced and delayed entries. References: https://forums.grsecurity.net/viewtopic.php?f=1&t=4284 Reported-by: Victor CC: stable@vger.kernel.org Signed-off-by: David Sterba Tested-by: Holger Hoffstätte Signed-off-by: Chris Mason Signed-off-by: Sasha Levin

cifs: fix erroneous return value

2016-03-04T15:18:40Z

[ Upstream commit 4b550af519854421dfec9f7732cdddeb057134b2 ] The setup_ntlmv2_rsp() function may return positive value ENOMEM instead of -ENOMEM in case of kmalloc failure. Signed-off-by: Anton Protopopov CC: Stable Signed-off-by: Steve French Signed-off-by: Sasha Levin

pty: make sure super_block is still valid in final /dev/tty close

2016-03-02T20:19:16Z

[ Upstream commit 1f55c718c290616889c04946864a13ef30f64929 ] Considering current pty code and multiple devpts instances, it's possible to umount a devpts file system while a program still has /dev/tty opened pointing to a previosuly closed pty pair in that instance. In the case all ptmx and pts/N files are closed, umount can be done. If the program closes /dev/tty after umount is done, devpts_kill_index will use now an invalid super_block, which was already destroyed in the umount operation after running ->kill_sb. This is another "use after free" type of issue, but now related to the allocated super_block instance. To avoid the problem (warning at ida_remove and potential crashes) for this specific case, I added two functions in devpts which grabs additional references to the super_block, which pty code now uses so it makes sure the super block structure is still valid until pty shutdown is done. I also moved the additional inode references to the same functions, which also covered similar case with inode being freed before /dev/tty final close/shutdown. Signed-off-by: Herton R. Krzesinski Cc: stable@vger.kernel.org # 2.6.29+ Reviewed-by: Peter Hurley Signed-off-by: Greg Kroah-Hartman Signed-off-by: Sasha Levin

Btrfs: fix hang on extent buffer lock caused by the inode_paths ioctl

2016-03-02T20:18:57Z

[ Upstream commit 0c0fe3b0fa45082cd752553fdb3a4b42503a118e ] While doing some tests I ran into an hang on an extent buffer's rwlock that produced the following trace: [39389.800012] NMI watchdog: BUG: soft lockup - CPU#15 stuck for 22s! [fdm-stress:32166] [39389.800016] NMI watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [fdm-stress:32165] [39389.800016] Modules linked in: btrfs dm_mod ppdev xor sha256_generic hmac raid6_pq drbg ansi_cprng aesni_intel i2c_piix4 acpi_cpufreq aes_x86_64 ablk_helper tpm_tis parport_pc i2c_core sg cryptd evdev psmouse lrw tpm parport gf128mul serio_raw pcspkr glue_helper processor button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs] [39389.800016] irq event stamp: 0 [39389.800016] hardirqs last enabled at (0): [< (null)>] (null) [39389.800016] hardirqs last disabled at (0): [] copy_process+0x638/0x1a35 [39389.800016] softirqs last enabled at (0): [] copy_process+0x638/0x1a35 [39389.800016] softirqs last disabled at (0): [< (null)>] (null) [39389.800016] CPU: 14 PID: 32165 Comm: fdm-stress Not tainted 4.4.0-rc6-btrfs-next-18+ #1 [39389.800016] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [39389.800016] task: ffff880175b1ca40 ti: ffff8800a185c000 task.ti: ffff8800a185c000 [39389.800016] RIP: 0010:[] [] queued_spin_lock_slowpath+0x57/0x158 [39389.800016] RSP: 0018:ffff8800a185fb80 EFLAGS: 00000202 [39389.800016] RAX: 0000000000000101 RBX: ffff8801710c4e9c RCX: 0000000000000101 [39389.800016] RDX: 0000000000000100 RSI: 0000000000000001 RDI: 0000000000000001 [39389.800016] RBP: ffff8800a185fb98 R08: 0000000000000001 R09: 0000000000000000 [39389.800016] R10: ffff8800a185fb68 R11: 6db6db6db6db6db7 R12: ffff8801710c4e98 [39389.800016] R13: ffff880175b1ca40 R14: ffff8800a185fc10 R15: ffff880175b1ca40 [39389.800016] FS: 00007f6d37fff700(0000) GS:ffff8802be9c0000(0000) knlGS:0000000000000000 [39389.800016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [39389.800016] CR2: 00007f6d300019b8 CR3: 0000000037c93000 CR4: 00000000001406e0 [39389.800016] Stack: [39389.800016] ffff8801710c4e98 ffff8801710c4e98 ffff880175b1ca40 ffff8800a185fbb0 [39389.800016] ffffffff81091e11 ffff8801710c4e98 ffff8800a185fbc8 ffffffff81091895 [39389.800016] ffff8801710c4e98 ffff8800a185fbe8 ffffffff81486c5c ffffffffa067288c [39389.800016] Call Trace: [39389.800016] [] queued_read_lock_slowpath+0x46/0x60 [39389.800016] [] do_raw_read_lock+0x3e/0x41 [39389.800016] [] _raw_read_lock+0x3d/0x44 [39389.800016] [] ? btrfs_tree_read_lock+0x54/0x125 [btrfs] [39389.800016] [] btrfs_tree_read_lock+0x54/0x125 [btrfs] [39389.800016] [] ? btrfs_find_item+0xa7/0xd2 [btrfs] [39389.800016] [] btrfs_ref_to_path+0xd6/0x174 [btrfs] [39389.800016] [] inode_to_path+0x53/0xa2 [btrfs] [39389.800016] [] paths_from_inode+0x117/0x2ec [btrfs] [39389.800016] [] btrfs_ioctl+0xd5b/0x2793 [btrfs] [39389.800016] [] ? arch_local_irq_save+0x9/0xc [39389.800016] [] ? __this_cpu_preempt_check+0x13/0x15 [39389.800016] [] ? arch_local_irq_save+0x9/0xc [39389.800016] [] ? rcu_read_unlock+0x3e/0x5d [39389.800016] [] do_vfs_ioctl+0x42b/0x4ea [39389.800016] [] ? __fget_light+0x62/0x71 [39389.800016] [] SyS_ioctl+0x57/0x79 [39389.800016] [] entry_SYSCALL_64_fastpath+0x12/0x6f [39389.800016] Code: b9 01 01 00 00 f7 c6 00 ff ff ff 75 32 83 fe 01 89 ca 89 f0 0f 45 d7 f0 0f b1 13 39 f0 74 04 89 c6 eb e2 ff ca 0f 84 fa 00 00 00 <8b> 03 84 c0 74 04 f3 90 eb f6 66 c7 03 01 00 e9 e6 00 00 00 e8 [39389.800012] Modules linked in: btrfs dm_mod ppdev xor sha256_generic hmac raid6_pq drbg ansi_cprng aesni_intel i2c_piix4 acpi_cpufreq aes_x86_64 ablk_helper tpm_tis parport_pc i2c_core sg cryptd evdev psmouse lrw tpm parport gf128mul serio_raw pcspkr glue_helper processor button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs] [39389.800012] irq event stamp: 0 [39389.800012] hardirqs last enabled at (0): [< (null)>] (null) [39389.800012] hardirqs last disabled at (0): [] copy_process+0x638/0x1a35 [39389.800012] softirqs last enabled at (0): [] copy_process+0x638/0x1a35 [39389.800012] softirqs last disabled at (0): [< (null)>] (null) [39389.800012] CPU: 15 PID: 32166 Comm: fdm-stress Tainted: G L 4.4.0-rc6-btrfs-next-18+ #1 [39389.800012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [39389.800012] task: ffff880179294380 ti: ffff880034a60000 task.ti: ffff880034a60000 [39389.800012] RIP: 0010:[] [] queued_write_lock_slowpath+0x62/0x72 [39389.800012] RSP: 0018:ffff880034a639f0 EFLAGS: 00000206 [39389.800012] RAX: 0000000000000101 RBX: ffff8801710c4e98 RCX: 0000000000000000 [39389.800012] RDX: 00000000000000ff RSI: 0000000000000000 RDI: ffff8801710c4e9c [39389.800012] RBP: ffff880034a639f8 R08: 0000000000000001 R09: 0000000000000000 [39389.800012] R10: ffff880034a639b0 R11: 0000000000001000 R12: ffff8801710c4e98 [39389.800012] R13: 0000000000000001 R14: ffff880172cbc000 R15: ffff8801710c4e00 [39389.800012] FS: 00007f6d377fe700(0000) GS:ffff8802be9e0000(0000) knlGS:0000000000000000 [39389.800012] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [39389.800012] CR2: 00007f6d3d3c1000 CR3: 0000000037c93000 CR4: 00000000001406e0 [39389.800012] Stack: [39389.800012] ffff8801710c4e98 ffff880034a63a10 ffffffff81091963 ffff8801710c4e98 [39389.800012] ffff880034a63a30 ffffffff81486f1b ffffffffa0672cb3 ffff8801710c4e00 [39389.800012] ffff880034a63a78 ffffffffa0672cb3 ffff8801710c4e00 ffff880034a63a58 [39389.800012] Call Trace: [39389.800012] [] do_raw_write_lock+0x72/0x8c [39389.800012] [] _raw_write_lock+0x3a/0x41 [39389.800012] [] ? btrfs_tree_lock+0x119/0x251 [btrfs] [39389.800012] [] btrfs_tree_lock+0x119/0x251 [btrfs] [39389.800012] [] ? rcu_read_unlock+0x5b/0x5d [btrfs] [39389.800012] [] ? btrfs_root_node+0xda/0xe6 [btrfs] [39389.800012] [] btrfs_lock_root_node+0x22/0x42 [btrfs] [39389.800012] [] btrfs_search_slot+0x1b8/0x758 [btrfs] [39389.800012] [] ? time_hardirqs_on+0x15/0x28 [39389.800012] [] btrfs_lookup_inode+0x31/0x95 [btrfs] [39389.800012] [] ? trace_hardirqs_on+0xd/0xf [39389.800012] [] ? mutex_lock_nested+0x397/0x3bc [39389.800012] [] __btrfs_update_delayed_inode+0x59/0x1c0 [btrfs] [39389.800012] [] __btrfs_commit_inode_delayed_items+0x194/0x5aa [btrfs] [39389.800012] [] ? _raw_spin_unlock+0x31/0x44 [39389.800012] [] __btrfs_run_delayed_items+0xa4/0x15c [btrfs] [39389.800012] [] btrfs_run_delayed_items+0x11/0x13 [btrfs] [39389.800012] [] btrfs_commit_transaction+0x234/0x96e [btrfs] [39389.800012] [] btrfs_sync_fs+0x145/0x1ad [btrfs] [39389.800012] [] btrfs_ioctl+0x11d2/0x2793 [btrfs] [39389.800012] [] ? arch_local_irq_save+0x9/0xc [39389.800012] [] ? __might_fault+0x4c/0xa7 [39389.800012] [] ? __might_fault+0x4c/0xa7 [39389.800012] [] ? arch_local_irq_save+0x9/0xc [39389.800012] [] ? rcu_read_unlock+0x3e/0x5d [39389.800012] [] do_vfs_ioctl+0x42b/0x4ea [39389.800012] [] ? __fget_light+0x62/0x71 [39389.800012] [] SyS_ioctl+0x57/0x79 [39389.800012] [] entry_SYSCALL_64_fastpath+0x12/0x6f [39389.800012] Code: f0 0f b1 13 85 c0 75 ef eb 2a f3 90 8a 03 84 c0 75 f8 f0 0f b0 13 84 c0 75 f0 ba ff 00 00 00 eb 0a f0 0f b1 13 ff c8 74 0b f3 90 <8b> 03 83 f8 01 75 f7 eb ed c6 43 04 00 5b 5d c3 0f 1f 44 00 00 This happens because in the code path executed by the inode_paths ioctl we end up nesting two calls to read lock a leaf's rwlock when after the first call to read_lock() and before the second call to read_lock(), another task (running the delayed items as part of a transaction commit) has already called write_lock() against the leaf's rwlock. This situation is illustrated by the following diagram: Task A Task B btrfs_ref_to_path() btrfs_commit_transaction() read_lock(&eb->lock); btrfs_run_delayed_items() __btrfs_commit_inode_delayed_items() __btrfs_update_delayed_inode() btrfs_lookup_inode() write_lock(&eb->lock); --> task waits for lock read_lock(&eb->lock); --> makes this task hang forever (and task B too of course) So fix this by avoiding doing the nested read lock, which is easily avoidable. This issue does not happen if task B calls write_lock() after task A does the second call to read_lock(), however there does not seem to exist anything in the documentation that mentions what is the expected behaviour for recursive locking of rwlocks (leaving the idea that doing so is not a good usage of rwlocks). Also, as a side effect necessary for this fix, make sure we do not needlessly read lock extent buffers when the input path has skip_locking set (used when called from send). Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana Signed-off-by: Sasha Levin

ocfs2/dlm: clear refmap bit of recovery lock while doing local recovery cleanup

2016-02-15T20:42:41Z

[ Upstream commit c95a51807b730e4681e2ecbdfd669ca52601959e ] When recovery master down, dlm_do_local_recovery_cleanup() only remove the $RECOVERY lock owned by dead node, but do not clear the refmap bit. Which will make umount thread falling in dead loop migrating $RECOVERY to the dead node. Signed-off-by: xuejiufei Reviewed-by: Joseph Qi Cc: Mark Fasheh Cc: Joel Becker Cc: Junxiao Bi Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Sasha Levin