user/sven/linux.git/fs, branch v4.1.22

ocfs2/dlm: fix BUG in dlm_move_lockres_to_recovery_list

2016-04-18T12:51:09Z

[ Upstream commit be12b299a83fc807bbaccd2bcb8ec50cbb0cb55c ] When master handles convert request, it queues ast first and then returns status. This may happen that the ast is sent before the request status because the above two messages are sent by two threads. And right after the ast is sent, if master down, it may trigger BUG in dlm_move_lockres_to_recovery_list in the requested node because ast handler moves it to grant list without clear lock->convert_pending. So remove BUG_ON statement and check if the ast is processed in dlmconvert_remote. Signed-off-by: Joseph Qi Reported-by: Yiwen Jiang Cc: Junxiao Bi Cc: Mark Fasheh Cc: Joel Becker Cc: Tariq Saeed Cc: Junxiao Bi Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Sasha Levin

ocfs2/dlm: fix race between convert and recovery

2016-04-18T12:51:08Z

[ Upstream commit ac7cf246dfdbec3d8fed296c7bf30e16f5099dac ] There is a race window between dlmconvert_remote and dlm_move_lockres_to_recovery_list, which will cause a lock with OCFS2_LOCK_BUSY in grant list, thus system hangs. dlmconvert_remote { spin_lock(&res->spinlock); list_move_tail(&lock->list, &res->converting); lock->convert_pending = 1; spin_unlock(&res->spinlock); status = dlm_send_remote_convert_request(); >>>>>> race window, master has queued ast and return DLM_NORMAL, and then down before sending ast. this node detects master down and calls dlm_move_lockres_to_recovery_list, which will revert the lock to grant list. Then OCFS2_LOCK_BUSY won't be cleared as new master won't send ast any more because it thinks already be authorized. spin_lock(&res->spinlock); lock->convert_pending = 0; if (status != DLM_NORMAL) dlm_revert_pending_convert(res, lock); spin_unlock(&res->spinlock); } In this case, check if res->state has DLM_LOCK_RES_RECOVERING bit set (res is still in recovering) or res master changed (new master has finished recovery), reset the status to DLM_RECOVERING, then it will retry convert. Signed-off-by: Joseph Qi Reported-by: Yiwen Jiang Reviewed-by: Junxiao Bi Cc: Mark Fasheh Cc: Joel Becker Cc: Tariq Saeed Cc: Junxiao Bi Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Sasha Levin

fs/coredump: prevent fsuid=0 dumps into user-controlled directories

2016-04-18T12:51:07Z

[ Upstream commit 378c6520e7d29280f400ef2ceaf155c86f05a71a ] This commit fixes the following security hole affecting systems where all of the following conditions are fulfilled: - The fs.suid_dumpable sysctl is set to 2. - The kernel.core_pattern sysctl's value starts with "/". (Systems where kernel.core_pattern starts with "|/" are not affected.) - Unprivileged user namespace creation is permitted. (This is true on Linux >=3.8, but some distributions disallow it by default using a distro patch.) Under these conditions, if a program executes under secure exec rules, causing it to run with the SUID_DUMP_ROOT flag, then unshares its user namespace, changes its root directory and crashes, the coredump will be written using fsuid=0 and a path derived from kernel.core_pattern - but this path is interpreted relative to the root directory of the process, allowing the attacker to control where a coredump will be written with root privileges. To fix the security issue, always interpret core_pattern for dumps that are written under SUID_DUMP_ROOT relative to the root directory of init. Signed-off-by: Jann Horn Acked-by: Kees Cook Cc: Al Viro Cc: "Eric W. Biederman" Cc: Andy Lutomirski Cc: Oleg Nesterov Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Sasha Levin

coredump: Use 64bit time for unix time of coredump

2016-04-18T12:51:07Z

[ Upstream commit 03927c8acb63100046260711c06ba28b6b5936fb ] struct timeval on 32-bit systems will have its tv_sec value overflow in year 2038 and beyond. Use a 64 bit value to print time of the coredump in seconds. ktime_get_real_seconds is chosen here for efficiency reasons. Suggested by: Arnd Bergmann Signed-off-by: Tina Ruchandani Signed-off-by: Arnd Bergmann Signed-off-by: Al Viro Signed-off-by: Sasha Levin

splice: handle zero nr_pages in splice_to_pipe()

2016-04-18T12:51:05Z

[ Upstream commit d6785d9152147596f60234157da2b02540c3e60f ] Running the following command: busybox cat /sys/kernel/debug/tracing/trace_pipe > /dev/null with any tracing enabled pretty very quickly leads to various NULL pointer dereferences and VM BUG_ON()s, such as these: BUG: unable to handle kernel NULL pointer dereference at 0000000000000020 IP: [] generic_pipe_buf_release+0xc/0x40 Call Trace: [] splice_direct_to_actor+0x143/0x1e0 [] ? generic_pipe_buf_nosteal+0x10/0x10 [] do_splice_direct+0x8f/0xb0 [] do_sendfile+0x199/0x380 [] SyS_sendfile64+0x90/0xa0 [] entry_SYSCALL_64_fastpath+0x12/0x6d page dumped because: VM_BUG_ON_PAGE(atomic_read(&page->_count) == 0) kernel BUG at include/linux/mm.h:367! invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC RIP: [] generic_pipe_buf_release+0x3c/0x40 Call Trace: [] splice_direct_to_actor+0x143/0x1e0 [] ? generic_pipe_buf_nosteal+0x10/0x10 [] do_splice_direct+0x8f/0xb0 [] do_sendfile+0x199/0x380 [] SyS_sendfile64+0x90/0xa0 [] tracesys_phase2+0x84/0x89 (busybox's cat uses sendfile(2), unlike the coreutils version) This is because tracing_splice_read_pipe() can call splice_to_pipe() with spd->nr_pages == 0. spd_pages underflows in splice_to_pipe() and we fill the page pointers and the other fields of the pipe_buffers with garbage. All other callers of splice_to_pipe() avoid calling it when nr_pages == 0, and we could make tracing_splice_read_pipe() do that too, but it seems reasonable to have splice_to_page() handle this condition gracefully. Cc: stable@vger.kernel.org Signed-off-by: Rabin Vincent Reviewed-by: Christoph Hellwig Signed-off-by: Al Viro Signed-off-by: Sasha Levin

vfs: show_vfsstat: do not ignore errors from show_devname method

2016-04-18T12:51:00Z

[ Upstream commit 5f8d498d4364f544fee17125787a47553db02afa ] Explicitly check show_devname method return code and bail out in case of an error. This fixes regression introduced by commit 9d4d65748a5c. Cc: stable@vger.kernel.org Signed-off-by: Dmitry V. Levin Signed-off-by: Al Viro Signed-off-by: Sasha Levin

nfsd: fix deadlock secinfo+readdir compound

2016-04-18T12:51:00Z

[ Upstream commit 2f6fc056e899bd0144a08da5cacaecbe8997cd74 ] nfsd_lookup_dentry exits with the parent filehandle locked. fh_put also unlocks if necessary (nfsd filehandle locking is probably too lenient), so it gets unlocked eventually, but if the following op in the compound needs to lock it again, we can deadlock. A fuzzer ran into this; normal clients don't send a secinfo followed by a readdir in the same compound. Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields Signed-off-by: Sasha Levin

fuse: Add reference counting for fuse_io_priv

2016-04-18T12:50:57Z

[ Upstream commit 744742d692e37ad5c20630e57d526c8f2e2fe3c9 ] The 'reqs' member of fuse_io_priv serves two purposes. First is to track the number of oustanding async requests to the server and to signal that the io request is completed. The second is to be a reference count on the structure to know when it can be freed. For sync io requests these purposes can be at odds. fuse_direct_IO() wants to block until the request is done, and since the signal is sent when 'reqs' reaches 0 it cannot keep a reference to the object. Yet it needs to use the object after the userspace server has completed processing requests. This leads to some handshaking and special casing that it needlessly complicated and responsible for at least one race condition. It's much cleaner and safer to maintain a separate reference count for the object lifecycle and to let 'reqs' just be a count of outstanding requests to the userspace server. Then we can know for sure when it is safe to free the object without any handshaking or special cases. The catch here is that most of the time these objects are stack allocated and should not be freed. Initializing these objects with a single reference that is never released prevents accidental attempts to free the objects. Fixes: 9d5722b7777e ("fuse: handle synchronous iocbs internally") Cc: stable@vger.kernel.org # v4.1+ Signed-off-by: Seth Forshee Signed-off-by: Miklos Szeredi Signed-off-by: Sasha Levin

fuse: do not use iocb after it may have been freed

2016-04-18T12:50:56Z

[ Upstream commit 7cabc61e01a0a8b663bd2b4c982aa53048218734 ] There's a race in fuse_direct_IO(), whereby is_sync_kiocb() is called on an iocb that could have been freed if async io has already completed. The fix in this case is simple and obvious: cache the result before starting io. It was discovered by KASan: kernel: ================================================================== kernel: BUG: KASan: use after free in fuse_direct_IO+0xb1a/0xcc0 at addr ffff88036c414390 Signed-off-by: Robert Doebbelin Signed-off-by: Miklos Szeredi Fixes: bcba24ccdc82 ("fuse: enable asynchronous processing direct IO") Cc: # 3.10+ Signed-off-by: Sasha Levin

jbd2: fix FS corruption possibility in jbd2_journal_destroy() on umount path

2016-04-18T12:50:52Z

[ Upstream commit c0a2ad9b50dd80eeccd73d9ff962234590d5ec93 ] On umount path, jbd2_journal_destroy() writes latest transaction ID (->j_tail_sequence) to be used at next mount. The bug is that ->j_tail_sequence is not holding latest transaction ID in some cases. So, at next mount, there is chance to conflict with remaining (not overwritten yet) transactions. mount (id=10) write transaction (id=11) write transaction (id=12) umount (id=10) <= the bug doesn't write latest ID mount (id=10) write transaction (id=11) crash mount [recovery process] transaction (id=11) transaction (id=12) <= valid transaction ID, but old commit must not replay Like above, this bug become the cause of recovery failure, or FS corruption. So why ->j_tail_sequence doesn't point latest ID? Because if checkpoint transactions was reclaimed by memory pressure (i.e. bdev_try_to_free_page()), then ->j_tail_sequence is not updated. (And another case is, __jbd2_journal_clean_checkpoint_list() is called with empty transaction.) So in above cases, ->j_tail_sequence is not pointing latest transaction ID at umount path. Plus, REQ_FLUSH for checkpoint is not done too. So, to fix this problem with minimum changes, this patch updates ->j_tail_sequence, and issue REQ_FLUSH. (With more complex changes, some optimizations would be possible to avoid unnecessary REQ_FLUSH for example though.) BTW, journal->j_tail_sequence = ++journal->j_transaction_sequence; Increment of ->j_transaction_sequence seems to be unnecessary, but ext3 does this. Signed-off-by: OGAWA Hirofumi Signed-off-by: Theodore Ts'o Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin