summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2022-01-27ext4: don't use the orphan list when migrating an inodeTheodore Ts'o
commit 6eeaf88fd586f05aaf1d48cb3a139d2a5c6eb055 upstream. We probably want to remove the indirect block to extents migration feature after a deprecation window, but until then, let's fix a potential data loss problem caused by the fact that we put the tmp_inode on the orphan list. In the unlikely case where we crash and do a journal recovery, the data blocks belonging to the inode being migrated are also represented in the tmp_inode on the orphan list --- and so its data blocks will get marked unallocated, and available for reuse. Instead, stop putting the tmp_inode on the oprhan list. So in the case where we crash while migrating the inode, we'll leak an inode, which is not a disaster. It will be easily fixed the next time we run fsck, and it's better than potentially having blocks getting claimed by two different files, and losing data as a result. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-01-27ext4: Fix BUG_ON in ext4_bread when write quota dataYe Bin
commit 380a0091cab482489e9b19e07f2a166ad2b76d5c upstream. We got issue as follows when run syzkaller: [ 167.936972] EXT4-fs error (device loop0): __ext4_remount:6314: comm rep: Abort forced by user [ 167.938306] EXT4-fs (loop0): Remounting filesystem read-only [ 167.981637] Assertion failure in ext4_getblk() at fs/ext4/inode.c:847: '(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY) || handle != NULL || create == 0' [ 167.983601] ------------[ cut here ]------------ [ 167.984245] kernel BUG at fs/ext4/inode.c:847! [ 167.984882] invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI [ 167.985624] CPU: 7 PID: 2290 Comm: rep Tainted: G B 5.16.0-rc5-next-20211217+ #123 [ 167.986823] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014 [ 167.988590] RIP: 0010:ext4_getblk+0x17e/0x504 [ 167.989189] Code: c6 01 74 28 49 c7 c0 a0 a3 5c 9b b9 4f 03 00 00 48 c7 c2 80 9c 5c 9b 48 c7 c6 40 b6 5c 9b 48 c7 c7 20 a4 5c 9b e8 77 e3 fd ff <0f> 0b 8b 04 244 [ 167.991679] RSP: 0018:ffff8881736f7398 EFLAGS: 00010282 [ 167.992385] RAX: 0000000000000094 RBX: 1ffff1102e6dee75 RCX: 0000000000000000 [ 167.993337] RDX: 0000000000000001 RSI: ffffffff9b6e29e0 RDI: ffffed102e6dee66 [ 167.994292] RBP: ffff88816a076210 R08: 0000000000000094 R09: ffffed107363fa09 [ 167.995252] R10: ffff88839b1fd047 R11: ffffed107363fa08 R12: ffff88816a0761e8 [ 167.996205] R13: 0000000000000000 R14: 0000000000000021 R15: 0000000000000001 [ 167.997158] FS: 00007f6a1428c740(0000) GS:ffff88839b000000(0000) knlGS:0000000000000000 [ 167.998238] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 167.999025] CR2: 00007f6a140716c8 CR3: 0000000133216000 CR4: 00000000000006e0 [ 167.999987] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 168.000944] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 168.001899] Call Trace: [ 168.002235] <TASK> [ 168.007167] ext4_bread+0xd/0x53 [ 168.007612] ext4_quota_write+0x20c/0x5c0 [ 168.010457] write_blk+0x100/0x220 [ 168.010944] remove_free_dqentry+0x1c6/0x440 [ 168.011525] free_dqentry.isra.0+0x565/0x830 [ 168.012133] remove_tree+0x318/0x6d0 [ 168.014744] remove_tree+0x1eb/0x6d0 [ 168.017346] remove_tree+0x1eb/0x6d0 [ 168.019969] remove_tree+0x1eb/0x6d0 [ 168.022128] qtree_release_dquot+0x291/0x340 [ 168.023297] v2_release_dquot+0xce/0x120 [ 168.023847] dquot_release+0x197/0x3e0 [ 168.024358] ext4_release_dquot+0x22a/0x2d0 [ 168.024932] dqput.part.0+0x1c9/0x900 [ 168.025430] __dquot_drop+0x120/0x190 [ 168.025942] ext4_clear_inode+0x86/0x220 [ 168.026472] ext4_evict_inode+0x9e8/0xa22 [ 168.028200] evict+0x29e/0x4f0 [ 168.028625] dispose_list+0x102/0x1f0 [ 168.029148] evict_inodes+0x2c1/0x3e0 [ 168.030188] generic_shutdown_super+0xa4/0x3b0 [ 168.030817] kill_block_super+0x95/0xd0 [ 168.031360] deactivate_locked_super+0x85/0xd0 [ 168.031977] cleanup_mnt+0x2bc/0x480 [ 168.033062] task_work_run+0xd1/0x170 [ 168.033565] do_exit+0xa4f/0x2b50 [ 168.037155] do_group_exit+0xef/0x2d0 [ 168.037666] __x64_sys_exit_group+0x3a/0x50 [ 168.038237] do_syscall_64+0x3b/0x90 [ 168.038751] entry_SYSCALL_64_after_hwframe+0x44/0xae In order to reproduce this problem, the following conditions need to be met: 1. Ext4 filesystem with no journal; 2. Filesystem image with incorrect quota data; 3. Abort filesystem forced by user; 4. umount filesystem; As in ext4_quota_write: ... if (EXT4_SB(sb)->s_journal && !handle) { ext4_msg(sb, KERN_WARNING, "Quota write (off=%llu, len=%llu)" " cancelled because transaction is not started", (unsigned long long)off, (unsigned long long)len); return -EIO; } ... We only check handle if NULL when filesystem has journal. There is need check handle if NULL even when filesystem has no journal. Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20211223015506.297766-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-01-27ext4: set csum seed in tmp inode while migrating to extentsLuís Henriques
commit e81c9302a6c3c008f5c30beb73b38adb0170ff2d upstream. When migrating to extents, the temporary inode will have it's own checksum seed. This means that, when swapping the inodes data, the inode checksums will be incorrect. This can be fixed by recalculating the extents checksums again. Or simply by copying the seed into the temporary inode. Link: https://bugzilla.kernel.org/show_bug.cgi?id=213357 Reported-by: Jeroen van Wolffelaar <jeroen@wolffelaar.nl> Signed-off-by: Luís Henriques <lhenriques@suse.de> Link: https://lore.kernel.org/r/20211214175058.19511-1-lhenriques@suse.de Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-01-27ubifs: Error path in ubifs_remount_rw() seems to wrongly free write buffersPetr Cvachoucek
commit 3fea4d9d160186617ff40490ae01f4f4f36b28ff upstream. it seems freeing the write buffers in the error path of the ubifs_remount_rw() is wrong. It leads later to a kernel oops like this: [10016.431274] UBIFS (ubi0:0): start fixing up free space [10090.810042] UBIFS (ubi0:0): free space fixup complete [10090.814623] UBIFS error (ubi0:0 pid 512): ubifs_remount_fs: cannot spawn "ubifs_bgt0_0", error -4 [10101.915108] UBIFS (ubi0:0): background thread "ubifs_bgt0_0" started, PID 517 [10105.275498] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000030 [10105.284352] Mem abort info: [10105.287160] ESR = 0x96000006 [10105.290252] EC = 0x25: DABT (current EL), IL = 32 bits [10105.295592] SET = 0, FnV = 0 [10105.298652] EA = 0, S1PTW = 0 [10105.301848] Data abort info: [10105.304723] ISV = 0, ISS = 0x00000006 [10105.308573] CM = 0, WnR = 0 [10105.311564] user pgtable: 4k pages, 48-bit VAs, pgdp=00000000f03d1000 [10105.318034] [0000000000000030] pgd=00000000f6cee003, pud=00000000f4884003, pmd=0000000000000000 [10105.326783] Internal error: Oops: 96000006 [#1] PREEMPT SMP [10105.332355] Modules linked in: ath10k_pci ath10k_core ath mac80211 libarc4 cfg80211 nvme nvme_core cryptodev(O) [10105.342468] CPU: 3 PID: 518 Comm: touch Tainted: G O 5.4.3 #1 [10105.349517] Hardware name: HYPEX CPU (DT) [10105.353525] pstate: 40000005 (nZcv daif -PAN -UAO) [10105.358324] pc : atomic64_try_cmpxchg_acquire.constprop.22+0x8/0x34 [10105.364596] lr : mutex_lock+0x1c/0x34 [10105.368253] sp : ffff000075633aa0 [10105.371563] x29: ffff000075633aa0 x28: 0000000000000001 [10105.376874] x27: ffff000076fa80c8 x26: 0000000000000004 [10105.382185] x25: 0000000000000030 x24: 0000000000000000 [10105.387495] x23: 0000000000000000 x22: 0000000000000038 [10105.392807] x21: 000000000000000c x20: ffff000076fa80c8 [10105.398119] x19: ffff000076fa8000 x18: 0000000000000000 [10105.403429] x17: 0000000000000000 x16: 0000000000000000 [10105.408741] x15: 0000000000000000 x14: fefefefefefefeff [10105.414052] x13: 0000000000000000 x12: 0000000000000fe0 [10105.419364] x11: 0000000000000fe0 x10: ffff000076709020 [10105.424675] x9 : 0000000000000000 x8 : 00000000000000a0 [10105.429986] x7 : ffff000076fa80f4 x6 : 0000000000000030 [10105.435297] x5 : 0000000000000000 x4 : 0000000000000000 [10105.440609] x3 : 0000000000000000 x2 : ffff00006f276040 [10105.445920] x1 : ffff000075633ab8 x0 : 0000000000000030 [10105.451232] Call trace: [10105.453676] atomic64_try_cmpxchg_acquire.constprop.22+0x8/0x34 [10105.459600] ubifs_garbage_collect+0xb4/0x334 [10105.463956] ubifs_budget_space+0x398/0x458 [10105.468139] ubifs_create+0x50/0x180 [10105.471712] path_openat+0x6a0/0x9b0 [10105.475284] do_filp_open+0x34/0x7c [10105.478771] do_sys_open+0x78/0xe4 [10105.482170] __arm64_sys_openat+0x1c/0x24 [10105.486180] el0_svc_handler+0x84/0xc8 [10105.489928] el0_svc+0x8/0xc [10105.492808] Code: 52800013 17fffffb d2800003 f9800011 (c85ffc05) [10105.498903] ---[ end trace 46b721d93267a586 ]--- To reproduce the problem: 1. Filesystem initially mounted read-only, free space fixup flag set. 2. mount -o remount,rw <mountpoint> 3. it takes some time (free space fixup running) ... try to terminate running mount by CTRL-C ... does not respond, only after free space fixup is complete ... then "ubifs_remount_fs: cannot spawn "ubifs_bgt0_0", error -4" 4. mount -o remount,rw <mountpoint> ... now finished instantly (fixup already done). 5. Create file or just unmount the filesystem and we get the oops. Cc: <stable@vger.kernel.org> Fixes: b50b9f408502 ("UBIFS: do not free write-buffers when in R/O mode") Signed-off-by: Petr Cvachoucek <cvachoucek@gmail.com> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-01-27btrfs: remove BUG_ON(!eie) in find_parent_nodesJosef Bacik
[ Upstream commit 9f05c09d6baef789726346397438cca4ec43c3ee ] If we're looking for leafs that point to a data extent we want to record the extent items that point at our bytenr. At this point we have the reference and we know for a fact that this leaf should have a reference to our bytenr. However if there's some sort of corruption we may not find any references to our leaf, and thus could end up with eie == NULL. Replace this BUG_ON() with an ASSERT() and then return -EUCLEAN for the mortals. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-01-27btrfs: remove BUG_ON() in find_parent_nodes()Josef Bacik
[ Upstream commit fcba0120edf88328524a4878d1d6f4ad39f2ec81 ] We search for an extent entry with .offset = -1, which shouldn't be a thing, but corruption happens. Add an ASSERT() for the developers, return -EUCLEAN for mortals. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-01-27fs: dlm: filter user dlm messages for kernel locksAlexander Aring
[ Upstream commit 6c2e3bf68f3e5e5a647aa52be246d5f552d7496d ] This patch fixes the following crash by receiving a invalid message: [ 160.672220] ================================================================== [ 160.676206] BUG: KASAN: user-memory-access in dlm_user_add_ast+0xc3/0x370 [ 160.679659] Read of size 8 at addr 00000000deadbeef by task kworker/u32:13/319 [ 160.681447] [ 160.681824] CPU: 10 PID: 319 Comm: kworker/u32:13 Not tainted 5.14.0-rc2+ #399 [ 160.683472] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.14.0-1.module+el8.6.0+12648+6ede71a5 04/01/2014 [ 160.685574] Workqueue: dlm_recv process_recv_sockets [ 160.686721] Call Trace: [ 160.687310] dump_stack_lvl+0x56/0x6f [ 160.688169] ? dlm_user_add_ast+0xc3/0x370 [ 160.689116] kasan_report.cold.14+0x116/0x11b [ 160.690138] ? dlm_user_add_ast+0xc3/0x370 [ 160.690832] dlm_user_add_ast+0xc3/0x370 [ 160.691502] _receive_unlock_reply+0x103/0x170 [ 160.692241] _receive_message+0x11df/0x1ec0 [ 160.692926] ? rcu_read_lock_sched_held+0xa1/0xd0 [ 160.693700] ? rcu_read_lock_bh_held+0xb0/0xb0 [ 160.694427] ? lock_acquire+0x175/0x400 [ 160.695058] ? do_purge.isra.51+0x200/0x200 [ 160.695744] ? lock_acquired+0x360/0x5d0 [ 160.696400] ? lock_contended+0x6a0/0x6a0 [ 160.697055] ? lock_release+0x21d/0x5e0 [ 160.697686] ? lock_is_held_type+0xe0/0x110 [ 160.698352] ? lock_is_held_type+0xe0/0x110 [ 160.699026] ? ___might_sleep+0x1cc/0x1e0 [ 160.699698] ? dlm_wait_requestqueue+0x94/0x140 [ 160.700451] ? dlm_process_requestqueue+0x240/0x240 [ 160.701249] ? down_write_killable+0x2b0/0x2b0 [ 160.701988] ? do_raw_spin_unlock+0xa2/0x130 [ 160.702690] dlm_receive_buffer+0x1a5/0x210 [ 160.703385] dlm_process_incoming_buffer+0x726/0x9f0 [ 160.704210] receive_from_sock+0x1c0/0x3b0 [ 160.704886] ? dlm_tcp_shutdown+0x30/0x30 [ 160.705561] ? lock_acquire+0x175/0x400 [ 160.706197] ? rcu_read_lock_sched_held+0xa1/0xd0 [ 160.706941] ? rcu_read_lock_bh_held+0xb0/0xb0 [ 160.707681] process_recv_sockets+0x32/0x40 [ 160.708366] process_one_work+0x55e/0xad0 [ 160.709045] ? pwq_dec_nr_in_flight+0x110/0x110 [ 160.709820] worker_thread+0x65/0x5e0 [ 160.710423] ? process_one_work+0xad0/0xad0 [ 160.711087] kthread+0x1ed/0x220 [ 160.711628] ? set_kthread_struct+0x80/0x80 [ 160.712314] ret_from_fork+0x22/0x30 The issue is that we received a DLM message for a user lock but the destination lock is a kernel lock. Note that the address which is trying to derefence is 00000000deadbeef, which is in a kernel lock lkb->lkb_astparam, this field should never be derefenced by the DLM kernel stack. In case of a user lock lkb->lkb_astparam is lkb->lkb_ua (memory is shared by a union field). The struct lkb_ua will be handled by the DLM kernel stack but on a kernel lock it will contain invalid data and ends in most likely crashing the kernel. It can be reproduced with two cluster nodes. node 2: dlm_tool join test echo "862 fooobaar 1 2 1" > /sys/kernel/debug/dlm/test_locks echo "862 3 1" > /sys/kernel/debug/dlm/test_waiters node 1: dlm_tool join test python: foo = DLM(h_cmd=3, o_nextcmd=1, h_nodeid=1, h_lockspace=0x77222027, \ m_type=7, m_flags=0x1, m_remid=0x862, m_result=0xFFFEFFFE) newFile = open("/sys/kernel/debug/dlm/comms/2/rawmsg", "wb") newFile.write(bytes(foo)) Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-01-27ext4: avoid trim error on fs with small groupsJan Kara
[ Upstream commit 173b6e383d2a204c9921ffc1eca3b87aa2106c33 ] A user reported FITRIM ioctl failing for him on ext4 on some devices without apparent reason. After some debugging we've found out that these devices (being LVM volumes) report rather large discard granularity of 42MB and the filesystem had 1k blocksize and thus group size of 8MB. Because ext4 FITRIM implementation puts discard granularity into minlen, ext4_trim_fs() declared the trim request as invalid. However just silently doing nothing seems to be a more appropriate reaction to such combination of parameters since user did not specify anything wrong. CC: Lukas Czerner <lczerner@redhat.com> Fixes: 5c2ed62fd447 ("ext4: Adjust minlen with discard_granularity in the FITRIM ioctl") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20211112152202.26614-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-01-11xfs: map unwritten blocks in XFS_IOC_{ALLOC,FREE}SP just like fallocateDarrick J. Wong
commit 983d8e60f50806f90534cc5373d0ce867e5aaf79 upstream. The old ALLOCSP/FREESP ioctls in XFS can be used to preallocate space at the end of files, just like fallocate and RESVSP. Make the behavior consistent with the other ioctls. Reported-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-12-22nfsd: fix use-after-free due to delegation raceJ. Bruce Fields
commit 548ec0805c399c65ed66c6641be467f717833ab5 upstream. A delegation break could arrive as soon as we've called vfs_setlease. A delegation break runs a callback which immediately (in nfsd4_cb_recall_prepare) adds the delegation to del_recall_lru. If we then exit nfs4_set_delegation without hashing the delegation, it will be freed as soon as the callback is done with it, without ever being removed from del_recall_lru. Symptoms show up later as use-after-free or list corruption warnings, usually in the laundromat thread. I suspect aba2072f4523 "nfsd: grant read delegations to clients holding writes" made this bug easier to hit, but I looked as far back as v3.0 and it looks to me it already had the same problem. So I'm not sure where the bug was introduced; it may have been there from the beginning. Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com> [Salvatore Bonaccorso: Backport for context changes to versions which do not have 20b7d86f29d3 ("nfsd: use boottime for lease expiry calculation")] Signed-off-by: Salvatore Bonaccorso <carnil@debian.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-12-14tracefs: Set all files to the same group ownership as the mount optionSteven Rostedt (VMware)
commit 48b27b6b5191e2e1f2798cd80877b6e4ef47c351 upstream. As people have been asking to allow non-root processes to have access to the tracefs directory, it was considered best to only allow groups to have access to the directory, where it is easier to just set the tracefs file system to a specific group (as other would be too dangerous), and that way the admins could pick which processes would have access to tracefs. Unfortunately, this broke tooling on Android that expected the other bit to be set. For some special cases, for non-root tools to trace the system, tracefs would be mounted and change the permissions of the top level directory which gave access to all running tasks permission to the tracing directory. Even though this would be dangerous to do in a production environment, for testing environments this can be useful. Now with the new changes to not allow other (which is still the proper thing to do), it breaks the testing tooling. Now more code needs to be loaded on the system to change ownership of the tracing directory. The real solution is to have tracefs honor the gid=xxx option when mounting. That is, (tracing group tracing has value 1003) mount -t tracefs -o gid=1003 tracefs /sys/kernel/tracing should have it that all files in the tracing directory should be of the given group. Copy the logic from d_walk() from dcache.c and simplify it for the mount case of tracefs if gid is set. All the files in tracefs will be walked and their group will be set to the value passed in. Link: https://lkml.kernel.org/r/20211207171729.2a54e1b3@gandalf.local.home Cc: Ingo Molnar <mingo@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-fsdevel@vger.kernel.org Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reported-by: Kalesh Singh <kaleshsingh@google.com> Reported-by: Yabin Cui <yabinc@google.com> Fixes: 49d67e445742 ("tracefs: Have tracefs directories not set OTH permission bits by default") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-12-14signalfd: use wake_up_pollfree()Eric Biggers
commit 9537bae0da1f8d1e2361ab6d0479e8af7824e160 upstream. wake_up_poll() uses nr_exclusive=1, so it's not guaranteed to wake up all exclusive waiters. Yet, POLLFREE *must* wake up all waiters. epoll and aio poll are fortunately not affected by this, but it's very fragile. Thus, the new function wake_up_pollfree() has been introduced. Convert signalfd to use wake_up_pollfree(). Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Fixes: d80e731ecab4 ("epoll: introduce POLLFREE to flush ->signalfd_wqh before kfree()") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20211209010455.42744-4-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-12-14tracefs: Have new files inherit the ownership of their parentSteven Rostedt (VMware)
commit ee7f3666995d8537dec17b1d35425f28877671a9 upstream. If directories in tracefs have their ownership changed, then any new files and directories that are created under those directories should inherit the ownership of the director they are created in. Link: https://lkml.kernel.org/r/20211208075720.4855d180@gandalf.local.home Cc: Kees Cook <keescook@chromium.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Yabin Cui <yabinc@google.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: stable@vger.kernel.org Fixes: 4282d60689d4f ("tracefs: Add new tracefs file system") Reported-by: Kalesh Singh <kaleshsingh@google.com> Reported: https://lore.kernel.org/all/CAC_TJve8MMAv+H_NdLSJXZUSoxOEq2zB_pVaJ9p=7H6Bu3X76g@mail.gmail.com/ Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-12-08fget: check that the fd still exists after getting a ref to itLinus Torvalds
commit 054aa8d439b9185d4f5eb9a90282d1ce74772969 upstream. Jann Horn points out that there is another possible race wrt Unix domain socket garbage collection, somewhat reminiscent of the one fixed in commit cbcf01128d0a ("af_unix: fix garbage collect vs MSG_PEEK"). See the extended comment about the garbage collection requirements added to unix_peek_fds() by that commit for details. The race comes from how we can locklessly look up a file descriptor just as it is in the process of being closed, and with the right artificial timing (Jann added a few strategic 'mdelay(500)' calls to do that), the Unix domain socket garbage collector could see the reference count decrement of the close() happen before fget() took its reference to the file and the file was attached onto a new file descriptor. This is all (intentionally) correct on the 'struct file *' side, with RCU lookups and lockless reference counting very much part of the design. Getting that reference count out of order isn't a problem per se. But the garbage collector can get confused by seeing this situation of having seen a file not having any remaining external references and then seeing it being attached to an fd. In commit cbcf01128d0a ("af_unix: fix garbage collect vs MSG_PEEK") the fix was to serialize the file descriptor install with the garbage collector by taking and releasing the unix_gc_lock. That's not really an option here, but since this all happens when we are in the process of looking up a file descriptor, we can instead simply just re-check that the file hasn't been closed in the meantime, and just re-do the lookup if we raced with a concurrent close() of the same file descriptor. Reported-and-tested-by: Jann Horn <jannh@google.com> Acked-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-12-08fs: add fget_many() and fput_many()Jens Axboe
commit 091141a42e15fe47ada737f3996b317072afcefb upstream. Some uses cases repeatedly get and put references to the same file, but the only exposed interface is doing these one at the time. As each of these entail an atomic inc or dec on a shared structure, that cost can add up. Add fget_many(), which works just like fget(), except it takes an argument for how many references to get on the file. Ditto fput_many(), which can drop an arbitrary number of references to a file. Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-12-08fuse: release pipe buf after last useMiklos Szeredi
commit 473441720c8616dfaf4451f9c7ea14f0eb5e5d65 upstream. Checking buf->flags should be done before the pipe_buf_release() is called on the pipe buffer, since releasing the buffer might modify the flags. This is exactly what page_cache_pipe_buf_release() does, and which results in the same VM_BUG_ON_PAGE(PageLRU(page)) that the original patch was trying to fix. Reported-by: Justin Forbes <jmforbes@linuxtx.org> Fixes: 712a951025c0 ("fuse: fix page stealing") Cc: <stable@vger.kernel.org> # v2.6.35 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-12-08fuse: fix page stealingMiklos Szeredi
commit 712a951025c0667ff00b25afc360f74e639dfabe upstream. It is possible to trigger a crash by splicing anon pipe bufs to the fuse device. The reason for this is that anon_pipe_buf_release() will reuse buf->page if the refcount is 1, but that page might have already been stolen and its flags modified (e.g. PG_lru added). This happens in the unlikely case of fuse_dev_splice_write() getting around to calling pipe_buf_release() after a page has been stolen, added to the page cache and removed from the page cache. Fix by calling pipe_buf_release() right after the page was inserted into the page cache. In this case the page has an elevated refcount so any release function will know that the page isn't reusable. Reported-by: Frank Dinoff <fdinoff@google.com> Link: https://lore.kernel.org/r/CAAmZXrsGg2xsP1CK+cbuEMumtrqdvD-NKnWzhNcvn71RV3c1yw@mail.gmail.com/ Fixes: dd3bb14f44a6 ("fuse: support splice() writing to fuse device") Cc: <stable@vger.kernel.org> # v2.6.35 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-12-08proc/vmcore: fix clearing user buffer by properly using clear_user()David Hildenbrand
commit c1e63117711977cc4295b2ce73de29dd17066c82 upstream. To clear a user buffer we cannot simply use memset, we have to use clear_user(). With a virtio-mem device that registers a vmcore_cb and has some logically unplugged memory inside an added Linux memory block, I can easily trigger a BUG by copying the vmcore via "cp": systemd[1]: Starting Kdump Vmcore Save Service... kdump[420]: Kdump is using the default log level(3). kdump[453]: saving to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/ kdump[458]: saving vmcore-dmesg.txt to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/ kdump[465]: saving vmcore-dmesg.txt complete kdump[467]: saving vmcore BUG: unable to handle page fault for address: 00007f2374e01000 #PF: supervisor write access in kernel mode #PF: error_code(0x0003) - permissions violation PGD 7a523067 P4D 7a523067 PUD 7a528067 PMD 7a525067 PTE 800000007048f867 Oops: 0003 [#1] PREEMPT SMP NOPTI CPU: 0 PID: 468 Comm: cp Not tainted 5.15.0+ #6 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-27-g64f37cc530f1-prebuilt.qemu.org 04/01/2014 RIP: 0010:read_from_oldmem.part.0.cold+0x1d/0x86 Code: ff ff ff e8 05 ff fe ff e9 b9 e9 7f ff 48 89 de 48 c7 c7 38 3b 60 82 e8 f1 fe fe ff 83 fd 08 72 3c 49 8d 7d 08 4c 89 e9 89 e8 <49> c7 45 00 00 00 00 00 49 c7 44 05 f8 00 00 00 00 48 83 e7 f81 RSP: 0018:ffffc9000073be08 EFLAGS: 00010212 RAX: 0000000000001000 RBX: 00000000002fd000 RCX: 00007f2374e01000 RDX: 0000000000000001 RSI: 00000000ffffdfff RDI: 00007f2374e01008 RBP: 0000000000001000 R08: 0000000000000000 R09: ffffc9000073bc50 R10: ffffc9000073bc48 R11: ffffffff829461a8 R12: 000000000000f000 R13: 00007f2374e01000 R14: 0000000000000000 R15: ffff88807bd421e8 FS: 00007f2374e12140(0000) GS:ffff88807f000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f2374e01000 CR3: 000000007a4aa000 CR4: 0000000000350eb0 Call Trace: read_vmcore+0x236/0x2c0 proc_reg_read+0x55/0xa0 vfs_read+0x95/0x190 ksys_read+0x4f/0xc0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Some x86-64 CPUs have a CPU feature called "Supervisor Mode Access Prevention (SMAP)", which is used to detect wrong access from the kernel to user buffers like this: SMAP triggers a permissions violation on wrong access. In the x86-64 variant of clear_user(), SMAP is properly handled via clac()+stac(). To fix, properly use clear_user() when we're dealing with a user buffer. Link: https://lkml.kernel.org/r/20211112092750.6921-1-david@redhat.com Fixes: 997c136f518c ("fs/proc/vmcore.c: add hook to read_from_oldmem() to check for non-ram pages") Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Philipp Rudo <prudo@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-12-08NFSv42: Don't fail clone() unless the OP_CLONE operation failedTrond Myklebust
[ Upstream commit d3c45824ad65aebf765fcf51366d317a29538820 ] The failure to retrieve post-op attributes has no bearing on whether or not the clone operation itself was successful. We must therefore ignore the return value of decode_getfattr() when looking at the success or failure of nfs4_xdr_dec_clone(). Fixes: 36022770de6c ("nfs42: add CLONE xdr functions") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-26btrfs: fix memory ordering between normal and ordered work functionsNikolay Borisov
commit 45da9c1767ac31857df572f0a909fbe88fd5a7e9 upstream. Ordered work functions aren't guaranteed to be handled by the same thread which executed the normal work functions. The only way execution between normal/ordered functions is synchronized is via the WORK_DONE_BIT, unfortunately the used bitops don't guarantee any ordering whatsoever. This manifested as seemingly inexplicable crashes on ARM64, where async_chunk::inode is seen as non-null in async_cow_submit which causes submit_compressed_extents to be called and crash occurs because async_chunk::inode suddenly became NULL. The call trace was similar to: pc : submit_compressed_extents+0x38/0x3d0 lr : async_cow_submit+0x50/0xd0 sp : ffff800015d4bc20 <registers omitted for brevity> Call trace: submit_compressed_extents+0x38/0x3d0 async_cow_submit+0x50/0xd0 run_ordered_work+0xc8/0x280 btrfs_work_helper+0x98/0x250 process_one_work+0x1f0/0x4ac worker_thread+0x188/0x504 kthread+0x110/0x114 ret_from_fork+0x10/0x18 Fix this by adding respective barrier calls which ensure that all accesses preceding setting of WORK_DONE_BIT are strictly ordered before setting the flag. At the same time add a read barrier after reading of WORK_DONE_BIT in run_ordered_work which ensures all subsequent loads would be strictly ordered after reading the bit. This in turn ensures are all accesses before WORK_DONE_BIT are going to be strictly ordered before any access that can occur in ordered_func. Reported-by: Chris Murphy <lists@colorremedies.com> Fixes: 08a9ff326418 ("btrfs: Added btrfs_workqueue_struct implemented ordered execution based on kernel workqueue") CC: stable@vger.kernel.org # 4.4+ Link: https://bugzilla.redhat.com/show_bug.cgi?id=2011928 Reviewed-by: Josef Bacik <josef@toxicpanda.com> Tested-by: Chris Murphy <chris@colorremedies.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-26JFS: fix memleak in jfs_mountDongliang Mu
[ Upstream commit c48a14dca2cb57527dde6b960adbe69953935f10 ] In jfs_mount, when diMount(ipaimap2) fails, it goes to errout35. However, the following code does not free ipaimap2 allocated by diReadSpecial. Fix this by refactoring the error handling code of jfs_mount. To be specific, modify the lable name and free ipaimap2 when the above error ocurrs. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Dongliang Mu <mudongliangabcd@gmail.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-26tracefs: Have tracefs directories not set OTH permission bits by defaultSteven Rostedt (VMware)
[ Upstream commit 49d67e445742bbcb03106b735b2ab39f6e5c56bc ] The tracefs file system is by default mounted such that only root user can access it. But there are legitimate reasons to create a group and allow those added to the group to have access to tracing. By changing the permissions of the tracefs mount point to allow access, it will allow group access to the tracefs directory. There should not be any real reason to allow all access to the tracefs directory as it contains sensitive information. Have the default permission of directories being created not have any OTH (other) bits set, such that an admin that wants to give permission to a group has to first disable all OTH bits in the file system. Link: https://lkml.kernel.org/r/20210818153038.664127804@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-26quota: correct error number in free_dqentry()Zhang Yi
commit d0e36a62bd4c60c09acc40e06ba4831a4d0bc75b upstream. Fix the error path in free_dqentry(), pass out the error number if the block to free is not correct. Fixes: 1ccd14b9c271 ("quota: Split off quota tree handling into a separate file") Link: https://lore.kernel.org/r/20211008093821.1001186-3-yi.zhang@huawei.com Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Cc: stable@kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-26quota: check block number when reading the block in quota fileZhang Yi
commit 9bf3d20331295b1ecb81f4ed9ef358c51699a050 upstream. The block number in the quota tree on disk should be smaller than the v2_disk_dqinfo.dqi_blocks. If the quota file was corrupted, we may be allocating an 'allocated' block and that would lead to a loop in a tree, which will probably trigger oops later. This patch adds a check for the block number in the quota tree to prevent such potential issue. Link: https://lore.kernel.org/r/20211008093821.1001186-2-yi.zhang@huawei.com Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Cc: stable@kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-26btrfs: fix lost error handling when replaying directory deletesFilipe Manana
commit 10adb1152d957a4d570ad630f93a88bb961616c1 upstream. At replay_dir_deletes(), if find_dir_range() returns an error we break out of the main while loop and then assign a value of 0 (success) to the 'ret' variable, resulting in completely ignoring that an error happened. Fix that by jumping to the 'out' label when find_dir_range() returns an error (negative value). CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-26ocfs2: fix data corruption on truncateJan Kara
commit 839b63860eb3835da165642923120d305925561d upstream. Patch series "ocfs2: Truncate data corruption fix". As further testing has shown, commit 5314454ea3f ("ocfs2: fix data corruption after conversion from inline format") didn't fix all the data corruption issues the customer started observing after 6dbf7bb55598 ("fs: Don't invalidate page buffers in block_write_full_page()") This time I have tracked them down to two bugs in ocfs2 truncation code. One bug (truncating page cache before clearing tail cluster and setting i_size) could cause data corruption even before 6dbf7bb55598, but before that commit it needed a race with page fault, after 6dbf7bb55598 it started to be pretty deterministic. Another bug (zeroing pages beyond old i_size) used to be harmless inefficiency before commit 6dbf7bb55598. But after commit 6dbf7bb55598 in combination with the first bug it resulted in deterministic data corruption. Although fixing only the first problem is needed to stop data corruption, I've fixed both issues to make the code more robust. This patch (of 2): ocfs2_truncate_file() did unmap invalidate page cache pages before zeroing partial tail cluster and setting i_size. Thus some pages could be left (and likely have left if the cluster zeroing happened) in the page cache beyond i_size after truncate finished letting user possibly see stale data once the file was extended again. Also the tail cluster zeroing was not guaranteed to finish before truncate finished causing possible stale data exposure. The problem started to be particularly easy to hit after commit 6dbf7bb55598 "fs: Don't invalidate page buffers in block_write_full_page()" stopped invalidation of pages beyond i_size from page writeback path. Fix these problems by unmapping and invalidating pages in the page cache after the i_size is reduced and tail cluster is zeroed out. Link: https://lkml.kernel.org/r/20211025150008.29002-1-jack@suse.cz Link: https://lkml.kernel.org/r/20211025151332.11301-1-jack@suse.cz Fixes: ccd979bdbce9 ("[PATCH] OCFS2: The Second Oracle Cluster Filesystem") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-12isofs: Fix out of bound access for corrupted isofs imageJan Kara
commit e96a1866b40570b5950cda8602c2819189c62a48 upstream. When isofs image is suitably corrupted isofs_read_inode() can read data beyond the end of buffer. Sanity-check the directory entry length before using it. Reported-and-tested-by: syzbot+6fc7fb214625d82af7d1@syzkaller.appspotmail.com CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-27ovl: fix missing negative dentry check in ovl_rename()Zheng Liang
commit a295aef603e109a47af355477326bd41151765b6 upstream. The following reproducer mkdir lower upper work merge touch lower/old touch lower/new mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merge rm merge/new mv merge/old merge/new & unlink upper/new may result in this race: PROCESS A: rename("merge/old", "merge/new"); overwrite=true,ovl_lower_positive(old)=true, ovl_dentry_is_whiteout(new)=true -> flags |= RENAME_EXCHANGE PROCESS B: unlink("upper/new"); PROCESS A: lookup newdentry in new_upperdir call vfs_rename() with negative newdentry and RENAME_EXCHANGE Fix by adding the missing check for negative newdentry. Signed-off-by: Zheng Liang <zhengliang6@huawei.com> Fixes: e9be9d5e76e3 ("overlay filesystem") Cc: <stable@vger.kernel.org> # v3.18 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Masami Ichikawa(CIP) <masami.ichikawa@cybertrust.co.jp>
2021-10-27ocfs2: mount fails with buffer overflow in strlenValentin Vidic
commit b15fa9224e6e1239414525d8d556d824701849fc upstream. Starting with kernel 5.11 built with CONFIG_FORTIFY_SOURCE mouting an ocfs2 filesystem with either o2cb or pcmk cluster stack fails with the trace below. Problem seems to be that strings for cluster stack and cluster name are not guaranteed to be null terminated in the disk representation, while strlcpy assumes that the source string is always null terminated. This causes a read outside of the source string triggering the buffer overflow detection. detected buffer overflow in strlen ------------[ cut here ]------------ kernel BUG at lib/string.c:1149! invalid opcode: 0000 [#1] SMP PTI CPU: 1 PID: 910 Comm: mount.ocfs2 Not tainted 5.14.0-1-amd64 #1 Debian 5.14.6-2 RIP: 0010:fortify_panic+0xf/0x11 ... Call Trace: ocfs2_initialize_super.isra.0.cold+0xc/0x18 [ocfs2] ocfs2_fill_super+0x359/0x19b0 [ocfs2] mount_bdev+0x185/0x1b0 legacy_get_tree+0x27/0x40 vfs_get_tree+0x25/0xb0 path_mount+0x454/0xa20 __x64_sys_mount+0x103/0x140 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae Link: https://lkml.kernel.org/r/20210929180654.32460-1-vvidic@valentin-vidic.from.hr Signed-off-by: Valentin Vidic <vvidic@valentin-vidic.from.hr> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-27NFSD: Keep existing listeners on portlist errorBenjamin Coddington
[ Upstream commit c20106944eb679fa3ab7e686fe5f6ba30fbc51e5 ] If nfsd has existing listening sockets without any processes, then an error returned from svc_create_xprt() for an additional transport will remove those existing listeners. We're seeing this in practice when userspace attempts to create rpcrdma transports without having the rpcrdma modules present before creating nfsd kernel processes. Fix this by checking for existing sockets before calling nfsd_destroy(). Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-10-17nfsd4: Handle the NFSv4 READDIR 'dircount' hint being zeroTrond Myklebust
commit f2e717d655040d632c9015f19aa4275f8b16e7f2 upstream. RFC3530 notes that the 'dircount' field may be zero, in which case the recommendation is to ignore it, and only enforce the 'maxcount' field. In RFC5661, this recommendation to ignore a zero valued field becomes a requirement. Fixes: aee377644146 ("nfsd4: fix rd_dircount enforcement") Cc: <stable@vger.kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-09ext2: fix sleeping in atomic bugs on errorDan Carpenter
[ Upstream commit 372d1f3e1bfede719864d0d1fbf3146b1e638c88 ] The ext2_error() function syncs the filesystem so it sleeps. The caller is holding a spinlock so it's not allowed to sleep. ext2_statfs() <- disables preempt -> ext2_count_free_blocks() -> ext2_get_group_desc() Fix this by using WARN() to print an error message and a stack trace instead of using ext2_error(). Link: https://lore.kernel.org/r/20210921203233.GA16529@kili Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-10-06ext4: fix potential infinite loop in ext4_dx_readdir()yangerkun
commit 42cb447410d024e9d54139ae9c21ea132a8c384c upstream. When ext4_htree_fill_tree() fails, ext4_dx_readdir() can run into an infinite loop since if info->last_pos != ctx->pos this will reset the directory scan and reread the failing entry. For example: 1. a dx_dir which has 3 block, block 0 as dx_root block, block 1/2 as leaf block which own the ext4_dir_entry_2 2. block 1 read ok and call_filldir which will fill the dirent and update the ctx->pos 3. block 2 read fail, but we has already fill some dirent, so we will return back to userspace will a positive return val(see ksys_getdents64) 4. the second ext4_dx_readdir will reset the world since info->last_pos != ctx->pos, and will also init the curr_hash which pos to block 1 5. So we will read block1 too, and once block2 still read fail, we can only fill one dirent because the hash of the entry in block1(besides the last one) won't greater than curr_hash 6. this time, we forget update last_pos too since the read for block2 will fail, and since we has got the one entry, ksys_getdents64 can return success 7. Latter we will trapped in a loop with step 4~6 Cc: stable@kernel.org Signed-off-by: yangerkun <yangerkun@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20210914111415.3921954-1-yangerkun@huawei.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-06qnx4: work around gcc false positive warning bugLinus Torvalds
commit d5f6545934c47e97c0b48a645418e877b452a992 upstream. In commit b7213ffa0e58 ("qnx4: avoid stringop-overread errors") I tried to teach gcc about how the directory entry structure can be two different things depending on a status flag. It made the code clearer, and it seemed to make gcc happy. However, Arnd points to a gcc bug, where despite using two different members of a union, gcc then gets confused, and uses the size of one of the members to decide if a string overrun happens. And not necessarily the rigth one. End result: with some configurations, gcc-11 will still complain about the source buffer size being overread: fs/qnx4/dir.c: In function 'qnx4_readdir': fs/qnx4/dir.c:76:32: error: 'strnlen' specified bound [16, 48] exceeds source size 1 [-Werror=stringop-overread] 76 | size = strnlen(name, size); | ^~~~~~~~~~~~~~~~~~~ fs/qnx4/dir.c:26:22: note: source object declared here 26 | char de_name; | ^~~~~~~ because gcc will get confused about which union member entry is actually getting accessed, even when the source code is very clear about it. Gcc internally will have combined two "redundant" pointers (pointing to different union elements that are at the same offset), and takes the size checking from one or the other - not necessarily the right one. This is clearly a gcc bug, but we can work around it fairly easily. The biggest thing here is the big honking comment about why we do what we do. Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99578#c6 Reported-and-tested-by: Arnd Bergmann <arnd@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-06qnx4: avoid stringop-overread errorsLinus Torvalds
[ Upstream commit b7213ffa0e585feb1aee3e7173e965e66ee0abaa ] The qnx4 directory entries are 64-byte blocks that have different contents depending on the a status byte that is in the last byte of the block. In particular, a directory entry can be either a "link info" entry with a 48-byte name and pointers to the real inode information, or an "inode entry" with a smaller 16-byte name and the full inode information. But the code was written to always just treat the directory name as if it was part of that "inode entry", and just extend the name to the longer case if the status byte said it was a link entry. That work just fine and gives the right results, but now that gcc is tracking data structure accesses much more, the code can trigger a compiler error about using up to 48 bytes (the long name) in a structure that only has that shorter name in it: fs/qnx4/dir.c: In function ‘qnx4_readdir’: fs/qnx4/dir.c:51:32: error: ‘strnlen’ specified bound 48 exceeds source size 16 [-Werror=stringop-overread] 51 | size = strnlen(de->di_fname, size); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~ In file included from fs/qnx4/qnx4.h:3, from fs/qnx4/dir.c:16: include/uapi/linux/qnx4_fs.h:45:25: note: source object declared here 45 | char di_fname[QNX4_SHORT_NAME_MAX]; | ^~~~~~~~ which is because the source code doesn't really make this whole "one of two different types" explicit. Fix this by introducing a very explicit union of the two types, and basically explaining to the compiler what is really going on. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-10-06cifs: fix incorrect check for null pointer in header_assembleSteve French
commit 9ed38fd4a15417cac83967360cf20b853bfab9b6 upstream. Although very unlikely that the tlink pointer would be null in this case, get_next_mid function can in theory return null (but not an error) so need to check for null (not for IS_ERR, which can not be returned here). Address warning: fs/smbfs_client/connect.c:2392 cifs_match_super() warn: 'tlink' isn't an ERR_PTR Pointed out by Dan Carpenter via smatch code analysis tool CC: stable@vger.kernel.org Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-26nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_groupNanyong Sun
[ Upstream commit 17243e1c3072b8417a5ebfc53065d0a87af7ca77 ] kobject_put() should be used to cleanup the memory associated with the kobject instead of kobject_del(). See the section "Kobject removal" of "Documentation/core-api/kobject.rst". Link: https://lkml.kernel.org/r/20210629022556.3985106-7-sunnanyong@huawei.com Link: https://lkml.kernel.org/r/1625651306-10829-7-git-send-email-konishi.ryusuke@gmail.com Signed-off-by: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-26nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_groupNanyong Sun
[ Upstream commit b2fe39c248f3fa4bbb2a20759b4fdd83504190f7 ] If kobject_init_and_add returns with error, kobject_put() is needed here to avoid memory leak, because kobject_init_and_add may return error without freeing the memory associated with the kobject it allocated. Link: https://lkml.kernel.org/r/20210629022556.3985106-6-sunnanyong@huawei.com Link: https://lkml.kernel.org/r/1625651306-10829-6-git-send-email-konishi.ryusuke@gmail.com Signed-off-by: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-26nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_groupNanyong Sun
[ Upstream commit a3e181259ddd61fd378390977a1e4e2316853afa ] The kobject_put() should be used to cleanup the memory associated with the kobject instead of kobject_del. See the section "Kobject removal" of "Documentation/core-api/kobject.rst". Link: https://lkml.kernel.org/r/20210629022556.3985106-5-sunnanyong@huawei.com Link: https://lkml.kernel.org/r/1625651306-10829-5-git-send-email-konishi.ryusuke@gmail.com Signed-off-by: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-26nilfs2: fix memory leak in nilfs_sysfs_create_##name##_groupNanyong Sun
[ Upstream commit 24f8cb1ed057c840728167dab33b32e44147c86f ] If kobject_init_and_add return with error, kobject_put() is needed here to avoid memory leak, because kobject_init_and_add may return error without freeing the memory associated with the kobject it allocated. Link: https://lkml.kernel.org/r/20210629022556.3985106-4-sunnanyong@huawei.com Link: https://lkml.kernel.org/r/1625651306-10829-4-git-send-email-konishi.ryusuke@gmail.com Signed-off-by: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-26nilfs2: fix NULL pointer in nilfs_##name##_attr_releaseNanyong Sun
[ Upstream commit dbc6e7d44a514f231a64d9d5676e001b660b6448 ] In nilfs_##name##_attr_release, kobj->parent should not be referenced because it is a NULL pointer. The release() method of kobject is always called in kobject_put(kobj), in the implementation of kobject_put(), the kobj->parent will be assigned as NULL before call the release() method. So just use kobj to get the subgroups, which is more efficient and can fix a NULL pointer reference problem. Link: https://lkml.kernel.org/r/20210629022556.3985106-3-sunnanyong@huawei.com Link: https://lkml.kernel.org/r/1625651306-10829-3-git-send-email-konishi.ryusuke@gmail.com Signed-off-by: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-26nilfs2: fix memory leak in nilfs_sysfs_create_device_groupNanyong Sun
[ Upstream commit 5f5dec07aca7067216ed4c1342e464e7307a9197 ] Patch series "nilfs2: fix incorrect usage of kobject". This patchset from Nanyong Sun fixes memory leak issues and a NULL pointer dereference issue caused by incorrect usage of kboject in nilfs2 sysfs implementation. This patch (of 6): Reported by syzkaller: BUG: memory leak unreferenced object 0xffff888100ca8988 (size 8): comm "syz-executor.1", pid 1930, jiffies 4294745569 (age 18.052s) hex dump (first 8 bytes): 6c 6f 6f 70 31 00 ff ff loop1... backtrace: kstrdup+0x36/0x70 mm/util.c:60 kstrdup_const+0x35/0x60 mm/util.c:83 kvasprintf_const+0xf1/0x180 lib/kasprintf.c:48 kobject_set_name_vargs+0x56/0x150 lib/kobject.c:289 kobject_add_varg lib/kobject.c:384 [inline] kobject_init_and_add+0xc9/0x150 lib/kobject.c:473 nilfs_sysfs_create_device_group+0x150/0x7d0 fs/nilfs2/sysfs.c:986 init_nilfs+0xa21/0xea0 fs/nilfs2/the_nilfs.c:637 nilfs_fill_super fs/nilfs2/super.c:1046 [inline] nilfs_mount+0x7b4/0xe80 fs/nilfs2/super.c:1316 legacy_get_tree+0x105/0x210 fs/fs_context.c:592 vfs_get_tree+0x8e/0x2d0 fs/super.c:1498 do_new_mount fs/namespace.c:2905 [inline] path_mount+0xf9b/0x1990 fs/namespace.c:3235 do_mount+0xea/0x100 fs/namespace.c:3248 __do_sys_mount fs/namespace.c:3456 [inline] __se_sys_mount fs/namespace.c:3433 [inline] __x64_sys_mount+0x14b/0x1f0 fs/namespace.c:3433 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae If kobject_init_and_add return with error, then the cleanup of kobject is needed because memory may be allocated in kobject_init_and_add without freeing. And the place of cleanup_dev_kobject should use kobject_put to free the memory associated with the kobject. As the section "Kobject removal" of "Documentation/core-api/kobject.rst" says, kobject_del() just makes the kobject "invisible", but it is not cleaned up. And no more cleanup will do after cleanup_dev_kobject, so kobject_put is needed here. Link: https://lkml.kernel.org/r/1625651306-10829-1-git-send-email-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/1625651306-10829-2-git-send-email-konishi.ryusuke@gmail.com Reported-by: Hulk Robot <hulkci@huawei.com> Link: https://lkml.kernel.org/r/20210629022556.3985106-2-sunnanyong@huawei.com Signed-off-by: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-26ceph: lockdep annotations for try_nonblocking_invalidateJeff Layton
[ Upstream commit 3eaf5aa1cfa8c97c72f5824e2e9263d6cc977b03 ] Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-22cifs: fix wrong release in sess_alloc_buffer() failed pathDing Hui
[ Upstream commit d72c74197b70bc3c95152f351a568007bffa3e11 ] smb_buf is allocated by small_smb_init_no_tc(), and buf type is CIFS_SMALL_BUFFER, so we should use cifs_small_buf_release() to release it in failed path. Signed-off-by: Ding Hui <dinghui@sangfor.com.cn> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-22gfs2: Don't call dlm after protocol is unmountedBob Peterson
[ Upstream commit d1340f80f0b8066321b499a376780da00560e857 ] In the gfs2 withdraw sequence, the dlm protocol is unmounted with a call to lm_unmount. After a withdraw, users are allowed to unmount the withdrawn file system. But at that point we may still have glocks left over that we need to free via unmount's call to gfs2_gl_hash_clear. These glocks may have never been completed because of whatever problem caused the withdraw (IO errors or whatever). Before this patch, function gdlm_put_lock would still try to call into dlm to unlock these leftover glocks, which resulted in dlm returning -EINVAL because the lock space was abandoned. These glocks were never freed because there was no mechanism after that to free them. This patch adds a check to gdlm_put_lock to see if the locking protocol was inactive (DFL_UNMOUNT flag) and if so, free the glock and not make the invalid call into dlm. I could have combined this "if" with the one that follows, related to leftover glock LVBs, but I felt the code was more readable with its own if clause. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-22CIFS: Fix a potencially linear read overflowLen Baker
[ Upstream commit f980d055a0f858d73d9467bb0b570721bbfcdfb8 ] strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated. Also, the strnlen() call does not avoid the read overflow in the strlcpy function when a not NUL-terminated string is passed. So, replace this block by a call to kstrndup() that avoids this type of overflow and does the same. Fixes: 066ce6899484d ("cifs: rename cifs_strlcpy_to_host and make it use new functions") Signed-off-by: Len Baker <len.baker@gmx.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-22udf_get_extendedattr() had no boundary checks.Stian Skjelstad
[ Upstream commit 58bc6d1be2f3b0ceecb6027dfa17513ec6aa2abb ] When parsing the ExtendedAttr data, malicous or corrupt attribute length could cause kernel hangs and buffer overruns in some special cases. Link: https://lore.kernel.org/r/20210822093332.25234-1-stian.skjelstad@gmail.com Signed-off-by: Stian Skjelstad <stian.skjelstad@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-22Revert "btrfs: compression: don't try to compress if we don't have enough pages"Qu Wenruo
commit 4e9655763b82a91e4c341835bb504a2b1590f984 upstream. This reverts commit f2165627319ffd33a6217275e5690b1ab5c45763. [BUG] It's no longer possible to create compressed inline extent after commit f2165627319f ("btrfs: compression: don't try to compress if we don't have enough pages"). [CAUSE] For compression code, there are several possible reasons we have a range that needs to be compressed while it's no more than one page. - Compressed inline write The data is always smaller than one sector and the test lacks the condition to properly recognize a non-inline extent. - Compressed subpage write For the incoming subpage compressed write support, we require page alignment of the delalloc range. And for 64K page size, we can compress just one page into smaller sectors. For those reasons, the requirement for the data to be more than one page is not correct, and is already causing regression for compressed inline data writeback. The idea of skipping one page to avoid wasting CPU time could be revisited in the future. [FIX] Fix it by reverting the offending commit. Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Link: https://lore.kernel.org/linux-btrfs/afa2742.c084f5d6.17b6b08dffc@tnonline.net Fixes: f2165627319f ("btrfs: compression: don't try to compress if we don't have enough pages") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-22ext4: fix race writing to an inline_data file while its xattrs are changingTheodore Ts'o
commit a54c4613dac1500b40e4ab55199f7c51f028e848 upstream. The location of the system.data extended attribute can change whenever xattr_sem is not taken. So we need to recalculate the i_inline_off field since it mgiht have changed between ext4_write_begin() and ext4_write_end(). This means that caching i_inline_off is probably not helpful, so in the long run we should probably get rid of it and shrink the in-memory ext4 inode slightly, but let's fix the race the simple way for now. Cc: stable@kernel.org Fixes: f19d5870cbf72 ("ext4: add normal write support for inline data") Reported-by: syzbot+13146364637c7363a7de@syzkaller.appspotmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-15ovl: prevent private clone if bind mount is not allowedMiklos Szeredi
commit 427215d85e8d1476da1a86b8d67aceb485eb3631 upstream. Add the following checks from __do_loopback() to clone_private_mount() as well: - verify that the mount is in the current namespace - verify that there are no locked children Reported-by: Alois Wohlschlager <alois1@gmx-topmail.de> Fixes: c771d683a62e ("vfs: introduce clone_private_mount()") Cc: <stable@vger.kernel.org> # v3.18 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>