From 820a185896b77814557302b981b092a9e7b36814 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Wed, 24 Jul 2024 15:15:35 +0200 Subject: fcntl: add F_CREATED_QUERY Systemd has a helper called openat_report_new() that returns whether a file was created anew or it already existed before for cases where O_CREAT has to be used without O_EXCL (cf. [1]). That apparently isn't something that's specific to systemd but it's where I noticed it. The current logic is that it first attempts to open the file without O_CREAT | O_EXCL and if it gets ENOENT the helper tries again with both flags. If that succeeds all is well. If it now reports EEXIST it retries. That works fairly well but some corner cases make this more involved. If this operates on a dangling symlink the first openat() without O_CREAT | O_EXCL will return ENOENT but the second openat() with O_CREAT | O_EXCL will fail with EEXIST. The reason is that openat() without O_CREAT | O_EXCL follows the symlink while O_CREAT | O_EXCL doesn't for security reasons. So it's not something we can really change unless we add an explicit opt-in via O_FOLLOW which seems really ugly. The caller could try and use fanotify() to register to listen for creation events in the directory before calling openat(). The caller could then compare the returned tid to its own tid to ensure that even in threaded environments it actually created the file. That might work but is a lot of work for something that should be fairly simple and I'm uncertain about it's reliability. The caller could use a bpf lsm hook to hook into security_file_open() to figure out whether they created the file. That also seems a bit wild. So let's add F_CREATED_QUERY which allows the caller to check whether they actually did create the file. That has caveats of course but I don't think they are problematic: * In multi-threaded environments a thread can only be sure that it did create the file if it calls openat() with O_CREAT. In other words, it's obviously not enough to just go through it's fdtable and check these fds because another thread could've created the file. * If there's any codepaths where an openat() with O_CREAT would yield the same struct file as that of another thread it would obviously cause wrong results. I'm not aware of any such codepaths from openat() itself. Imho, that would be a bug. * Related to the previous point, calling the new fcntl() on files created and opened via special-purpose system calls or ioctl()s would cause wrong results only if the affected subsystem a) raises FMODE_CREATED and b) may return the same struct file for two different calls. I'm not seeing anything outside of regular VFS code that raises FMODE_CREATED. There is code for b) in e.g., the drm layer where the same struct file is resurfaced but again FMODE_CREATED isn't used and it would be very misleading if it did. Link: https://github.com/systemd/systemd/blob/11d5e2b5fbf9f6bfa5763fd45b56829ad4f0777f/src/basic/fs-util.c#L1078 [1] Link: https://lore.kernel.org/r/20240724-work-fcntl-v1-1-e8153a2f1991@kernel.org Reviewed-by: Jeff Layton Reviewed-by: Jan Kara Signed-off-by: Christian Brauner --- include/uapi/linux/fcntl.h | 3 +++ 1 file changed, 3 insertions(+) (limited to 'include') diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index c0bcc185fa48..e55a3314bcb0 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -16,6 +16,9 @@ #define F_DUPFD_QUERY (F_LINUX_SPECIFIC_BASE + 3) +/* Was the file just created? */ +#define F_CREATED_QUERY (F_LINUX_SPECIFIC_BASE + 4) + /* * Cancel a blocking posix lock; internal use only until we expose an * asynchronous lock api to userspace: -- cgit v1.2.3 From c393eaa85349941badf3ce5f087400dbaf3bbe02 Mon Sep 17 00:00:00 2001 From: Haifeng Xu Date: Fri, 26 Jul 2024 11:05:25 +0800 Subject: fs: don't flush in-flight wb switches for superblocks without cgroup writeback When deactivating any type of superblock, it had to wait for the in-flight wb switches to be completed. wb switches are executed in inode_switch_wbs_work_fn() which needs to acquire the wb_switch_rwsem and races against sync_inodes_sb(). If there are too much dirty data in the superblock, the waiting time may increase significantly. For superblocks without cgroup writeback such as tmpfs, they have nothing to do with the wb swithes, so the flushing can be avoided. Signed-off-by: Haifeng Xu Link: https://lore.kernel.org/r/20240726030525.180330-1-haifeng.xu@shopee.com Reviewed-by: Jan Kara Suggested-by: Jan Kara Signed-off-by: Christian Brauner --- fs/fs-writeback.c | 7 ++++++- fs/super.c | 2 +- include/linux/writeback.h | 4 ++-- 3 files changed, 9 insertions(+), 4 deletions(-) (limited to 'include') diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index b865a3fa52f3..4451ecff37c4 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1132,6 +1132,7 @@ out_bdi_put: /** * cgroup_writeback_umount - flush inode wb switches for umount + * @sb: target super_block * * This function is called when a super_block is about to be destroyed and * flushes in-flight inode wb switches. An inode wb switch goes through @@ -1140,8 +1141,12 @@ out_bdi_put: * rare occurrences and synchronize_rcu() can take a while, perform * flushing iff wb switches are in flight. */ -void cgroup_writeback_umount(void) +void cgroup_writeback_umount(struct super_block *sb) { + + if (!(sb->s_bdi->capabilities & BDI_CAP_WRITEBACK)) + return; + /* * SB_ACTIVE should be reliably cleared before checking * isw_nr_in_flight, see generic_shutdown_super(). diff --git a/fs/super.c b/fs/super.c index 38d72a3cf6fc..28f42502bdf9 100644 --- a/fs/super.c +++ b/fs/super.c @@ -621,7 +621,7 @@ void generic_shutdown_super(struct super_block *sb) sync_filesystem(sb); sb->s_flags &= ~SB_ACTIVE; - cgroup_writeback_umount(); + cgroup_writeback_umount(sb); /* Evict all inodes with zero refcount. */ evict_inodes(sb); diff --git a/include/linux/writeback.h b/include/linux/writeback.h index 1a54676d843a..56b85841ae4c 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -217,7 +217,7 @@ void wbc_account_cgroup_owner(struct writeback_control *wbc, struct page *page, size_t bytes); int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, enum wb_reason reason, struct wb_completion *done); -void cgroup_writeback_umount(void); +void cgroup_writeback_umount(struct super_block *sb); bool cleanup_offline_cgwb(struct bdi_writeback *wb); /** @@ -324,7 +324,7 @@ static inline void wbc_account_cgroup_owner(struct writeback_control *wbc, { } -static inline void cgroup_writeback_umount(void) +static inline void cgroup_writeback_umount(struct super_block *sb) { } -- cgit v1.2.3 From c01a5d89e5c8abe638107be2a4ea9e4c7fcdd7f6 Mon Sep 17 00:00:00 2001 From: Wang Long Date: Fri, 2 Aug 2024 17:19:01 +0800 Subject: percpu-rwsem: remove the unused parameter 'read' In the function percpu_rwsem_release, the parameter `read` is unused, so remove it. Signed-off-by: Wang Long Link: https://lore.kernel.org/r/20240802091901.2546797-1-w@laoqinren.net Reviewed-by: Jan Kara Signed-off-by: Christian Brauner --- fs/super.c | 2 +- include/linux/fs.h | 2 +- include/linux/percpu-rwsem.h | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) (limited to 'include') diff --git a/fs/super.c b/fs/super.c index 28f42502bdf9..33f35abb1990 100644 --- a/fs/super.c +++ b/fs/super.c @@ -1905,7 +1905,7 @@ static void lockdep_sb_freeze_release(struct super_block *sb) int level; for (level = SB_FREEZE_LEVELS - 1; level >= 0; level--) - percpu_rwsem_release(sb->s_writers.rw_sem + level, 0, _THIS_IP_); + percpu_rwsem_release(sb->s_writers.rw_sem + level, _THIS_IP_); } /* diff --git a/include/linux/fs.h b/include/linux/fs.h index fb0426f349fc..c0286d479c9d 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1683,7 +1683,7 @@ static inline bool __sb_start_write_trylock(struct super_block *sb, int level) #define __sb_writers_acquired(sb, lev) \ percpu_rwsem_acquire(&(sb)->s_writers.rw_sem[(lev)-1], 1, _THIS_IP_) #define __sb_writers_release(sb, lev) \ - percpu_rwsem_release(&(sb)->s_writers.rw_sem[(lev)-1], 1, _THIS_IP_) + percpu_rwsem_release(&(sb)->s_writers.rw_sem[(lev)-1], _THIS_IP_) /** * __sb_write_started - check if sb freeze level is held diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h index 36b942b67b7d..c012df33a9f0 100644 --- a/include/linux/percpu-rwsem.h +++ b/include/linux/percpu-rwsem.h @@ -145,7 +145,7 @@ extern void percpu_free_rwsem(struct percpu_rw_semaphore *); #define percpu_rwsem_assert_held(sem) lockdep_assert_held(sem) static inline void percpu_rwsem_release(struct percpu_rw_semaphore *sem, - bool read, unsigned long ip) + unsigned long ip) { lock_release(&sem->dep_map, ip); } -- cgit v1.2.3 From 087adb4f0f91ee330446a70af899e6a996e5cc13 Mon Sep 17 00:00:00 2001 From: Mateusz Guzik Date: Tue, 6 Aug 2024 19:28:46 +0200 Subject: vfs: dodge smp_mb in break_lease and break_deleg in the common case These inlines show up in the fast path (e.g., in do_dentry_open()) and induce said full barrier regarding i_flctx access when in most cases the pointer is NULL. The pointer can be safely checked before issuing the barrier, dodging it in most cases as a result. It is plausible the consume fence would be sufficient, but I don't want to go audit all callers regarding what they before calling here. Signed-off-by: Mateusz Guzik Link: https://lore.kernel.org/r/20240806172846.886570-1-mjguzik@gmail.com Reviewed-by: Jeff Layton Signed-off-by: Christian Brauner --- include/linux/filelock.h | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) (limited to 'include') diff --git a/include/linux/filelock.h b/include/linux/filelock.h index daee999d05f3..bb44224c6676 100644 --- a/include/linux/filelock.h +++ b/include/linux/filelock.h @@ -420,28 +420,38 @@ static inline int locks_lock_file_wait(struct file *filp, struct file_lock *fl) #ifdef CONFIG_FILE_LOCKING static inline int break_lease(struct inode *inode, unsigned int mode) { + struct file_lock_context *flctx; + /* * Since this check is lockless, we must ensure that any refcounts * taken are done before checking i_flctx->flc_lease. Otherwise, we * could end up racing with tasks trying to set a new lease on this * file. */ + flctx = READ_ONCE(inode->i_flctx); + if (!flctx) + return 0; smp_mb(); - if (inode->i_flctx && !list_empty_careful(&inode->i_flctx->flc_lease)) + if (!list_empty_careful(&flctx->flc_lease)) return __break_lease(inode, mode, FL_LEASE); return 0; } static inline int break_deleg(struct inode *inode, unsigned int mode) { + struct file_lock_context *flctx; + /* * Since this check is lockless, we must ensure that any refcounts * taken are done before checking i_flctx->flc_lease. Otherwise, we * could end up racing with tasks trying to set a new lease on this * file. */ + flctx = READ_ONCE(inode->i_flctx); + if (!flctx) + return 0; smp_mb(); - if (inode->i_flctx && !list_empty_careful(&inode->i_flctx->flc_lease)) + if (!list_empty_careful(&flctx->flc_lease)) return __break_lease(inode, mode, FL_DELEG); return 0; } -- cgit v1.2.3 From 8447d848e1dc6d42d1dcd00f133d7715fc732c47 Mon Sep 17 00:00:00 2001 From: Mateusz Guzik Date: Sat, 10 Aug 2024 08:47:53 +0200 Subject: vfs: only read fops once in fops_get/put In do_dentry_open() the usage is: f->f_op = fops_get(inode->i_fop); In generated asm the compiler emits 2 reads from inode->i_fop instead of just one. This popped up due to false-sharing where loads from that offset end up bouncing a cacheline during parallel open. While this is going to be fixed, the spurious load does not need to be there. This makes do_dentry_open() go down from 1177 to 1154 bytes. fops_put() is patched to maintain some consistency. No functional changes. Reviewed-by: Josef Bacik Signed-off-by: Mateusz Guzik Link: https://lore.kernel.org/r/20240810064753.1211441-1-mjguzik@gmail.com Signed-off-by: Christian Brauner --- include/linux/fs.h | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) (limited to 'include') diff --git a/include/linux/fs.h b/include/linux/fs.h index c0286d479c9d..696c620f90e0 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2553,10 +2553,17 @@ struct super_block *sget(struct file_system_type *type, struct super_block *sget_dev(struct fs_context *fc, dev_t dev); /* Alas, no aliases. Too much hassle with bringing module.h everywhere */ -#define fops_get(fops) \ - (((fops) && try_module_get((fops)->owner) ? (fops) : NULL)) -#define fops_put(fops) \ - do { if (fops) module_put((fops)->owner); } while(0) +#define fops_get(fops) ({ \ + const struct file_operations *_fops = (fops); \ + (((_fops) && try_module_get((_fops)->owner) ? (_fops) : NULL)); \ +}) + +#define fops_put(fops) ({ \ + const struct file_operations *_fops = (fops); \ + if (_fops) \ + module_put((_fops)->owner); \ +}) + /* * This one is to be used *ONLY* from ->open() instances. * fops must be non-NULL, pinned down *and* module dependencies -- cgit v1.2.3 From 641bb4394f405cba498b100b44541ffc0aed5be1 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Fri, 9 Aug 2024 12:38:56 +0200 Subject: fs: move FMODE_UNSIGNED_OFFSET to fop_flags This is another flag that is statically set and doesn't need to use up an FMODE_* bit. Move it to ->fop_flags and free up another FMODE_* bit. (1) mem_open() used from proc_mem_operations (2) adi_open() used from adi_fops (3) drm_open_helper(): (3.1) accel_open() used from DRM_ACCEL_FOPS (3.2) drm_open() used from (3.2.1) amdgpu_driver_kms_fops (3.2.2) psb_gem_fops (3.2.3) i915_driver_fops (3.2.4) nouveau_driver_fops (3.2.5) panthor_drm_driver_fops (3.2.6) radeon_driver_kms_fops (3.2.7) tegra_drm_fops (3.2.8) vmwgfx_driver_fops (3.2.9) xe_driver_fops (3.2.10) DRM_GEM_FOPS (3.2.11) DEFINE_DRM_GEM_DMA_FOPS (4) struct memdev sets fmode flags based on type of device opened. For devices using struct mem_fops unsigned offset is used. Mark all these file operations as FOP_UNSIGNED_OFFSET and add asserts into the open helper to ensure that the flag is always set. Link: https://lore.kernel.org/r/20240809-work-fop_unsigned-v1-1-658e054d893e@kernel.org Reviewed-by: Jeff Layton Reviewed-by: Jan Kara Signed-off-by: Christian Brauner --- drivers/char/adi.c | 8 +------- drivers/char/mem.c | 3 ++- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 1 + drivers/gpu/drm/drm_file.c | 3 ++- drivers/gpu/drm/gma500/psb_drv.c | 1 + drivers/gpu/drm/i915/i915_driver.c | 1 + drivers/gpu/drm/nouveau/nouveau_drm.c | 1 + drivers/gpu/drm/radeon/radeon_drv.c | 1 + drivers/gpu/drm/tegra/drm.c | 1 + drivers/gpu/drm/vmwgfx/vmwgfx_drv.c | 1 + drivers/gpu/drm/xe/xe_device.c | 1 + fs/proc/base.c | 10 ++++------ fs/read_write.c | 2 +- include/drm/drm_accel.h | 3 ++- include/drm/drm_gem.h | 3 ++- include/drm/drm_gem_dma_helper.h | 1 + include/linux/fs.h | 5 +++-- mm/mmap.c | 2 +- 18 files changed, 27 insertions(+), 21 deletions(-) (limited to 'include') diff --git a/drivers/char/adi.c b/drivers/char/adi.c index 751d7cc0da1b..1c76c8758f0f 100644 --- a/drivers/char/adi.c +++ b/drivers/char/adi.c @@ -14,12 +14,6 @@ #define MAX_BUF_SZ PAGE_SIZE -static int adi_open(struct inode *inode, struct file *file) -{ - file->f_mode |= FMODE_UNSIGNED_OFFSET; - return 0; -} - static int read_mcd_tag(unsigned long addr) { long err; @@ -206,9 +200,9 @@ static loff_t adi_llseek(struct file *file, loff_t offset, int whence) static const struct file_operations adi_fops = { .owner = THIS_MODULE, .llseek = adi_llseek, - .open = adi_open, .read = adi_read, .write = adi_write, + .fop_flags = FOP_UNSIGNED_OFFSET, }; static struct miscdevice adi_miscdev = { diff --git a/drivers/char/mem.c b/drivers/char/mem.c index 7c359cc406d5..169eed162a7f 100644 --- a/drivers/char/mem.c +++ b/drivers/char/mem.c @@ -643,6 +643,7 @@ static const struct file_operations __maybe_unused mem_fops = { .get_unmapped_area = get_unmapped_area_mem, .mmap_capabilities = memory_mmap_capabilities, #endif + .fop_flags = FOP_UNSIGNED_OFFSET, }; static const struct file_operations null_fops = { @@ -693,7 +694,7 @@ static const struct memdev { umode_t mode; } devlist[] = { #ifdef CONFIG_DEVMEM - [DEVMEM_MINOR] = { "mem", &mem_fops, FMODE_UNSIGNED_OFFSET, 0 }, + [DEVMEM_MINOR] = { "mem", &mem_fops, 0, 0 }, #endif [3] = { "null", &null_fops, FMODE_NOWAIT, 0666 }, #ifdef CONFIG_DEVPORT diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index 094498a0964b..d7ef8cbecf6c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -2908,6 +2908,7 @@ static const struct file_operations amdgpu_driver_kms_fops = { #ifdef CONFIG_PROC_FS .show_fdinfo = drm_show_fdinfo, #endif + .fop_flags = FOP_UNSIGNED_OFFSET, }; int amdgpu_file_to_fpriv(struct file *filp, struct amdgpu_fpriv **fpriv) diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c index 714e42b05108..f8de3cba1a08 100644 --- a/drivers/gpu/drm/drm_file.c +++ b/drivers/gpu/drm/drm_file.c @@ -318,6 +318,8 @@ int drm_open_helper(struct file *filp, struct drm_minor *minor) if (dev->switch_power_state != DRM_SWITCH_POWER_ON && dev->switch_power_state != DRM_SWITCH_POWER_DYNAMIC_OFF) return -EINVAL; + if (WARN_ON_ONCE(!(filp->f_op->fop_flags & FOP_UNSIGNED_OFFSET))) + return -EINVAL; drm_dbg_core(dev, "comm=\"%s\", pid=%d, minor=%d\n", current->comm, task_pid_nr(current), minor->index); @@ -335,7 +337,6 @@ int drm_open_helper(struct file *filp, struct drm_minor *minor) } filp->private_data = priv; - filp->f_mode |= FMODE_UNSIGNED_OFFSET; priv->filp = filp; mutex_lock(&dev->filelist_mutex); diff --git a/drivers/gpu/drm/gma500/psb_drv.c b/drivers/gpu/drm/gma500/psb_drv.c index 8b64f61ffaf9..d67c2b3ad901 100644 --- a/drivers/gpu/drm/gma500/psb_drv.c +++ b/drivers/gpu/drm/gma500/psb_drv.c @@ -498,6 +498,7 @@ static const struct file_operations psb_gem_fops = { .mmap = drm_gem_mmap, .poll = drm_poll, .read = drm_read, + .fop_flags = FOP_UNSIGNED_OFFSET, }; static const struct drm_driver driver = { diff --git a/drivers/gpu/drm/i915/i915_driver.c b/drivers/gpu/drm/i915/i915_driver.c index fb8e9c2fcea5..cf276299bccb 100644 --- a/drivers/gpu/drm/i915/i915_driver.c +++ b/drivers/gpu/drm/i915/i915_driver.c @@ -1671,6 +1671,7 @@ static const struct file_operations i915_driver_fops = { #ifdef CONFIG_PROC_FS .show_fdinfo = drm_show_fdinfo, #endif + .fop_flags = FOP_UNSIGNED_OFFSET, }; static int diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c index a58c31089613..e243b42f8582 100644 --- a/drivers/gpu/drm/nouveau/nouveau_drm.c +++ b/drivers/gpu/drm/nouveau/nouveau_drm.c @@ -1274,6 +1274,7 @@ nouveau_driver_fops = { .compat_ioctl = nouveau_compat_ioctl, #endif .llseek = noop_llseek, + .fop_flags = FOP_UNSIGNED_OFFSET, }; static struct drm_driver diff --git a/drivers/gpu/drm/radeon/radeon_drv.c b/drivers/gpu/drm/radeon/radeon_drv.c index 7bf08164140e..ac49779ed03d 100644 --- a/drivers/gpu/drm/radeon/radeon_drv.c +++ b/drivers/gpu/drm/radeon/radeon_drv.c @@ -520,6 +520,7 @@ static const struct file_operations radeon_driver_kms_fops = { #ifdef CONFIG_COMPAT .compat_ioctl = radeon_kms_compat_ioctl, #endif + .fop_flags = FOP_UNSIGNED_OFFSET, }; static const struct drm_ioctl_desc radeon_ioctls_kms[] = { diff --git a/drivers/gpu/drm/tegra/drm.c b/drivers/gpu/drm/tegra/drm.c index 03d1c76aec2d..108c26a33edb 100644 --- a/drivers/gpu/drm/tegra/drm.c +++ b/drivers/gpu/drm/tegra/drm.c @@ -801,6 +801,7 @@ static const struct file_operations tegra_drm_fops = { .read = drm_read, .compat_ioctl = drm_compat_ioctl, .llseek = noop_llseek, + .fop_flags = FOP_UNSIGNED_OFFSET, }; static int tegra_drm_context_cleanup(int id, void *p, void *data) diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.c b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.c index 50ad3105c16e..2825dd3149ed 100644 --- a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.c +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.c @@ -1609,6 +1609,7 @@ static const struct file_operations vmwgfx_driver_fops = { .compat_ioctl = vmw_compat_ioctl, #endif .llseek = noop_llseek, + .fop_flags = FOP_UNSIGNED_OFFSET, }; static const struct drm_driver driver = { diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c index f2f1d8ddb221..2e5537d6608d 100644 --- a/drivers/gpu/drm/xe/xe_device.c +++ b/drivers/gpu/drm/xe/xe_device.c @@ -238,6 +238,7 @@ static const struct file_operations xe_driver_fops = { #ifdef CONFIG_PROC_FS .show_fdinfo = drm_show_fdinfo, #endif + .fop_flags = FOP_UNSIGNED_OFFSET, }; static struct drm_driver driver = { diff --git a/fs/proc/base.c b/fs/proc/base.c index 72a1acd03675..1409d1003101 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -827,12 +827,9 @@ static int __mem_open(struct inode *inode, struct file *file, unsigned int mode) static int mem_open(struct inode *inode, struct file *file) { - int ret = __mem_open(inode, file, PTRACE_MODE_ATTACH); - - /* OK to pass negative loff_t, we can catch out-of-range */ - file->f_mode |= FMODE_UNSIGNED_OFFSET; - - return ret; + if (WARN_ON_ONCE(!(file->f_op->fop_flags & FOP_UNSIGNED_OFFSET))) + return -EINVAL; + return __mem_open(inode, file, PTRACE_MODE_ATTACH); } static ssize_t mem_rw(struct file *file, char __user *buf, @@ -932,6 +929,7 @@ static const struct file_operations proc_mem_operations = { .write = mem_write, .open = mem_open, .release = mem_release, + .fop_flags = FOP_UNSIGNED_OFFSET, }; static int environ_open(struct inode *inode, struct file *file) diff --git a/fs/read_write.c b/fs/read_write.c index 90e283b31ca1..89d4af0e3b93 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -36,7 +36,7 @@ EXPORT_SYMBOL(generic_ro_fops); static inline bool unsigned_offsets(struct file *file) { - return file->f_mode & FMODE_UNSIGNED_OFFSET; + return file->f_op->fop_flags & FOP_UNSIGNED_OFFSET; } /** diff --git a/include/drm/drm_accel.h b/include/drm/drm_accel.h index f4d3784b1dce..41c78b7d712c 100644 --- a/include/drm/drm_accel.h +++ b/include/drm/drm_accel.h @@ -28,7 +28,8 @@ .poll = drm_poll,\ .read = drm_read,\ .llseek = noop_llseek, \ - .mmap = drm_gem_mmap + .mmap = drm_gem_mmap, \ + .fop_flags = FOP_UNSIGNED_OFFSET /** * DEFINE_DRM_ACCEL_FOPS() - macro to generate file operations for accelerators drivers diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h index bae4865b2101..d8b86df2ec0d 100644 --- a/include/drm/drm_gem.h +++ b/include/drm/drm_gem.h @@ -447,7 +447,8 @@ struct drm_gem_object { .poll = drm_poll,\ .read = drm_read,\ .llseek = noop_llseek,\ - .mmap = drm_gem_mmap + .mmap = drm_gem_mmap, \ + .fop_flags = FOP_UNSIGNED_OFFSET /** * DEFINE_DRM_GEM_FOPS() - macro to generate file operations for GEM drivers diff --git a/include/drm/drm_gem_dma_helper.h b/include/drm/drm_gem_dma_helper.h index a827bde494f6..f2678e7ecb98 100644 --- a/include/drm/drm_gem_dma_helper.h +++ b/include/drm/drm_gem_dma_helper.h @@ -267,6 +267,7 @@ unsigned long drm_gem_dma_get_unmapped_area(struct file *filp, .read = drm_read,\ .llseek = noop_llseek,\ .mmap = drm_gem_mmap,\ + .fop_flags = FOP_UNSIGNED_OFFSET, \ DRM_GEM_DMA_UNMAPPED_AREA_FOPS \ } diff --git a/include/linux/fs.h b/include/linux/fs.h index 696c620f90e0..0145bda465ff 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -146,8 +146,7 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset, /* Expect random access pattern */ #define FMODE_RANDOM ((__force fmode_t)(1 << 12)) -/* File is huge (eg. /dev/mem): treat loff_t as unsigned */ -#define FMODE_UNSIGNED_OFFSET ((__force fmode_t)(1 << 13)) +/* FMODE_* bit 13 */ /* File is opened with O_PATH; almost nothing can be done with it */ #define FMODE_PATH ((__force fmode_t)(1 << 14)) @@ -2073,6 +2072,8 @@ struct file_operations { #define FOP_DIO_PARALLEL_WRITE ((__force fop_flags_t)(1 << 3)) /* Contains huge pages */ #define FOP_HUGE_PAGES ((__force fop_flags_t)(1 << 4)) +/* Treat loff_t as unsigned (e.g., /dev/mem) */ +#define FOP_UNSIGNED_OFFSET ((__force fop_flags_t)(1 << 5)) /* Wrap a directory iterator that needs exclusive inode access */ int wrap_directory_iterator(struct file *, struct dir_context *, diff --git a/mm/mmap.c b/mm/mmap.c index d0dfc85b209b..6ddb278a5ee8 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1229,7 +1229,7 @@ static inline u64 file_mmap_size_max(struct file *file, struct inode *inode) return MAX_LFS_FILESIZE; /* Special "we do even unsigned file positions" case */ - if (file->f_mode & FMODE_UNSIGNED_OFFSET) + if (file->f_op->fop_flags & FOP_UNSIGNED_OFFSET) return 0; /* Yes, random drivers might want more. But I'm tired of buggy drivers */ -- cgit v1.2.3 From 433f9d76a01056dfeaefc15167b11e514e56f956 Mon Sep 17 00:00:00 2001 From: Ian Kent Date: Wed, 14 Aug 2024 17:02:31 +0800 Subject: autofs: add per dentry expire timeout Add ability to set per-dentry mount expire timeout to autofs. There are two fairly well known automounter map formats, the autofs format and the amd format (more or less System V and Berkley). Some time ago Linux autofs added an amd map format parser that implemented a fair amount of the amd functionality. This was done within the autofs infrastructure and some functionality wasn't implemented because it either didn't make sense or required extra kernel changes. The idea was to restrict changes to be within the existing autofs functionality as much as possible and leave changes with a wider scope to be considered later. One of these changes is implementing the amd options: 1) "unmount", expire this mount according to a timeout (same as the current autofs default). 2) "nounmount", don't expire this mount (same as setting the autofs timeout to 0 except only for this specific mount) . 3) "utimeout=", expire this mount using the specified timeout (again same as setting the autofs timeout but only for this mount). To implement these options per-dentry expire timeouts need to be implemented for autofs indirect mounts. This is because all map keys (mounts) for autofs indirect mounts use an expire timeout stored in the autofs mount super block info. structure and all indirect mounts use the same expire timeout. Now I have a request to add the "nounmount" option so I need to add the per-dentry expire handling to the kernel implementation to do this. The implementation uses the trailing path component to identify the mount (and is also used as the autofs map key) which is passed in the autofs_dev_ioctl structure path field. The expire timeout is passed in autofs_dev_ioctl timeout field (well, of the timeout union). If the passed in timeout is equal to -1 the per-dentry timeout and flag are cleared providing for the "unmount" option. If the timeout is greater than or equal to 0 the timeout is set to the value and the flag is also set. If the dentry timeout is 0 the dentry will not expire by timeout which enables the implementation of the "nounmount" option for the specific mount. When the dentry timeout is greater than zero it allows for the implementation of the "utimeout=" option. Signed-off-by: Ian Kent Link: https://lore.kernel.org/r/20240814090231.963520-1-raven@themaw.net Signed-off-by: Christian Brauner --- fs/autofs/autofs_i.h | 4 ++ fs/autofs/dev-ioctl.c | 97 +++++++++++++++++++++++++++++++++++++++++--- fs/autofs/expire.c | 7 +++- fs/autofs/inode.c | 2 + include/uapi/linux/auto_fs.h | 2 +- 5 files changed, 104 insertions(+), 8 deletions(-) (limited to 'include') diff --git a/fs/autofs/autofs_i.h b/fs/autofs/autofs_i.h index 8c1d587b3eef..77c7991d89aa 100644 --- a/fs/autofs/autofs_i.h +++ b/fs/autofs/autofs_i.h @@ -62,6 +62,7 @@ struct autofs_info { struct list_head expiring; struct autofs_sb_info *sbi; + unsigned long exp_timeout; unsigned long last_used; int count; @@ -81,6 +82,9 @@ struct autofs_info { */ #define AUTOFS_INF_PENDING (1<<2) /* dentry pending mount */ +#define AUTOFS_INF_EXPIRE_SET (1<<3) /* per-dentry expire timeout set for + this mount point. + */ struct autofs_wait_queue { wait_queue_head_t queue; struct autofs_wait_queue *next; diff --git a/fs/autofs/dev-ioctl.c b/fs/autofs/dev-ioctl.c index 5bf781ea6d67..f011e026358e 100644 --- a/fs/autofs/dev-ioctl.c +++ b/fs/autofs/dev-ioctl.c @@ -128,7 +128,13 @@ static int validate_dev_ioctl(int cmd, struct autofs_dev_ioctl *param) goto out; } + /* Setting the per-dentry expire timeout requires a trailing + * path component, ie. no '/', so invert the logic of the + * check_name() return for AUTOFS_DEV_IOCTL_TIMEOUT_CMD. + */ err = check_name(param->path); + if (cmd == AUTOFS_DEV_IOCTL_TIMEOUT_CMD) + err = err ? 0 : -EINVAL; if (err) { pr_warn("invalid path supplied for cmd(0x%08x)\n", cmd); @@ -396,16 +402,97 @@ static int autofs_dev_ioctl_catatonic(struct file *fp, return 0; } -/* Set the autofs mount timeout */ +/* + * Set the autofs mount expire timeout. + * + * There are two places an expire timeout can be set, in the autofs + * super block info. (this is all that's needed for direct and offset + * mounts because there's a distinct mount corresponding to each of + * these) and per-dentry within within the dentry info. If a per-dentry + * timeout is set it will override the expire timeout set in the parent + * autofs super block info. + * + * If setting the autofs super block expire timeout the autofs_dev_ioctl + * size field will be equal to the autofs_dev_ioctl structure size. If + * setting the per-dentry expire timeout the mount point name is passed + * in the autofs_dev_ioctl path field and the size field updated to + * reflect this. + * + * Setting the autofs mount expire timeout sets the timeout in the super + * block info. struct. Setting the per-dentry timeout does a little more. + * If the timeout is equal to -1 the per-dentry timeout (and flag) is + * cleared which reverts to using the super block timeout, otherwise if + * timeout is 0 the timeout is set to this value and the flag is left + * set which disables expiration for the mount point, lastly the flag + * and the timeout are set enabling the dentry to use this timeout. + */ static int autofs_dev_ioctl_timeout(struct file *fp, struct autofs_sb_info *sbi, struct autofs_dev_ioctl *param) { - unsigned long timeout; + unsigned long timeout = param->timeout.timeout; + + /* If setting the expire timeout for an individual indirect + * mount point dentry the mount trailing component path is + * placed in param->path and param->size adjusted to account + * for it otherwise param->size it is set to the structure + * size. + */ + if (param->size == AUTOFS_DEV_IOCTL_SIZE) { + param->timeout.timeout = sbi->exp_timeout / HZ; + sbi->exp_timeout = timeout * HZ; + } else { + struct dentry *base = fp->f_path.dentry; + struct inode *inode = base->d_inode; + int path_len = param->size - AUTOFS_DEV_IOCTL_SIZE - 1; + struct dentry *dentry; + struct autofs_info *ino; + + if (!autofs_type_indirect(sbi->type)) + return -EINVAL; + + /* An expire timeout greater than the superblock timeout + * could be a problem at shutdown but the super block + * timeout itself can change so all we can really do is + * warn the user. + */ + if (timeout >= sbi->exp_timeout) + pr_warn("per-mount expire timeout is greater than " + "the parent autofs mount timeout which could " + "prevent shutdown\n"); + + inode_lock_shared(inode); + dentry = try_lookup_one_len(param->path, base, path_len); + inode_unlock_shared(inode); + if (IS_ERR_OR_NULL(dentry)) + return dentry ? PTR_ERR(dentry) : -ENOENT; + ino = autofs_dentry_ino(dentry); + if (!ino) { + dput(dentry); + return -ENOENT; + } + + if (ino->exp_timeout && ino->flags & AUTOFS_INF_EXPIRE_SET) + param->timeout.timeout = ino->exp_timeout / HZ; + else + param->timeout.timeout = sbi->exp_timeout / HZ; + + if (timeout == -1) { + /* Revert to using the super block timeout */ + ino->flags &= ~AUTOFS_INF_EXPIRE_SET; + ino->exp_timeout = 0; + } else { + /* Set the dentry expire flag and timeout. + * + * If timeout is 0 it will prevent the expire + * of this particular automount. + */ + ino->flags |= AUTOFS_INF_EXPIRE_SET; + ino->exp_timeout = timeout * HZ; + } + dput(dentry); + } - timeout = param->timeout.timeout; - param->timeout.timeout = sbi->exp_timeout / HZ; - sbi->exp_timeout = timeout * HZ; return 0; } diff --git a/fs/autofs/expire.c b/fs/autofs/expire.c index 39d8c84c16f4..5c2d459e1e48 100644 --- a/fs/autofs/expire.c +++ b/fs/autofs/expire.c @@ -429,8 +429,6 @@ static struct dentry *autofs_expire_indirect(struct super_block *sb, if (!root) return NULL; - timeout = sbi->exp_timeout; - dentry = NULL; while ((dentry = get_next_positive_subdir(dentry, root))) { spin_lock(&sbi->fs_lock); @@ -441,6 +439,11 @@ static struct dentry *autofs_expire_indirect(struct super_block *sb, } spin_unlock(&sbi->fs_lock); + if (ino->flags & AUTOFS_INF_EXPIRE_SET) + timeout = ino->exp_timeout; + else + timeout = sbi->exp_timeout; + expired = should_expire(dentry, mnt, timeout, how); if (!expired) continue; diff --git a/fs/autofs/inode.c b/fs/autofs/inode.c index 64faa6c51f60..ee2edccaef70 100644 --- a/fs/autofs/inode.c +++ b/fs/autofs/inode.c @@ -19,6 +19,7 @@ struct autofs_info *autofs_new_ino(struct autofs_sb_info *sbi) INIT_LIST_HEAD(&ino->expiring); ino->last_used = jiffies; ino->sbi = sbi; + ino->exp_timeout = -1; ino->count = 1; } return ino; @@ -28,6 +29,7 @@ void autofs_clean_ino(struct autofs_info *ino) { ino->uid = GLOBAL_ROOT_UID; ino->gid = GLOBAL_ROOT_GID; + ino->exp_timeout = -1; ino->last_used = jiffies; } diff --git a/include/uapi/linux/auto_fs.h b/include/uapi/linux/auto_fs.h index 1f7925afad2d..8081df849743 100644 --- a/include/uapi/linux/auto_fs.h +++ b/include/uapi/linux/auto_fs.h @@ -23,7 +23,7 @@ #define AUTOFS_MIN_PROTO_VERSION 3 #define AUTOFS_MAX_PROTO_VERSION 5 -#define AUTOFS_PROTO_SUBVERSION 5 +#define AUTOFS_PROTO_SUBVERSION 6 /* * The wait_queue_token (autofs_wqt_t) is part of a structure which is passed -- cgit v1.2.3 From 1c48d441468c425e878d81c5c3e7ff8a9a594cc0 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Fri, 16 Aug 2024 16:35:52 +0200 Subject: inode: remove __I_DIO_WAKEUP Afaict, we can just rely on inode->i_dio_count for waiting instead of this awkward indirection through __I_DIO_WAKEUP. This survives LTP dio and xfstests dio tests. Link: https://lore.kernel.org/r/20240816-vfs-misc-dio-v1-1-80fe21a2c710@kernel.org Reviewed-by: Josef Bacik Reviewed-by: Jeff Layton Signed-off-by: Christian Brauner --- fs/inode.c | 23 +++++++++++------------ fs/netfs/locking.c | 22 +++++----------------- include/linux/fs.h | 8 +++----- 3 files changed, 19 insertions(+), 34 deletions(-) (limited to 'include') diff --git a/fs/inode.c b/fs/inode.c index de0969d6009c..248a131a02c3 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -2500,18 +2500,11 @@ EXPORT_SYMBOL(inode_owner_or_capable); /* * Direct i/o helper functions */ -static void __inode_dio_wait(struct inode *inode) +bool inode_dio_finished(const struct inode *inode) { - wait_queue_head_t *wq = bit_waitqueue(&inode->i_state, __I_DIO_WAKEUP); - DEFINE_WAIT_BIT(q, &inode->i_state, __I_DIO_WAKEUP); - - do { - prepare_to_wait(wq, &q.wq_entry, TASK_UNINTERRUPTIBLE); - if (atomic_read(&inode->i_dio_count)) - schedule(); - } while (atomic_read(&inode->i_dio_count)); - finish_wait(wq, &q.wq_entry); + return atomic_read(&inode->i_dio_count) == 0; } +EXPORT_SYMBOL(inode_dio_finished); /** * inode_dio_wait - wait for outstanding DIO requests to finish @@ -2525,11 +2518,17 @@ static void __inode_dio_wait(struct inode *inode) */ void inode_dio_wait(struct inode *inode) { - if (atomic_read(&inode->i_dio_count)) - __inode_dio_wait(inode); + wait_var_event(&inode->i_dio_count, inode_dio_finished(inode)); } EXPORT_SYMBOL(inode_dio_wait); +void inode_dio_wait_interruptible(struct inode *inode) +{ + wait_var_event_interruptible(&inode->i_dio_count, + inode_dio_finished(inode)); +} +EXPORT_SYMBOL(inode_dio_wait_interruptible); + /* * inode_set_flags - atomically set some inode flags * diff --git a/fs/netfs/locking.c b/fs/netfs/locking.c index 75dc52a49b3a..21eab56ee2f9 100644 --- a/fs/netfs/locking.c +++ b/fs/netfs/locking.c @@ -19,25 +19,13 @@ * Must be called under a lock that serializes taking new references * to i_dio_count, usually by inode->i_mutex. */ -static int inode_dio_wait_interruptible(struct inode *inode) +static int netfs_inode_dio_wait_interruptible(struct inode *inode) { - if (!atomic_read(&inode->i_dio_count)) + if (inode_dio_finished(inode)) return 0; - wait_queue_head_t *wq = bit_waitqueue(&inode->i_state, __I_DIO_WAKEUP); - DEFINE_WAIT_BIT(q, &inode->i_state, __I_DIO_WAKEUP); - - for (;;) { - prepare_to_wait(wq, &q.wq_entry, TASK_INTERRUPTIBLE); - if (!atomic_read(&inode->i_dio_count)) - break; - if (signal_pending(current)) - break; - schedule(); - } - finish_wait(wq, &q.wq_entry); - - return atomic_read(&inode->i_dio_count) ? -ERESTARTSYS : 0; + inode_dio_wait_interruptible(inode); + return !inode_dio_finished(inode) ? -ERESTARTSYS : 0; } /* Call with exclusively locked inode->i_rwsem */ @@ -46,7 +34,7 @@ static int netfs_block_o_direct(struct netfs_inode *ictx) if (!test_bit(NETFS_ICTX_ODIRECT, &ictx->flags)) return 0; clear_bit(NETFS_ICTX_ODIRECT, &ictx->flags); - return inode_dio_wait_interruptible(&ictx->inode); + return netfs_inode_dio_wait_interruptible(&ictx->inode); } /** diff --git a/include/linux/fs.h b/include/linux/fs.h index 0145bda465ff..8f8d274929d7 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2373,8 +2373,6 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src, * * I_REFERENCED Marks the inode as recently references on the LRU list. * - * I_DIO_WAKEUP Never set. Only used as a key for wait_on_bit(). - * * I_WB_SWITCH Cgroup bdi_writeback switching in progress. Used to * synchronize competing switching instances and to tell * wb stat updates to grab the i_pages lock. See @@ -2409,8 +2407,6 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src, #define __I_SYNC 7 #define I_SYNC (1 << __I_SYNC) #define I_REFERENCED (1 << 8) -#define __I_DIO_WAKEUP 9 -#define I_DIO_WAKEUP (1 << __I_DIO_WAKEUP) #define I_LINKABLE (1 << 10) #define I_DIRTY_TIME (1 << 11) #define I_WB_SWITCH (1 << 13) @@ -3227,7 +3223,9 @@ static inline ssize_t blockdev_direct_IO(struct kiocb *iocb, } #endif +bool inode_dio_finished(const struct inode *inode); void inode_dio_wait(struct inode *inode); +void inode_dio_wait_interruptible(struct inode *inode); /** * inode_dio_begin - signal start of a direct I/O requests @@ -3251,7 +3249,7 @@ static inline void inode_dio_begin(struct inode *inode) static inline void inode_dio_end(struct inode *inode) { if (atomic_dec_and_test(&inode->i_dio_count)) - wake_up_bit(&inode->i_state, __I_DIO_WAKEUP); + wake_up_var(&inode->i_dio_count); } extern void inode_set_flags(struct inode *inode, unsigned int flags, -- cgit v1.2.3 From 029c3f27fe84c5fd29d49a99cc5176433bf12209 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Thu, 22 Aug 2024 11:28:38 +0200 Subject: fs: remove unused path_put_init() This helper has been unused for a while now. Link: https://lore.kernel.org/r/20240822-bewuchs-werktag-46672b3c0606@brauner Reviewed-by: Jan Kara Signed-off-by: Christian Brauner --- include/linux/path.h | 6 ------ 1 file changed, 6 deletions(-) (limited to 'include') diff --git a/include/linux/path.h b/include/linux/path.h index ca073e70decd..7ea389dc764b 100644 --- a/include/linux/path.h +++ b/include/linux/path.h @@ -18,12 +18,6 @@ static inline int path_equal(const struct path *path1, const struct path *path2) return path1->mnt == path2->mnt && path1->dentry == path2->dentry; } -static inline void path_put_init(struct path *path) -{ - path_put(path); - *path = (struct path) { }; -} - /* * Cleanup macro for use with __free(path_put). Avoids dereference and * copying @path unlike DEFINE_FREE(). path_put() will handle the empty -- cgit v1.2.3 From 71ff58ce3428b2471efae5b00594e892b285c97c Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Thu, 22 Aug 2024 11:30:58 +0200 Subject: fs: s/__u32/u32/ for s_fsnotify_mask The underscore variants are for uapi whereas the non-underscore variants are for in-kernel consumers. Link: https://lore.kernel.org/r/20240822-anwerben-nutzung-1cd6c82a565f@brauner Reviewed-by: Jan Kara Signed-off-by: Christian Brauner --- include/linux/fs.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include') diff --git a/include/linux/fs.h b/include/linux/fs.h index 8f8d274929d7..cda1f2545ea0 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1266,7 +1266,7 @@ struct super_block { time64_t s_time_min; time64_t s_time_max; #ifdef CONFIG_FSNOTIFY - __u32 s_fsnotify_mask; + u32 s_fsnotify_mask; struct fsnotify_sb_info *s_fsnotify_info; #endif -- cgit v1.2.3 From da18ecbf0fb6fb73e6f9273e7aa15610f148ce17 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Fri, 23 Aug 2024 14:47:35 +0200 Subject: fs: add i_state helpers The i_state member is an unsigned long so that it can be used with the wait bit infrastructure which expects unsigned long. This wastes 4 bytes which we're unlikely to ever use. Switch to using the var event wait mechanism using the address of the bit. Thanks to Linus for the address idea. Link: https://lore.kernel.org/r/20240823-work-i_state-v3-1-5cd5fd207a57@kernel.org Reviewed-by: Josef Bacik Reviewed-by: Jan Kara Signed-off-by: Christian Brauner --- fs/inode.c | 11 +++++++++++ include/linux/fs.h | 15 +++++++++++++++ 2 files changed, 26 insertions(+) (limited to 'include') diff --git a/fs/inode.c b/fs/inode.c index ba1645a09603..1b0f52ebae27 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -472,6 +472,17 @@ static void __inode_add_lru(struct inode *inode, bool rotate) inode->i_state |= I_REFERENCED; } +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe, + struct inode *inode, u32 bit) +{ + void *bit_address; + + bit_address = inode_state_wait_address(inode, bit); + init_wait_var_entry(wqe, bit_address, 0); + return __var_waitqueue(bit_address); +} +EXPORT_SYMBOL(inode_bit_waitqueue); + /* * Add inode to LRU if needed (inode is unused and clean). * diff --git a/include/linux/fs.h b/include/linux/fs.h index cda1f2545ea0..d35de348d2ec 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -744,6 +744,21 @@ struct inode { void *i_private; /* fs or device private pointer */ } __randomize_layout; +/* + * Get bit address from inode->i_state to use with wait_var_event() + * infrastructre. + */ +#define inode_state_wait_address(inode, bit) ((char *)&(inode)->i_state + (bit)) + +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe, + struct inode *inode, u32 bit); + +static inline void inode_wake_up_bit(struct inode *inode, u32 bit) +{ + /* Caller is responsible for correct memory barriers. */ + wake_up_var(inode_state_wait_address(inode, bit)); +} + struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode); static inline unsigned int i_blocksize(const struct inode *node) -- cgit v1.2.3 From 2ed634c96ed1d6e4ee0497be176f9980866e81cb Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Fri, 23 Aug 2024 14:47:36 +0200 Subject: fs: reorder i_state bits so that we can use the first bits to derive unique addresses from i_state. Link: https://lore.kernel.org/r/20240823-work-i_state-v3-2-5cd5fd207a57@kernel.org Reviewed-by: Josef Bacik Reviewed-by: Jeff Layton Reviewed-by: Jan Kara Signed-off-by: Christian Brauner --- include/linux/fs.h | 38 +++++++++++++++++++++----------------- 1 file changed, 21 insertions(+), 17 deletions(-) (limited to 'include') diff --git a/include/linux/fs.h b/include/linux/fs.h index d35de348d2ec..a868c9823c10 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2410,28 +2410,32 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src, * i_count. * * Q: What is the difference between I_WILL_FREE and I_FREEING? + * + * __I_{SYNC,NEW,LRU_ISOLATING} are used to derive unique addresses to wait + * upon. There's one free address left. */ -#define I_DIRTY_SYNC (1 << 0) -#define I_DIRTY_DATASYNC (1 << 1) -#define I_DIRTY_PAGES (1 << 2) -#define __I_NEW 3 +#define __I_NEW 0 #define I_NEW (1 << __I_NEW) -#define I_WILL_FREE (1 << 4) -#define I_FREEING (1 << 5) -#define I_CLEAR (1 << 6) -#define __I_SYNC 7 +#define __I_SYNC 1 #define I_SYNC (1 << __I_SYNC) -#define I_REFERENCED (1 << 8) +#define __I_LRU_ISOLATING 2 +#define I_LRU_ISOLATING (1 << __I_LRU_ISOLATING) + +#define I_DIRTY_SYNC (1 << 3) +#define I_DIRTY_DATASYNC (1 << 4) +#define I_DIRTY_PAGES (1 << 5) +#define I_WILL_FREE (1 << 6) +#define I_FREEING (1 << 7) +#define I_CLEAR (1 << 8) +#define I_REFERENCED (1 << 9) #define I_LINKABLE (1 << 10) #define I_DIRTY_TIME (1 << 11) -#define I_WB_SWITCH (1 << 13) -#define I_OVL_INUSE (1 << 14) -#define I_CREATING (1 << 15) -#define I_DONTCACHE (1 << 16) -#define I_SYNC_QUEUED (1 << 17) -#define I_PINNING_NETFS_WB (1 << 18) -#define __I_LRU_ISOLATING 19 -#define I_LRU_ISOLATING (1 << __I_LRU_ISOLATING) +#define I_WB_SWITCH (1 << 12) +#define I_OVL_INUSE (1 << 13) +#define I_CREATING (1 << 14) +#define I_DONTCACHE (1 << 15) +#define I_SYNC_QUEUED (1 << 16) +#define I_PINNING_NETFS_WB (1 << 17) #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC) #define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES) -- cgit v1.2.3 From 0fe340a98b584a31b07c664dc5ff0c923730581e Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Fri, 23 Aug 2024 14:47:38 +0200 Subject: inode: port __I_NEW to var event Port the __I_NEW mechanism to use the new var event mechanism. Link: https://lore.kernel.org/r/20240823-work-i_state-v3-4-5cd5fd207a57@kernel.org Reviewed-by: Josef Bacik Reviewed-by: Jan Kara Signed-off-by: Christian Brauner --- fs/bcachefs/fs.c | 10 ++++++---- fs/dcache.c | 7 ++++++- fs/inode.c | 32 ++++++++++++++++++++++++-------- include/linux/writeback.h | 3 ++- 4 files changed, 38 insertions(+), 14 deletions(-) (limited to 'include') diff --git a/fs/bcachefs/fs.c b/fs/bcachefs/fs.c index 94c392abef65..c0900c0c0f8a 100644 --- a/fs/bcachefs/fs.c +++ b/fs/bcachefs/fs.c @@ -1644,14 +1644,16 @@ again: break; } } else if (clean_pass && this_pass_clean) { - wait_queue_head_t *wq = bit_waitqueue(&inode->v.i_state, __I_NEW); - DEFINE_WAIT_BIT(wait, &inode->v.i_state, __I_NEW); + struct wait_bit_queue_entry wqe; + struct wait_queue_head *wq_head; - prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE); + wq_head = inode_bit_waitqueue(&wqe, &inode->v, __I_NEW); + prepare_to_wait_event(wq_head, &wqe.wq_entry, + TASK_UNINTERRUPTIBLE); mutex_unlock(&c->vfs_inodes_lock); schedule(); - finish_wait(wq, &wait.wq_entry); + finish_wait(wq_head, &wqe.wq_entry); goto again; } } diff --git a/fs/dcache.c b/fs/dcache.c index 1af75fa68638..894e38cdf4d0 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -1908,8 +1908,13 @@ void d_instantiate_new(struct dentry *entry, struct inode *inode) __d_instantiate(entry, inode); WARN_ON(!(inode->i_state & I_NEW)); inode->i_state &= ~I_NEW & ~I_CREATING; + /* + * Pairs with the barrier in prepare_to_wait_event() to make sure + * ___wait_var_event() either sees the bit cleared or + * waitqueue_active() check in wake_up_var() sees the waiter. + */ smp_mb(); - wake_up_bit(&inode->i_state, __I_NEW); + inode_wake_up_bit(inode, __I_NEW); spin_unlock(&inode->i_lock); } EXPORT_SYMBOL(d_instantiate_new); diff --git a/fs/inode.c b/fs/inode.c index 1b0f52ebae27..1aa785421f13 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -734,7 +734,13 @@ static void evict(struct inode *inode) * used as an indicator whether blocking on it is safe. */ spin_lock(&inode->i_lock); - wake_up_bit(&inode->i_state, __I_NEW); + /* + * Pairs with the barrier in prepare_to_wait_event() to make sure + * ___wait_var_event() either sees the bit cleared or + * waitqueue_active() check in wake_up_var() sees the waiter. + */ + smp_mb(); + inode_wake_up_bit(inode, __I_NEW); BUG_ON(inode->i_state != (I_FREEING | I_CLEAR)); spin_unlock(&inode->i_lock); @@ -1146,8 +1152,13 @@ void unlock_new_inode(struct inode *inode) spin_lock(&inode->i_lock); WARN_ON(!(inode->i_state & I_NEW)); inode->i_state &= ~I_NEW & ~I_CREATING; + /* + * Pairs with the barrier in prepare_to_wait_event() to make sure + * ___wait_var_event() either sees the bit cleared or + * waitqueue_active() check in wake_up_var() sees the waiter. + */ smp_mb(); - wake_up_bit(&inode->i_state, __I_NEW); + inode_wake_up_bit(inode, __I_NEW); spin_unlock(&inode->i_lock); } EXPORT_SYMBOL(unlock_new_inode); @@ -1158,8 +1169,13 @@ void discard_new_inode(struct inode *inode) spin_lock(&inode->i_lock); WARN_ON(!(inode->i_state & I_NEW)); inode->i_state &= ~I_NEW; + /* + * Pairs with the barrier in prepare_to_wait_event() to make sure + * ___wait_var_event() either sees the bit cleared or + * waitqueue_active() check in wake_up_var() sees the waiter. + */ smp_mb(); - wake_up_bit(&inode->i_state, __I_NEW); + inode_wake_up_bit(inode, __I_NEW); spin_unlock(&inode->i_lock); iput(inode); } @@ -2348,8 +2364,8 @@ EXPORT_SYMBOL(inode_needs_sync); */ static void __wait_on_freeing_inode(struct inode *inode, bool is_inode_hash_locked) { - wait_queue_head_t *wq; - DEFINE_WAIT_BIT(wait, &inode->i_state, __I_NEW); + struct wait_bit_queue_entry wqe; + struct wait_queue_head *wq_head; /* * Handle racing against evict(), see that routine for more details. @@ -2360,14 +2376,14 @@ static void __wait_on_freeing_inode(struct inode *inode, bool is_inode_hash_lock return; } - wq = bit_waitqueue(&inode->i_state, __I_NEW); - prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE); + wq_head = inode_bit_waitqueue(&wqe, inode, __I_NEW); + prepare_to_wait_event(wq_head, &wqe.wq_entry, TASK_UNINTERRUPTIBLE); spin_unlock(&inode->i_lock); rcu_read_unlock(); if (is_inode_hash_locked) spin_unlock(&inode_hash_lock); schedule(); - finish_wait(wq, &wait.wq_entry); + finish_wait(wq_head, &wqe.wq_entry); if (is_inode_hash_locked) spin_lock(&inode_hash_lock); rcu_read_lock(); diff --git a/include/linux/writeback.h b/include/linux/writeback.h index 56b85841ae4c..8f651bb0a1a5 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -200,7 +200,8 @@ void inode_io_list_del(struct inode *inode); /* writeback.h requires fs.h; it, too, is not included from here. */ static inline void wait_on_inode(struct inode *inode) { - wait_on_bit(&inode->i_state, __I_NEW, TASK_UNINTERRUPTIBLE); + wait_var_event(inode_state_wait_address(inode, __I_NEW), + !(READ_ONCE(inode->i_state) & I_NEW)); } #ifdef CONFIG_CGROUP_WRITEBACK -- cgit v1.2.3 From 2b111edbe0a9c441605be5cfb73001dc98ec686f Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Fri, 23 Aug 2024 14:47:40 +0200 Subject: inode: make i_state a u32 Now that we use the wait var event mechanism make i_state a u32 and free up 4 bytes. This means we currently have two 4 byte holes in struct inode which we can pack. Link: https://lore.kernel.org/r/20240823-work-i_state-v3-6-5cd5fd207a57@kernel.org Reviewed-by: Josef Bacik Reviewed-by: Jan Kara Signed-off-by: Christian Brauner --- include/linux/fs.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'include') diff --git a/include/linux/fs.h b/include/linux/fs.h index a868c9823c10..a1ef8c65d828 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -681,7 +681,8 @@ struct inode { #endif /* Misc */ - unsigned long i_state; + u32 i_state; + /* 32-bit hole */ struct rw_semaphore i_rwsem; unsigned long dirtied_when; /* jiffies of first dirtying */ -- cgit v1.2.3 From 459ca85ae1feff78d1518344df88bb79a092780c Mon Sep 17 00:00:00 2001 From: Julian Sun Date: Wed, 28 Aug 2024 16:13:59 +0800 Subject: writeback: Refine the show_inode_state() macro definition MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Currently, the show_inode_state() macro only prints part of the state of inode->i_state. Let’s improve it to display more of its state. Signed-off-by: Julian Sun Link: https://lore.kernel.org/r/20240828081359.62429-1-sunjunchao2870@gmail.com Reviewed-by: Jan Kara Signed-off-by: Christian Brauner --- include/trace/events/writeback.h | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) (limited to 'include') diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index 54e353c9f919..a261e86e61fa 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -20,7 +20,15 @@ {I_CLEAR, "I_CLEAR"}, \ {I_SYNC, "I_SYNC"}, \ {I_DIRTY_TIME, "I_DIRTY_TIME"}, \ - {I_REFERENCED, "I_REFERENCED"} \ + {I_REFERENCED, "I_REFERENCED"}, \ + {I_LINKABLE, "I_LINKABLE"}, \ + {I_WB_SWITCH, "I_WB_SWITCH"}, \ + {I_OVL_INUSE, "I_OVL_INUSE"}, \ + {I_CREATING, "I_CREATING"}, \ + {I_DONTCACHE, "I_DONTCACHE"}, \ + {I_SYNC_QUEUED, "I_SYNC_QUEUED"}, \ + {I_PINNING_NETFS_WB, "I_PINNING_NETFS_WB"}, \ + {I_LRU_ISOLATING, "I_LRU_ISOLATING"} \ ) /* enums need to be exported to user space */ -- cgit v1.2.3 From 5c40e050e6ac0218af7c520095729d440cc87e6b Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Thu, 29 Aug 2024 15:06:40 +0200 Subject: fs: drop GFP_NOFAIL mode from alloc_page_buffers There is only one called of alloc_page_buffers and it doesn't require __GFP_NOFAIL so drop this allocation mode. Signed-off-by: Michal Hocko Link: https://lore.kernel.org/r/20240829130640.1397970-1-mhocko@kernel.org Acked-by: Song Liu Reviewed-by: Jan Kara Signed-off-by: Christian Brauner --- drivers/md/md-bitmap.c | 2 +- fs/buffer.c | 5 +---- include/linux/buffer_head.h | 3 +-- 3 files changed, 3 insertions(+), 7 deletions(-) (limited to 'include') diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c index 08232d8dc815..db5330d97348 100644 --- a/drivers/md/md-bitmap.c +++ b/drivers/md/md-bitmap.c @@ -360,7 +360,7 @@ static int read_file_page(struct file *file, unsigned long index, pr_debug("read bitmap file (%dB @ %llu)\n", (int)PAGE_SIZE, (unsigned long long)index << PAGE_SHIFT); - bh = alloc_page_buffers(page, blocksize, false); + bh = alloc_page_buffers(page, blocksize); if (!bh) { ret = -ENOMEM; goto out; diff --git a/fs/buffer.c b/fs/buffer.c index 31a9062cad7e..74f4eb5c7087 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -957,12 +957,9 @@ no_grow: } EXPORT_SYMBOL_GPL(folio_alloc_buffers); -struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size, - bool retry) +struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size) { gfp_t gfp = GFP_NOFS | __GFP_ACCOUNT; - if (retry) - gfp |= __GFP_NOFAIL; return folio_alloc_buffers(page_folio(page), size, gfp); } diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h index 14acf1bbe0ce..7e903457967a 100644 --- a/include/linux/buffer_head.h +++ b/include/linux/buffer_head.h @@ -199,8 +199,7 @@ void folio_set_bh(struct buffer_head *bh, struct folio *folio, unsigned long offset); struct buffer_head *folio_alloc_buffers(struct folio *folio, unsigned long size, gfp_t gfp); -struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size, - bool retry); +struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size); struct buffer_head *create_empty_buffers(struct folio *folio, unsigned long blocksize, unsigned long b_state); void end_buffer_read_sync(struct buffer_head *bh, int uptodate); -- cgit v1.2.3 From b4fef22c2fb97fa204f0c99c7c7f1c6b422ef0aa Mon Sep 17 00:00:00 2001 From: Aleksa Sarai Date: Wed, 28 Aug 2024 20:19:42 +1000 Subject: uapi: explain how per-syscall AT_* flags should be allocated Unfortunately, the way we have gone about adding new AT_* flags has been a little messy. In the beginning, all of the AT_* flags had generic meanings and so it made sense to share the flag bits indiscriminately. However, we inevitably ran into syscalls that needed their own syscall-specific flags. Due to the lack of a planned out policy, we ended up with the following situations: * Existing syscalls adding new features tended to use new AT_* bits, with some effort taken to try to re-use bits for flags that were so obviously syscall specific that they only make sense for a single syscall (such as the AT_EACCESS/AT_REMOVEDIR/AT_HANDLE_FID triplet). Given the constraints of bitflags, this works well in practice, but ideally (to avoid future confusion) we would plan ahead and define a set of "per-syscall bits" ahead of time so that when allocating new bits we don't end up with a complete mish-mash of which bits are supposed to be per-syscall and which aren't. * New syscalls dealt with this in several ways: - Some syscalls (like renameat2(2), move_mount(2), fsopen(2), and fspick(2)) created their separate own flag spaces that have no overlap with the AT_* flags. Most of these ended up allocating their bits sequentually. In the case of move_mount(2) and fspick(2), several flags have identical meanings to AT_* flags but were allocated in their own flag space. This makes sense for syscalls that will never share AT_* flags, but for some syscalls this leads to duplication with AT_* flags in a way that could cause confusion (if renameat2(2) grew a RENAME_EMPTY_PATH it seems likely that users could mistake it for AT_EMPTY_PATH since it is an *at(2) syscall). - Some syscalls unfortunately ended up both creating their own flag space while also using bits from other flag spaces. The most obvious example is open_tree(2), where the standard usage ends up using flags from *THREE* separate flag spaces: open_tree(AT_FDCWD, "/foo", OPEN_TREE_CLONE|O_CLOEXEC|AT_RECURSIVE); (Note that O_CLOEXEC is also platform-specific, so several future OPEN_TREE_* bits are also made unusable in one fell swoop.) It's not entirely clear to me what the "right" choice is for new syscalls. Just saying that all future VFS syscalls should use AT_* flags doesn't seem practical. openat2(2) has RESOLVE_* flags (many of which don't make much sense to burn generic AT_* flags for) and move_mount(2) has separate AT_*-like flags for both the source and target so separate flags are needed anyway (though it seems possible that renameat2(2) could grow *_EMPTY_PATH flags at some point, and it's a bit of a shame they can't be reused). But at least for syscalls that _do_ choose to use AT_* flags, we should explicitly state the policy that 0x2ff is currently intended for per-syscall flags and that new flags should err on the side of overlapping with existing flag bits (so we can extend the scope of generic flags in the future if necessary). And add AT_* aliases for the RENAME_* flags to further cement that renameat2(2) is an *at(2) flag, just with its own per-syscall flags. Suggested-by: Amir Goldstein Reviewed-by: Jeff Layton Reviewed-by: Josef Bacik Signed-off-by: Aleksa Sarai Link: https://lore.kernel.org/r/20240828-exportfs-u64-mount-id-v3-1-10c2c4c16708@cyphar.com Reviewed-by: Jan Kara Signed-off-by: Christian Brauner --- include/uapi/linux/fcntl.h | 80 ++++++++++++++++++++++++++++++++-------------- 1 file changed, 56 insertions(+), 24 deletions(-) (limited to 'include') diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index e55a3314bcb0..38a6d66d9e88 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -90,37 +90,69 @@ #define DN_ATTRIB 0x00000020 /* File changed attibutes */ #define DN_MULTISHOT 0x80000000 /* Don't remove notifier */ +#define AT_FDCWD -100 /* Special value for dirfd used to + indicate openat should use the + current working directory. */ + + +/* Generic flags for the *at(2) family of syscalls. */ + +/* Reserved for per-syscall flags 0xff. */ +#define AT_SYMLINK_NOFOLLOW 0x100 /* Do not follow symbolic + links. */ +/* Reserved for per-syscall flags 0x200 */ +#define AT_SYMLINK_FOLLOW 0x400 /* Follow symbolic links. */ +#define AT_NO_AUTOMOUNT 0x800 /* Suppress terminal automount + traversal. */ +#define AT_EMPTY_PATH 0x1000 /* Allow empty relative + pathname to operate on dirfd + directly. */ /* - * The constants AT_REMOVEDIR and AT_EACCESS have the same value. AT_EACCESS is - * meaningful only to faccessat, while AT_REMOVEDIR is meaningful only to - * unlinkat. The two functions do completely different things and therefore, - * the flags can be allowed to overlap. For example, passing AT_REMOVEDIR to - * faccessat would be undefined behavior and thus treating it equivalent to - * AT_EACCESS is valid undefined behavior. + * These flags are currently statx(2)-specific, but they could be made generic + * in the future and so they should not be used for other per-syscall flags. */ -#define AT_FDCWD -100 /* Special value used to indicate - openat should use the current - working directory. */ -#define AT_SYMLINK_NOFOLLOW 0x100 /* Do not follow symbolic links. */ +#define AT_STATX_SYNC_TYPE 0x6000 /* Type of synchronisation required from statx() */ +#define AT_STATX_SYNC_AS_STAT 0x0000 /* - Do whatever stat() does */ +#define AT_STATX_FORCE_SYNC 0x2000 /* - Force the attributes to be sync'd with the server */ +#define AT_STATX_DONT_SYNC 0x4000 /* - Don't sync attributes with the server */ + +#define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */ + +/* + * Per-syscall flags for the *at(2) family of syscalls. + * + * These are flags that are so syscall-specific that a user passing these flags + * to the wrong syscall is so "clearly wrong" that we can safely call such + * usage "undefined behaviour". + * + * For example, the constants AT_REMOVEDIR and AT_EACCESS have the same value. + * AT_EACCESS is meaningful only to faccessat, while AT_REMOVEDIR is meaningful + * only to unlinkat. The two functions do completely different things and + * therefore, the flags can be allowed to overlap. For example, passing + * AT_REMOVEDIR to faccessat would be undefined behavior and thus treating it + * equivalent to AT_EACCESS is valid undefined behavior. + * + * Note for implementers: When picking a new per-syscall AT_* flag, try to + * reuse already existing flags first. This leaves us with as many unused bits + * as possible, so we can use them for generic bits in the future if necessary. + */ + +/* Flags for renameat2(2) (must match legacy RENAME_* flags). */ +#define AT_RENAME_NOREPLACE 0x0001 +#define AT_RENAME_EXCHANGE 0x0002 +#define AT_RENAME_WHITEOUT 0x0004 + +/* Flag for faccessat(2). */ #define AT_EACCESS 0x200 /* Test access permitted for effective IDs, not real IDs. */ +/* Flag for unlinkat(2). */ #define AT_REMOVEDIR 0x200 /* Remove directory instead of unlinking file. */ -#define AT_SYMLINK_FOLLOW 0x400 /* Follow symbolic links. */ -#define AT_NO_AUTOMOUNT 0x800 /* Suppress terminal automount traversal */ -#define AT_EMPTY_PATH 0x1000 /* Allow empty relative pathname */ - -#define AT_STATX_SYNC_TYPE 0x6000 /* Type of synchronisation required from statx() */ -#define AT_STATX_SYNC_AS_STAT 0x0000 /* - Do whatever stat() does */ -#define AT_STATX_FORCE_SYNC 0x2000 /* - Force the attributes to be sync'd with the server */ -#define AT_STATX_DONT_SYNC 0x4000 /* - Don't sync attributes with the server */ - -#define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */ +/* Flags for name_to_handle_at(2). */ +#define AT_HANDLE_FID 0x200 /* File handle is needed to compare + object identity and may not be + usable with open_by_handle_at(2). */ -/* Flags for name_to_handle_at(2). We reuse AT_ flag space to save bits... */ -#define AT_HANDLE_FID AT_REMOVEDIR /* file handle is needed to - compare object identity and may not - be usable to open_by_handle_at(2) */ #if defined(__KERNEL__) #define AT_GETATTR_NOSEC 0x80000000 #endif -- cgit v1.2.3 From 4356d575ef0f39a3e8e0ce0c40d84ce900ac3b61 Mon Sep 17 00:00:00 2001 From: Aleksa Sarai Date: Wed, 28 Aug 2024 20:19:43 +1000 Subject: fhandle: expose u64 mount id to name_to_handle_at(2) Now that we provide a unique 64-bit mount ID interface in statx(2), we can now provide a race-free way for name_to_handle_at(2) to provide a file handle and corresponding mount without needing to worry about racing with /proc/mountinfo parsing or having to open a file just to do statx(2). While this is not necessary if you are using AT_EMPTY_PATH and don't care about an extra statx(2) call, users that pass full paths into name_to_handle_at(2) need to know which mount the file handle comes from (to make sure they don't try to open_by_handle_at a file handle from a different filesystem) and switching to AT_EMPTY_PATH would require allocating a file for every name_to_handle_at(2) call, turning err = name_to_handle_at(-EBADF, "/foo/bar/baz", &handle, &mntid, AT_HANDLE_MNT_ID_UNIQUE); into int fd = openat(-EBADF, "/foo/bar/baz", O_PATH | O_CLOEXEC); err1 = name_to_handle_at(fd, "", &handle, &unused_mntid, AT_EMPTY_PATH); err2 = statx(fd, "", AT_EMPTY_PATH, STATX_MNT_ID_UNIQUE, &statxbuf); mntid = statxbuf.stx_mnt_id; close(fd); Reviewed-by: Jeff Layton Signed-off-by: Aleksa Sarai Link: https://lore.kernel.org/r/20240828-exportfs-u64-mount-id-v3-2-10c2c4c16708@cyphar.com Reviewed-by: Jan Kara Reviewed-by: Josef Bacik Signed-off-by: Christian Brauner --- fs/fhandle.c | 29 ++++++++++++++++++++++------- include/linux/syscalls.h | 2 +- include/uapi/linux/fcntl.h | 1 + 3 files changed, 24 insertions(+), 8 deletions(-) (limited to 'include') diff --git a/fs/fhandle.c b/fs/fhandle.c index 6e8cea16790e..8cb665629f4a 100644 --- a/fs/fhandle.c +++ b/fs/fhandle.c @@ -16,7 +16,8 @@ static long do_sys_name_to_handle(const struct path *path, struct file_handle __user *ufh, - int __user *mnt_id, int fh_flags) + void __user *mnt_id, bool unique_mntid, + int fh_flags) { long retval; struct file_handle f_handle; @@ -69,9 +70,19 @@ static long do_sys_name_to_handle(const struct path *path, } else retval = 0; /* copy the mount id */ - if (put_user(real_mount(path->mnt)->mnt_id, mnt_id) || - copy_to_user(ufh, handle, - struct_size(handle, f_handle, handle_bytes))) + if (unique_mntid) { + if (put_user(real_mount(path->mnt)->mnt_id_unique, + (u64 __user *) mnt_id)) + retval = -EFAULT; + } else { + if (put_user(real_mount(path->mnt)->mnt_id, + (int __user *) mnt_id)) + retval = -EFAULT; + } + /* copy the handle */ + if (retval != -EFAULT && + copy_to_user(ufh, handle, + struct_size(handle, f_handle, handle_bytes))) retval = -EFAULT; kfree(handle); return retval; @@ -83,6 +94,7 @@ static long do_sys_name_to_handle(const struct path *path, * @name: name that should be converted to handle. * @handle: resulting file handle * @mnt_id: mount id of the file system containing the file + * (u64 if AT_HANDLE_MNT_ID_UNIQUE, otherwise int) * @flag: flag value to indicate whether to follow symlink or not * and whether a decodable file handle is required. * @@ -92,7 +104,7 @@ static long do_sys_name_to_handle(const struct path *path, * value required. */ SYSCALL_DEFINE5(name_to_handle_at, int, dfd, const char __user *, name, - struct file_handle __user *, handle, int __user *, mnt_id, + struct file_handle __user *, handle, void __user *, mnt_id, int, flag) { struct path path; @@ -100,7 +112,8 @@ SYSCALL_DEFINE5(name_to_handle_at, int, dfd, const char __user *, name, int fh_flags; int err; - if (flag & ~(AT_SYMLINK_FOLLOW | AT_EMPTY_PATH | AT_HANDLE_FID)) + if (flag & ~(AT_SYMLINK_FOLLOW | AT_EMPTY_PATH | AT_HANDLE_FID | + AT_HANDLE_MNT_ID_UNIQUE)) return -EINVAL; lookup_flags = (flag & AT_SYMLINK_FOLLOW) ? LOOKUP_FOLLOW : 0; @@ -109,7 +122,9 @@ SYSCALL_DEFINE5(name_to_handle_at, int, dfd, const char __user *, name, lookup_flags |= LOOKUP_EMPTY; err = user_path_at(dfd, name, lookup_flags, &path); if (!err) { - err = do_sys_name_to_handle(&path, handle, mnt_id, fh_flags); + err = do_sys_name_to_handle(&path, handle, mnt_id, + flag & AT_HANDLE_MNT_ID_UNIQUE, + fh_flags); path_put(&path); } return err; diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 4bcf6754738d..5758104921e6 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -870,7 +870,7 @@ asmlinkage long sys_fanotify_mark(int fanotify_fd, unsigned int flags, #endif asmlinkage long sys_name_to_handle_at(int dfd, const char __user *name, struct file_handle __user *handle, - int __user *mnt_id, int flag); + void __user *mnt_id, int flag); asmlinkage long sys_open_by_handle_at(int mountdirfd, struct file_handle __user *handle, int flags); diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 38a6d66d9e88..87e2dec79fea 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -152,6 +152,7 @@ #define AT_HANDLE_FID 0x200 /* File handle is needed to compare object identity and may not be usable with open_by_handle_at(2). */ +#define AT_HANDLE_MNT_ID_UNIQUE 0x001 /* Return the u64 unique mount ID. */ #if defined(__KERNEL__) #define AT_GETATTR_NOSEC 0x80000000 -- cgit v1.2.3 From 2077006d4725c82c6e9612cec3a6c140921b067f Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Tue, 10 Sep 2024 10:16:39 +0200 Subject: uidgid: make sure we fit into one cacheline When I expanded uidgid mappings I intended for a struct uid_gid_map to fit into a single cacheline on x86 as they tend to be pretty performance sensitive (idmapped mounts etc). But a 4 byte hole was added that brought it over 64 bytes. Fix that by adding the static extent array and the extent counter into a substruct. C's type punning for unions guarantees that we can access ->nr_extents even if the last written to member wasn't within the same object. This is also what we rely on in struct_group() and friends. This of course relies on non-strict aliasing which we don't do. 99) If the member used to read the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called "type punning"). Link: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2310.pdf Link: https://lore.kernel.org/r/20240910-work-uid_gid_map-v1-1-e6bc761363ed@kernel.org Reviewed-by: Jeff Layton Signed-off-by: Christian Brauner --- include/linux/user_namespace.h | 6 ++++-- kernel/user.c | 6 +++--- 2 files changed, 7 insertions(+), 5 deletions(-) (limited to 'include') diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index 6030a8235617..3625096d5f85 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -21,9 +21,11 @@ struct uid_gid_extent { }; struct uid_gid_map { /* 64 bytes -- 1 cache line */ - u32 nr_extents; union { - struct uid_gid_extent extent[UID_GID_MAP_MAX_BASE_EXTENTS]; + struct { + struct uid_gid_extent extent[UID_GID_MAP_MAX_BASE_EXTENTS]; + u32 nr_extents; + }; struct { struct uid_gid_extent *forward; struct uid_gid_extent *reverse; diff --git a/kernel/user.c b/kernel/user.c index aa1162deafe4..f46b1d41163b 100644 --- a/kernel/user.c +++ b/kernel/user.c @@ -36,33 +36,33 @@ EXPORT_SYMBOL_GPL(init_binfmt_misc); */ struct user_namespace init_user_ns = { .uid_map = { - .nr_extents = 1, { .extent[0] = { .first = 0, .lower_first = 0, .count = 4294967295U, }, + .nr_extents = 1, }, }, .gid_map = { - .nr_extents = 1, { .extent[0] = { .first = 0, .lower_first = 0, .count = 4294967295U, }, + .nr_extents = 1, }, }, .projid_map = { - .nr_extents = 1, { .extent[0] = { .first = 0, .lower_first = 0, .count = 4294967295U, }, + .nr_extents = 1, }, }, .ns.count = REFCOUNT_INIT(3), -- cgit v1.2.3