diff options
Diffstat (limited to 'Documentation/filesystems')
22 files changed, 894 insertions, 813 deletions
diff --git a/Documentation/filesystems/erofs.rst b/Documentation/filesystems/erofs.rst index 7ddb235aee9d..08194f194b94 100644 --- a/Documentation/filesystems/erofs.rst +++ b/Documentation/filesystems/erofs.rst @@ -116,7 +116,7 @@ cache_strategy=%s Select a strategy for cached decompression from now on: cluster for further reading. It still does in-place I/O decompression for the rest compressed physical clusters; - readaround Cache the both ends of incomplete compressed + readaround Cache both ends of incomplete compressed physical clusters for further reading. It still does in-place I/O decompression for the rest compressed physical clusters. diff --git a/Documentation/filesystems/ext4/atomic_writes.rst b/Documentation/filesystems/ext4/atomic_writes.rst index aeb47ace738d..ae8995740aa8 100644 --- a/Documentation/filesystems/ext4/atomic_writes.rst +++ b/Documentation/filesystems/ext4/atomic_writes.rst @@ -14,7 +14,7 @@ I/O) on regular files with extents, provided the underlying storage device supports hardware atomic writes. This is supported in the following two ways: 1. **Single-fsblock Atomic Writes**: - EXT4's supports atomic write operations with a single filesystem block since + EXT4 supports atomic write operations with a single filesystem block since v6.13. In this the atomic write unit minimum and maximum sizes are both set to filesystem blocksize. e.g. doing atomic write of 16KB with 16KB filesystem blocksize on 64KB @@ -50,7 +50,7 @@ Multi-fsblock Implementation Details The bigalloc feature changes ext4 to allocate in units of multiple filesystem blocks, also known as clusters. With bigalloc each bit within block bitmap -represents cluster (power of 2 number of blocks) rather than individual +represents a cluster (power of 2 number of blocks) rather than individual filesystem blocks. EXT4 supports multi-fsblock atomic writes with bigalloc, subject to the following constraints. The minimum atomic write size is the larger of the fs @@ -189,7 +189,7 @@ The write must be aligned to the filesystem's block size and not exceed the filesystem's maximum atomic write unit size. See ``generic_atomic_write_valid()`` for more details. -``statx()`` system call with ``STATX_WRITE_ATOMIC`` flag can provides following +``statx()`` system call with ``STATX_WRITE_ATOMIC`` flag can provide following details: * ``stx_atomic_write_unit_min``: Minimum size of an atomic write request. diff --git a/Documentation/filesystems/ext4/directory.rst b/Documentation/filesystems/ext4/directory.rst index 6eece8e31df8..9b003a4d453f 100644 --- a/Documentation/filesystems/ext4/directory.rst +++ b/Documentation/filesystems/ext4/directory.rst @@ -183,10 +183,10 @@ in the place where the name normally goes. The structure is - det_checksum - Directory leaf block checksum. -The leaf directory block checksum is calculated against the FS UUID, the -directory's inode number, the directory's inode generation number, and -the entire directory entry block up to (but not including) the fake -directory entry. +The leaf directory block checksum is calculated against the FS UUID (or +the checksum seed, if that feature is enabled for the fs), the directory's +inode number, the directory's inode generation number, and the entire +directory entry block up to (but not including) the fake directory entry. Hash Tree Directories ~~~~~~~~~~~~~~~~~~~~~ @@ -196,12 +196,12 @@ new feature was added to ext3 to provide a faster (but peculiar) balanced tree keyed off a hash of the directory entry name. If the EXT4_INDEX_FL (0x1000) flag is set in the inode, this directory uses a hashed btree (htree) to organize and find directory entries. For -backwards read-only compatibility with ext2, this tree is actually -hidden inside the directory file, masquerading as “empty” directory data -blocks! It was stated previously that the end of the linear directory -entry table was signified with an entry pointing to inode 0; this is -(ab)used to fool the old linear-scan algorithm into thinking that the -rest of the directory block is empty so that it moves on. +backwards read-only compatibility with ext2, interior tree nodes are actually +hidden inside the directory file, masquerading as “empty” directory entries +spanning the whole block. It was stated previously that directory entries +with the inode set to 0 are treated as unused entries; this is (ab)used to +fool the old linear-scan algorithm into skipping over those blocks containing +the interior tree node data. The root of the tree always lives in the first data block of the directory. By ext2 custom, the '.' and '..' entries must appear at the @@ -209,24 +209,24 @@ beginning of this first block, so they are put here as two ``struct ext4_dir_entry_2`` s and not stored in the tree. The rest of the root node contains metadata about the tree and finally a hash->block map to find nodes that are lower in the htree. If -``dx_root.info.indirect_levels`` is non-zero then the htree has two -levels; the data block pointed to by the root node's map is an interior -node, which is indexed by a minor hash. Interior nodes in this tree -contains a zeroed out ``struct ext4_dir_entry_2`` followed by a -minor_hash->block map to find leafe nodes. Leaf nodes contain a linear -array of all ``struct ext4_dir_entry_2``; all of these entries -(presumably) hash to the same value. If there is an overflow, the -entries simply overflow into the next leaf node, and the -least-significant bit of the hash (in the interior node map) that gets -us to this next leaf node is set. - -To traverse the directory as a htree, the code calculates the hash of -the desired file name and uses it to find the corresponding block -number. If the tree is flat, the block is a linear array of directory -entries that can be searched; otherwise, the minor hash of the file name -is computed and used against this second block to find the corresponding -third block number. That third block number will be a linear array of -directory entries. +``dx_root.info.indirect_levels`` is non-zero then the htree has that many +levels and the blocks pointed to by the root node's map are interior nodes. +These interior nodes have a zeroed out ``struct ext4_dir_entry_2`` followed by +a hash->block map to find nodes of the next level. Leaf nodes look like +classic linear directory blocks, but all of its entries have a hash value +equal or greater than the indicated hash of the parent node. + +The actual hash value for an entry name is only 31 bits, the least-significant +bit is set to 0. However, if there is a hash collision between directory +entries, the least-significant bit may get set to 1 on interior nodes in the +case where these two (or more) hash-colliding entries do not fit into one leaf +node and must be split across multiple nodes. + +To look up a name in such a htree, the code calculates the hash of the desired +file name and uses it to find the leaf node with the range of hash values the +calculated hash falls into (in other words, a lookup works basically the same +as it would in a B-Tree keyed by the hash value), and possibly also scanning +the leaf nodes that follow (in tree order) in case of hash collisions. To traverse the directory as a linear array (such as the old code does), the code simply reads every data block in the directory. The blocks used @@ -319,7 +319,8 @@ of a data block: * - 0x24 - __le32 - block - - The block number (within the directory file) that goes with hash=0. + - The block number (within the directory file) that lead to the left-most + leaf node, i.e. the leaf containing entries with the lowest hash values. * - 0x28 - struct dx_entry - entries[0] @@ -442,7 +443,7 @@ The dx_tail structure is 8 bytes long and looks like this: * - 0x0 - u32 - dt_reserved - - Zero. + - Unused (but still part of the checksum curiously). * - 0x4 - __le32 - dt_checksum @@ -450,4 +451,4 @@ The dx_tail structure is 8 bytes long and looks like this: The checksum is calculated against the FS UUID, the htree index header (dx_root or dx_node), all of the htree indices (dx_entry) that are in -use, and the tail block (dx_tail). +use, and the tail block (dx_tail) with the dt_checksum initially set to 0. diff --git a/Documentation/filesystems/f2fs.rst b/Documentation/filesystems/f2fs.rst index e5bb89452aff..a8d02fe5be83 100644 --- a/Documentation/filesystems/f2fs.rst +++ b/Documentation/filesystems/f2fs.rst @@ -1,8 +1,11 @@ .. SPDX-License-Identifier: GPL-2.0 -========================================== -WHAT IS Flash-Friendly File System (F2FS)? -========================================== +================================= +Flash-Friendly File System (F2FS) +================================= + +Overview +======== NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have been equipped on a variety systems ranging from mobile to server systems. Since @@ -173,9 +176,12 @@ data_flush Enable data flushing before checkpoint in order to persist data of regular and symlink. reserve_root=%d Support configuring reserved space which is used for allocation from a privileged user with specified uid or - gid, unit: 4KB, the default limit is 0.2% of user blocks. -resuid=%d The user ID which may use the reserved blocks. -resgid=%d The group ID which may use the reserved blocks. + gid, unit: 4KB, the default limit is 12.5% of user blocks. +reserve_node=%d Support configuring reserved nodes which are used for + allocation from a privileged user with specified uid or + gid, the default limit is 12.5% of all nodes. +resuid=%d The user ID which may use the reserved blocks and nodes. +resgid=%d The group ID which may use the reserved blocks and nodes. fault_injection=%d Enable fault injection in all supported types with specified injection rate. fault_type=%d Support configuring fault injection type, should be @@ -291,9 +297,13 @@ compress_algorithm=%s Control compress algorithm, currently f2fs supports "lzo" "lz4", "zstd" and "lzo-rle" algorithm. compress_algorithm=%s:%d Control compress algorithm and its compress level, now, only "lz4" and "zstd" support compress level config. + + ========= =========== algorithm level range + ========= =========== lz4 3 - 16 zstd 1 - 22 + ========= =========== compress_log_size=%u Support configuring compress cluster size. The size will be 4KB * (1 << %u). The default and minimum sizes are 16KB. compress_extension=%s Support adding specified extension, so that f2fs can enable @@ -357,6 +367,7 @@ errors=%s Specify f2fs behavior on critical errors. This supports modes: panic immediately, continue without doing anything, and remount the partition in read-only mode. By default it uses "continue" mode. + ====================== =============== =============== ======== mode continue remount-ro panic ====================== =============== =============== ======== @@ -370,6 +381,25 @@ errors=%s Specify f2fs behavior on critical errors. This supports modes: ====================== =============== =============== ======== nat_bits Enable nat_bits feature to enhance full/empty nat blocks access, by default it's disabled. +lookup_mode=%s Control the directory lookup behavior for casefolded + directories. This option has no effect on directories + that do not have the casefold feature enabled. + + ================== ======================================== + Value Description + ================== ======================================== + perf (Default) Enforces a hash-only lookup. + The linear search fallback is always + disabled, ignoring the on-disk flag. + compat Enables the linear search fallback for + compatibility with directory entries + created by older kernel that used a + different case-folding algorithm. + This mode ignores the on-disk flag. + auto F2FS determines the mode based on the + on-disk `SB_ENC_NO_COMPAT_FALLBACK_FL` + flag. + ================== ======================================== ======================== ============================================================ Debugfs Entries @@ -795,11 +825,13 @@ ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREME extension list " " -- buffered io +------------------------------------------------------------------ N/A COLD_DATA WRITE_LIFE_EXTREME N/A HOT_DATA WRITE_LIFE_SHORT N/A WARM_DATA WRITE_LIFE_NOT_SET -- direct io +------------------------------------------------------------------ WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET @@ -915,24 +947,26 @@ compression enabled files (refer to "Compression implementation" section for how enable compression on a regular inode). 1) compress_mode=fs -This is the default option. f2fs does automatic compression in the writeback of the -compression enabled files. + + This is the default option. f2fs does automatic compression in the writeback of the + compression enabled files. 2) compress_mode=user -This disables the automatic compression and gives the user discretion of choosing the -target file and the timing. The user can do manual compression/decompression on the -compression enabled files using F2FS_IOC_DECOMPRESS_FILE and F2FS_IOC_COMPRESS_FILE -ioctls like the below. -To decompress a file, + This disables the automatic compression and gives the user discretion of choosing the + target file and the timing. The user can do manual compression/decompression on the + compression enabled files using F2FS_IOC_DECOMPRESS_FILE and F2FS_IOC_COMPRESS_FILE + ioctls like the below. + +To decompress a file:: -fd = open(filename, O_WRONLY, 0); -ret = ioctl(fd, F2FS_IOC_DECOMPRESS_FILE); + fd = open(filename, O_WRONLY, 0); + ret = ioctl(fd, F2FS_IOC_DECOMPRESS_FILE); -To compress a file, +To compress a file:: -fd = open(filename, O_WRONLY, 0); -ret = ioctl(fd, F2FS_IOC_COMPRESS_FILE); + fd = open(filename, O_WRONLY, 0); + ret = ioctl(fd, F2FS_IOC_COMPRESS_FILE); NVMe Zoned Namespace devices ---------------------------- @@ -962,32 +996,32 @@ reserved and used by another filesystem or for different purposes. Once that external usage is complete, the device aliasing file can be deleted, releasing the reserved space back to F2FS for its own use. -<use-case> - -# ls /dev/vd* -/dev/vdb (32GB) /dev/vdc (32GB) -# mkfs.ext4 /dev/vdc -# mkfs.f2fs -c /dev/vdc@vdc.file /dev/vdb -# mount /dev/vdb /mnt/f2fs -# ls -l /mnt/f2fs -vdc.file -# df -h -/dev/vdb 64G 33G 32G 52% /mnt/f2fs - -# mount -o loop /dev/vdc /mnt/ext4 -# df -h -/dev/vdb 64G 33G 32G 52% /mnt/f2fs -/dev/loop7 32G 24K 30G 1% /mnt/ext4 -# umount /mnt/ext4 - -# f2fs_io getflags /mnt/f2fs/vdc.file -get a flag on /mnt/f2fs/vdc.file ret=0, flags=nocow(pinned),immutable -# f2fs_io setflags noimmutable /mnt/f2fs/vdc.file -get a flag on noimmutable ret=0, flags=800010 -set a flag on /mnt/f2fs/vdc.file ret=0, flags=noimmutable -# rm /mnt/f2fs/vdc.file -# df -h -/dev/vdb 64G 753M 64G 2% /mnt/f2fs +.. code-block:: + + # ls /dev/vd* + /dev/vdb (32GB) /dev/vdc (32GB) + # mkfs.ext4 /dev/vdc + # mkfs.f2fs -c /dev/vdc@vdc.file /dev/vdb + # mount /dev/vdb /mnt/f2fs + # ls -l /mnt/f2fs + vdc.file + # df -h + /dev/vdb 64G 33G 32G 52% /mnt/f2fs + + # mount -o loop /dev/vdc /mnt/ext4 + # df -h + /dev/vdb 64G 33G 32G 52% /mnt/f2fs + /dev/loop7 32G 24K 30G 1% /mnt/ext4 + # umount /mnt/ext4 + + # f2fs_io getflags /mnt/f2fs/vdc.file + get a flag on /mnt/f2fs/vdc.file ret=0, flags=nocow(pinned),immutable + # f2fs_io setflags noimmutable /mnt/f2fs/vdc.file + get a flag on noimmutable ret=0, flags=800010 + set a flag on /mnt/f2fs/vdc.file ret=0, flags=noimmutable + # rm /mnt/f2fs/vdc.file + # df -h + /dev/vdb 64G 753M 64G 2% /mnt/f2fs So, the key idea is, user can do any file operations on /dev/vdc, and reclaim the space after the use, while the space is counted as /data. diff --git a/Documentation/filesystems/fuse-io-uring.rst b/Documentation/filesystems/fuse/fuse-io-uring.rst index d73dd0dbd238..d73dd0dbd238 100644 --- a/Documentation/filesystems/fuse-io-uring.rst +++ b/Documentation/filesystems/fuse/fuse-io-uring.rst diff --git a/Documentation/filesystems/fuse-io.rst b/Documentation/filesystems/fuse/fuse-io.rst index 6464de4266ad..d736ac4cb483 100644 --- a/Documentation/filesystems/fuse-io.rst +++ b/Documentation/filesystems/fuse/fuse-io.rst @@ -1,7 +1,7 @@ .. SPDX-License-Identifier: GPL-2.0 ============== -Fuse I/O Modes +FUSE I/O Modes ============== Fuse supports the following I/O modes: diff --git a/Documentation/filesystems/fuse-passthrough.rst b/Documentation/filesystems/fuse/fuse-passthrough.rst index 2b0e7c2da54a..2b0e7c2da54a 100644 --- a/Documentation/filesystems/fuse-passthrough.rst +++ b/Documentation/filesystems/fuse/fuse-passthrough.rst diff --git a/Documentation/filesystems/fuse.rst b/Documentation/filesystems/fuse/fuse.rst index 1e31e87aee68..0fbd5a03fdc9 100644 --- a/Documentation/filesystems/fuse.rst +++ b/Documentation/filesystems/fuse/fuse.rst @@ -1,8 +1,8 @@ .. SPDX-License-Identifier: GPL-2.0 -==== -FUSE -==== +============= +FUSE Overview +============= Definitions =========== @@ -129,6 +129,20 @@ For each connection the following files exist within this directory: connection. This means that all waiting requests will be aborted an error returned for all aborted and new requests. + max_background + The maximum number of background requests that can be outstanding + at a time. When the number of background requests reaches this limit, + further requests will be blocked until some are completed, potentially + causing I/O operations to stall. + + congestion_threshold + The threshold of background requests at which the kernel considers + the filesystem to be congested. When the number of background requests + exceeds this value, the kernel will skip asynchronous readahead + operations, reducing read-ahead optimizations but preserving essential + I/O, as well as suspending non-synchronous writeback operations + (WB_SYNC_NONE), delaying page cache flushing to the filesystem. + Only the owner of the mount may read or write these files. Interrupting filesystem operations diff --git a/Documentation/filesystems/fuse/index.rst b/Documentation/filesystems/fuse/index.rst new file mode 100644 index 000000000000..393a845214da --- /dev/null +++ b/Documentation/filesystems/fuse/index.rst @@ -0,0 +1,14 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================================================== +FUSE (Filesystem in Userspace) Technical Documentation +====================================================== + +.. toctree:: + :maxdepth: 2 + :numbered: + + fuse + fuse-io + fuse-io-uring + fuse-passthrough diff --git a/Documentation/filesystems/gfs2-glocks.rst b/Documentation/filesystems/gfs2-glocks.rst index adc0d4c4d979..ce5ff08cbd59 100644 --- a/Documentation/filesystems/gfs2-glocks.rst +++ b/Documentation/filesystems/gfs2-glocks.rst @@ -105,7 +105,7 @@ go_unlocked Yes No Operations must not drop either the bit lock or the spinlock if its held on entry. go_dump and do_demote_ok must never block. Note that go_dump will only be called if the glock's state - indicates that it is caching uptodate data. + indicates that it is caching up-to-date data. Glock locking order within GFS2: diff --git a/Documentation/filesystems/hpfs.rst b/Documentation/filesystems/hpfs.rst index 7e0dd2f4373e..0f9516b5eb07 100644 --- a/Documentation/filesystems/hpfs.rst +++ b/Documentation/filesystems/hpfs.rst @@ -65,7 +65,7 @@ are case sensitive, so for example when you create a file FOO, you can use 'cat FOO', 'cat Foo', 'cat foo' or 'cat F*' but not 'cat f*'. Note, that you also won't be able to compile linux kernel (and maybe other things) on HPFS because kernel creates different files with names like bootsect.S and -bootsect.s. When searching for file thats name has characters >= 128, codepages +bootsect.s. When searching for file whose name has characters >= 128, codepages are used - see below. OS/2 ignores dots and spaces at the end of file name, so this driver does as well. If you create 'a. ...', the file 'a' will be created, but you can still diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 622187a96bdc..af516e528ded 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -95,10 +95,7 @@ Documentation for filesystem implementations. hfs hfsplus hpfs - fuse - fuse-io - fuse-io-uring - fuse-passthrough + fuse/index inotify isofs nilfs2 diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst index 067ed8e14ef3..387fd9cc72ca 100644 --- a/Documentation/filesystems/iomap/operations.rst +++ b/Documentation/filesystems/iomap/operations.rst @@ -321,7 +321,7 @@ The fields are as follows: - ``writeback_submit``: Submit the previous built writeback context. Block based file systems should use the iomap_ioend_writeback_submit helper, other file system can implement their own. - File systems can optionall to hook into writeback bio submission. + File systems can optionally hook into writeback bio submission. This might include pre-write space accounting updates, or installing a custom ``->bi_end_io`` function for internal purposes, such as deferring the ioend completion to a workqueue to run metadata update diff --git a/Documentation/filesystems/mount_api.rst b/Documentation/filesystems/mount_api.rst index e149b89118c8..c99ab1f7fea4 100644 --- a/Documentation/filesystems/mount_api.rst +++ b/Documentation/filesystems/mount_api.rst @@ -506,8 +506,16 @@ returned. * :: + int vfs_parse_fs_qstr(struct fs_context *fc, const char *key, + const struct qstr *value); + + A wrapper around vfs_parse_fs_param() that copies the value string it is + passed. + + * :: + int vfs_parse_fs_string(struct fs_context *fc, const char *key, - const char *value, size_t v_size); + const char *value); A wrapper around vfs_parse_fs_param() that copies the value string it is passed. diff --git a/Documentation/filesystems/ocfs2-online-filecheck.rst b/Documentation/filesystems/ocfs2-online-filecheck.rst index 2257bb53edc1..9e8449416e0b 100644 --- a/Documentation/filesystems/ocfs2-online-filecheck.rst +++ b/Documentation/filesystems/ocfs2-online-filecheck.rst @@ -58,33 +58,33 @@ inode, fixing inode and setting the size of result record history. # echo "<inode>" > /sys/fs/ocfs2/<devname>/filecheck/check # cat /sys/fs/ocfs2/<devname>/filecheck/check -The output is like this:: + The output is like this:: INO DONE ERROR 39502 1 GENERATION - <INO> lists the inode numbers. - <DONE> indicates whether the operation has been finished. - <ERROR> says what kind of errors was found. For the detailed error numbers, - please refer to the file linux/fs/ocfs2/filecheck.h. + <INO> lists the inode numbers. + <DONE> indicates whether the operation has been finished. + <ERROR> says what kind of errors was found. For the detailed error numbers, + please refer to the file linux/fs/ocfs2/filecheck.h. 2. If you determine to fix this inode, do:: # echo "<inode>" > /sys/fs/ocfs2/<devname>/filecheck/fix # cat /sys/fs/ocfs2/<devname>/filecheck/fix -The output is like this::: + The output is like this:: INO DONE ERROR 39502 1 SUCCESS -This time, the <ERROR> column indicates whether this fix is successful or not. + This time, the <ERROR> column indicates whether this fix is successful or not. 3. The record cache is used to store the history of check/fix results. It's -default size is 10, and can be adjust between the range of 10 ~ 100. You can -adjust the size like this:: + default size is 10, and can be adjust between the range of 10 ~ 100. You can + adjust the size like this:: - # echo "<size>" > /sys/fs/ocfs2/<devname>/filecheck/set + # echo "<size>" > /sys/fs/ocfs2/<devname>/filecheck/set Fixing stuff ============ diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst index 78c3d07c0c08..7233b04668fc 100644 --- a/Documentation/filesystems/porting.rst +++ b/Documentation/filesystems/porting.rst @@ -1297,3 +1297,15 @@ Several functions are renamed: - user_path_create -> start_creating_user_path - user_path_locked_at -> start_removing_user_path_at - done_path_create -> end_creating_path + +--- + +**mandatory** + +Calling conventions for vfs_parse_fs_string() have changed; it does *not* +take length anymore (value ? strlen(value) : 0 is used). If you want +a different length, use + + vfs_parse_fs_qstr(fc, key, &QSTR_LEN(value, len)) + +instead. diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 42f2fb9e3c8f..0b86a8022fa1 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -61,19 +61,6 @@ Preface 0.1 Introduction/Credits ------------------------ -This documentation is part of a soon (or so we hope) to be released book on -the SuSE Linux distribution. As there is no complete documentation for the -/proc file system and we've used many freely available sources to write these -chapters, it seems only fair to give the work back to the Linux community. -This work is based on the 2.2.* kernel version and the upcoming 2.4.*. I'm -afraid it's still far from complete, but we hope it will be useful. As far as -we know, it is the first 'all-in-one' document about the /proc file system. It -is focused on the Intel x86 hardware, so if you are looking for PPC, ARM, -SPARC, AXP, etc., features, you probably won't find what you are looking for. -It also only covers IPv4 networking, not IPv6 nor other protocols - sorry. But -additions and patches are welcome and will be added to this document if you -mail them to Bodo. - We'd like to thank Alan Cox, Rik van Riel, and Alexey Kuznetsov and a lot of other people for help compiling this documentation. We'd also like to extend a special thank you to Andi Kleen for documentation, which we relied on heavily @@ -81,17 +68,9 @@ to create this document, as well as the additional information he provided. Thanks to everybody else who contributed source or docs to the Linux kernel and helped create a great piece of software... :) -If you have any comments, corrections or additions, please don't hesitate to -contact Bodo Bauer at bb@ricochet.net. We'll be happy to add them to this -document. - The latest version of this document is available online at https://www.kernel.org/doc/html/latest/filesystems/proc.html -If the above direction does not works for you, you could try the kernel -mailing list at linux-kernel@vger.kernel.org and/or try to reach me at -comandante@zaralinux.com. - 0.2 Legal Stuff --------------- @@ -2180,6 +2159,20 @@ DMA Buffer files where 'size' is the size of the DMA buffer in bytes. 'count' is the file count of the DMA buffer file. 'exp_name' is the name of the DMA buffer exporter. +VFIO Device files +~~~~~~~~~~~~~~~~~ + +:: + + pos: 0 + flags: 02000002 + mnt_id: 17 + ino: 5122 + vfio-device-syspath: /sys/devices/pci0000:e0/0000:e0:01.1/0000:e1:00.0/0000:e2:05.0/0000:e8:00.0 + +where 'vfio-device-syspath' is the sysfs path corresponding to the VFIO device +file. + 3.9 /proc/<pid>/map_files - Information about memory mapped files --------------------------------------------------------------------- This directory contains symbolic links which represent memory mapped files diff --git a/Documentation/filesystems/propagate_umount.txt b/Documentation/filesystems/propagate_umount.txt index c90349e5b889..9a7eb96df300 100644 --- a/Documentation/filesystems/propagate_umount.txt +++ b/Documentation/filesystems/propagate_umount.txt @@ -286,7 +286,7 @@ Trim_one(m) strip the "seen by Trim_ancestors" mark from m remove m from the Candidates list return - + remove_this = false found = false for each n in children(m) @@ -312,7 +312,7 @@ Trim_ancestors(m) } Terminating condition in the loop in Trim_ancestors() is correct, -since that that loop will never run into p belonging to U - p is always +since that loop will never run into p belonging to U - p is always an ancestor of argument of Trim_one() and since U is closed, the argument of Trim_one() would also have to belong to U. But Trim_one() is never called for elements of U. In other words, p belongs to S if and only @@ -361,7 +361,7 @@ such removals. Proof: suppose S was non-shifting, x is a locked element of S, parent of x is not in S and S - {x} is not non-shifting. Then there is an element m in S - {x} and a subtree mounted strictly inside m, such that m contains -an element not in in S - {x}. Since S is non-shifting, everything in +an element not in S - {x}. Since S is non-shifting, everything in that subtree must belong to S. But that means that this subtree must contain x somewhere *and* that parent of x either belongs that subtree or is equal to m. Either way it must belong to S. Contradiction. diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst index 006d23af66e1..b7f35b07876a 100644 --- a/Documentation/filesystems/resctrl.rst +++ b/Documentation/filesystems/resctrl.rst @@ -769,7 +769,7 @@ this would be dependent on number of cores the benchmark is run on. depending on # of threads: For the same SKU in #1, a 'single thread, with 10% bandwidth' and '4 -thread, with 10% bandwidth' can consume upto 10GBps and 40GBps although +thread, with 10% bandwidth' can consume up to 10GBps and 40GBps although they have same percentage bandwidth of 10%. This is simply because as threads start using more cores in an rdtgroup, the actual bandwidth may increase or vary although user specified bandwidth percentage is same. diff --git a/Documentation/filesystems/sharedsubtree.rst b/Documentation/filesystems/sharedsubtree.rst index 1cf56489ed48..8b7dc9159083 100644 --- a/Documentation/filesystems/sharedsubtree.rst +++ b/Documentation/filesystems/sharedsubtree.rst @@ -31,965 +31,960 @@ and versioned filesystem. ----------- Shared subtree provides four different flavors of mounts; struct vfsmount to be -precise +precise: - a. shared mount - b. slave mount - c. private mount - d. unbindable mount +a) A **shared mount** can be replicated to as many mountpoints and all the + replicas continue to be exactly same. -2a) A shared mount can be replicated to as many mountpoints and all the -replicas continue to be exactly same. + Here is an example: - Here is an example: + Let's say /mnt has a mount that is shared:: - Let's say /mnt has a mount that is shared:: + # mount --make-shared /mnt - mount --make-shared /mnt + .. note:: + mount(8) command now supports the --make-shared flag, + so the sample 'smount' program is no longer needed and has been + removed. - Note: mount(8) command now supports the --make-shared flag, - so the sample 'smount' program is no longer needed and has been - removed. + :: - :: + # mount --bind /mnt /tmp - # mount --bind /mnt /tmp + The above command replicates the mount at /mnt to the mountpoint /tmp + and the contents of both the mounts remain identical. - The above command replicates the mount at /mnt to the mountpoint /tmp - and the contents of both the mounts remain identical. + :: - :: + #ls /mnt + a b c - #ls /mnt - a b c + #ls /tmp + a b c - #ls /tmp - a b c + Now let's say we mount a device at /tmp/a:: - Now let's say we mount a device at /tmp/a:: + # mount /dev/sd0 /tmp/a - # mount /dev/sd0 /tmp/a + # ls /tmp/a + t1 t2 t3 - #ls /tmp/a - t1 t2 t3 + # ls /mnt/a + t1 t2 t3 - #ls /mnt/a - t1 t2 t3 + Note that the mount has propagated to the mount at /mnt as well. - Note that the mount has propagated to the mount at /mnt as well. + And the same is true even when /dev/sd0 is mounted on /mnt/a. The + contents will be visible under /tmp/a too. - And the same is true even when /dev/sd0 is mounted on /mnt/a. The - contents will be visible under /tmp/a too. +b) A **slave mount** is like a shared mount except that mount and umount events + only propagate towards it. -2b) A slave mount is like a shared mount except that mount and umount events - only propagate towards it. + All slave mounts have a master mount which is a shared. - All slave mounts have a master mount which is a shared. + Here is an example: - Here is an example: + Let's say /mnt has a mount which is shared:: - Let's say /mnt has a mount which is shared. - # mount --make-shared /mnt + # mount --make-shared /mnt - Let's bind mount /mnt to /tmp - # mount --bind /mnt /tmp + Let's bind mount /mnt to /tmp:: - the new mount at /tmp becomes a shared mount and it is a replica of - the mount at /mnt. + # mount --bind /mnt /tmp - Now let's make the mount at /tmp; a slave of /mnt - # mount --make-slave /tmp + the new mount at /tmp becomes a shared mount and it is a replica of + the mount at /mnt. - let's mount /dev/sd0 on /mnt/a - # mount /dev/sd0 /mnt/a + Now let's make the mount at /tmp; a slave of /mnt:: - #ls /mnt/a - t1 t2 t3 + # mount --make-slave /tmp - #ls /tmp/a - t1 t2 t3 + let's mount /dev/sd0 on /mnt/a:: - Note the mount event has propagated to the mount at /tmp + # mount /dev/sd0 /mnt/a - However let's see what happens if we mount something on the mount at /tmp + # ls /mnt/a + t1 t2 t3 - # mount /dev/sd1 /tmp/b + # ls /tmp/a + t1 t2 t3 - #ls /tmp/b - s1 s2 s3 + Note the mount event has propagated to the mount at /tmp - #ls /mnt/b + However let's see what happens if we mount something on the mount at + /tmp:: - Note how the mount event has not propagated to the mount at - /mnt + # mount /dev/sd1 /tmp/b + # ls /tmp/b + s1 s2 s3 -2c) A private mount does not forward or receive propagation. + # ls /mnt/b - This is the mount we are familiar with. Its the default type. + Note how the mount event has not propagated to the mount at + /mnt -2d) A unbindable mount is a unbindable private mount +c) A **private mount** does not forward or receive propagation. - let's say we have a mount at /mnt and we make it unbindable:: + This is the mount we are familiar with. Its the default type. - # mount --make-unbindable /mnt - Let's try to bind mount this mount somewhere else:: +d) An **unbindable mount** is, as the name suggests, an unbindable private + mount. - # mount --bind /mnt /tmp - mount: wrong fs type, bad option, bad superblock on /mnt, - or too many mounted file systems + let's say we have a mount at /mnt and we make it unbindable:: - Binding a unbindable mount is a invalid operation. + # mount --make-unbindable /mnt + + Let's try to bind mount this mount somewhere else:: + + # mount --bind /mnt /tmp mount: wrong fs type, bad option, bad + superblock on /mnt, or too many mounted file systems + + Binding a unbindable mount is a invalid operation. 3) Setting mount states ----------------------- - The mount command (util-linux package) can be used to set mount - states:: +The mount command (util-linux package) can be used to set mount +states:: - mount --make-shared mountpoint - mount --make-slave mountpoint - mount --make-private mountpoint - mount --make-unbindable mountpoint + mount --make-shared mountpoint + mount --make-slave mountpoint + mount --make-private mountpoint + mount --make-unbindable mountpoint 4) Use cases ------------ - A) A process wants to clone its own namespace, but still wants to - access the CD that got mounted recently. +A) A process wants to clone its own namespace, but still wants to + access the CD that got mounted recently. - Solution: + Solution: - The system administrator can make the mount at /cdrom shared:: + The system administrator can make the mount at /cdrom shared:: - mount --bind /cdrom /cdrom - mount --make-shared /cdrom + mount --bind /cdrom /cdrom + mount --make-shared /cdrom - Now any process that clones off a new namespace will have a - mount at /cdrom which is a replica of the same mount in the - parent namespace. + Now any process that clones off a new namespace will have a + mount at /cdrom which is a replica of the same mount in the + parent namespace. - So when a CD is inserted and mounted at /cdrom that mount gets - propagated to the other mount at /cdrom in all the other clone - namespaces. + So when a CD is inserted and mounted at /cdrom that mount gets + propagated to the other mount at /cdrom in all the other clone + namespaces. - B) A process wants its mounts invisible to any other process, but - still be able to see the other system mounts. +B) A process wants its mounts invisible to any other process, but + still be able to see the other system mounts. - Solution: + Solution: - To begin with, the administrator can mark the entire mount tree - as shareable:: + To begin with, the administrator can mark the entire mount tree + as shareable:: - mount --make-rshared / + mount --make-rshared / - A new process can clone off a new namespace. And mark some part - of its namespace as slave:: + A new process can clone off a new namespace. And mark some part + of its namespace as slave:: - mount --make-rslave /myprivatetree + mount --make-rslave /myprivatetree - Hence forth any mounts within the /myprivatetree done by the - process will not show up in any other namespace. However mounts - done in the parent namespace under /myprivatetree still shows - up in the process's namespace. + Hence forth any mounts within the /myprivatetree done by the + process will not show up in any other namespace. However mounts + done in the parent namespace under /myprivatetree still shows + up in the process's namespace. - Apart from the above semantics this feature provides the - building blocks to solve the following problems: +Apart from the above semantics this feature provides the +building blocks to solve the following problems: - C) Per-user namespace +C) Per-user namespace - The above semantics allows a way to share mounts across - namespaces. But namespaces are associated with processes. If - namespaces are made first class objects with user API to - associate/disassociate a namespace with userid, then each user - could have his/her own namespace and tailor it to his/her - requirements. This needs to be supported in PAM. + The above semantics allows a way to share mounts across + namespaces. But namespaces are associated with processes. If + namespaces are made first class objects with user API to + associate/disassociate a namespace with userid, then each user + could have his/her own namespace and tailor it to his/her + requirements. This needs to be supported in PAM. - D) Versioned files +D) Versioned files - If the entire mount tree is visible at multiple locations, then - an underlying versioning file system can return different - versions of the file depending on the path used to access that - file. + If the entire mount tree is visible at multiple locations, then + an underlying versioning file system can return different + versions of the file depending on the path used to access that + file. - An example is:: + An example is:: - mount --make-shared / - mount --rbind / /view/v1 - mount --rbind / /view/v2 - mount --rbind / /view/v3 - mount --rbind / /view/v4 + mount --make-shared / + mount --rbind / /view/v1 + mount --rbind / /view/v2 + mount --rbind / /view/v3 + mount --rbind / /view/v4 - and if /usr has a versioning filesystem mounted, then that - mount appears at /view/v1/usr, /view/v2/usr, /view/v3/usr and - /view/v4/usr too + and if /usr has a versioning filesystem mounted, then that + mount appears at /view/v1/usr, /view/v2/usr, /view/v3/usr and + /view/v4/usr too - A user can request v3 version of the file /usr/fs/namespace.c - by accessing /view/v3/usr/fs/namespace.c . The underlying - versioning filesystem can then decipher that v3 version of the - filesystem is being requested and return the corresponding - inode. + A user can request v3 version of the file /usr/fs/namespace.c + by accessing /view/v3/usr/fs/namespace.c . The underlying + versioning filesystem can then decipher that v3 version of the + filesystem is being requested and return the corresponding + inode. 5) Detailed semantics --------------------- - The section below explains the detailed semantics of - bind, rbind, move, mount, umount and clone-namespace operations. - - Note: the word 'vfsmount' and the noun 'mount' have been used - to mean the same thing, throughout this document. +The section below explains the detailed semantics of +bind, rbind, move, mount, umount and clone-namespace operations. -5a) Mount states +.. Note:: + the word 'vfsmount' and the noun 'mount' have been used + to mean the same thing, throughout this document. - A given mount can be in one of the following states +a) Mount states - 1) shared - 2) slave - 3) shared and slave - 4) private - 5) unbindable + A **propagation event** is defined as event generated on a vfsmount + that leads to mount or unmount actions in other vfsmounts. - A 'propagation event' is defined as event generated on a vfsmount - that leads to mount or unmount actions in other vfsmounts. + A **peer group** is defined as a group of vfsmounts that propagate + events to each other. - A 'peer group' is defined as a group of vfsmounts that propagate - events to each other. + A given mount can be in one of the following states: - (1) Shared mounts + (1) Shared mounts - A 'shared mount' is defined as a vfsmount that belongs to a - 'peer group'. + A **shared mount** is defined as a vfsmount that belongs to a + peer group. - For example:: + For example:: - mount --make-shared /mnt - mount --bind /mnt /tmp + mount --make-shared /mnt + mount --bind /mnt /tmp - The mount at /mnt and that at /tmp are both shared and belong - to the same peer group. Anything mounted or unmounted under - /mnt or /tmp reflect in all the other mounts of its peer - group. + The mount at /mnt and that at /tmp are both shared and belong + to the same peer group. Anything mounted or unmounted under + /mnt or /tmp reflect in all the other mounts of its peer + group. - (2) Slave mounts + (2) Slave mounts - A 'slave mount' is defined as a vfsmount that receives - propagation events and does not forward propagation events. + A **slave mount** is defined as a vfsmount that receives + propagation events and does not forward propagation events. - A slave mount as the name implies has a master mount from which - mount/unmount events are received. Events do not propagate from - the slave mount to the master. Only a shared mount can be made - a slave by executing the following command:: + A slave mount as the name implies has a master mount from which + mount/unmount events are received. Events do not propagate from + the slave mount to the master. Only a shared mount can be made + a slave by executing the following command:: - mount --make-slave mount + mount --make-slave mount - A shared mount that is made as a slave is no more shared unless - modified to become shared. + A shared mount that is made as a slave is no more shared unless + modified to become shared. - (3) Shared and Slave + (3) Shared and Slave - A vfsmount can be both shared as well as slave. This state - indicates that the mount is a slave of some vfsmount, and - has its own peer group too. This vfsmount receives propagation - events from its master vfsmount, and also forwards propagation - events to its 'peer group' and to its slave vfsmounts. + A vfsmount can be both **shared** as well as **slave**. This state + indicates that the mount is a slave of some vfsmount, and + has its own peer group too. This vfsmount receives propagation + events from its master vfsmount, and also forwards propagation + events to its 'peer group' and to its slave vfsmounts. - Strictly speaking, the vfsmount is shared having its own - peer group, and this peer-group is a slave of some other - peer group. + Strictly speaking, the vfsmount is shared having its own + peer group, and this peer-group is a slave of some other + peer group. - Only a slave vfsmount can be made as 'shared and slave' by - either executing the following command:: + Only a slave vfsmount can be made as 'shared and slave' by + either executing the following command:: - mount --make-shared mount + mount --make-shared mount - or by moving the slave vfsmount under a shared vfsmount. + or by moving the slave vfsmount under a shared vfsmount. - (4) Private mount + (4) Private mount - A 'private mount' is defined as vfsmount that does not - receive or forward any propagation events. + A **private mount** is defined as vfsmount that does not + receive or forward any propagation events. - (5) Unbindable mount + (5) Unbindable mount - A 'unbindable mount' is defined as vfsmount that does not - receive or forward any propagation events and cannot - be bind mounted. + A **unbindable mount** is defined as vfsmount that does not + receive or forward any propagation events and cannot + be bind mounted. - State diagram: + State diagram: - The state diagram below explains the state transition of a mount, - in response to various commands:: + The state diagram below explains the state transition of a mount, + in response to various commands:: - ----------------------------------------------------------------------- - | |make-shared | make-slave | make-private |make-unbindab| - --------------|------------|--------------|--------------|-------------| - |shared |shared |*slave/private| private | unbindable | - | | | | | | - |-------------|------------|--------------|--------------|-------------| - |slave |shared | **slave | private | unbindable | - | |and slave | | | | - |-------------|------------|--------------|--------------|-------------| - |shared |shared | slave | private | unbindable | - |and slave |and slave | | | | - |-------------|------------|--------------|--------------|-------------| - |private |shared | **private | private | unbindable | - |-------------|------------|--------------|--------------|-------------| - |unbindable |shared |**unbindable | private | unbindable | - ------------------------------------------------------------------------ + ----------------------------------------------------------------------- + | |make-shared | make-slave | make-private |make-unbindab| + --------------|------------|--------------|--------------|-------------| + |shared |shared |*slave/private| private | unbindable | + | | | | | | + |-------------|------------|--------------|--------------|-------------| + |slave |shared | **slave | private | unbindable | + | |and slave | | | | + |-------------|------------|--------------|--------------|-------------| + |shared |shared | slave | private | unbindable | + |and slave |and slave | | | | + |-------------|------------|--------------|--------------|-------------| + |private |shared | **private | private | unbindable | + |-------------|------------|--------------|--------------|-------------| + |unbindable |shared |**unbindable | private | unbindable | + ------------------------------------------------------------------------ - * if the shared mount is the only mount in its peer group, making it - slave, makes it private automatically. Note that there is no master to - which it can be slaved to. + * if the shared mount is the only mount in its peer group, making it + slave, makes it private automatically. Note that there is no master to + which it can be slaved to. - ** slaving a non-shared mount has no effect on the mount. + ** slaving a non-shared mount has no effect on the mount. - Apart from the commands listed below, the 'move' operation also changes - the state of a mount depending on type of the destination mount. Its - explained in section 5d. + Apart from the commands listed below, the 'move' operation also changes + the state of a mount depending on type of the destination mount. Its + explained in section 5d. -5b) Bind semantics +b) Bind semantics - Consider the following command:: + Consider the following command:: - mount --bind A/a B/b + mount --bind A/a B/b - where 'A' is the source mount, 'a' is the dentry in the mount 'A', 'B' - is the destination mount and 'b' is the dentry in the destination mount. + where 'A' is the source mount, 'a' is the dentry in the mount 'A', 'B' + is the destination mount and 'b' is the dentry in the destination mount. - The outcome depends on the type of mount of 'A' and 'B'. The table - below contains quick reference:: + The outcome depends on the type of mount of 'A' and 'B'. The table + below contains quick reference:: - -------------------------------------------------------------------------- - | BIND MOUNT OPERATION | - |************************************************************************| - |source(A)->| shared | private | slave | unbindable | - | dest(B) | | | | | - | | | | | | | - | v | | | | | - |************************************************************************| - | shared | shared | shared | shared & slave | invalid | - | | | | | | - |non-shared| shared | private | slave | invalid | - ************************************************************************** + -------------------------------------------------------------------------- + | BIND MOUNT OPERATION | + |************************************************************************| + |source(A)->| shared | private | slave | unbindable | + | dest(B) | | | | | + | | | | | | | + | v | | | | | + |************************************************************************| + | shared | shared | shared | shared & slave | invalid | + | | | | | | + |non-shared| shared | private | slave | invalid | + ************************************************************************** - Details: + Details: - 1. 'A' is a shared mount and 'B' is a shared mount. A new mount 'C' - which is clone of 'A', is created. Its root dentry is 'a' . 'C' is - mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ... - are created and mounted at the dentry 'b' on all mounts where 'B' - propagates to. A new propagation tree containing 'C1',..,'Cn' is - created. This propagation tree is identical to the propagation tree of - 'B'. And finally the peer-group of 'C' is merged with the peer group - of 'A'. + 1. 'A' is a shared mount and 'B' is a shared mount. A new mount 'C' + which is clone of 'A', is created. Its root dentry is 'a' . 'C' is + mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ... + are created and mounted at the dentry 'b' on all mounts where 'B' + propagates to. A new propagation tree containing 'C1',..,'Cn' is + created. This propagation tree is identical to the propagation tree of + 'B'. And finally the peer-group of 'C' is merged with the peer group + of 'A'. - 2. 'A' is a private mount and 'B' is a shared mount. A new mount 'C' - which is clone of 'A', is created. Its root dentry is 'a'. 'C' is - mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ... - are created and mounted at the dentry 'b' on all mounts where 'B' - propagates to. A new propagation tree is set containing all new mounts - 'C', 'C1', .., 'Cn' with exactly the same configuration as the - propagation tree for 'B'. + 2. 'A' is a private mount and 'B' is a shared mount. A new mount 'C' + which is clone of 'A', is created. Its root dentry is 'a'. 'C' is + mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ... + are created and mounted at the dentry 'b' on all mounts where 'B' + propagates to. A new propagation tree is set containing all new mounts + 'C', 'C1', .., 'Cn' with exactly the same configuration as the + propagation tree for 'B'. - 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. A new - mount 'C' which is clone of 'A', is created. Its root dentry is 'a' . - 'C' is mounted on mount 'B' at dentry 'b'. Also new mounts 'C1', 'C2', - 'C3' ... are created and mounted at the dentry 'b' on all mounts where - 'B' propagates to. A new propagation tree containing the new mounts - 'C','C1',.. 'Cn' is created. This propagation tree is identical to the - propagation tree for 'B'. And finally the mount 'C' and its peer group - is made the slave of mount 'Z'. In other words, mount 'C' is in the - state 'slave and shared'. - - 4. 'A' is a unbindable mount and 'B' is a shared mount. This is a - invalid operation. - - 5. 'A' is a private mount and 'B' is a non-shared(private or slave or - unbindable) mount. A new mount 'C' which is clone of 'A', is created. - Its root dentry is 'a'. 'C' is mounted on mount 'B' at dentry 'b'. - - 6. 'A' is a shared mount and 'B' is a non-shared mount. A new mount 'C' - which is a clone of 'A' is created. Its root dentry is 'a'. 'C' is - mounted on mount 'B' at dentry 'b'. 'C' is made a member of the - peer-group of 'A'. - - 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. A - new mount 'C' which is a clone of 'A' is created. Its root dentry is - 'a'. 'C' is mounted on mount 'B' at dentry 'b'. Also 'C' is set as a - slave mount of 'Z'. In other words 'A' and 'C' are both slave mounts of - 'Z'. All mount/unmount events on 'Z' propagates to 'A' and 'C'. But - mount/unmount on 'A' do not propagate anywhere else. Similarly - mount/unmount on 'C' do not propagate anywhere else. - - 8. 'A' is a unbindable mount and 'B' is a non-shared mount. This is a - invalid operation. A unbindable mount cannot be bind mounted. - -5c) Rbind semantics - - rbind is same as bind. Bind replicates the specified mount. Rbind - replicates all the mounts in the tree belonging to the specified mount. - Rbind mount is bind mount applied to all the mounts in the tree. - - If the source tree that is rbind has some unbindable mounts, - then the subtree under the unbindable mount is pruned in the new - location. - - eg: - - let's say we have the following mount tree:: - - A - / \ - B C - / \ / \ - D E F G - - Let's say all the mount except the mount C in the tree are - of a type other than unbindable. - - If this tree is rbound to say Z - - We will have the following tree at the new location:: - - Z - | - A' - / - B' Note how the tree under C is pruned - / \ in the new location. - D' E' - - - -5d) Move semantics - - Consider the following command - - mount --move A B/b + 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. A new + mount 'C' which is clone of 'A', is created. Its root dentry is 'a' . + 'C' is mounted on mount 'B' at dentry 'b'. Also new mounts 'C1', 'C2', + 'C3' ... are created and mounted at the dentry 'b' on all mounts where + 'B' propagates to. A new propagation tree containing the new mounts + 'C','C1',.. 'Cn' is created. This propagation tree is identical to the + propagation tree for 'B'. And finally the mount 'C' and its peer group + is made the slave of mount 'Z'. In other words, mount 'C' is in the + state 'slave and shared'. + + 4. 'A' is a unbindable mount and 'B' is a shared mount. This is a + invalid operation. + + 5. 'A' is a private mount and 'B' is a non-shared(private or slave or + unbindable) mount. A new mount 'C' which is clone of 'A', is created. + Its root dentry is 'a'. 'C' is mounted on mount 'B' at dentry 'b'. + + 6. 'A' is a shared mount and 'B' is a non-shared mount. A new mount 'C' + which is a clone of 'A' is created. Its root dentry is 'a'. 'C' is + mounted on mount 'B' at dentry 'b'. 'C' is made a member of the + peer-group of 'A'. + + 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. A + new mount 'C' which is a clone of 'A' is created. Its root dentry is + 'a'. 'C' is mounted on mount 'B' at dentry 'b'. Also 'C' is set as a + slave mount of 'Z'. In other words 'A' and 'C' are both slave mounts of + 'Z'. All mount/unmount events on 'Z' propagates to 'A' and 'C'. But + mount/unmount on 'A' do not propagate anywhere else. Similarly + mount/unmount on 'C' do not propagate anywhere else. + + 8. 'A' is a unbindable mount and 'B' is a non-shared mount. This is a + invalid operation. A unbindable mount cannot be bind mounted. + +c) Rbind semantics + + rbind is same as bind. Bind replicates the specified mount. Rbind + replicates all the mounts in the tree belonging to the specified mount. + Rbind mount is bind mount applied to all the mounts in the tree. + + If the source tree that is rbind has some unbindable mounts, + then the subtree under the unbindable mount is pruned in the new + location. + + eg: + + let's say we have the following mount tree:: + + A + / \ + B C + / \ / \ + D E F G + + Let's say all the mount except the mount C in the tree are + of a type other than unbindable. + + If this tree is rbound to say Z + + We will have the following tree at the new location:: + + Z + | + A' + / + B' Note how the tree under C is pruned + / \ in the new location. + D' E' + + + +d) Move semantics + + Consider the following command:: + + mount --move A B/b - where 'A' is the source mount, 'B' is the destination mount and 'b' is - the dentry in the destination mount. + where 'A' is the source mount, 'B' is the destination mount and 'b' is + the dentry in the destination mount. - The outcome depends on the type of the mount of 'A' and 'B'. The table - below is a quick reference:: + The outcome depends on the type of the mount of 'A' and 'B'. The table + below is a quick reference:: - --------------------------------------------------------------------------- - | MOVE MOUNT OPERATION | - |************************************************************************** - | source(A)->| shared | private | slave | unbindable | - | dest(B) | | | | | - | | | | | | | - | v | | | | | - |************************************************************************** - | shared | shared | shared |shared and slave| invalid | - | | | | | | - |non-shared| shared | private | slave | unbindable | - *************************************************************************** + --------------------------------------------------------------------------- + | MOVE MOUNT OPERATION | + |************************************************************************** + | source(A)->| shared | private | slave | unbindable | + | dest(B) | | | | | + | | | | | | | + | v | | | | | + |************************************************************************** + | shared | shared | shared |shared and slave| invalid | + | | | | | | + |non-shared| shared | private | slave | unbindable | + *************************************************************************** - .. Note:: moving a mount residing under a shared mount is invalid. + .. Note:: moving a mount residing under a shared mount is invalid. - Details follow: + Details follow: - 1. 'A' is a shared mount and 'B' is a shared mount. The mount 'A' is - mounted on mount 'B' at dentry 'b'. Also new mounts 'A1', 'A2'...'An' - are created and mounted at dentry 'b' on all mounts that receive - propagation from mount 'B'. A new propagation tree is created in the - exact same configuration as that of 'B'. This new propagation tree - contains all the new mounts 'A1', 'A2'... 'An'. And this new - propagation tree is appended to the already existing propagation tree - of 'A'. + 1. 'A' is a shared mount and 'B' is a shared mount. The mount 'A' is + mounted on mount 'B' at dentry 'b'. Also new mounts 'A1', 'A2'...'An' + are created and mounted at dentry 'b' on all mounts that receive + propagation from mount 'B'. A new propagation tree is created in the + exact same configuration as that of 'B'. This new propagation tree + contains all the new mounts 'A1', 'A2'... 'An'. And this new + propagation tree is appended to the already existing propagation tree + of 'A'. - 2. 'A' is a private mount and 'B' is a shared mount. The mount 'A' is - mounted on mount 'B' at dentry 'b'. Also new mount 'A1', 'A2'... 'An' - are created and mounted at dentry 'b' on all mounts that receive - propagation from mount 'B'. The mount 'A' becomes a shared mount and a - propagation tree is created which is identical to that of - 'B'. This new propagation tree contains all the new mounts 'A1', - 'A2'... 'An'. + 2. 'A' is a private mount and 'B' is a shared mount. The mount 'A' is + mounted on mount 'B' at dentry 'b'. Also new mount 'A1', 'A2'... 'An' + are created and mounted at dentry 'b' on all mounts that receive + propagation from mount 'B'. The mount 'A' becomes a shared mount and a + propagation tree is created which is identical to that of + 'B'. This new propagation tree contains all the new mounts 'A1', + 'A2'... 'An'. - 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. The - mount 'A' is mounted on mount 'B' at dentry 'b'. Also new mounts 'A1', - 'A2'... 'An' are created and mounted at dentry 'b' on all mounts that - receive propagation from mount 'B'. A new propagation tree is created - in the exact same configuration as that of 'B'. This new propagation - tree contains all the new mounts 'A1', 'A2'... 'An'. And this new - propagation tree is appended to the already existing propagation tree of - 'A'. Mount 'A' continues to be the slave mount of 'Z' but it also - becomes 'shared'. + 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. The + mount 'A' is mounted on mount 'B' at dentry 'b'. Also new mounts 'A1', + 'A2'... 'An' are created and mounted at dentry 'b' on all mounts that + receive propagation from mount 'B'. A new propagation tree is created + in the exact same configuration as that of 'B'. This new propagation + tree contains all the new mounts 'A1', 'A2'... 'An'. And this new + propagation tree is appended to the already existing propagation tree of + 'A'. Mount 'A' continues to be the slave mount of 'Z' but it also + becomes 'shared'. - 4. 'A' is a unbindable mount and 'B' is a shared mount. The operation - is invalid. Because mounting anything on the shared mount 'B' can - create new mounts that get mounted on the mounts that receive - propagation from 'B'. And since the mount 'A' is unbindable, cloning - it to mount at other mountpoints is not possible. + 4. 'A' is a unbindable mount and 'B' is a shared mount. The operation + is invalid. Because mounting anything on the shared mount 'B' can + create new mounts that get mounted on the mounts that receive + propagation from 'B'. And since the mount 'A' is unbindable, cloning + it to mount at other mountpoints is not possible. - 5. 'A' is a private mount and 'B' is a non-shared(private or slave or - unbindable) mount. The mount 'A' is mounted on mount 'B' at dentry 'b'. + 5. 'A' is a private mount and 'B' is a non-shared(private or slave or + unbindable) mount. The mount 'A' is mounted on mount 'B' at dentry 'b'. - 6. 'A' is a shared mount and 'B' is a non-shared mount. The mount 'A' - is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a - shared mount. + 6. 'A' is a shared mount and 'B' is a non-shared mount. The mount 'A' + is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a + shared mount. - 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. - The mount 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' - continues to be a slave mount of mount 'Z'. + 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. + The mount 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' + continues to be a slave mount of mount 'Z'. - 8. 'A' is a unbindable mount and 'B' is a non-shared mount. The mount - 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a - unbindable mount. + 8. 'A' is a unbindable mount and 'B' is a non-shared mount. The mount + 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a + unbindable mount. -5e) Mount semantics +e) Mount semantics - Consider the following command:: + Consider the following command:: - mount device B/b + mount device B/b - 'B' is the destination mount and 'b' is the dentry in the destination - mount. + 'B' is the destination mount and 'b' is the dentry in the destination + mount. - The above operation is the same as bind operation with the exception - that the source mount is always a private mount. + The above operation is the same as bind operation with the exception + that the source mount is always a private mount. -5f) Unmount semantics +f) Unmount semantics - Consider the following command:: + Consider the following command:: - umount A + umount A - where 'A' is a mount mounted on mount 'B' at dentry 'b'. + where 'A' is a mount mounted on mount 'B' at dentry 'b'. - If mount 'B' is shared, then all most-recently-mounted mounts at dentry - 'b' on mounts that receive propagation from mount 'B' and does not have - sub-mounts within them are unmounted. + If mount 'B' is shared, then all most-recently-mounted mounts at dentry + 'b' on mounts that receive propagation from mount 'B' and does not have + sub-mounts within them are unmounted. - Example: Let's say 'B1', 'B2', 'B3' are shared mounts that propagate to - each other. + Example: Let's say 'B1', 'B2', 'B3' are shared mounts that propagate to + each other. - let's say 'A1', 'A2', 'A3' are first mounted at dentry 'b' on mount - 'B1', 'B2' and 'B3' respectively. + let's say 'A1', 'A2', 'A3' are first mounted at dentry 'b' on mount + 'B1', 'B2' and 'B3' respectively. - let's say 'C1', 'C2', 'C3' are next mounted at the same dentry 'b' on - mount 'B1', 'B2' and 'B3' respectively. + let's say 'C1', 'C2', 'C3' are next mounted at the same dentry 'b' on + mount 'B1', 'B2' and 'B3' respectively. - if 'C1' is unmounted, all the mounts that are most-recently-mounted on - 'B1' and on the mounts that 'B1' propagates-to are unmounted. + if 'C1' is unmounted, all the mounts that are most-recently-mounted on + 'B1' and on the mounts that 'B1' propagates-to are unmounted. - 'B1' propagates to 'B2' and 'B3'. And the most recently mounted mount - on 'B2' at dentry 'b' is 'C2', and that of mount 'B3' is 'C3'. + 'B1' propagates to 'B2' and 'B3'. And the most recently mounted mount + on 'B2' at dentry 'b' is 'C2', and that of mount 'B3' is 'C3'. - So all 'C1', 'C2' and 'C3' should be unmounted. + So all 'C1', 'C2' and 'C3' should be unmounted. - If any of 'C2' or 'C3' has some child mounts, then that mount is not - unmounted, but all other mounts are unmounted. However if 'C1' is told - to be unmounted and 'C1' has some sub-mounts, the umount operation is - failed entirely. + If any of 'C2' or 'C3' has some child mounts, then that mount is not + unmounted, but all other mounts are unmounted. However if 'C1' is told + to be unmounted and 'C1' has some sub-mounts, the umount operation is + failed entirely. -5g) Clone Namespace +g) Clone Namespace - A cloned namespace contains all the mounts as that of the parent - namespace. + A cloned namespace contains all the mounts as that of the parent + namespace. - Let's say 'A' and 'B' are the corresponding mounts in the parent and the - child namespace. + Let's say 'A' and 'B' are the corresponding mounts in the parent and the + child namespace. - If 'A' is shared, then 'B' is also shared and 'A' and 'B' propagate to - each other. + If 'A' is shared, then 'B' is also shared and 'A' and 'B' propagate to + each other. - If 'A' is a slave mount of 'Z', then 'B' is also the slave mount of - 'Z'. + If 'A' is a slave mount of 'Z', then 'B' is also the slave mount of + 'Z'. - If 'A' is a private mount, then 'B' is a private mount too. + If 'A' is a private mount, then 'B' is a private mount too. - If 'A' is unbindable mount, then 'B' is a unbindable mount too. + If 'A' is unbindable mount, then 'B' is a unbindable mount too. 6) Quiz ------- - A. What is the result of the following command sequence? +A. What is the result of the following command sequence? - :: + :: - mount --bind /mnt /mnt - mount --make-shared /mnt - mount --bind /mnt /tmp - mount --move /tmp /mnt/1 + mount --bind /mnt /mnt + mount --make-shared /mnt + mount --bind /mnt /tmp + mount --move /tmp /mnt/1 - what should be the contents of /mnt /mnt/1 /mnt/1/1 should be? - Should they all be identical? or should /mnt and /mnt/1 be - identical only? + what should be the contents of /mnt /mnt/1 /mnt/1/1 should be? + Should they all be identical? or should /mnt and /mnt/1 be + identical only? - B. What is the result of the following command sequence? +B. What is the result of the following command sequence? - :: + :: - mount --make-rshared / - mkdir -p /v/1 - mount --rbind / /v/1 + mount --make-rshared / + mkdir -p /v/1 + mount --rbind / /v/1 - what should be the content of /v/1/v/1 be? + what should be the content of /v/1/v/1 be? - C. What is the result of the following command sequence? +C. What is the result of the following command sequence? - :: + :: - mount --bind /mnt /mnt - mount --make-shared /mnt - mkdir -p /mnt/1/2/3 /mnt/1/test - mount --bind /mnt/1 /tmp - mount --make-slave /mnt - mount --make-shared /mnt - mount --bind /mnt/1/2 /tmp1 - mount --make-slave /mnt + mount --bind /mnt /mnt + mount --make-shared /mnt + mkdir -p /mnt/1/2/3 /mnt/1/test + mount --bind /mnt/1 /tmp + mount --make-slave /mnt + mount --make-shared /mnt + mount --bind /mnt/1/2 /tmp1 + mount --make-slave /mnt - At this point we have the first mount at /tmp and - its root dentry is 1. Let's call this mount 'A' - And then we have a second mount at /tmp1 with root - dentry 2. Let's call this mount 'B' - Next we have a third mount at /mnt with root dentry - mnt. Let's call this mount 'C' + At this point we have the first mount at /tmp and + its root dentry is 1. Let's call this mount 'A' + And then we have a second mount at /tmp1 with root + dentry 2. Let's call this mount 'B' + Next we have a third mount at /mnt with root dentry + mnt. Let's call this mount 'C' - 'B' is the slave of 'A' and 'C' is a slave of 'B' - A -> B -> C + 'B' is the slave of 'A' and 'C' is a slave of 'B' + A -> B -> C - at this point if we execute the following command + at this point if we execute the following command:: - mount --bind /bin /tmp/test + mount --bind /bin /tmp/test - The mount is attempted on 'A' + The mount is attempted on 'A' - will the mount propagate to 'B' and 'C' ? + will the mount propagate to 'B' and 'C' ? - what would be the contents of - /mnt/1/test be? + what would be the contents of + /mnt/1/test be? 7) FAQ ------ - Q1. Why is bind mount needed? How is it different from symbolic links? - symbolic links can get stale if the destination mount gets - unmounted or moved. Bind mounts continue to exist even if the - other mount is unmounted or moved. +1. Why is bind mount needed? How is it different from symbolic links? - Q2. Why can't the shared subtree be implemented using exportfs? + symbolic links can get stale if the destination mount gets + unmounted or moved. Bind mounts continue to exist even if the + other mount is unmounted or moved. - exportfs is a heavyweight way of accomplishing part of what - shared subtree can do. I cannot imagine a way to implement the - semantics of slave mount using exportfs? +2. Why can't the shared subtree be implemented using exportfs? - Q3 Why is unbindable mount needed? + exportfs is a heavyweight way of accomplishing part of what + shared subtree can do. I cannot imagine a way to implement the + semantics of slave mount using exportfs? - Let's say we want to replicate the mount tree at multiple - locations within the same subtree. +3. Why is unbindable mount needed? - if one rbind mounts a tree within the same subtree 'n' times - the number of mounts created is an exponential function of 'n'. - Having unbindable mount can help prune the unneeded bind - mounts. Here is an example. + Let's say we want to replicate the mount tree at multiple + locations within the same subtree. - step 1: - let's say the root tree has just two directories with - one vfsmount:: + if one rbind mounts a tree within the same subtree 'n' times + the number of mounts created is an exponential function of 'n'. + Having unbindable mount can help prune the unneeded bind + mounts. Here is an example. - root - / \ - tmp usr + step 1: + let's say the root tree has just two directories with + one vfsmount:: - And we want to replicate the tree at multiple - mountpoints under /root/tmp + root + / \ + tmp usr - step 2: - :: + And we want to replicate the tree at multiple + mountpoints under /root/tmp + step 2: + :: - mount --make-shared /root - mkdir -p /tmp/m1 + mount --make-shared /root - mount --rbind /root /tmp/m1 + mkdir -p /tmp/m1 - the new tree now looks like this:: + mount --rbind /root /tmp/m1 - root - / \ - tmp usr - / - m1 - / \ - tmp usr - / - m1 + the new tree now looks like this:: - it has two vfsmounts + root + / \ + tmp usr + / + m1 + / \ + tmp usr + / + m1 - step 3: - :: + it has two vfsmounts - mkdir -p /tmp/m2 - mount --rbind /root /tmp/m2 + step 3: + :: - the new tree now looks like this:: + mkdir -p /tmp/m2 + mount --rbind /root /tmp/m2 - root - / \ - tmp usr - / \ - m1 m2 - / \ / \ - tmp usr tmp usr - / \ / - m1 m2 m1 - / \ / \ - tmp usr tmp usr - / / \ - m1 m1 m2 - / \ - tmp usr - / \ - m1 m2 + the new tree now looks like this:: - it has 6 vfsmounts + root + / \ + tmp usr + / \ + m1 m2 + / \ / \ + tmp usr tmp usr + / \ / + m1 m2 m1 + / \ / \ + tmp usr tmp usr + / / \ + m1 m1 m2 + / \ + tmp usr + / \ + m1 m2 - step 4: - :: - mkdir -p /tmp/m3 - mount --rbind /root /tmp/m3 + it has 6 vfsmounts - I won't draw the tree..but it has 24 vfsmounts + step 4: + :: + mkdir -p /tmp/m3 + mount --rbind /root /tmp/m3 - at step i the number of vfsmounts is V[i] = i*V[i-1]. - This is an exponential function. And this tree has way more - mounts than what we really needed in the first place. + I won't draw the tree..but it has 24 vfsmounts - One could use a series of umount at each step to prune - out the unneeded mounts. But there is a better solution. - Unclonable mounts come in handy here. - step 1: - let's say the root tree has just two directories with - one vfsmount:: + at step i the number of vfsmounts is V[i] = i*V[i-1]. + This is an exponential function. And this tree has way more + mounts than what we really needed in the first place. - root - / \ - tmp usr + One could use a series of umount at each step to prune + out the unneeded mounts. But there is a better solution. + Unclonable mounts come in handy here. - How do we set up the same tree at multiple locations under - /root/tmp + step 1: + let's say the root tree has just two directories with + one vfsmount:: - step 2: - :: + root + / \ + tmp usr + How do we set up the same tree at multiple locations under + /root/tmp - mount --bind /root/tmp /root/tmp + step 2: + :: - mount --make-rshared /root - mount --make-unbindable /root/tmp - mkdir -p /tmp/m1 + mount --bind /root/tmp /root/tmp - mount --rbind /root /tmp/m1 + mount --make-rshared /root + mount --make-unbindable /root/tmp - the new tree now looks like this:: + mkdir -p /tmp/m1 - root - / \ - tmp usr - / - m1 - / \ - tmp usr + mount --rbind /root /tmp/m1 - step 3: - :: + the new tree now looks like this:: - mkdir -p /tmp/m2 - mount --rbind /root /tmp/m2 + root + / \ + tmp usr + / + m1 + / \ + tmp usr - the new tree now looks like this:: + step 3: + :: - root - / \ - tmp usr - / \ - m1 m2 - / \ / \ - tmp usr tmp usr + mkdir -p /tmp/m2 + mount --rbind /root /tmp/m2 - step 4: - :: + the new tree now looks like this:: - mkdir -p /tmp/m3 - mount --rbind /root /tmp/m3 + root + / \ + tmp usr + / \ + m1 m2 + / \ / \ + tmp usr tmp usr - the new tree now looks like this:: + step 4: + :: - root - / \ - tmp usr - / \ \ - m1 m2 m3 - / \ / \ / \ - tmp usr tmp usr tmp usr + mkdir -p /tmp/m3 + mount --rbind /root /tmp/m3 + + the new tree now looks like this:: + + root + / \ + tmp usr + / \ \ + m1 m2 m3 + / \ / \ / \ + tmp usr tmp usr tmp usr 8) Implementation ----------------- -8A) Datastructure +A) Datastructure + + Several new fields are introduced to struct vfsmount: + + ->mnt_share + Links together all the mount to/from which this vfsmount + send/receives propagation events. - 4 new fields are introduced to struct vfsmount: + ->mnt_slave_list + Links all the mounts to which this vfsmount propagates + to. - * ->mnt_share - * ->mnt_slave_list - * ->mnt_slave - * ->mnt_master + ->mnt_slave + Links together all the slaves that its master vfsmount + propagates to. - ->mnt_share - links together all the mount to/from which this vfsmount - send/receives propagation events. + ->mnt_master + Points to the master vfsmount from which this vfsmount + receives propagation. - ->mnt_slave_list - links all the mounts to which this vfsmount propagates - to. + ->mnt_flags + Takes two more flags to indicate the propagation status of + the vfsmount. MNT_SHARE indicates that the vfsmount is a shared + vfsmount. MNT_UNCLONABLE indicates that the vfsmount cannot be + replicated. - ->mnt_slave - links together all the slaves that its master vfsmount - propagates to. + All the shared vfsmounts in a peer group form a cyclic list through + ->mnt_share. - ->mnt_master - points to the master vfsmount from which this vfsmount - receives propagation. + All vfsmounts with the same ->mnt_master form on a cyclic list anchored + in ->mnt_master->mnt_slave_list and going through ->mnt_slave. - ->mnt_flags - takes two more flags to indicate the propagation status of - the vfsmount. MNT_SHARE indicates that the vfsmount is a shared - vfsmount. MNT_UNCLONABLE indicates that the vfsmount cannot be - replicated. + ->mnt_master can point to arbitrary (and possibly different) members + of master peer group. To find all immediate slaves of a peer group + you need to go through _all_ ->mnt_slave_list of its members. + Conceptually it's just a single set - distribution among the + individual lists does not affect propagation or the way propagation + tree is modified by operations. - All the shared vfsmounts in a peer group form a cyclic list through - ->mnt_share. + All vfsmounts in a peer group have the same ->mnt_master. If it is + non-NULL, they form a contiguous (ordered) segment of slave list. - All vfsmounts with the same ->mnt_master form on a cyclic list anchored - in ->mnt_master->mnt_slave_list and going through ->mnt_slave. + A example propagation tree looks as shown in the figure below. - ->mnt_master can point to arbitrary (and possibly different) members - of master peer group. To find all immediate slaves of a peer group - you need to go through _all_ ->mnt_slave_list of its members. - Conceptually it's just a single set - distribution among the - individual lists does not affect propagation or the way propagation - tree is modified by operations. + .. note:: + Though it looks like a forest, if we consider all the shared + mounts as a conceptual entity called 'pnode', it becomes a tree. - All vfsmounts in a peer group have the same ->mnt_master. If it is - non-NULL, they form a contiguous (ordered) segment of slave list. + :: - A example propagation tree looks as shown in the figure below. - [ NOTE: Though it looks like a forest, if we consider all the shared - mounts as a conceptual entity called 'pnode', it becomes a tree]:: + A <--> B <--> C <---> D + /|\ /| |\ + / F G J K H I + / + E<-->K + /|\ + M L N - A <--> B <--> C <---> D - /|\ /| |\ - / F G J K H I - / - E<-->K - /|\ - M L N + In the above figure A,B,C and D all are shared and propagate to each + other. 'A' has got 3 slave mounts 'E' 'F' and 'G' 'C' has got 2 slave + mounts 'J' and 'K' and 'D' has got two slave mounts 'H' and 'I'. + 'E' is also shared with 'K' and they propagate to each other. And + 'K' has 3 slaves 'M', 'L' and 'N' - In the above figure A,B,C and D all are shared and propagate to each - other. 'A' has got 3 slave mounts 'E' 'F' and 'G' 'C' has got 2 slave - mounts 'J' and 'K' and 'D' has got two slave mounts 'H' and 'I'. - 'E' is also shared with 'K' and they propagate to each other. And - 'K' has 3 slaves 'M', 'L' and 'N' + A's ->mnt_share links with the ->mnt_share of 'B' 'C' and 'D' - A's ->mnt_share links with the ->mnt_share of 'B' 'C' and 'D' + A's ->mnt_slave_list links with ->mnt_slave of 'E', 'K', 'F' and 'G' - A's ->mnt_slave_list links with ->mnt_slave of 'E', 'K', 'F' and 'G' + E's ->mnt_share links with ->mnt_share of K - E's ->mnt_share links with ->mnt_share of K + 'E', 'K', 'F', 'G' have their ->mnt_master point to struct vfsmount of 'A' - 'E', 'K', 'F', 'G' have their ->mnt_master point to struct vfsmount of 'A' + 'M', 'L', 'N' have their ->mnt_master point to struct vfsmount of 'K' - 'M', 'L', 'N' have their ->mnt_master point to struct vfsmount of 'K' + K's ->mnt_slave_list links with ->mnt_slave of 'M', 'L' and 'N' - K's ->mnt_slave_list links with ->mnt_slave of 'M', 'L' and 'N' + C's ->mnt_slave_list links with ->mnt_slave of 'J' and 'K' - C's ->mnt_slave_list links with ->mnt_slave of 'J' and 'K' + J and K's ->mnt_master points to struct vfsmount of C - J and K's ->mnt_master points to struct vfsmount of C + and finally D's ->mnt_slave_list links with ->mnt_slave of 'H' and 'I' - and finally D's ->mnt_slave_list links with ->mnt_slave of 'H' and 'I' + 'H' and 'I' have their ->mnt_master pointing to struct vfsmount of 'D'. - 'H' and 'I' have their ->mnt_master pointing to struct vfsmount of 'D'. + NOTE: The propagation tree is orthogonal to the mount tree. - NOTE: The propagation tree is orthogonal to the mount tree. +B) Locking: -8B Locking: + ->mnt_share, ->mnt_slave, ->mnt_slave_list, ->mnt_master are protected + by namespace_sem (exclusive for modifications, shared for reading). - ->mnt_share, ->mnt_slave, ->mnt_slave_list, ->mnt_master are protected - by namespace_sem (exclusive for modifications, shared for reading). + Normally we have ->mnt_flags modifications serialized by vfsmount_lock. + There are two exceptions: do_add_mount() and clone_mnt(). + The former modifies a vfsmount that has not been visible in any shared + data structures yet. + The latter holds namespace_sem and the only references to vfsmount + are in lists that can't be traversed without namespace_sem. - Normally we have ->mnt_flags modifications serialized by vfsmount_lock. - There are two exceptions: do_add_mount() and clone_mnt(). - The former modifies a vfsmount that has not been visible in any shared - data structures yet. - The latter holds namespace_sem and the only references to vfsmount - are in lists that can't be traversed without namespace_sem. +C) Algorithm: -8C Algorithm: + The crux of the implementation resides in rbind/move operation. - The crux of the implementation resides in rbind/move operation. + The overall algorithm breaks the operation into 3 phases: (look at + attach_recursive_mnt() and propagate_mnt()) - The overall algorithm breaks the operation into 3 phases: (look at - attach_recursive_mnt() and propagate_mnt()) + 1. Prepare phase. - 1. prepare phase. - 2. commit phases. - 3. abort phases. + For each mount in the source tree: - Prepare phase: + a) Create the necessary number of mount trees to + be attached to each of the mounts that receive + propagation from the destination mount. + b) Do not attach any of the trees to its destination. + However note down its ->mnt_parent and ->mnt_mountpoint + c) Link all the new mounts to form a propagation tree that + is identical to the propagation tree of the destination + mount. - for each mount in the source tree: + If this phase is successful, there should be 'n' new + propagation trees; where 'n' is the number of mounts in the + source tree. Go to the commit phase - a) Create the necessary number of mount trees to - be attached to each of the mounts that receive - propagation from the destination mount. - b) Do not attach any of the trees to its destination. - However note down its ->mnt_parent and ->mnt_mountpoint - c) Link all the new mounts to form a propagation tree that - is identical to the propagation tree of the destination - mount. + Also there should be 'm' new mount trees, where 'm' is + the number of mounts to which the destination mount + propagates to. - If this phase is successful, there should be 'n' new - propagation trees; where 'n' is the number of mounts in the - source tree. Go to the commit phase + If any memory allocations fail, go to the abort phase. - Also there should be 'm' new mount trees, where 'm' is - the number of mounts to which the destination mount - propagates to. + 2. Commit phase. - if any memory allocations fail, go to the abort phase. + Attach each of the mount trees to their corresponding + destination mounts. - Commit phase - attach each of the mount trees to their corresponding - destination mounts. + 3. Abort phase. - Abort phase - delete all the newly created trees. + Delete all the newly created trees. - .. Note:: - all the propagation related functionality resides in the file pnode.c + .. Note:: + all the propagation related functionality resides in the file pnode.c ------------------------------------------------------------------------ diff --git a/Documentation/filesystems/sysfs.rst b/Documentation/filesystems/sysfs.rst index c32993bc83c7..2703c04af7d0 100644 --- a/Documentation/filesystems/sysfs.rst +++ b/Documentation/filesystems/sysfs.rst @@ -243,8 +243,8 @@ Other notes: - show() methods should return the number of bytes printed into the buffer. -- show() should only use sysfs_emit() or sysfs_emit_at() when formatting - the value to be returned to user space. +- New implementations of show() methods should only use sysfs_emit() or + sysfs_emit_at() when formatting the value to be returned to user space. - store() should return the number of bytes used from the buffer. If the entire buffer has been used, just return the count argument. @@ -299,7 +299,6 @@ The top level sysfs directory looks like:: hypervisor/ kernel/ module/ - net/ power/ devices/ contains a filesystem representation of the device tree. It maps @@ -313,7 +312,7 @@ kernel. Each bus's directory contains two subdirectories:: drivers/ devices/ contains symlinks for each device discovered in the system -that point to the device's directory under root/. +that point to the device's directory under /sys/devices. drivers/ contains a directory for each device driver that is loaded for devices on that particular bus (this assumes that drivers do not @@ -321,22 +320,36 @@ span multiple bus types). fs/ contains a directory for some filesystems. Currently each filesystem wanting to export attributes must create its own hierarchy -below fs/ (see ./fuse.rst for an example). +below fs/ (see fuse/fuse.rst for an example). module/ contains parameter values and state information for all loaded system modules, for both builtin and loadable modules. dev/ contains two directories: char/ and block/. Inside these two directories there are symlinks named <major>:<minor>. These symlinks -point to the sysfs directory for the given device. /sys/dev provides a +point to the directories under /sys/devices for each device. /sys/dev provides a quick way to lookup the sysfs interface for a device from the result of a stat(2) operation. More information on driver-model specific features can be found in Documentation/driver-api/driver-model/. +block/ contains symlinks to all the block devices discovered on the system. +These symlinks point to directories under /sys/devices. -TODO: Finish this section. +class/ contains a directory for each device class, grouped by functional type. +Each directory in class/ contains symlinks to devices in the /sys/devices directory. + +firmware/ contains system firmware data and configuration such as firmware tables, +ACPI information, and device tree data. + +hypervisor/ contains virtualization platform information and provides an interface to +the underlying hypervisor. It is only present when running on a virtual machine. + +kernel/ contains runtime kernel parameters, configuration settings, and status. + +power/ contains power management subsystem information including +sleep states, suspend/resume capabilities, and policies. Current Interfaces diff --git a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst index e231d127cd40..8cbcd3c26434 100644 --- a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst +++ b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst @@ -454,7 +454,7 @@ filesystem so that it can apply pending filesystem updates to the staging information. Once the scan is done, the owning object is re-locked, the live data is used to write a new ondisk structure, and the repairs are committed atomically. -The hooks are disabled and the staging staging area is freed. +The hooks are disabled and the staging area is freed. Finally, the storage from the old data structure are carefully reaped. Introducing concurrency helps online repair avoid various locking problems, but @@ -475,7 +475,7 @@ operation, which may cause application failure or an unplanned filesystem shutdown. Inspiration for the secondary metadata repair strategy was drawn from section -2.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without Side-File") +2.4 of Srinivasan above, and sections 2 ("NSF: Index Build Without Side-File") and 3.1.1 ("Duplicate Key Insert Problem") in C. Mohan, `"Algorithms for Creating Indexes for Very Large Tables Without Quiescing Updates" <https://dl.acm.org/doi/10.1145/130283.130337>`_, 1992. @@ -2185,7 +2185,7 @@ The chapter about :ref:`secondary metadata<secondary_metadata>` mentioned that checking and repairing of secondary metadata commonly requires coordination between a live metadata scan of the filesystem and writer threads that are updating that metadata. -Keeping the scan data up to date requires requires the ability to propagate +Keeping the scan data up to date requires the ability to propagate metadata updates from the filesystem into the data being collected by the scan. This *can* be done by appending concurrent updates into a separate log file and applying them before writing the new metadata to disk, but this leads to @@ -4179,7 +4179,7 @@ When the exchange is initiated, the sequence of operations is as follows: This will be discussed in more detail in subsequent sections. If the filesystem goes down in the middle of an operation, log recovery will -find the most recent unfinished maping exchange log intent item and restart +find the most recent unfinished mapping exchange log intent item and restart from there. This is how atomic file mapping exchanges guarantees that an outside observer will either see the old broken structure or the new one, and never a mismash of |