summaryrefslogtreecommitdiff
path: root/Documentation/filesystems
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r--Documentation/filesystems/bcachefs/CodingStyle.rst186
-rw-r--r--Documentation/filesystems/bcachefs/SubmittingPatches.rst105
-rw-r--r--Documentation/filesystems/bcachefs/casefolding.rst108
-rw-r--r--Documentation/filesystems/bcachefs/errorcodes.rst30
-rw-r--r--Documentation/filesystems/bcachefs/future/idle_work.rst78
-rw-r--r--Documentation/filesystems/bcachefs/index.rst38
-rw-r--r--Documentation/filesystems/erofs.rst2
-rw-r--r--Documentation/filesystems/ext4/atomic_writes.rst6
-rw-r--r--Documentation/filesystems/f2fs.rst122
-rw-r--r--Documentation/filesystems/fuse/fuse-io-uring.rst (renamed from Documentation/filesystems/fuse-io-uring.rst)0
-rw-r--r--Documentation/filesystems/fuse/fuse-io.rst (renamed from Documentation/filesystems/fuse-io.rst)2
-rw-r--r--Documentation/filesystems/fuse/fuse-passthrough.rst (renamed from Documentation/filesystems/fuse-passthrough.rst)0
-rw-r--r--Documentation/filesystems/fuse/fuse.rst (renamed from Documentation/filesystems/fuse.rst)20
-rw-r--r--Documentation/filesystems/fuse/index.rst14
-rw-r--r--Documentation/filesystems/gfs2-glocks.rst2
-rw-r--r--Documentation/filesystems/hpfs.rst2
-rw-r--r--Documentation/filesystems/index.rst6
-rw-r--r--Documentation/filesystems/iomap/operations.rst2
-rw-r--r--Documentation/filesystems/locking.rst2
-rw-r--r--Documentation/filesystems/mount_api.rst10
-rw-r--r--Documentation/filesystems/ocfs2-online-filecheck.rst20
-rw-r--r--Documentation/filesystems/porting.rst28
-rw-r--r--Documentation/filesystems/proc.rst47
-rw-r--r--Documentation/filesystems/propagate_umount.txt6
-rw-r--r--Documentation/filesystems/resctrl.rst327
-rw-r--r--Documentation/filesystems/sharedsubtree.rst1347
-rw-r--r--Documentation/filesystems/sysfs.rst27
-rw-r--r--Documentation/filesystems/vfs.rst31
-rw-r--r--Documentation/filesystems/xfs/xfs-online-fsck-design.rst8
29 files changed, 1216 insertions, 1360 deletions
diff --git a/Documentation/filesystems/bcachefs/CodingStyle.rst b/Documentation/filesystems/bcachefs/CodingStyle.rst
deleted file mode 100644
index b29562a6bf55..000000000000
--- a/Documentation/filesystems/bcachefs/CodingStyle.rst
+++ /dev/null
@@ -1,186 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-bcachefs coding style
-=====================
-
-Good development is like gardening, and codebases are our gardens. Tend to them
-every day; look for little things that are out of place or in need of tidying.
-A little weeding here and there goes a long way; don't wait until things have
-spiraled out of control.
-
-Things don't always have to be perfect - nitpicking often does more harm than
-good. But appreciate beauty when you see it - and let people know.
-
-The code that you are afraid to touch is the code most in need of refactoring.
-
-A little organizing here and there goes a long way.
-
-Put real thought into how you organize things.
-
-Good code is readable code, where the structure is simple and leaves nowhere
-for bugs to hide.
-
-Assertions are one of our most important tools for writing reliable code. If in
-the course of writing a patchset you encounter a condition that shouldn't
-happen (and will have unpredictable or undefined behaviour if it does), or
-you're not sure if it can happen and not sure how to handle it yet - make it a
-BUG_ON(). Don't leave undefined or unspecified behavior lurking in the codebase.
-
-By the time you finish the patchset, you should understand better which
-assertions need to be handled and turned into checks with error paths, and
-which should be logically impossible. Leave the BUG_ON()s in for the ones which
-are logically impossible. (Or, make them debug mode assertions if they're
-expensive - but don't turn everything into a debug mode assertion, so that
-we're not stuck debugging undefined behaviour should it turn out that you were
-wrong).
-
-Assertions are documentation that can't go out of date. Good assertions are
-wonderful.
-
-Good assertions drastically and dramatically reduce the amount of testing
-required to shake out bugs.
-
-Good assertions are based on state, not logic. To write good assertions, you
-have to think about what the invariants on your state are.
-
-Good invariants and assertions will hold everywhere in your codebase. This
-means that you can run them in only a few places in the checked in version, but
-should you need to debug something that caused the assertion to fail, you can
-quickly shotgun them everywhere to find the codepath that broke the invariant.
-
-A good assertion checks something that the compiler could check for us, and
-elide - if we were working in a language with embedded correctness proofs that
-the compiler could check. This is something that exists today, but it'll likely
-still be a few decades before it comes to systems programming languages. But we
-can still incorporate that kind of thinking into our code and document the
-invariants with runtime checks - much like the way people working in
-dynamically typed languages may add type annotations, gradually making their
-code statically typed.
-
-Looking for ways to make your assertions simpler - and higher level - will
-often nudge you towards making the entire system simpler and more robust.
-
-Good code is code where you can poke around and see what it's doing -
-introspection. We can't debug anything if we can't see what's going on.
-
-Whenever we're debugging, and the solution isn't immediately obvious, if the
-issue is that we don't know where the issue is because we can't see what's
-going on - fix that first.
-
-We have the tools to make anything visible at runtime, efficiently - RCU and
-percpu data structures among them. Don't let things stay hidden.
-
-The most important tool for introspection is the humble pretty printer - in
-bcachefs, this means `*_to_text()` functions, which output to printbufs.
-
-Pretty printers are wonderful, because they compose and you can use them
-everywhere. Having functions to print whatever object you're working with will
-make your error messages much easier to write (therefore they will actually
-exist) and much more informative. And they can be used from sysfs/debugfs, as
-well as tracepoints.
-
-Runtime info and debugging tools should come with clear descriptions and
-labels, and good structure - we don't want files with a list of bare integers,
-like in procfs. Part of the job of the debugging tools is to educate users and
-new developers as to how the system works.
-
-Error messages should, whenever possible, tell you everything you need to debug
-the issue. It's worth putting effort into them.
-
-Tracepoints shouldn't be the first thing you reach for. They're an important
-tool, but always look for more immediate ways to make things visible. When we
-have to rely on tracing, we have to know which tracepoints we're looking for,
-and then we have to run the troublesome workload, and then we have to sift
-through logs. This is a lot of steps to go through when a user is hitting
-something, and if it's intermittent it may not even be possible.
-
-The humble counter is an incredibly useful tool. They're cheap and simple to
-use, and many complicated internal operations with lots of things that can
-behave weirdly (anything involving memory reclaim, for example) become
-shockingly easy to debug once you have counters on every distinct codepath.
-
-Persistent counters are even better.
-
-When debugging, try to get the most out of every bug you come across; don't
-rush to fix the initial issue. Look for things that will make related bugs
-easier the next time around - introspection, new assertions, better error
-messages, new debug tools, and do those first. Look for ways to make the system
-better behaved; often one bug will uncover several other bugs through
-downstream effects.
-
-Fix all that first, and then the original bug last - even if that means keeping
-a user waiting. They'll thank you in the long run, and when they understand
-what you're doing you'll be amazed at how patient they're happy to be. Users
-like to help - otherwise they wouldn't be reporting the bug in the first place.
-
-Talk to your users. Don't isolate yourself.
-
-Users notice all sorts of interesting things, and by just talking to them and
-interacting with them you can benefit from their experience.
-
-Spend time doing support and helpdesk stuff. Don't just write code - code isn't
-finished until it's being used trouble free.
-
-This will also motivate you to make your debugging tools as good as possible,
-and perhaps even your documentation, too. Like anything else in life, the more
-time you spend at it the better you'll get, and you the developer are the
-person most able to improve the tools to make debugging quick and easy.
-
-Be wary of how you take on and commit to big projects. Don't let development
-become product-manager focused. Often time an idea is a good one but needs to
-wait for its proper time - but you won't know if it's the proper time for an
-idea until you start writing code.
-
-Expect to throw a lot of things away, or leave them half finished for later.
-Nobody writes all perfect code that all gets shipped, and you'll be much more
-productive in the long run if you notice this early and shift to something
-else. The experience gained and lessons learned will be valuable for all the
-other work you do.
-
-But don't be afraid to tackle projects that require significant rework of
-existing code. Sometimes these can be the best projects, because they can lead
-us to make existing code more general, more flexible, more multipurpose and
-perhaps more robust. Just don't hesitate to abandon the idea if it looks like
-it's going to make a mess of things.
-
-Complicated features can often be done as a series of refactorings, with the
-final change that actually implements the feature as a quite small patch at the
-end. It's wonderful when this happens, especially when those refactorings are
-things that improve the codebase in their own right. When that happens there's
-much less risk of wasted effort if the feature you were going for doesn't work
-out.
-
-Always strive to work incrementally. Always strive to turn the big projects
-into little bite sized projects that can prove their own merits.
-
-Instead of always tackling those big projects, look for little things that
-will be useful, and make the big projects easier.
-
-The question of what's likely to be useful is where junior developers most
-often go astray - doing something because it seems like it'll be useful often
-leads to overengineering. Knowing what's useful comes from many years of
-experience, or talking with people who have that experience - or from simply
-reading lots of code and looking for common patterns and issues. Don't be
-afraid to throw things away and do something simpler.
-
-Talk about your ideas with your fellow developers; often times the best things
-come from relaxed conversations where people aren't afraid to say "what if?".
-
-Don't neglect your tools.
-
-The most important tools (besides the compiler and our text editor) are the
-tools we use for testing. The shortest possible edit/test/debug cycle is
-essential for working productively. We learn, gain experience, and discover the
-errors in our thinking by running our code and seeing what happens. If your
-time is being wasted because your tools are bad or too slow - don't accept it,
-fix it.
-
-Put effort into your documentation, commit messages, and code comments - but
-don't go overboard. A good commit message is wonderful - but if the information
-was important enough to go in a commit message, ask yourself if it would be
-even better as a code comment.
-
-A good code comment is wonderful, but even better is the comment that didn't
-need to exist because the code was so straightforward as to be obvious;
-organized into small clean and tidy modules, with clear and descriptive names
-for functions and variables, where every line of code has a clear purpose.
diff --git a/Documentation/filesystems/bcachefs/SubmittingPatches.rst b/Documentation/filesystems/bcachefs/SubmittingPatches.rst
deleted file mode 100644
index 18c79d548391..000000000000
--- a/Documentation/filesystems/bcachefs/SubmittingPatches.rst
+++ /dev/null
@@ -1,105 +0,0 @@
-Submitting patches to bcachefs
-==============================
-
-Here are suggestions for submitting patches to bcachefs subsystem.
-
-Submission checklist
---------------------
-
-Patches must be tested before being submitted, either with the xfstests suite
-[0]_, or the full bcachefs test suite in ktest [1]_, depending on what's being
-touched. Note that ktest wraps xfstests and will be an easier method to running
-it for most users; it includes single-command wrappers for all the mainstream
-in-kernel local filesystems.
-
-Patches will undergo more testing after being merged (including
-lockdep/kasan/preempt/etc. variants), these are not generally required to be
-run by the submitter - but do put some thought into what you're changing and
-which tests might be relevant, e.g. are you dealing with tricky memory layout
-work? kasan, are you doing locking work? then lockdep; and ktest includes
-single-command variants for the debug build types you'll most likely need.
-
-The exception to this rule is incomplete WIP/RFC patches: if you're working on
-something nontrivial, it's encouraged to send out a WIP patch to let people
-know what you're doing and make sure you're on the right track. Just make sure
-it includes a brief note as to what's done and what's incomplete, to avoid
-confusion.
-
-Rigorous checkpatch.pl adherence is not required (many of its warnings are
-considered out of date), but try not to deviate too much without reason.
-
-Focus on writing code that reads well and is organized well; code should be
-aesthetically pleasing.
-
-CI
---
-
-Instead of running your tests locally, when running the full test suite it's
-preferable to let a server farm do it in parallel, and then have the results
-in a nice test dashboard (which can tell you which failures are new, and
-presents results in a git log view, avoiding the need for most bisecting).
-
-That exists [2]_, and community members may request an account. If you work for
-a big tech company, you'll need to help out with server costs to get access -
-but the CI is not restricted to running bcachefs tests: it runs any ktest test
-(which generally makes it easy to wrap other tests that can run in qemu).
-
-Other things to think about
----------------------------
-
-- How will we debug this code? Is there sufficient introspection to diagnose
- when something starts acting wonky on a user machine?
-
- We don't necessarily need every single field of every data structure visible
- with introspection, but having the important fields of all the core data
- types wired up makes debugging drastically easier - a bit of thoughtful
- foresight greatly reduces the need to have people build custom kernels with
- debug patches.
-
- More broadly, think about all the debug tooling that might be needed.
-
-- Does it make the codebase more or less of a mess? Can we also try to do some
- organizing, too?
-
-- Do new tests need to be written? New assertions? How do we know and verify
- that the code is correct, and what happens if something goes wrong?
-
- We don't yet have automated code coverage analysis or easy fault injection -
- but for now, pretend we did and ask what they might tell us.
-
- Assertions are hugely important, given that we don't yet have a systems
- language that can do ergonomic embedded correctness proofs. Hitting an assert
- in testing is much better than wandering off into undefined behaviour la-la
- land - use them. Use them judiciously, and not as a replacement for proper
- error handling, but use them.
-
-- Does it need to be performance tested? Should we add new performance counters?
-
- bcachefs has a set of persistent runtime counters which can be viewed with
- the 'bcachefs fs top' command; this should give users a basic idea of what
- their filesystem is currently doing. If you're doing a new feature or looking
- at old code, think if anything should be added.
-
-- If it's a new on disk format feature - have upgrades and downgrades been
- tested? (Automated tests exists but aren't in the CI, due to the hassle of
- disk image management; coordinate to have them run.)
-
-Mailing list, IRC
------------------
-
-Patches should hit the list [3]_, but much discussion and code review happens
-on IRC as well [4]_; many people appreciate the more conversational approach
-and quicker feedback.
-
-Additionally, we have a lively user community doing excellent QA work, which
-exists primarily on IRC. Please make use of that resource; user feedback is
-important for any nontrivial feature, and documenting it in commit messages
-would be a good idea.
-
-.. rubric:: References
-
-.. [0] git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
-.. [1] https://evilpiepirate.org/git/ktest.git/
-.. [2] https://evilpiepirate.org/~testdashboard/ci/
-.. [3] linux-bcachefs@vger.kernel.org
-.. [4] irc.oftc.net#bcache, #bcachefs-dev
diff --git a/Documentation/filesystems/bcachefs/casefolding.rst b/Documentation/filesystems/bcachefs/casefolding.rst
deleted file mode 100644
index 871a38f557e8..000000000000
--- a/Documentation/filesystems/bcachefs/casefolding.rst
+++ /dev/null
@@ -1,108 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-Casefolding
-===========
-
-bcachefs has support for case-insensitive file and directory
-lookups using the regular `chattr +F` (`S_CASEFOLD`, `FS_CASEFOLD_FL`)
-casefolding attributes.
-
-The main usecase for casefolding is compatibility with software written
-against other filesystems that rely on casefolded lookups
-(eg. NTFS and Wine/Proton).
-Taking advantage of file-system level casefolding can lead to great
-loading time gains in many applications and games.
-
-Casefolding support requires a kernel with the `CONFIG_UNICODE` enabled.
-Once a directory has been flagged for casefolding, a feature bit
-is enabled on the superblock which marks the filesystem as using
-casefolding.
-When the feature bit for casefolding is enabled, it is no longer possible
-to mount that filesystem on kernels without `CONFIG_UNICODE` enabled.
-
-On the lookup/query side: casefolding is implemented by allocating a new
-string of `BCH_NAME_MAX` length using the `utf8_casefold` function to
-casefold the query string.
-
-On the dirent side: casefolding is implemented by ensuring the `bkey`'s
-hash is made from the casefolded string and storing the cached casefolded
-name with the regular name in the dirent.
-
-The structure looks like this:
-
-* Regular: [dirent data][regular name][nul][nul]...
-* Casefolded: [dirent data][reg len][cf len][regular name][casefolded name][nul][nul]...
-
-(Do note, the number of NULs here is merely for illustration; their count can
-vary per-key, and they may not even be present if the key is aligned to
-`sizeof(u64)`.)
-
-This is efficient as it means that for all file lookups that require casefolding,
-it has identical performance to a regular lookup:
-a hash comparison and a `memcmp` of the name.
-
-Rationale
----------
-
-Several designs were considered for this system:
-One was to introduce a dirent_v2, however that would be painful especially as
-the hash system only has support for a single key type. This would also need
-`BCH_NAME_MAX` to change between versions, and a new feature bit.
-
-Another option was to store without the two lengths, and just take the length of
-the regular name and casefolded name contiguously / 2 as the length. This would
-assume that the regular length == casefolded length, but that could potentially
-not be true, if the uppercase unicode glyph had a different UTF-8 encoding than
-the lowercase unicode glyph.
-It would be possible to disregard the casefold cache for those cases, but it was
-decided to simply encode the two string lengths in the key to avoid random
-performance issues if this edgecase was ever hit.
-
-The option settled on was to use a free-bit in d_type to mark a dirent as having
-a casefold cache, and then treat the first 4 bytes the name block as lengths.
-You can see this in the `d_cf_name_block` member of union in `bch_dirent`.
-
-The feature bit was used to allow casefolding support to be enabled for the majority
-of users, but some allow users who have no need for the feature to still use bcachefs as
-`CONFIG_UNICODE` can increase the kernel side a significant amount due to the tables used,
-which may be decider between using bcachefs for eg. embedded platforms.
-
-Other filesystems like ext4 and f2fs have a super-block level option for casefolding
-encoding, but bcachefs currently does not provide this. ext4 and f2fs do not expose
-any encodings than a single UTF-8 version. When future encodings are desirable,
-they will be added trivially using the opts mechanism.
-
-dentry/dcache considerations
-----------------------------
-
-Currently, in casefolded directories, bcachefs (like other filesystems) will not cache
-negative dentry's.
-
-This is because currently doing so presents a problem in the following scenario:
-
- - Lookup file "blAH" in a casefolded directory
- - Creation of file "BLAH" in a casefolded directory
- - Lookup file "blAH" in a casefolded directory
-
-This would fail if negative dentry's were cached.
-
-This is slightly suboptimal, but could be fixed in future with some vfs work.
-
-
-References
-----------
-
-(from Peter Anvin, on the list)
-
-It is worth noting that Microsoft has basically declared their
-"recommended" case folding (upcase) table to be permanently frozen (for
-new filesystem instances in the case where they use an on-disk
-translation table created at format time.) As far as I know they have
-never supported anything other than 1:1 conversion of BMP code points,
-nor normalization.
-
-The exFAT specification enumerates the full recommended upcase table,
-although in a somewhat annoying format (basically a hex dump of
-compressed data):
-
-https://learn.microsoft.com/en-us/windows/win32/fileio/exfat-specification
diff --git a/Documentation/filesystems/bcachefs/errorcodes.rst b/Documentation/filesystems/bcachefs/errorcodes.rst
deleted file mode 100644
index 2cccaa0ba7cd..000000000000
--- a/Documentation/filesystems/bcachefs/errorcodes.rst
+++ /dev/null
@@ -1,30 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-bcachefs private error codes
-----------------------------
-
-In bcachefs, as a hard rule we do not throw or directly use standard error
-codes (-EINVAL, -EBUSY, etc.). Instead, we define private error codes as needed
-in fs/bcachefs/errcode.h.
-
-This gives us much better error messages and makes debugging much easier. Any
-direct uses of standard error codes you see in the source code are simply old
-code that has yet to be converted - feel free to clean it up!
-
-Private error codes may subtype another error code, this allows for grouping of
-related errors that should be handled similarly (e.g. transaction restart
-errors), as well as specifying which standard error code should be returned at
-the bcachefs module boundary.
-
-At the module boundary, we use bch2_err_class() to convert to a standard error
-code; this also emits a trace event so that the original error code be
-recovered even if it wasn't logged.
-
-Do not reuse error codes! Generally speaking, a private error code should only
-be thrown in one place. That means that when we see it in a log message we can
-see, unambiguously, exactly which file and line number it was returned from.
-
-Try to give error codes names that are as reasonably descriptive of the error
-as possible. Frequently, the error will be logged at a place far removed from
-where the error was generated; good names for error codes mean much more
-descriptive and useful error messages.
diff --git a/Documentation/filesystems/bcachefs/future/idle_work.rst b/Documentation/filesystems/bcachefs/future/idle_work.rst
deleted file mode 100644
index 59a332509dcd..000000000000
--- a/Documentation/filesystems/bcachefs/future/idle_work.rst
+++ /dev/null
@@ -1,78 +0,0 @@
-Idle/background work classes design doc:
-
-Right now, our behaviour at idle isn't ideal, it was designed for servers that
-would be under sustained load, to keep pending work at a "medium" level, to
-let work build up so we can process it in more efficient batches, while also
-giving headroom for bursts in load.
-
-But for desktops or mobile - scenarios where work is less sustained and power
-usage is more important - we want to operate differently, with a "rush to
-idle" so the system can go to sleep. We don't want to be dribbling out
-background work while the system should be idle.
-
-The complicating factor is that there are a number of background tasks, which
-form a heirarchy (or a digraph, depending on how you divide it up) - one
-background task may generate work for another.
-
-Thus proper idle detection needs to model this heirarchy.
-
-- Foreground writes
-- Page cache writeback
-- Copygc, rebalance
-- Journal reclaim
-
-When we implement idle detection and rush to idle, we need to be careful not
-to disturb too much the existing behaviour that works reasonably well when the
-system is under sustained load (or perhaps improve it in the case of
-rebalance, which currently does not actively attempt to let work batch up).
-
-SUSTAINED LOAD REGIME
----------------------
-
-When the system is under continuous load, we want these jobs to run
-continuously - this is perhaps best modelled with a P/D controller, where
-they'll be trying to keep a target value (i.e. fragmented disk space,
-available journal space) roughly in the middle of some range.
-
-The goal under sustained load is to balance our ability to handle load spikes
-without running out of x resource (free disk space, free space in the
-journal), while also letting some work accumululate to be batched (or become
-unnecessary).
-
-For example, we don't want to run copygc too aggressively, because then it
-will be evacuating buckets that would have become empty (been overwritten or
-deleted) anyways, and we don't want to wait until we're almost out of free
-space because then the system will behave unpredicably - suddenly we're doing
-a lot more work to service each write and the system becomes much slower.
-
-IDLE REGIME
------------
-
-When the system becomes idle, we should start flushing our pending work
-quicker so the system can go to sleep.
-
-Note that the definition of "idle" depends on where in the heirarchy a task
-is - a task should start flushing work more quickly when the task above it has
-stopped generating new work.
-
-e.g. rebalance should start flushing more quickly when page cache writeback is
-idle, and journal reclaim should only start flushing more quickly when both
-copygc and rebalance are idle.
-
-It's important to let work accumulate when more work is still incoming and we
-still have room, because flushing is always more efficient if we let it batch
-up. New writes may overwrite data before rebalance moves it, and tasks may be
-generating more updates for the btree nodes that journal reclaim needs to flush.
-
-On idle, how much work we do at each interval should be proportional to the
-length of time we have been idle for. If we're idle only for a short duration,
-we shouldn't flush everything right away; the system might wake up and start
-generating new work soon, and flushing immediately might end up doing a lot of
-work that would have been unnecessary if we'd allowed things to batch more.
-
-To summarize, we will need:
-
- - A list of classes for background tasks that generate work, which will
- include one "foreground" class.
- - Tracking for each class - "Am I doing work, or have I gone to sleep?"
- - And each class should check the class above it when deciding how much work to issue.
diff --git a/Documentation/filesystems/bcachefs/index.rst b/Documentation/filesystems/bcachefs/index.rst
deleted file mode 100644
index e5c4c2120b93..000000000000
--- a/Documentation/filesystems/bcachefs/index.rst
+++ /dev/null
@@ -1,38 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-======================
-bcachefs Documentation
-======================
-
-Subsystem-specific development process notes
---------------------------------------------
-
-Development notes specific to bcachefs. These are intended to supplement
-:doc:`general kernel development handbook </process/index>`.
-
-.. toctree::
- :maxdepth: 1
- :numbered:
-
- CodingStyle
- SubmittingPatches
-
-Filesystem implementation
--------------------------
-
-Documentation for filesystem features and their implementation details.
-At this moment, only a few of these are described here.
-
-.. toctree::
- :maxdepth: 1
- :numbered:
-
- casefolding
- errorcodes
-
-Future design
--------------
-.. toctree::
- :maxdepth: 1
-
- future/idle_work
diff --git a/Documentation/filesystems/erofs.rst b/Documentation/filesystems/erofs.rst
index 7ddb235aee9d..08194f194b94 100644
--- a/Documentation/filesystems/erofs.rst
+++ b/Documentation/filesystems/erofs.rst
@@ -116,7 +116,7 @@ cache_strategy=%s Select a strategy for cached decompression from now on:
cluster for further reading. It still does
in-place I/O decompression for the rest
compressed physical clusters;
- readaround Cache the both ends of incomplete compressed
+ readaround Cache both ends of incomplete compressed
physical clusters for further reading.
It still does in-place I/O decompression
for the rest compressed physical clusters.
diff --git a/Documentation/filesystems/ext4/atomic_writes.rst b/Documentation/filesystems/ext4/atomic_writes.rst
index aeb47ace738d..ae8995740aa8 100644
--- a/Documentation/filesystems/ext4/atomic_writes.rst
+++ b/Documentation/filesystems/ext4/atomic_writes.rst
@@ -14,7 +14,7 @@ I/O) on regular files with extents, provided the underlying storage device
supports hardware atomic writes. This is supported in the following two ways:
1. **Single-fsblock Atomic Writes**:
- EXT4's supports atomic write operations with a single filesystem block since
+ EXT4 supports atomic write operations with a single filesystem block since
v6.13. In this the atomic write unit minimum and maximum sizes are both set
to filesystem blocksize.
e.g. doing atomic write of 16KB with 16KB filesystem blocksize on 64KB
@@ -50,7 +50,7 @@ Multi-fsblock Implementation Details
The bigalloc feature changes ext4 to allocate in units of multiple filesystem
blocks, also known as clusters. With bigalloc each bit within block bitmap
-represents cluster (power of 2 number of blocks) rather than individual
+represents a cluster (power of 2 number of blocks) rather than individual
filesystem blocks.
EXT4 supports multi-fsblock atomic writes with bigalloc, subject to the
following constraints. The minimum atomic write size is the larger of the fs
@@ -189,7 +189,7 @@ The write must be aligned to the filesystem's block size and not exceed the
filesystem's maximum atomic write unit size.
See ``generic_atomic_write_valid()`` for more details.
-``statx()`` system call with ``STATX_WRITE_ATOMIC`` flag can provides following
+``statx()`` system call with ``STATX_WRITE_ATOMIC`` flag can provide following
details:
* ``stx_atomic_write_unit_min``: Minimum size of an atomic write request.
diff --git a/Documentation/filesystems/f2fs.rst b/Documentation/filesystems/f2fs.rst
index e5bb89452aff..a8d02fe5be83 100644
--- a/Documentation/filesystems/f2fs.rst
+++ b/Documentation/filesystems/f2fs.rst
@@ -1,8 +1,11 @@
.. SPDX-License-Identifier: GPL-2.0
-==========================================
-WHAT IS Flash-Friendly File System (F2FS)?
-==========================================
+=================================
+Flash-Friendly File System (F2FS)
+=================================
+
+Overview
+========
NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have
been equipped on a variety systems ranging from mobile to server systems. Since
@@ -173,9 +176,12 @@ data_flush Enable data flushing before checkpoint in order to
persist data of regular and symlink.
reserve_root=%d Support configuring reserved space which is used for
allocation from a privileged user with specified uid or
- gid, unit: 4KB, the default limit is 0.2% of user blocks.
-resuid=%d The user ID which may use the reserved blocks.
-resgid=%d The group ID which may use the reserved blocks.
+ gid, unit: 4KB, the default limit is 12.5% of user blocks.
+reserve_node=%d Support configuring reserved nodes which are used for
+ allocation from a privileged user with specified uid or
+ gid, the default limit is 12.5% of all nodes.
+resuid=%d The user ID which may use the reserved blocks and nodes.
+resgid=%d The group ID which may use the reserved blocks and nodes.
fault_injection=%d Enable fault injection in all supported types with
specified injection rate.
fault_type=%d Support configuring fault injection type, should be
@@ -291,9 +297,13 @@ compress_algorithm=%s Control compress algorithm, currently f2fs supports "lzo"
"lz4", "zstd" and "lzo-rle" algorithm.
compress_algorithm=%s:%d Control compress algorithm and its compress level, now, only
"lz4" and "zstd" support compress level config.
+
+ ========= ===========
algorithm level range
+ ========= ===========
lz4 3 - 16
zstd 1 - 22
+ ========= ===========
compress_log_size=%u Support configuring compress cluster size. The size will
be 4KB * (1 << %u). The default and minimum sizes are 16KB.
compress_extension=%s Support adding specified extension, so that f2fs can enable
@@ -357,6 +367,7 @@ errors=%s Specify f2fs behavior on critical errors. This supports modes:
panic immediately, continue without doing anything, and remount
the partition in read-only mode. By default it uses "continue"
mode.
+
====================== =============== =============== ========
mode continue remount-ro panic
====================== =============== =============== ========
@@ -370,6 +381,25 @@ errors=%s Specify f2fs behavior on critical errors. This supports modes:
====================== =============== =============== ========
nat_bits Enable nat_bits feature to enhance full/empty nat blocks access,
by default it's disabled.
+lookup_mode=%s Control the directory lookup behavior for casefolded
+ directories. This option has no effect on directories
+ that do not have the casefold feature enabled.
+
+ ================== ========================================
+ Value Description
+ ================== ========================================
+ perf (Default) Enforces a hash-only lookup.
+ The linear search fallback is always
+ disabled, ignoring the on-disk flag.
+ compat Enables the linear search fallback for
+ compatibility with directory entries
+ created by older kernel that used a
+ different case-folding algorithm.
+ This mode ignores the on-disk flag.
+ auto F2FS determines the mode based on the
+ on-disk `SB_ENC_NO_COMPAT_FALLBACK_FL`
+ flag.
+ ================== ========================================
======================== ============================================================
Debugfs Entries
@@ -795,11 +825,13 @@ ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREME
extension list " "
-- buffered io
+------------------------------------------------------------------
N/A COLD_DATA WRITE_LIFE_EXTREME
N/A HOT_DATA WRITE_LIFE_SHORT
N/A WARM_DATA WRITE_LIFE_NOT_SET
-- direct io
+------------------------------------------------------------------
WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME
WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT
WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET
@@ -915,24 +947,26 @@ compression enabled files (refer to "Compression implementation" section for how
enable compression on a regular inode).
1) compress_mode=fs
-This is the default option. f2fs does automatic compression in the writeback of the
-compression enabled files.
+
+ This is the default option. f2fs does automatic compression in the writeback of the
+ compression enabled files.
2) compress_mode=user
-This disables the automatic compression and gives the user discretion of choosing the
-target file and the timing. The user can do manual compression/decompression on the
-compression enabled files using F2FS_IOC_DECOMPRESS_FILE and F2FS_IOC_COMPRESS_FILE
-ioctls like the below.
-To decompress a file,
+ This disables the automatic compression and gives the user discretion of choosing the
+ target file and the timing. The user can do manual compression/decompression on the
+ compression enabled files using F2FS_IOC_DECOMPRESS_FILE and F2FS_IOC_COMPRESS_FILE
+ ioctls like the below.
+
+To decompress a file::
-fd = open(filename, O_WRONLY, 0);
-ret = ioctl(fd, F2FS_IOC_DECOMPRESS_FILE);
+ fd = open(filename, O_WRONLY, 0);
+ ret = ioctl(fd, F2FS_IOC_DECOMPRESS_FILE);
-To compress a file,
+To compress a file::
-fd = open(filename, O_WRONLY, 0);
-ret = ioctl(fd, F2FS_IOC_COMPRESS_FILE);
+ fd = open(filename, O_WRONLY, 0);
+ ret = ioctl(fd, F2FS_IOC_COMPRESS_FILE);
NVMe Zoned Namespace devices
----------------------------
@@ -962,32 +996,32 @@ reserved and used by another filesystem or for different purposes. Once that
external usage is complete, the device aliasing file can be deleted, releasing
the reserved space back to F2FS for its own use.
-<use-case>
-
-# ls /dev/vd*
-/dev/vdb (32GB) /dev/vdc (32GB)
-# mkfs.ext4 /dev/vdc
-# mkfs.f2fs -c /dev/vdc@vdc.file /dev/vdb
-# mount /dev/vdb /mnt/f2fs
-# ls -l /mnt/f2fs
-vdc.file
-# df -h
-/dev/vdb 64G 33G 32G 52% /mnt/f2fs
-
-# mount -o loop /dev/vdc /mnt/ext4
-# df -h
-/dev/vdb 64G 33G 32G 52% /mnt/f2fs
-/dev/loop7 32G 24K 30G 1% /mnt/ext4
-# umount /mnt/ext4
-
-# f2fs_io getflags /mnt/f2fs/vdc.file
-get a flag on /mnt/f2fs/vdc.file ret=0, flags=nocow(pinned),immutable
-# f2fs_io setflags noimmutable /mnt/f2fs/vdc.file
-get a flag on noimmutable ret=0, flags=800010
-set a flag on /mnt/f2fs/vdc.file ret=0, flags=noimmutable
-# rm /mnt/f2fs/vdc.file
-# df -h
-/dev/vdb 64G 753M 64G 2% /mnt/f2fs
+.. code-block::
+
+ # ls /dev/vd*
+ /dev/vdb (32GB) /dev/vdc (32GB)
+ # mkfs.ext4 /dev/vdc
+ # mkfs.f2fs -c /dev/vdc@vdc.file /dev/vdb
+ # mount /dev/vdb /mnt/f2fs
+ # ls -l /mnt/f2fs
+ vdc.file
+ # df -h
+ /dev/vdb 64G 33G 32G 52% /mnt/f2fs
+
+ # mount -o loop /dev/vdc /mnt/ext4
+ # df -h
+ /dev/vdb 64G 33G 32G 52% /mnt/f2fs
+ /dev/loop7 32G 24K 30G 1% /mnt/ext4
+ # umount /mnt/ext4
+
+ # f2fs_io getflags /mnt/f2fs/vdc.file
+ get a flag on /mnt/f2fs/vdc.file ret=0, flags=nocow(pinned),immutable
+ # f2fs_io setflags noimmutable /mnt/f2fs/vdc.file
+ get a flag on noimmutable ret=0, flags=800010
+ set a flag on /mnt/f2fs/vdc.file ret=0, flags=noimmutable
+ # rm /mnt/f2fs/vdc.file
+ # df -h
+ /dev/vdb 64G 753M 64G 2% /mnt/f2fs
So, the key idea is, user can do any file operations on /dev/vdc, and
reclaim the space after the use, while the space is counted as /data.
diff --git a/Documentation/filesystems/fuse-io-uring.rst b/Documentation/filesystems/fuse/fuse-io-uring.rst
index d73dd0dbd238..d73dd0dbd238 100644
--- a/Documentation/filesystems/fuse-io-uring.rst
+++ b/Documentation/filesystems/fuse/fuse-io-uring.rst
diff --git a/Documentation/filesystems/fuse-io.rst b/Documentation/filesystems/fuse/fuse-io.rst
index 6464de4266ad..d736ac4cb483 100644
--- a/Documentation/filesystems/fuse-io.rst
+++ b/Documentation/filesystems/fuse/fuse-io.rst
@@ -1,7 +1,7 @@
.. SPDX-License-Identifier: GPL-2.0
==============
-Fuse I/O Modes
+FUSE I/O Modes
==============
Fuse supports the following I/O modes:
diff --git a/Documentation/filesystems/fuse-passthrough.rst b/Documentation/filesystems/fuse/fuse-passthrough.rst
index 2b0e7c2da54a..2b0e7c2da54a 100644
--- a/Documentation/filesystems/fuse-passthrough.rst
+++ b/Documentation/filesystems/fuse/fuse-passthrough.rst
diff --git a/Documentation/filesystems/fuse.rst b/Documentation/filesystems/fuse/fuse.rst
index 1e31e87aee68..0fbd5a03fdc9 100644
--- a/Documentation/filesystems/fuse.rst
+++ b/Documentation/filesystems/fuse/fuse.rst
@@ -1,8 +1,8 @@
.. SPDX-License-Identifier: GPL-2.0
-====
-FUSE
-====
+=============
+FUSE Overview
+=============
Definitions
===========
@@ -129,6 +129,20 @@ For each connection the following files exist within this directory:
connection. This means that all waiting requests will be aborted an
error returned for all aborted and new requests.
+ max_background
+ The maximum number of background requests that can be outstanding
+ at a time. When the number of background requests reaches this limit,
+ further requests will be blocked until some are completed, potentially
+ causing I/O operations to stall.
+
+ congestion_threshold
+ The threshold of background requests at which the kernel considers
+ the filesystem to be congested. When the number of background requests
+ exceeds this value, the kernel will skip asynchronous readahead
+ operations, reducing read-ahead optimizations but preserving essential
+ I/O, as well as suspending non-synchronous writeback operations
+ (WB_SYNC_NONE), delaying page cache flushing to the filesystem.
+
Only the owner of the mount may read or write these files.
Interrupting filesystem operations
diff --git a/Documentation/filesystems/fuse/index.rst b/Documentation/filesystems/fuse/index.rst
new file mode 100644
index 000000000000..393a845214da
--- /dev/null
+++ b/Documentation/filesystems/fuse/index.rst
@@ -0,0 +1,14 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================================================
+FUSE (Filesystem in Userspace) Technical Documentation
+======================================================
+
+.. toctree::
+ :maxdepth: 2
+ :numbered:
+
+ fuse
+ fuse-io
+ fuse-io-uring
+ fuse-passthrough
diff --git a/Documentation/filesystems/gfs2-glocks.rst b/Documentation/filesystems/gfs2-glocks.rst
index adc0d4c4d979..ce5ff08cbd59 100644
--- a/Documentation/filesystems/gfs2-glocks.rst
+++ b/Documentation/filesystems/gfs2-glocks.rst
@@ -105,7 +105,7 @@ go_unlocked Yes No
Operations must not drop either the bit lock or the spinlock
if its held on entry. go_dump and do_demote_ok must never block.
Note that go_dump will only be called if the glock's state
- indicates that it is caching uptodate data.
+ indicates that it is caching up-to-date data.
Glock locking order within GFS2:
diff --git a/Documentation/filesystems/hpfs.rst b/Documentation/filesystems/hpfs.rst
index 7e0dd2f4373e..0f9516b5eb07 100644
--- a/Documentation/filesystems/hpfs.rst
+++ b/Documentation/filesystems/hpfs.rst
@@ -65,7 +65,7 @@ are case sensitive, so for example when you create a file FOO, you can use
'cat FOO', 'cat Foo', 'cat foo' or 'cat F*' but not 'cat f*'. Note, that you
also won't be able to compile linux kernel (and maybe other things) on HPFS
because kernel creates different files with names like bootsect.S and
-bootsect.s. When searching for file thats name has characters >= 128, codepages
+bootsect.s. When searching for file whose name has characters >= 128, codepages
are used - see below.
OS/2 ignores dots and spaces at the end of file name, so this driver does as
well. If you create 'a. ...', the file 'a' will be created, but you can still
diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index 11a599387266..af516e528ded 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -72,7 +72,6 @@ Documentation for filesystem implementations.
afs
autofs
autofs-mount-control
- bcachefs/index
befs
bfs
btrfs
@@ -96,10 +95,7 @@ Documentation for filesystem implementations.
hfs
hfsplus
hpfs
- fuse
- fuse-io
- fuse-io-uring
- fuse-passthrough
+ fuse/index
inotify
isofs
nilfs2
diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst
index 067ed8e14ef3..387fd9cc72ca 100644
--- a/Documentation/filesystems/iomap/operations.rst
+++ b/Documentation/filesystems/iomap/operations.rst
@@ -321,7 +321,7 @@ The fields are as follows:
- ``writeback_submit``: Submit the previous built writeback context.
Block based file systems should use the iomap_ioend_writeback_submit
helper, other file system can implement their own.
- File systems can optionall to hook into writeback bio submission.
+ File systems can optionally hook into writeback bio submission.
This might include pre-write space accounting updates, or installing
a custom ``->bi_end_io`` function for internal purposes, such as
deferring the ioend completion to a workqueue to run metadata update
diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index aa287ccdac2f..77704fde9845 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -443,7 +443,7 @@ prototypes::
int (*direct_access) (struct block_device *, sector_t, void **,
unsigned long *);
void (*unlock_native_capacity) (struct gendisk *);
- int (*getgeo)(struct block_device *, struct hd_geometry *);
+ int (*getgeo)(struct gendisk *, struct hd_geometry *);
void (*swap_slot_free_notify) (struct block_device *, unsigned long);
locking rules:
diff --git a/Documentation/filesystems/mount_api.rst b/Documentation/filesystems/mount_api.rst
index e149b89118c8..c99ab1f7fea4 100644
--- a/Documentation/filesystems/mount_api.rst
+++ b/Documentation/filesystems/mount_api.rst
@@ -506,8 +506,16 @@ returned.
* ::
+ int vfs_parse_fs_qstr(struct fs_context *fc, const char *key,
+ const struct qstr *value);
+
+ A wrapper around vfs_parse_fs_param() that copies the value string it is
+ passed.
+
+ * ::
+
int vfs_parse_fs_string(struct fs_context *fc, const char *key,
- const char *value, size_t v_size);
+ const char *value);
A wrapper around vfs_parse_fs_param() that copies the value string it is
passed.
diff --git a/Documentation/filesystems/ocfs2-online-filecheck.rst b/Documentation/filesystems/ocfs2-online-filecheck.rst
index 2257bb53edc1..9e8449416e0b 100644
--- a/Documentation/filesystems/ocfs2-online-filecheck.rst
+++ b/Documentation/filesystems/ocfs2-online-filecheck.rst
@@ -58,33 +58,33 @@ inode, fixing inode and setting the size of result record history.
# echo "<inode>" > /sys/fs/ocfs2/<devname>/filecheck/check
# cat /sys/fs/ocfs2/<devname>/filecheck/check
-The output is like this::
+ The output is like this::
INO DONE ERROR
39502 1 GENERATION
- <INO> lists the inode numbers.
- <DONE> indicates whether the operation has been finished.
- <ERROR> says what kind of errors was found. For the detailed error numbers,
- please refer to the file linux/fs/ocfs2/filecheck.h.
+ <INO> lists the inode numbers.
+ <DONE> indicates whether the operation has been finished.
+ <ERROR> says what kind of errors was found. For the detailed error numbers,
+ please refer to the file linux/fs/ocfs2/filecheck.h.
2. If you determine to fix this inode, do::
# echo "<inode>" > /sys/fs/ocfs2/<devname>/filecheck/fix
# cat /sys/fs/ocfs2/<devname>/filecheck/fix
-The output is like this:::
+ The output is like this::
INO DONE ERROR
39502 1 SUCCESS
-This time, the <ERROR> column indicates whether this fix is successful or not.
+ This time, the <ERROR> column indicates whether this fix is successful or not.
3. The record cache is used to store the history of check/fix results. It's
-default size is 10, and can be adjust between the range of 10 ~ 100. You can
-adjust the size like this::
+ default size is 10, and can be adjust between the range of 10 ~ 100. You can
+ adjust the size like this::
- # echo "<size>" > /sys/fs/ocfs2/<devname>/filecheck/set
+ # echo "<size>" > /sys/fs/ocfs2/<devname>/filecheck/set
Fixing stuff
============
diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index 85f590254f07..7233b04668fc 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -340,8 +340,8 @@ of those. Caller makes sure async writeback cannot be running for the inode whil
->drop_inode() returns int now; it's called on final iput() with
inode->i_lock held and it returns true if filesystems wants the inode to be
-dropped. As before, generic_drop_inode() is still the default and it's been
-updated appropriately. generic_delete_inode() is also alive and it consists
+dropped. As before, inode_generic_drop() is still the default and it's been
+updated appropriately. inode_just_drop() is also alive and it consists
simply of return 1. Note that all actual eviction work is done by caller after
->drop_inode() returns.
@@ -1285,3 +1285,27 @@ rather than a VMA, as the VMA at this stage is not yet valid.
The vm_area_desc provides the minimum required information for a filesystem
to initialise state upon memory mapping of a file-backed region, and output
parameters for the file system to set this state.
+
+---
+
+**mandatory**
+
+Several functions are renamed:
+
+- kern_path_locked -> start_removing_path
+- kern_path_create -> start_creating_path
+- user_path_create -> start_creating_user_path
+- user_path_locked_at -> start_removing_user_path_at
+- done_path_create -> end_creating_path
+
+---
+
+**mandatory**
+
+Calling conventions for vfs_parse_fs_string() have changed; it does *not*
+take length anymore (value ? strlen(value) : 0 is used). If you want
+a different length, use
+
+ vfs_parse_fs_qstr(fc, key, &QSTR_LEN(value, len))
+
+instead.
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 2971551b7235..3002258c9c7f 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -61,19 +61,6 @@ Preface
0.1 Introduction/Credits
------------------------
-This documentation is part of a soon (or so we hope) to be released book on
-the SuSE Linux distribution. As there is no complete documentation for the
-/proc file system and we've used many freely available sources to write these
-chapters, it seems only fair to give the work back to the Linux community.
-This work is based on the 2.2.* kernel version and the upcoming 2.4.*. I'm
-afraid it's still far from complete, but we hope it will be useful. As far as
-we know, it is the first 'all-in-one' document about the /proc file system. It
-is focused on the Intel x86 hardware, so if you are looking for PPC, ARM,
-SPARC, AXP, etc., features, you probably won't find what you are looking for.
-It also only covers IPv4 networking, not IPv6 nor other protocols - sorry. But
-additions and patches are welcome and will be added to this document if you
-mail them to Bodo.
-
We'd like to thank Alan Cox, Rik van Riel, and Alexey Kuznetsov and a lot of
other people for help compiling this documentation. We'd also like to extend a
special thank you to Andi Kleen for documentation, which we relied on heavily
@@ -81,17 +68,9 @@ to create this document, as well as the additional information he provided.
Thanks to everybody else who contributed source or docs to the Linux kernel
and helped create a great piece of software... :)
-If you have any comments, corrections or additions, please don't hesitate to
-contact Bodo Bauer at bb@ricochet.net. We'll be happy to add them to this
-document.
-
The latest version of this document is available online at
https://www.kernel.org/doc/html/latest/filesystems/proc.html
-If the above direction does not works for you, you could try the kernel
-mailing list at linux-kernel@vger.kernel.org and/or try to reach me at
-comandante@zaralinux.com.
-
0.2 Legal Stuff
---------------
@@ -291,8 +270,9 @@ It's slow but very precise.
HugetlbPages size of hugetlb memory portions
CoreDumping process's memory is currently being dumped
(killing the process may lead to a corrupted core)
- THP_enabled process is allowed to use THP (returns 0 when
- PR_SET_THP_DISABLE is set on the process
+ THP_enabled process is allowed to use THP (returns 0 when
+ PR_SET_THP_DISABLE is set on the process to disable
+ THP completely, not just partially)
Threads number of threads
SigQ number of signals queued/max. number for queue
SigPnd bitmap of pending signals for the thread
@@ -1008,6 +988,19 @@ number, module (if originates from a loadable module) and the function calling
the allocation. The number of bytes allocated and number of calls at each
location are reported. The first line indicates the version of the file, the
second line is the header listing fields in the file.
+If file version is 2.0 or higher then each line may contain additional
+<key>:<value> pairs representing extra information about the call site.
+For example if the counters are not accurate, the line will be appended with
+"accurate:no" pair.
+
+Supported markers in v2:
+accurate:no
+
+ Absolute values of the counters in this line are not accurate
+ because of the failure to allocate memory to track some of the
+ allocations made at this location. Deltas in these counters are
+ accurate, therefore counters can be used to track allocation size
+ and count changes.
Example output.
@@ -2362,6 +2355,7 @@ The following mount options are supported:
hidepid= Set /proc/<pid>/ access mode.
gid= Set the group authorized to learn processes information.
subset= Show only the specified subset of procfs.
+ pidns= Specify a the namespace used by this procfs.
========= ========================================================
hidepid=off or hidepid=0 means classic mode - everybody may access all
@@ -2394,6 +2388,13 @@ information about processes information, just add identd to this group.
subset=pid hides all top level files and directories in the procfs that
are not related to tasks.
+pidns= specifies a pid namespace (either as a string path to something like
+`/proc/$pid/ns/pid`, or a file descriptor when using `FSCONFIG_SET_FD`) that
+will be used by the procfs instance when translating pids. By default, procfs
+will use the calling process's active pid namespace. Note that the pid
+namespace of an existing procfs instance cannot be modified (attempting to do
+so will give an `-EBUSY` error).
+
Chapter 5: Filesystem behavior
==============================
diff --git a/Documentation/filesystems/propagate_umount.txt b/Documentation/filesystems/propagate_umount.txt
index c90349e5b889..9a7eb96df300 100644
--- a/Documentation/filesystems/propagate_umount.txt
+++ b/Documentation/filesystems/propagate_umount.txt
@@ -286,7 +286,7 @@ Trim_one(m)
strip the "seen by Trim_ancestors" mark from m
remove m from the Candidates list
return
-
+
remove_this = false
found = false
for each n in children(m)
@@ -312,7 +312,7 @@ Trim_ancestors(m)
}
Terminating condition in the loop in Trim_ancestors() is correct,
-since that that loop will never run into p belonging to U - p is always
+since that loop will never run into p belonging to U - p is always
an ancestor of argument of Trim_one() and since U is closed, the argument
of Trim_one() would also have to belong to U. But Trim_one() is never
called for elements of U. In other words, p belongs to S if and only
@@ -361,7 +361,7 @@ such removals.
Proof: suppose S was non-shifting, x is a locked element of S, parent of x
is not in S and S - {x} is not non-shifting. Then there is an element m
in S - {x} and a subtree mounted strictly inside m, such that m contains
-an element not in in S - {x}. Since S is non-shifting, everything in
+an element not in S - {x}. Since S is non-shifting, everything in
that subtree must belong to S. But that means that this subtree must
contain x somewhere *and* that parent of x either belongs that subtree
or is equal to m. Either way it must belong to S. Contradiction.
diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
index c7949dd44f2f..b7f35b07876a 100644
--- a/Documentation/filesystems/resctrl.rst
+++ b/Documentation/filesystems/resctrl.rst
@@ -26,6 +26,7 @@ MBM (Memory Bandwidth Monitoring) "cqm_mbm_total", "cqm_mbm_local"
MBA (Memory Bandwidth Allocation) "mba"
SMBA (Slow Memory Bandwidth Allocation) ""
BMEC (Bandwidth Monitoring Event Configuration) ""
+ABMC (Assignable Bandwidth Monitoring Counters) ""
=============================================== ================================
Historically, new features were made visible by default in /proc/cpuinfo. This
@@ -256,6 +257,144 @@ with the following files:
# cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
0=0x30;1=0x30;3=0x15;4=0x15
+"mbm_assign_mode":
+ The supported counter assignment modes. The enclosed brackets indicate which mode
+ is enabled. The MBM events associated with counters may reset when "mbm_assign_mode"
+ is changed.
+ ::
+
+ # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
+ [mbm_event]
+ default
+
+ "mbm_event":
+
+ mbm_event mode allows users to assign a hardware counter to an RMID, event
+ pair and monitor the bandwidth usage as long as it is assigned. The hardware
+ continues to track the assigned counter until it is explicitly unassigned by
+ the user. Each event within a resctrl group can be assigned independently.
+
+ In this mode, a monitoring event can only accumulate data while it is backed
+ by a hardware counter. Use "mbm_L3_assignments" found in each CTRL_MON and MON
+ group to specify which of the events should have a counter assigned. The number
+ of counters available is described in the "num_mbm_cntrs" file. Changing the
+ mode may cause all counters on the resource to reset.
+
+ Moving to mbm_event counter assignment mode requires users to assign the counters
+ to the events. Otherwise, the MBM event counters will return 'Unassigned' when read.
+
+ The mode is beneficial for AMD platforms that support more CTRL_MON
+ and MON groups than available hardware counters. By default, this
+ feature is enabled on AMD platforms with the ABMC (Assignable Bandwidth
+ Monitoring Counters) capability, ensuring counters remain assigned even
+ when the corresponding RMID is not actively used by any processor.
+
+ "default":
+
+ In default mode, resctrl assumes there is a hardware counter for each
+ event within every CTRL_MON and MON group. On AMD platforms, it is
+ recommended to use the mbm_event mode, if supported, to prevent reset of MBM
+ events between reads resulting from hardware re-allocating counters. This can
+ result in misleading values or display "Unavailable" if no counter is assigned
+ to the event.
+
+ * To enable "mbm_event" counter assignment mode:
+ ::
+
+ # echo "mbm_event" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
+
+ * To enable "default" monitoring mode:
+ ::
+
+ # echo "default" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
+
+"num_mbm_cntrs":
+ The maximum number of counters (total of available and assigned counters) in
+ each domain when the system supports mbm_event mode.
+
+ For example, on a system with maximum of 32 memory bandwidth monitoring
+ counters in each of its L3 domains:
+ ::
+
+ # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
+ 0=32;1=32
+
+"available_mbm_cntrs":
+ The number of counters available for assignment in each domain when mbm_event
+ mode is enabled on the system.
+
+ For example, on a system with 30 available [hardware] assignable counters
+ in each of its L3 domains:
+ ::
+
+ # cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs
+ 0=30;1=30
+
+"event_configs":
+ Directory that exists when "mbm_event" counter assignment mode is supported.
+ Contains a sub-directory for each MBM event that can be assigned to a counter.
+
+ Two MBM events are supported by default: mbm_local_bytes and mbm_total_bytes.
+ Each MBM event's sub-directory contains a file named "event_filter" that is
+ used to view and modify which memory transactions the MBM event is configured
+ with. The file is accessible only when "mbm_event" counter assignment mode is
+ enabled.
+
+ List of memory transaction types supported:
+
+ ========================== ========================================================
+ Name Description
+ ========================== ========================================================
+ dirty_victim_writes_all Dirty Victims from the QOS domain to all types of memory
+ remote_reads_slow_memory Reads to slow memory in the non-local NUMA domain
+ local_reads_slow_memory Reads to slow memory in the local NUMA domain
+ remote_non_temporal_writes Non-temporal writes to non-local NUMA domain
+ local_non_temporal_writes Non-temporal writes to local NUMA domain
+ remote_reads Reads to memory in the non-local NUMA domain
+ local_reads Reads to memory in the local NUMA domain
+ ========================== ========================================================
+
+ For example::
+
+ # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_total_bytes/event_filter
+ local_reads,remote_reads,local_non_temporal_writes,remote_non_temporal_writes,
+ local_reads_slow_memory,remote_reads_slow_memory,dirty_victim_writes_all
+
+ # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_local_bytes/event_filter
+ local_reads,local_non_temporal_writes,local_reads_slow_memory
+
+ Modify the event configuration by writing to the "event_filter" file within
+ the "event_configs" directory. The read/write "event_filter" file contains the
+ configuration of the event that reflects which memory transactions are counted by it.
+
+ For example::
+
+ # echo "local_reads, local_non_temporal_writes" >
+ /sys/fs/resctrl/info/L3_MON/event_configs/mbm_total_bytes/event_filter
+
+ # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_total_bytes/event_filter
+ local_reads,local_non_temporal_writes
+
+"mbm_assign_on_mkdir":
+ Exists when "mbm_event" counter assignment mode is supported. Accessible
+ only when "mbm_event" counter assignment mode is enabled.
+
+ Determines if a counter will automatically be assigned to an RMID, MBM event
+ pair when its associated monitor group is created via mkdir. Enabled by default
+ on boot, also when switched from "default" mode to "mbm_event" counter assignment
+ mode. Users can disable this capability by writing to the interface.
+
+ "0":
+ Auto assignment is disabled.
+ "1":
+ Auto assignment is enabled.
+
+ Example::
+
+ # echo 0 > /sys/fs/resctrl/info/L3_MON/mbm_assign_on_mkdir
+ # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_on_mkdir
+ 0
+
"max_threshold_occupancy":
Read/write file provides the largest value (in
bytes) at which a previously used LLC_occupancy
@@ -380,10 +519,77 @@ When monitoring is enabled all MON groups will also contain:
for the L3 cache they occupy). These are named "mon_sub_L3_YY"
where "YY" is the node number.
+ When the 'mbm_event' counter assignment mode is enabled, reading
+ an MBM event of a MON group returns 'Unassigned' if no hardware
+ counter is assigned to it. For CTRL_MON groups, 'Unassigned' is
+ returned if the MBM event does not have an assigned counter in the
+ CTRL_MON group nor in any of its associated MON groups.
+
"mon_hw_id":
Available only with debug option. The identifier used by hardware
for the monitor group. On x86 this is the RMID.
+When monitoring is enabled all MON groups may also contain:
+
+"mbm_L3_assignments":
+ Exists when "mbm_event" counter assignment mode is supported and lists the
+ counter assignment states of the group.
+
+ The assignment list is displayed in the following format:
+
+ <Event>:<Domain ID>=<Assignment state>;<Domain ID>=<Assignment state>
+
+ Event: A valid MBM event in the
+ /sys/fs/resctrl/info/L3_MON/event_configs directory.
+
+ Domain ID: A valid domain ID. When writing, '*' applies the changes
+ to all the domains.
+
+ Assignment states:
+
+ _ : No counter assigned.
+
+ e : Counter assigned exclusively.
+
+ Example:
+
+ To display the counter assignment states for the default group.
+ ::
+
+ # cd /sys/fs/resctrl
+ # cat /sys/fs/resctrl/mbm_L3_assignments
+ mbm_total_bytes:0=e;1=e
+ mbm_local_bytes:0=e;1=e
+
+ Assignments can be modified by writing to the interface.
+
+ Examples:
+
+ To unassign the counter associated with the mbm_total_bytes event on domain 0:
+ ::
+
+ # echo "mbm_total_bytes:0=_" > /sys/fs/resctrl/mbm_L3_assignments
+ # cat /sys/fs/resctrl/mbm_L3_assignments
+ mbm_total_bytes:0=_;1=e
+ mbm_local_bytes:0=e;1=e
+
+ To unassign the counter associated with the mbm_total_bytes event on all the domains:
+ ::
+
+ # echo "mbm_total_bytes:*=_" > /sys/fs/resctrl/mbm_L3_assignments
+ # cat /sys/fs/resctrl/mbm_L3_assignments
+ mbm_total_bytes:0=_;1=_
+ mbm_local_bytes:0=e;1=e
+
+ To assign a counter associated with the mbm_total_bytes event on all domains in
+ exclusive mode:
+ ::
+
+ # echo "mbm_total_bytes:*=e" > /sys/fs/resctrl/mbm_L3_assignments
+ # cat /sys/fs/resctrl/mbm_L3_assignments
+ mbm_total_bytes:0=e;1=e
+ mbm_local_bytes:0=e;1=e
+
When the "mba_MBps" mount option is used all CTRL_MON groups will also contain:
"mba_MBps_event":
@@ -563,7 +769,7 @@ this would be dependent on number of cores the benchmark is run on.
depending on # of threads:
For the same SKU in #1, a 'single thread, with 10% bandwidth' and '4
-thread, with 10% bandwidth' can consume upto 10GBps and 40GBps although
+thread, with 10% bandwidth' can consume up to 10GBps and 40GBps although
they have same percentage bandwidth of 10%. This is simply because as
threads start using more cores in an rdtgroup, the actual bandwidth may
increase or vary although user specified bandwidth percentage is same.
@@ -1429,6 +1635,125 @@ View the llc occupancy snapshot::
# cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy
11234000
+
+Examples on working with mbm_assign_mode
+========================================
+
+a. Check if MBM counter assignment mode is supported.
+::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl/
+
+ # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
+ [mbm_event]
+ default
+
+The "mbm_event" mode is detected and enabled.
+
+b. Check how many assignable counters are supported.
+::
+
+ # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
+ 0=32;1=32
+
+c. Check how many assignable counters are available for assignment in each domain.
+::
+
+ # cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs
+ 0=30;1=30
+
+d. To list the default group's assign states.
+::
+
+ # cat /sys/fs/resctrl/mbm_L3_assignments
+ mbm_total_bytes:0=e;1=e
+ mbm_local_bytes:0=e;1=e
+
+e. To unassign the counter associated with the mbm_total_bytes event on domain 0.
+::
+
+ # echo "mbm_total_bytes:0=_" > /sys/fs/resctrl/mbm_L3_assignments
+ # cat /sys/fs/resctrl/mbm_L3_assignments
+ mbm_total_bytes:0=_;1=e
+ mbm_local_bytes:0=e;1=e
+
+f. To unassign the counter associated with the mbm_total_bytes event on all domains.
+::
+
+ # echo "mbm_total_bytes:*=_" > /sys/fs/resctrl/mbm_L3_assignments
+ # cat /sys/fs/resctrl/mbm_L3_assignment
+ mbm_total_bytes:0=_;1=_
+ mbm_local_bytes:0=e;1=e
+
+g. To assign a counter associated with the mbm_total_bytes event on all domains in
+exclusive mode.
+::
+
+ # echo "mbm_total_bytes:*=e" > /sys/fs/resctrl/mbm_L3_assignments
+ # cat /sys/fs/resctrl/mbm_L3_assignments
+ mbm_total_bytes:0=e;1=e
+ mbm_local_bytes:0=e;1=e
+
+h. Read the events mbm_total_bytes and mbm_local_bytes of the default group. There is
+no change in reading the events with the assignment.
+::
+
+ # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
+ 779247936
+ # cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes
+ 562324232
+ # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
+ 212122123
+ # cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
+ 121212144
+
+i. Check the event configurations.
+::
+
+ # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_total_bytes/event_filter
+ local_reads,remote_reads,local_non_temporal_writes,remote_non_temporal_writes,
+ local_reads_slow_memory,remote_reads_slow_memory,dirty_victim_writes_all
+
+ # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_local_bytes/event_filter
+ local_reads,local_non_temporal_writes,local_reads_slow_memory
+
+j. Change the event configuration for mbm_local_bytes.
+::
+
+ # echo "local_reads, local_non_temporal_writes, local_reads_slow_memory, remote_reads" >
+ /sys/fs/resctrl/info/L3_MON/event_configs/mbm_local_bytes/event_filter
+
+ # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_local_bytes/event_filter
+ local_reads,local_non_temporal_writes,local_reads_slow_memory,remote_reads
+
+k. Now read the local events again. The first read may come back with "Unavailable"
+status. The subsequent read of mbm_local_bytes will display the current value.
+::
+
+ # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
+ Unavailable
+ # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
+ 2252323
+ # cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
+ Unavailable
+ # cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
+ 1566565
+
+l. Users have the option to go back to 'default' mbm_assign_mode if required. This can be
+done using the following command. Note that switching the mbm_assign_mode may reset all
+the MBM counters (and thus all MBM events) of all the resctrl groups.
+::
+
+ # echo "default" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
+ # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
+ mbm_event
+ [default]
+
+m. Unmount the resctrl filesystem.
+::
+
+ # umount /sys/fs/resctrl/
+
Intel RDT Errata
================
diff --git a/Documentation/filesystems/sharedsubtree.rst b/Documentation/filesystems/sharedsubtree.rst
index 1cf56489ed48..8b7dc9159083 100644
--- a/Documentation/filesystems/sharedsubtree.rst
+++ b/Documentation/filesystems/sharedsubtree.rst
@@ -31,965 +31,960 @@ and versioned filesystem.
-----------
Shared subtree provides four different flavors of mounts; struct vfsmount to be
-precise
+precise:
- a. shared mount
- b. slave mount
- c. private mount
- d. unbindable mount
+a) A **shared mount** can be replicated to as many mountpoints and all the
+ replicas continue to be exactly same.
-2a) A shared mount can be replicated to as many mountpoints and all the
-replicas continue to be exactly same.
+ Here is an example:
- Here is an example:
+ Let's say /mnt has a mount that is shared::
- Let's say /mnt has a mount that is shared::
+ # mount --make-shared /mnt
- mount --make-shared /mnt
+ .. note::
+ mount(8) command now supports the --make-shared flag,
+ so the sample 'smount' program is no longer needed and has been
+ removed.
- Note: mount(8) command now supports the --make-shared flag,
- so the sample 'smount' program is no longer needed and has been
- removed.
+ ::
- ::
+ # mount --bind /mnt /tmp
- # mount --bind /mnt /tmp
+ The above command replicates the mount at /mnt to the mountpoint /tmp
+ and the contents of both the mounts remain identical.
- The above command replicates the mount at /mnt to the mountpoint /tmp
- and the contents of both the mounts remain identical.
+ ::
- ::
+ #ls /mnt
+ a b c
- #ls /mnt
- a b c
+ #ls /tmp
+ a b c
- #ls /tmp
- a b c
+ Now let's say we mount a device at /tmp/a::
- Now let's say we mount a device at /tmp/a::
+ # mount /dev/sd0 /tmp/a
- # mount /dev/sd0 /tmp/a
+ # ls /tmp/a
+ t1 t2 t3
- #ls /tmp/a
- t1 t2 t3
+ # ls /mnt/a
+ t1 t2 t3
- #ls /mnt/a
- t1 t2 t3
+ Note that the mount has propagated to the mount at /mnt as well.
- Note that the mount has propagated to the mount at /mnt as well.
+ And the same is true even when /dev/sd0 is mounted on /mnt/a. The
+ contents will be visible under /tmp/a too.
- And the same is true even when /dev/sd0 is mounted on /mnt/a. The
- contents will be visible under /tmp/a too.
+b) A **slave mount** is like a shared mount except that mount and umount events
+ only propagate towards it.
-2b) A slave mount is like a shared mount except that mount and umount events
- only propagate towards it.
+ All slave mounts have a master mount which is a shared.
- All slave mounts have a master mount which is a shared.
+ Here is an example:
- Here is an example:
+ Let's say /mnt has a mount which is shared::
- Let's say /mnt has a mount which is shared.
- # mount --make-shared /mnt
+ # mount --make-shared /mnt
- Let's bind mount /mnt to /tmp
- # mount --bind /mnt /tmp
+ Let's bind mount /mnt to /tmp::
- the new mount at /tmp becomes a shared mount and it is a replica of
- the mount at /mnt.
+ # mount --bind /mnt /tmp
- Now let's make the mount at /tmp; a slave of /mnt
- # mount --make-slave /tmp
+ the new mount at /tmp becomes a shared mount and it is a replica of
+ the mount at /mnt.
- let's mount /dev/sd0 on /mnt/a
- # mount /dev/sd0 /mnt/a
+ Now let's make the mount at /tmp; a slave of /mnt::
- #ls /mnt/a
- t1 t2 t3
+ # mount --make-slave /tmp
- #ls /tmp/a
- t1 t2 t3
+ let's mount /dev/sd0 on /mnt/a::
- Note the mount event has propagated to the mount at /tmp
+ # mount /dev/sd0 /mnt/a
- However let's see what happens if we mount something on the mount at /tmp
+ # ls /mnt/a
+ t1 t2 t3
- # mount /dev/sd1 /tmp/b
+ # ls /tmp/a
+ t1 t2 t3
- #ls /tmp/b
- s1 s2 s3
+ Note the mount event has propagated to the mount at /tmp
- #ls /mnt/b
+ However let's see what happens if we mount something on the mount at
+ /tmp::
- Note how the mount event has not propagated to the mount at
- /mnt
+ # mount /dev/sd1 /tmp/b
+ # ls /tmp/b
+ s1 s2 s3
-2c) A private mount does not forward or receive propagation.
+ # ls /mnt/b
- This is the mount we are familiar with. Its the default type.
+ Note how the mount event has not propagated to the mount at
+ /mnt
-2d) A unbindable mount is a unbindable private mount
+c) A **private mount** does not forward or receive propagation.
- let's say we have a mount at /mnt and we make it unbindable::
+ This is the mount we are familiar with. Its the default type.
- # mount --make-unbindable /mnt
- Let's try to bind mount this mount somewhere else::
+d) An **unbindable mount** is, as the name suggests, an unbindable private
+ mount.
- # mount --bind /mnt /tmp
- mount: wrong fs type, bad option, bad superblock on /mnt,
- or too many mounted file systems
+ let's say we have a mount at /mnt and we make it unbindable::
- Binding a unbindable mount is a invalid operation.
+ # mount --make-unbindable /mnt
+
+ Let's try to bind mount this mount somewhere else::
+
+ # mount --bind /mnt /tmp mount: wrong fs type, bad option, bad
+ superblock on /mnt, or too many mounted file systems
+
+ Binding a unbindable mount is a invalid operation.
3) Setting mount states
-----------------------
- The mount command (util-linux package) can be used to set mount
- states::
+The mount command (util-linux package) can be used to set mount
+states::
- mount --make-shared mountpoint
- mount --make-slave mountpoint
- mount --make-private mountpoint
- mount --make-unbindable mountpoint
+ mount --make-shared mountpoint
+ mount --make-slave mountpoint
+ mount --make-private mountpoint
+ mount --make-unbindable mountpoint
4) Use cases
------------
- A) A process wants to clone its own namespace, but still wants to
- access the CD that got mounted recently.
+A) A process wants to clone its own namespace, but still wants to
+ access the CD that got mounted recently.
- Solution:
+ Solution:
- The system administrator can make the mount at /cdrom shared::
+ The system administrator can make the mount at /cdrom shared::
- mount --bind /cdrom /cdrom
- mount --make-shared /cdrom
+ mount --bind /cdrom /cdrom
+ mount --make-shared /cdrom
- Now any process that clones off a new namespace will have a
- mount at /cdrom which is a replica of the same mount in the
- parent namespace.
+ Now any process that clones off a new namespace will have a
+ mount at /cdrom which is a replica of the same mount in the
+ parent namespace.
- So when a CD is inserted and mounted at /cdrom that mount gets
- propagated to the other mount at /cdrom in all the other clone
- namespaces.
+ So when a CD is inserted and mounted at /cdrom that mount gets
+ propagated to the other mount at /cdrom in all the other clone
+ namespaces.
- B) A process wants its mounts invisible to any other process, but
- still be able to see the other system mounts.
+B) A process wants its mounts invisible to any other process, but
+ still be able to see the other system mounts.
- Solution:
+ Solution:
- To begin with, the administrator can mark the entire mount tree
- as shareable::
+ To begin with, the administrator can mark the entire mount tree
+ as shareable::
- mount --make-rshared /
+ mount --make-rshared /
- A new process can clone off a new namespace. And mark some part
- of its namespace as slave::
+ A new process can clone off a new namespace. And mark some part
+ of its namespace as slave::
- mount --make-rslave /myprivatetree
+ mount --make-rslave /myprivatetree
- Hence forth any mounts within the /myprivatetree done by the
- process will not show up in any other namespace. However mounts
- done in the parent namespace under /myprivatetree still shows
- up in the process's namespace.
+ Hence forth any mounts within the /myprivatetree done by the
+ process will not show up in any other namespace. However mounts
+ done in the parent namespace under /myprivatetree still shows
+ up in the process's namespace.
- Apart from the above semantics this feature provides the
- building blocks to solve the following problems:
+Apart from the above semantics this feature provides the
+building blocks to solve the following problems:
- C) Per-user namespace
+C) Per-user namespace
- The above semantics allows a way to share mounts across
- namespaces. But namespaces are associated with processes. If
- namespaces are made first class objects with user API to
- associate/disassociate a namespace with userid, then each user
- could have his/her own namespace and tailor it to his/her
- requirements. This needs to be supported in PAM.
+ The above semantics allows a way to share mounts across
+ namespaces. But namespaces are associated with processes. If
+ namespaces are made first class objects with user API to
+ associate/disassociate a namespace with userid, then each user
+ could have his/her own namespace and tailor it to his/her
+ requirements. This needs to be supported in PAM.
- D) Versioned files
+D) Versioned files
- If the entire mount tree is visible at multiple locations, then
- an underlying versioning file system can return different
- versions of the file depending on the path used to access that
- file.
+ If the entire mount tree is visible at multiple locations, then
+ an underlying versioning file system can return different
+ versions of the file depending on the path used to access that
+ file.
- An example is::
+ An example is::
- mount --make-shared /
- mount --rbind / /view/v1
- mount --rbind / /view/v2
- mount --rbind / /view/v3
- mount --rbind / /view/v4
+ mount --make-shared /
+ mount --rbind / /view/v1
+ mount --rbind / /view/v2
+ mount --rbind / /view/v3
+ mount --rbind / /view/v4
- and if /usr has a versioning filesystem mounted, then that
- mount appears at /view/v1/usr, /view/v2/usr, /view/v3/usr and
- /view/v4/usr too
+ and if /usr has a versioning filesystem mounted, then that
+ mount appears at /view/v1/usr, /view/v2/usr, /view/v3/usr and
+ /view/v4/usr too
- A user can request v3 version of the file /usr/fs/namespace.c
- by accessing /view/v3/usr/fs/namespace.c . The underlying
- versioning filesystem can then decipher that v3 version of the
- filesystem is being requested and return the corresponding
- inode.
+ A user can request v3 version of the file /usr/fs/namespace.c
+ by accessing /view/v3/usr/fs/namespace.c . The underlying
+ versioning filesystem can then decipher that v3 version of the
+ filesystem is being requested and return the corresponding
+ inode.
5) Detailed semantics
---------------------
- The section below explains the detailed semantics of
- bind, rbind, move, mount, umount and clone-namespace operations.
-
- Note: the word 'vfsmount' and the noun 'mount' have been used
- to mean the same thing, throughout this document.
+The section below explains the detailed semantics of
+bind, rbind, move, mount, umount and clone-namespace operations.
-5a) Mount states
+.. Note::
+ the word 'vfsmount' and the noun 'mount' have been used
+ to mean the same thing, throughout this document.
- A given mount can be in one of the following states
+a) Mount states
- 1) shared
- 2) slave
- 3) shared and slave
- 4) private
- 5) unbindable
+ A **propagation event** is defined as event generated on a vfsmount
+ that leads to mount or unmount actions in other vfsmounts.
- A 'propagation event' is defined as event generated on a vfsmount
- that leads to mount or unmount actions in other vfsmounts.
+ A **peer group** is defined as a group of vfsmounts that propagate
+ events to each other.
- A 'peer group' is defined as a group of vfsmounts that propagate
- events to each other.
+ A given mount can be in one of the following states:
- (1) Shared mounts
+ (1) Shared mounts
- A 'shared mount' is defined as a vfsmount that belongs to a
- 'peer group'.
+ A **shared mount** is defined as a vfsmount that belongs to a
+ peer group.
- For example::
+ For example::
- mount --make-shared /mnt
- mount --bind /mnt /tmp
+ mount --make-shared /mnt
+ mount --bind /mnt /tmp
- The mount at /mnt and that at /tmp are both shared and belong
- to the same peer group. Anything mounted or unmounted under
- /mnt or /tmp reflect in all the other mounts of its peer
- group.
+ The mount at /mnt and that at /tmp are both shared and belong
+ to the same peer group. Anything mounted or unmounted under
+ /mnt or /tmp reflect in all the other mounts of its peer
+ group.
- (2) Slave mounts
+ (2) Slave mounts
- A 'slave mount' is defined as a vfsmount that receives
- propagation events and does not forward propagation events.
+ A **slave mount** is defined as a vfsmount that receives
+ propagation events and does not forward propagation events.
- A slave mount as the name implies has a master mount from which
- mount/unmount events are received. Events do not propagate from
- the slave mount to the master. Only a shared mount can be made
- a slave by executing the following command::
+ A slave mount as the name implies has a master mount from which
+ mount/unmount events are received. Events do not propagate from
+ the slave mount to the master. Only a shared mount can be made
+ a slave by executing the following command::
- mount --make-slave mount
+ mount --make-slave mount
- A shared mount that is made as a slave is no more shared unless
- modified to become shared.
+ A shared mount that is made as a slave is no more shared unless
+ modified to become shared.
- (3) Shared and Slave
+ (3) Shared and Slave
- A vfsmount can be both shared as well as slave. This state
- indicates that the mount is a slave of some vfsmount, and
- has its own peer group too. This vfsmount receives propagation
- events from its master vfsmount, and also forwards propagation
- events to its 'peer group' and to its slave vfsmounts.
+ A vfsmount can be both **shared** as well as **slave**. This state
+ indicates that the mount is a slave of some vfsmount, and
+ has its own peer group too. This vfsmount receives propagation
+ events from its master vfsmount, and also forwards propagation
+ events to its 'peer group' and to its slave vfsmounts.
- Strictly speaking, the vfsmount is shared having its own
- peer group, and this peer-group is a slave of some other
- peer group.
+ Strictly speaking, the vfsmount is shared having its own
+ peer group, and this peer-group is a slave of some other
+ peer group.
- Only a slave vfsmount can be made as 'shared and slave' by
- either executing the following command::
+ Only a slave vfsmount can be made as 'shared and slave' by
+ either executing the following command::
- mount --make-shared mount
+ mount --make-shared mount
- or by moving the slave vfsmount under a shared vfsmount.
+ or by moving the slave vfsmount under a shared vfsmount.
- (4) Private mount
+ (4) Private mount
- A 'private mount' is defined as vfsmount that does not
- receive or forward any propagation events.
+ A **private mount** is defined as vfsmount that does not
+ receive or forward any propagation events.
- (5) Unbindable mount
+ (5) Unbindable mount
- A 'unbindable mount' is defined as vfsmount that does not
- receive or forward any propagation events and cannot
- be bind mounted.
+ A **unbindable mount** is defined as vfsmount that does not
+ receive or forward any propagation events and cannot
+ be bind mounted.
- State diagram:
+ State diagram:
- The state diagram below explains the state transition of a mount,
- in response to various commands::
+ The state diagram below explains the state transition of a mount,
+ in response to various commands::
- -----------------------------------------------------------------------
- | |make-shared | make-slave | make-private |make-unbindab|
- --------------|------------|--------------|--------------|-------------|
- |shared |shared |*slave/private| private | unbindable |
- | | | | | |
- |-------------|------------|--------------|--------------|-------------|
- |slave |shared | **slave | private | unbindable |
- | |and slave | | | |
- |-------------|------------|--------------|--------------|-------------|
- |shared |shared | slave | private | unbindable |
- |and slave |and slave | | | |
- |-------------|------------|--------------|--------------|-------------|
- |private |shared | **private | private | unbindable |
- |-------------|------------|--------------|--------------|-------------|
- |unbindable |shared |**unbindable | private | unbindable |
- ------------------------------------------------------------------------
+ -----------------------------------------------------------------------
+ | |make-shared | make-slave | make-private |make-unbindab|
+ --------------|------------|--------------|--------------|-------------|
+ |shared |shared |*slave/private| private | unbindable |
+ | | | | | |
+ |-------------|------------|--------------|--------------|-------------|
+ |slave |shared | **slave | private | unbindable |
+ | |and slave | | | |
+ |-------------|------------|--------------|--------------|-------------|
+ |shared |shared | slave | private | unbindable |
+ |and slave |and slave | | | |
+ |-------------|------------|--------------|--------------|-------------|
+ |private |shared | **private | private | unbindable |
+ |-------------|------------|--------------|--------------|-------------|
+ |unbindable |shared |**unbindable | private | unbindable |
+ ------------------------------------------------------------------------
- * if the shared mount is the only mount in its peer group, making it
- slave, makes it private automatically. Note that there is no master to
- which it can be slaved to.
+ * if the shared mount is the only mount in its peer group, making it
+ slave, makes it private automatically. Note that there is no master to
+ which it can be slaved to.
- ** slaving a non-shared mount has no effect on the mount.
+ ** slaving a non-shared mount has no effect on the mount.
- Apart from the commands listed below, the 'move' operation also changes
- the state of a mount depending on type of the destination mount. Its
- explained in section 5d.
+ Apart from the commands listed below, the 'move' operation also changes
+ the state of a mount depending on type of the destination mount. Its
+ explained in section 5d.
-5b) Bind semantics
+b) Bind semantics
- Consider the following command::
+ Consider the following command::
- mount --bind A/a B/b
+ mount --bind A/a B/b
- where 'A' is the source mount, 'a' is the dentry in the mount 'A', 'B'
- is the destination mount and 'b' is the dentry in the destination mount.
+ where 'A' is the source mount, 'a' is the dentry in the mount 'A', 'B'
+ is the destination mount and 'b' is the dentry in the destination mount.
- The outcome depends on the type of mount of 'A' and 'B'. The table
- below contains quick reference::
+ The outcome depends on the type of mount of 'A' and 'B'. The table
+ below contains quick reference::
- --------------------------------------------------------------------------
- | BIND MOUNT OPERATION |
- |************************************************************************|
- |source(A)->| shared | private | slave | unbindable |
- | dest(B) | | | | |
- | | | | | | |
- | v | | | | |
- |************************************************************************|
- | shared | shared | shared | shared & slave | invalid |
- | | | | | |
- |non-shared| shared | private | slave | invalid |
- **************************************************************************
+ --------------------------------------------------------------------------
+ | BIND MOUNT OPERATION |
+ |************************************************************************|
+ |source(A)->| shared | private | slave | unbindable |
+ | dest(B) | | | | |
+ | | | | | | |
+ | v | | | | |
+ |************************************************************************|
+ | shared | shared | shared | shared & slave | invalid |
+ | | | | | |
+ |non-shared| shared | private | slave | invalid |
+ **************************************************************************
- Details:
+ Details:
- 1. 'A' is a shared mount and 'B' is a shared mount. A new mount 'C'
- which is clone of 'A', is created. Its root dentry is 'a' . 'C' is
- mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
- are created and mounted at the dentry 'b' on all mounts where 'B'
- propagates to. A new propagation tree containing 'C1',..,'Cn' is
- created. This propagation tree is identical to the propagation tree of
- 'B'. And finally the peer-group of 'C' is merged with the peer group
- of 'A'.
+ 1. 'A' is a shared mount and 'B' is a shared mount. A new mount 'C'
+ which is clone of 'A', is created. Its root dentry is 'a' . 'C' is
+ mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
+ are created and mounted at the dentry 'b' on all mounts where 'B'
+ propagates to. A new propagation tree containing 'C1',..,'Cn' is
+ created. This propagation tree is identical to the propagation tree of
+ 'B'. And finally the peer-group of 'C' is merged with the peer group
+ of 'A'.
- 2. 'A' is a private mount and 'B' is a shared mount. A new mount 'C'
- which is clone of 'A', is created. Its root dentry is 'a'. 'C' is
- mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
- are created and mounted at the dentry 'b' on all mounts where 'B'
- propagates to. A new propagation tree is set containing all new mounts
- 'C', 'C1', .., 'Cn' with exactly the same configuration as the
- propagation tree for 'B'.
+ 2. 'A' is a private mount and 'B' is a shared mount. A new mount 'C'
+ which is clone of 'A', is created. Its root dentry is 'a'. 'C' is
+ mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
+ are created and mounted at the dentry 'b' on all mounts where 'B'
+ propagates to. A new propagation tree is set containing all new mounts
+ 'C', 'C1', .., 'Cn' with exactly the same configuration as the
+ propagation tree for 'B'.
- 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. A new
- mount 'C' which is clone of 'A', is created. Its root dentry is 'a' .
- 'C' is mounted on mount 'B' at dentry 'b'. Also new mounts 'C1', 'C2',
- 'C3' ... are created and mounted at the dentry 'b' on all mounts where
- 'B' propagates to. A new propagation tree containing the new mounts
- 'C','C1',.. 'Cn' is created. This propagation tree is identical to the
- propagation tree for 'B'. And finally the mount 'C' and its peer group
- is made the slave of mount 'Z'. In other words, mount 'C' is in the
- state 'slave and shared'.
-
- 4. 'A' is a unbindable mount and 'B' is a shared mount. This is a
- invalid operation.
-
- 5. 'A' is a private mount and 'B' is a non-shared(private or slave or
- unbindable) mount. A new mount 'C' which is clone of 'A', is created.
- Its root dentry is 'a'. 'C' is mounted on mount 'B' at dentry 'b'.
-
- 6. 'A' is a shared mount and 'B' is a non-shared mount. A new mount 'C'
- which is a clone of 'A' is created. Its root dentry is 'a'. 'C' is
- mounted on mount 'B' at dentry 'b'. 'C' is made a member of the
- peer-group of 'A'.
-
- 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. A
- new mount 'C' which is a clone of 'A' is created. Its root dentry is
- 'a'. 'C' is mounted on mount 'B' at dentry 'b'. Also 'C' is set as a
- slave mount of 'Z'. In other words 'A' and 'C' are both slave mounts of
- 'Z'. All mount/unmount events on 'Z' propagates to 'A' and 'C'. But
- mount/unmount on 'A' do not propagate anywhere else. Similarly
- mount/unmount on 'C' do not propagate anywhere else.
-
- 8. 'A' is a unbindable mount and 'B' is a non-shared mount. This is a
- invalid operation. A unbindable mount cannot be bind mounted.
-
-5c) Rbind semantics
-
- rbind is same as bind. Bind replicates the specified mount. Rbind
- replicates all the mounts in the tree belonging to the specified mount.
- Rbind mount is bind mount applied to all the mounts in the tree.
-
- If the source tree that is rbind has some unbindable mounts,
- then the subtree under the unbindable mount is pruned in the new
- location.
-
- eg:
-
- let's say we have the following mount tree::
-
- A
- / \
- B C
- / \ / \
- D E F G
-
- Let's say all the mount except the mount C in the tree are
- of a type other than unbindable.
-
- If this tree is rbound to say Z
-
- We will have the following tree at the new location::
-
- Z
- |
- A'
- /
- B' Note how the tree under C is pruned
- / \ in the new location.
- D' E'
-
-
-
-5d) Move semantics
-
- Consider the following command
-
- mount --move A B/b
+ 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. A new
+ mount 'C' which is clone of 'A', is created. Its root dentry is 'a' .
+ 'C' is mounted on mount 'B' at dentry 'b'. Also new mounts 'C1', 'C2',
+ 'C3' ... are created and mounted at the dentry 'b' on all mounts where
+ 'B' propagates to. A new propagation tree containing the new mounts
+ 'C','C1',.. 'Cn' is created. This propagation tree is identical to the
+ propagation tree for 'B'. And finally the mount 'C' and its peer group
+ is made the slave of mount 'Z'. In other words, mount 'C' is in the
+ state 'slave and shared'.
+
+ 4. 'A' is a unbindable mount and 'B' is a shared mount. This is a
+ invalid operation.
+
+ 5. 'A' is a private mount and 'B' is a non-shared(private or slave or
+ unbindable) mount. A new mount 'C' which is clone of 'A', is created.
+ Its root dentry is 'a'. 'C' is mounted on mount 'B' at dentry 'b'.
+
+ 6. 'A' is a shared mount and 'B' is a non-shared mount. A new mount 'C'
+ which is a clone of 'A' is created. Its root dentry is 'a'. 'C' is
+ mounted on mount 'B' at dentry 'b'. 'C' is made a member of the
+ peer-group of 'A'.
+
+ 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. A
+ new mount 'C' which is a clone of 'A' is created. Its root dentry is
+ 'a'. 'C' is mounted on mount 'B' at dentry 'b'. Also 'C' is set as a
+ slave mount of 'Z'. In other words 'A' and 'C' are both slave mounts of
+ 'Z'. All mount/unmount events on 'Z' propagates to 'A' and 'C'. But
+ mount/unmount on 'A' do not propagate anywhere else. Similarly
+ mount/unmount on 'C' do not propagate anywhere else.
+
+ 8. 'A' is a unbindable mount and 'B' is a non-shared mount. This is a
+ invalid operation. A unbindable mount cannot be bind mounted.
+
+c) Rbind semantics
+
+ rbind is same as bind. Bind replicates the specified mount. Rbind
+ replicates all the mounts in the tree belonging to the specified mount.
+ Rbind mount is bind mount applied to all the mounts in the tree.
+
+ If the source tree that is rbind has some unbindable mounts,
+ then the subtree under the unbindable mount is pruned in the new
+ location.
+
+ eg:
+
+ let's say we have the following mount tree::
+
+ A
+ / \
+ B C
+ / \ / \
+ D E F G
+
+ Let's say all the mount except the mount C in the tree are
+ of a type other than unbindable.
+
+ If this tree is rbound to say Z
+
+ We will have the following tree at the new location::
+
+ Z
+ |
+ A'
+ /
+ B' Note how the tree under C is pruned
+ / \ in the new location.
+ D' E'
+
+
+
+d) Move semantics
+
+ Consider the following command::
+
+ mount --move A B/b
- where 'A' is the source mount, 'B' is the destination mount and 'b' is
- the dentry in the destination mount.
+ where 'A' is the source mount, 'B' is the destination mount and 'b' is
+ the dentry in the destination mount.
- The outcome depends on the type of the mount of 'A' and 'B'. The table
- below is a quick reference::
+ The outcome depends on the type of the mount of 'A' and 'B'. The table
+ below is a quick reference::
- ---------------------------------------------------------------------------
- | MOVE MOUNT OPERATION |
- |**************************************************************************
- | source(A)->| shared | private | slave | unbindable |
- | dest(B) | | | | |
- | | | | | | |
- | v | | | | |
- |**************************************************************************
- | shared | shared | shared |shared and slave| invalid |
- | | | | | |
- |non-shared| shared | private | slave | unbindable |
- ***************************************************************************
+ ---------------------------------------------------------------------------
+ | MOVE MOUNT OPERATION |
+ |**************************************************************************
+ | source(A)->| shared | private | slave | unbindable |
+ | dest(B) | | | | |
+ | | | | | | |
+ | v | | | | |
+ |**************************************************************************
+ | shared | shared | shared |shared and slave| invalid |
+ | | | | | |
+ |non-shared| shared | private | slave | unbindable |
+ ***************************************************************************
- .. Note:: moving a mount residing under a shared mount is invalid.
+ .. Note:: moving a mount residing under a shared mount is invalid.
- Details follow:
+ Details follow:
- 1. 'A' is a shared mount and 'B' is a shared mount. The mount 'A' is
- mounted on mount 'B' at dentry 'b'. Also new mounts 'A1', 'A2'...'An'
- are created and mounted at dentry 'b' on all mounts that receive
- propagation from mount 'B'. A new propagation tree is created in the
- exact same configuration as that of 'B'. This new propagation tree
- contains all the new mounts 'A1', 'A2'... 'An'. And this new
- propagation tree is appended to the already existing propagation tree
- of 'A'.
+ 1. 'A' is a shared mount and 'B' is a shared mount. The mount 'A' is
+ mounted on mount 'B' at dentry 'b'. Also new mounts 'A1', 'A2'...'An'
+ are created and mounted at dentry 'b' on all mounts that receive
+ propagation from mount 'B'. A new propagation tree is created in the
+ exact same configuration as that of 'B'. This new propagation tree
+ contains all the new mounts 'A1', 'A2'... 'An'. And this new
+ propagation tree is appended to the already existing propagation tree
+ of 'A'.
- 2. 'A' is a private mount and 'B' is a shared mount. The mount 'A' is
- mounted on mount 'B' at dentry 'b'. Also new mount 'A1', 'A2'... 'An'
- are created and mounted at dentry 'b' on all mounts that receive
- propagation from mount 'B'. The mount 'A' becomes a shared mount and a
- propagation tree is created which is identical to that of
- 'B'. This new propagation tree contains all the new mounts 'A1',
- 'A2'... 'An'.
+ 2. 'A' is a private mount and 'B' is a shared mount. The mount 'A' is
+ mounted on mount 'B' at dentry 'b'. Also new mount 'A1', 'A2'... 'An'
+ are created and mounted at dentry 'b' on all mounts that receive
+ propagation from mount 'B'. The mount 'A' becomes a shared mount and a
+ propagation tree is created which is identical to that of
+ 'B'. This new propagation tree contains all the new mounts 'A1',
+ 'A2'... 'An'.
- 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. The
- mount 'A' is mounted on mount 'B' at dentry 'b'. Also new mounts 'A1',
- 'A2'... 'An' are created and mounted at dentry 'b' on all mounts that
- receive propagation from mount 'B'. A new propagation tree is created
- in the exact same configuration as that of 'B'. This new propagation
- tree contains all the new mounts 'A1', 'A2'... 'An'. And this new
- propagation tree is appended to the already existing propagation tree of
- 'A'. Mount 'A' continues to be the slave mount of 'Z' but it also
- becomes 'shared'.
+ 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. The
+ mount 'A' is mounted on mount 'B' at dentry 'b'. Also new mounts 'A1',
+ 'A2'... 'An' are created and mounted at dentry 'b' on all mounts that
+ receive propagation from mount 'B'. A new propagation tree is created
+ in the exact same configuration as that of 'B'. This new propagation
+ tree contains all the new mounts 'A1', 'A2'... 'An'. And this new
+ propagation tree is appended to the already existing propagation tree of
+ 'A'. Mount 'A' continues to be the slave mount of 'Z' but it also
+ becomes 'shared'.
- 4. 'A' is a unbindable mount and 'B' is a shared mount. The operation
- is invalid. Because mounting anything on the shared mount 'B' can
- create new mounts that get mounted on the mounts that receive
- propagation from 'B'. And since the mount 'A' is unbindable, cloning
- it to mount at other mountpoints is not possible.
+ 4. 'A' is a unbindable mount and 'B' is a shared mount. The operation
+ is invalid. Because mounting anything on the shared mount 'B' can
+ create new mounts that get mounted on the mounts that receive
+ propagation from 'B'. And since the mount 'A' is unbindable, cloning
+ it to mount at other mountpoints is not possible.
- 5. 'A' is a private mount and 'B' is a non-shared(private or slave or
- unbindable) mount. The mount 'A' is mounted on mount 'B' at dentry 'b'.
+ 5. 'A' is a private mount and 'B' is a non-shared(private or slave or
+ unbindable) mount. The mount 'A' is mounted on mount 'B' at dentry 'b'.
- 6. 'A' is a shared mount and 'B' is a non-shared mount. The mount 'A'
- is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
- shared mount.
+ 6. 'A' is a shared mount and 'B' is a non-shared mount. The mount 'A'
+ is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
+ shared mount.
- 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount.
- The mount 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A'
- continues to be a slave mount of mount 'Z'.
+ 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount.
+ The mount 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A'
+ continues to be a slave mount of mount 'Z'.
- 8. 'A' is a unbindable mount and 'B' is a non-shared mount. The mount
- 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
- unbindable mount.
+ 8. 'A' is a unbindable mount and 'B' is a non-shared mount. The mount
+ 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
+ unbindable mount.
-5e) Mount semantics
+e) Mount semantics
- Consider the following command::
+ Consider the following command::
- mount device B/b
+ mount device B/b
- 'B' is the destination mount and 'b' is the dentry in the destination
- mount.
+ 'B' is the destination mount and 'b' is the dentry in the destination
+ mount.
- The above operation is the same as bind operation with the exception
- that the source mount is always a private mount.
+ The above operation is the same as bind operation with the exception
+ that the source mount is always a private mount.
-5f) Unmount semantics
+f) Unmount semantics
- Consider the following command::
+ Consider the following command::
- umount A
+ umount A
- where 'A' is a mount mounted on mount 'B' at dentry 'b'.
+ where 'A' is a mount mounted on mount 'B' at dentry 'b'.
- If mount 'B' is shared, then all most-recently-mounted mounts at dentry
- 'b' on mounts that receive propagation from mount 'B' and does not have
- sub-mounts within them are unmounted.
+ If mount 'B' is shared, then all most-recently-mounted mounts at dentry
+ 'b' on mounts that receive propagation from mount 'B' and does not have
+ sub-mounts within them are unmounted.
- Example: Let's say 'B1', 'B2', 'B3' are shared mounts that propagate to
- each other.
+ Example: Let's say 'B1', 'B2', 'B3' are shared mounts that propagate to
+ each other.
- let's say 'A1', 'A2', 'A3' are first mounted at dentry 'b' on mount
- 'B1', 'B2' and 'B3' respectively.
+ let's say 'A1', 'A2', 'A3' are first mounted at dentry 'b' on mount
+ 'B1', 'B2' and 'B3' respectively.
- let's say 'C1', 'C2', 'C3' are next mounted at the same dentry 'b' on
- mount 'B1', 'B2' and 'B3' respectively.
+ let's say 'C1', 'C2', 'C3' are next mounted at the same dentry 'b' on
+ mount 'B1', 'B2' and 'B3' respectively.
- if 'C1' is unmounted, all the mounts that are most-recently-mounted on
- 'B1' and on the mounts that 'B1' propagates-to are unmounted.
+ if 'C1' is unmounted, all the mounts that are most-recently-mounted on
+ 'B1' and on the mounts that 'B1' propagates-to are unmounted.
- 'B1' propagates to 'B2' and 'B3'. And the most recently mounted mount
- on 'B2' at dentry 'b' is 'C2', and that of mount 'B3' is 'C3'.
+ 'B1' propagates to 'B2' and 'B3'. And the most recently mounted mount
+ on 'B2' at dentry 'b' is 'C2', and that of mount 'B3' is 'C3'.
- So all 'C1', 'C2' and 'C3' should be unmounted.
+ So all 'C1', 'C2' and 'C3' should be unmounted.
- If any of 'C2' or 'C3' has some child mounts, then that mount is not
- unmounted, but all other mounts are unmounted. However if 'C1' is told
- to be unmounted and 'C1' has some sub-mounts, the umount operation is
- failed entirely.
+ If any of 'C2' or 'C3' has some child mounts, then that mount is not
+ unmounted, but all other mounts are unmounted. However if 'C1' is told
+ to be unmounted and 'C1' has some sub-mounts, the umount operation is
+ failed entirely.
-5g) Clone Namespace
+g) Clone Namespace
- A cloned namespace contains all the mounts as that of the parent
- namespace.
+ A cloned namespace contains all the mounts as that of the parent
+ namespace.
- Let's say 'A' and 'B' are the corresponding mounts in the parent and the
- child namespace.
+ Let's say 'A' and 'B' are the corresponding mounts in the parent and the
+ child namespace.
- If 'A' is shared, then 'B' is also shared and 'A' and 'B' propagate to
- each other.
+ If 'A' is shared, then 'B' is also shared and 'A' and 'B' propagate to
+ each other.
- If 'A' is a slave mount of 'Z', then 'B' is also the slave mount of
- 'Z'.
+ If 'A' is a slave mount of 'Z', then 'B' is also the slave mount of
+ 'Z'.
- If 'A' is a private mount, then 'B' is a private mount too.
+ If 'A' is a private mount, then 'B' is a private mount too.
- If 'A' is unbindable mount, then 'B' is a unbindable mount too.
+ If 'A' is unbindable mount, then 'B' is a unbindable mount too.
6) Quiz
-------
- A. What is the result of the following command sequence?
+A. What is the result of the following command sequence?
- ::
+ ::
- mount --bind /mnt /mnt
- mount --make-shared /mnt
- mount --bind /mnt /tmp
- mount --move /tmp /mnt/1
+ mount --bind /mnt /mnt
+ mount --make-shared /mnt
+ mount --bind /mnt /tmp
+ mount --move /tmp /mnt/1
- what should be the contents of /mnt /mnt/1 /mnt/1/1 should be?
- Should they all be identical? or should /mnt and /mnt/1 be
- identical only?
+ what should be the contents of /mnt /mnt/1 /mnt/1/1 should be?
+ Should they all be identical? or should /mnt and /mnt/1 be
+ identical only?
- B. What is the result of the following command sequence?
+B. What is the result of the following command sequence?
- ::
+ ::
- mount --make-rshared /
- mkdir -p /v/1
- mount --rbind / /v/1
+ mount --make-rshared /
+ mkdir -p /v/1
+ mount --rbind / /v/1
- what should be the content of /v/1/v/1 be?
+ what should be the content of /v/1/v/1 be?
- C. What is the result of the following command sequence?
+C. What is the result of the following command sequence?
- ::
+ ::
- mount --bind /mnt /mnt
- mount --make-shared /mnt
- mkdir -p /mnt/1/2/3 /mnt/1/test
- mount --bind /mnt/1 /tmp
- mount --make-slave /mnt
- mount --make-shared /mnt
- mount --bind /mnt/1/2 /tmp1
- mount --make-slave /mnt
+ mount --bind /mnt /mnt
+ mount --make-shared /mnt
+ mkdir -p /mnt/1/2/3 /mnt/1/test
+ mount --bind /mnt/1 /tmp
+ mount --make-slave /mnt
+ mount --make-shared /mnt
+ mount --bind /mnt/1/2 /tmp1
+ mount --make-slave /mnt
- At this point we have the first mount at /tmp and
- its root dentry is 1. Let's call this mount 'A'
- And then we have a second mount at /tmp1 with root
- dentry 2. Let's call this mount 'B'
- Next we have a third mount at /mnt with root dentry
- mnt. Let's call this mount 'C'
+ At this point we have the first mount at /tmp and
+ its root dentry is 1. Let's call this mount 'A'
+ And then we have a second mount at /tmp1 with root
+ dentry 2. Let's call this mount 'B'
+ Next we have a third mount at /mnt with root dentry
+ mnt. Let's call this mount 'C'
- 'B' is the slave of 'A' and 'C' is a slave of 'B'
- A -> B -> C
+ 'B' is the slave of 'A' and 'C' is a slave of 'B'
+ A -> B -> C
- at this point if we execute the following command
+ at this point if we execute the following command::
- mount --bind /bin /tmp/test
+ mount --bind /bin /tmp/test
- The mount is attempted on 'A'
+ The mount is attempted on 'A'
- will the mount propagate to 'B' and 'C' ?
+ will the mount propagate to 'B' and 'C' ?
- what would be the contents of
- /mnt/1/test be?
+ what would be the contents of
+ /mnt/1/test be?
7) FAQ
------
- Q1. Why is bind mount needed? How is it different from symbolic links?
- symbolic links can get stale if the destination mount gets
- unmounted or moved. Bind mounts continue to exist even if the
- other mount is unmounted or moved.
+1. Why is bind mount needed? How is it different from symbolic links?
- Q2. Why can't the shared subtree be implemented using exportfs?
+ symbolic links can get stale if the destination mount gets
+ unmounted or moved. Bind mounts continue to exist even if the
+ other mount is unmounted or moved.
- exportfs is a heavyweight way of accomplishing part of what
- shared subtree can do. I cannot imagine a way to implement the
- semantics of slave mount using exportfs?
+2. Why can't the shared subtree be implemented using exportfs?
- Q3 Why is unbindable mount needed?
+ exportfs is a heavyweight way of accomplishing part of what
+ shared subtree can do. I cannot imagine a way to implement the
+ semantics of slave mount using exportfs?
- Let's say we want to replicate the mount tree at multiple
- locations within the same subtree.
+3. Why is unbindable mount needed?
- if one rbind mounts a tree within the same subtree 'n' times
- the number of mounts created is an exponential function of 'n'.
- Having unbindable mount can help prune the unneeded bind
- mounts. Here is an example.
+ Let's say we want to replicate the mount tree at multiple
+ locations within the same subtree.
- step 1:
- let's say the root tree has just two directories with
- one vfsmount::
+ if one rbind mounts a tree within the same subtree 'n' times
+ the number of mounts created is an exponential function of 'n'.
+ Having unbindable mount can help prune the unneeded bind
+ mounts. Here is an example.
- root
- / \
- tmp usr
+ step 1:
+ let's say the root tree has just two directories with
+ one vfsmount::
- And we want to replicate the tree at multiple
- mountpoints under /root/tmp
+ root
+ / \
+ tmp usr
- step 2:
- ::
+ And we want to replicate the tree at multiple
+ mountpoints under /root/tmp
+ step 2:
+ ::
- mount --make-shared /root
- mkdir -p /tmp/m1
+ mount --make-shared /root
- mount --rbind /root /tmp/m1
+ mkdir -p /tmp/m1
- the new tree now looks like this::
+ mount --rbind /root /tmp/m1
- root
- / \
- tmp usr
- /
- m1
- / \
- tmp usr
- /
- m1
+ the new tree now looks like this::
- it has two vfsmounts
+ root
+ / \
+ tmp usr
+ /
+ m1
+ / \
+ tmp usr
+ /
+ m1
- step 3:
- ::
+ it has two vfsmounts
- mkdir -p /tmp/m2
- mount --rbind /root /tmp/m2
+ step 3:
+ ::
- the new tree now looks like this::
+ mkdir -p /tmp/m2
+ mount --rbind /root /tmp/m2
- root
- / \
- tmp usr
- / \
- m1 m2
- / \ / \
- tmp usr tmp usr
- / \ /
- m1 m2 m1
- / \ / \
- tmp usr tmp usr
- / / \
- m1 m1 m2
- / \
- tmp usr
- / \
- m1 m2
+ the new tree now looks like this::
- it has 6 vfsmounts
+ root
+ / \
+ tmp usr
+ / \
+ m1 m2
+ / \ / \
+ tmp usr tmp usr
+ / \ /
+ m1 m2 m1
+ / \ / \
+ tmp usr tmp usr
+ / / \
+ m1 m1 m2
+ / \
+ tmp usr
+ / \
+ m1 m2
- step 4:
- ::
- mkdir -p /tmp/m3
- mount --rbind /root /tmp/m3
+ it has 6 vfsmounts
- I won't draw the tree..but it has 24 vfsmounts
+ step 4:
+ ::
+ mkdir -p /tmp/m3
+ mount --rbind /root /tmp/m3
- at step i the number of vfsmounts is V[i] = i*V[i-1].
- This is an exponential function. And this tree has way more
- mounts than what we really needed in the first place.
+ I won't draw the tree..but it has 24 vfsmounts
- One could use a series of umount at each step to prune
- out the unneeded mounts. But there is a better solution.
- Unclonable mounts come in handy here.
- step 1:
- let's say the root tree has just two directories with
- one vfsmount::
+ at step i the number of vfsmounts is V[i] = i*V[i-1].
+ This is an exponential function. And this tree has way more
+ mounts than what we really needed in the first place.
- root
- / \
- tmp usr
+ One could use a series of umount at each step to prune
+ out the unneeded mounts. But there is a better solution.
+ Unclonable mounts come in handy here.
- How do we set up the same tree at multiple locations under
- /root/tmp
+ step 1:
+ let's say the root tree has just two directories with
+ one vfsmount::
- step 2:
- ::
+ root
+ / \
+ tmp usr
+ How do we set up the same tree at multiple locations under
+ /root/tmp
- mount --bind /root/tmp /root/tmp
+ step 2:
+ ::
- mount --make-rshared /root
- mount --make-unbindable /root/tmp
- mkdir -p /tmp/m1
+ mount --bind /root/tmp /root/tmp
- mount --rbind /root /tmp/m1
+ mount --make-rshared /root
+ mount --make-unbindable /root/tmp
- the new tree now looks like this::
+ mkdir -p /tmp/m1
- root
- / \
- tmp usr
- /
- m1
- / \
- tmp usr
+ mount --rbind /root /tmp/m1
- step 3:
- ::
+ the new tree now looks like this::
- mkdir -p /tmp/m2
- mount --rbind /root /tmp/m2
+ root
+ / \
+ tmp usr
+ /
+ m1
+ / \
+ tmp usr
- the new tree now looks like this::
+ step 3:
+ ::
- root
- / \
- tmp usr
- / \
- m1 m2
- / \ / \
- tmp usr tmp usr
+ mkdir -p /tmp/m2
+ mount --rbind /root /tmp/m2
- step 4:
- ::
+ the new tree now looks like this::
- mkdir -p /tmp/m3
- mount --rbind /root /tmp/m3
+ root
+ / \
+ tmp usr
+ / \
+ m1 m2
+ / \ / \
+ tmp usr tmp usr
- the new tree now looks like this::
+ step 4:
+ ::
- root
- / \
- tmp usr
- / \ \
- m1 m2 m3
- / \ / \ / \
- tmp usr tmp usr tmp usr
+ mkdir -p /tmp/m3
+ mount --rbind /root /tmp/m3
+
+ the new tree now looks like this::
+
+ root
+ / \
+ tmp usr
+ / \ \
+ m1 m2 m3
+ / \ / \ / \
+ tmp usr tmp usr tmp usr
8) Implementation
-----------------
-8A) Datastructure
+A) Datastructure
+
+ Several new fields are introduced to struct vfsmount:
+
+ ->mnt_share
+ Links together all the mount to/from which this vfsmount
+ send/receives propagation events.
- 4 new fields are introduced to struct vfsmount:
+ ->mnt_slave_list
+ Links all the mounts to which this vfsmount propagates
+ to.
- * ->mnt_share
- * ->mnt_slave_list
- * ->mnt_slave
- * ->mnt_master
+ ->mnt_slave
+ Links together all the slaves that its master vfsmount
+ propagates to.
- ->mnt_share
- links together all the mount to/from which this vfsmount
- send/receives propagation events.
+ ->mnt_master
+ Points to the master vfsmount from which this vfsmount
+ receives propagation.
- ->mnt_slave_list
- links all the mounts to which this vfsmount propagates
- to.
+ ->mnt_flags
+ Takes two more flags to indicate the propagation status of
+ the vfsmount. MNT_SHARE indicates that the vfsmount is a shared
+ vfsmount. MNT_UNCLONABLE indicates that the vfsmount cannot be
+ replicated.
- ->mnt_slave
- links together all the slaves that its master vfsmount
- propagates to.
+ All the shared vfsmounts in a peer group form a cyclic list through
+ ->mnt_share.
- ->mnt_master
- points to the master vfsmount from which this vfsmount
- receives propagation.
+ All vfsmounts with the same ->mnt_master form on a cyclic list anchored
+ in ->mnt_master->mnt_slave_list and going through ->mnt_slave.
- ->mnt_flags
- takes two more flags to indicate the propagation status of
- the vfsmount. MNT_SHARE indicates that the vfsmount is a shared
- vfsmount. MNT_UNCLONABLE indicates that the vfsmount cannot be
- replicated.
+ ->mnt_master can point to arbitrary (and possibly different) members
+ of master peer group. To find all immediate slaves of a peer group
+ you need to go through _all_ ->mnt_slave_list of its members.
+ Conceptually it's just a single set - distribution among the
+ individual lists does not affect propagation or the way propagation
+ tree is modified by operations.
- All the shared vfsmounts in a peer group form a cyclic list through
- ->mnt_share.
+ All vfsmounts in a peer group have the same ->mnt_master. If it is
+ non-NULL, they form a contiguous (ordered) segment of slave list.
- All vfsmounts with the same ->mnt_master form on a cyclic list anchored
- in ->mnt_master->mnt_slave_list and going through ->mnt_slave.
+ A example propagation tree looks as shown in the figure below.
- ->mnt_master can point to arbitrary (and possibly different) members
- of master peer group. To find all immediate slaves of a peer group
- you need to go through _all_ ->mnt_slave_list of its members.
- Conceptually it's just a single set - distribution among the
- individual lists does not affect propagation or the way propagation
- tree is modified by operations.
+ .. note::
+ Though it looks like a forest, if we consider all the shared
+ mounts as a conceptual entity called 'pnode', it becomes a tree.
- All vfsmounts in a peer group have the same ->mnt_master. If it is
- non-NULL, they form a contiguous (ordered) segment of slave list.
+ ::
- A example propagation tree looks as shown in the figure below.
- [ NOTE: Though it looks like a forest, if we consider all the shared
- mounts as a conceptual entity called 'pnode', it becomes a tree]::
+ A <--> B <--> C <---> D
+ /|\ /| |\
+ / F G J K H I
+ /
+ E<-->K
+ /|\
+ M L N
- A <--> B <--> C <---> D
- /|\ /| |\
- / F G J K H I
- /
- E<-->K
- /|\
- M L N
+ In the above figure A,B,C and D all are shared and propagate to each
+ other. 'A' has got 3 slave mounts 'E' 'F' and 'G' 'C' has got 2 slave
+ mounts 'J' and 'K' and 'D' has got two slave mounts 'H' and 'I'.
+ 'E' is also shared with 'K' and they propagate to each other. And
+ 'K' has 3 slaves 'M', 'L' and 'N'
- In the above figure A,B,C and D all are shared and propagate to each
- other. 'A' has got 3 slave mounts 'E' 'F' and 'G' 'C' has got 2 slave
- mounts 'J' and 'K' and 'D' has got two slave mounts 'H' and 'I'.
- 'E' is also shared with 'K' and they propagate to each other. And
- 'K' has 3 slaves 'M', 'L' and 'N'
+ A's ->mnt_share links with the ->mnt_share of 'B' 'C' and 'D'
- A's ->mnt_share links with the ->mnt_share of 'B' 'C' and 'D'
+ A's ->mnt_slave_list links with ->mnt_slave of 'E', 'K', 'F' and 'G'
- A's ->mnt_slave_list links with ->mnt_slave of 'E', 'K', 'F' and 'G'
+ E's ->mnt_share links with ->mnt_share of K
- E's ->mnt_share links with ->mnt_share of K
+ 'E', 'K', 'F', 'G' have their ->mnt_master point to struct vfsmount of 'A'
- 'E', 'K', 'F', 'G' have their ->mnt_master point to struct vfsmount of 'A'
+ 'M', 'L', 'N' have their ->mnt_master point to struct vfsmount of 'K'
- 'M', 'L', 'N' have their ->mnt_master point to struct vfsmount of 'K'
+ K's ->mnt_slave_list links with ->mnt_slave of 'M', 'L' and 'N'
- K's ->mnt_slave_list links with ->mnt_slave of 'M', 'L' and 'N'
+ C's ->mnt_slave_list links with ->mnt_slave of 'J' and 'K'
- C's ->mnt_slave_list links with ->mnt_slave of 'J' and 'K'
+ J and K's ->mnt_master points to struct vfsmount of C
- J and K's ->mnt_master points to struct vfsmount of C
+ and finally D's ->mnt_slave_list links with ->mnt_slave of 'H' and 'I'
- and finally D's ->mnt_slave_list links with ->mnt_slave of 'H' and 'I'
+ 'H' and 'I' have their ->mnt_master pointing to struct vfsmount of 'D'.
- 'H' and 'I' have their ->mnt_master pointing to struct vfsmount of 'D'.
+ NOTE: The propagation tree is orthogonal to the mount tree.
- NOTE: The propagation tree is orthogonal to the mount tree.
+B) Locking:
-8B Locking:
+ ->mnt_share, ->mnt_slave, ->mnt_slave_list, ->mnt_master are protected
+ by namespace_sem (exclusive for modifications, shared for reading).
- ->mnt_share, ->mnt_slave, ->mnt_slave_list, ->mnt_master are protected
- by namespace_sem (exclusive for modifications, shared for reading).
+ Normally we have ->mnt_flags modifications serialized by vfsmount_lock.
+ There are two exceptions: do_add_mount() and clone_mnt().
+ The former modifies a vfsmount that has not been visible in any shared
+ data structures yet.
+ The latter holds namespace_sem and the only references to vfsmount
+ are in lists that can't be traversed without namespace_sem.
- Normally we have ->mnt_flags modifications serialized by vfsmount_lock.
- There are two exceptions: do_add_mount() and clone_mnt().
- The former modifies a vfsmount that has not been visible in any shared
- data structures yet.
- The latter holds namespace_sem and the only references to vfsmount
- are in lists that can't be traversed without namespace_sem.
+C) Algorithm:
-8C Algorithm:
+ The crux of the implementation resides in rbind/move operation.
- The crux of the implementation resides in rbind/move operation.
+ The overall algorithm breaks the operation into 3 phases: (look at
+ attach_recursive_mnt() and propagate_mnt())
- The overall algorithm breaks the operation into 3 phases: (look at
- attach_recursive_mnt() and propagate_mnt())
+ 1. Prepare phase.
- 1. prepare phase.
- 2. commit phases.
- 3. abort phases.
+ For each mount in the source tree:
- Prepare phase:
+ a) Create the necessary number of mount trees to
+ be attached to each of the mounts that receive
+ propagation from the destination mount.
+ b) Do not attach any of the trees to its destination.
+ However note down its ->mnt_parent and ->mnt_mountpoint
+ c) Link all the new mounts to form a propagation tree that
+ is identical to the propagation tree of the destination
+ mount.
- for each mount in the source tree:
+ If this phase is successful, there should be 'n' new
+ propagation trees; where 'n' is the number of mounts in the
+ source tree. Go to the commit phase
- a) Create the necessary number of mount trees to
- be attached to each of the mounts that receive
- propagation from the destination mount.
- b) Do not attach any of the trees to its destination.
- However note down its ->mnt_parent and ->mnt_mountpoint
- c) Link all the new mounts to form a propagation tree that
- is identical to the propagation tree of the destination
- mount.
+ Also there should be 'm' new mount trees, where 'm' is
+ the number of mounts to which the destination mount
+ propagates to.
- If this phase is successful, there should be 'n' new
- propagation trees; where 'n' is the number of mounts in the
- source tree. Go to the commit phase
+ If any memory allocations fail, go to the abort phase.
- Also there should be 'm' new mount trees, where 'm' is
- the number of mounts to which the destination mount
- propagates to.
+ 2. Commit phase.
- if any memory allocations fail, go to the abort phase.
+ Attach each of the mount trees to their corresponding
+ destination mounts.
- Commit phase
- attach each of the mount trees to their corresponding
- destination mounts.
+ 3. Abort phase.
- Abort phase
- delete all the newly created trees.
+ Delete all the newly created trees.
- .. Note::
- all the propagation related functionality resides in the file pnode.c
+ .. Note::
+ all the propagation related functionality resides in the file pnode.c
------------------------------------------------------------------------
diff --git a/Documentation/filesystems/sysfs.rst b/Documentation/filesystems/sysfs.rst
index c32993bc83c7..2703c04af7d0 100644
--- a/Documentation/filesystems/sysfs.rst
+++ b/Documentation/filesystems/sysfs.rst
@@ -243,8 +243,8 @@ Other notes:
- show() methods should return the number of bytes printed into the
buffer.
-- show() should only use sysfs_emit() or sysfs_emit_at() when formatting
- the value to be returned to user space.
+- New implementations of show() methods should only use sysfs_emit() or
+ sysfs_emit_at() when formatting the value to be returned to user space.
- store() should return the number of bytes used from the buffer. If the
entire buffer has been used, just return the count argument.
@@ -299,7 +299,6 @@ The top level sysfs directory looks like::
hypervisor/
kernel/
module/
- net/
power/
devices/ contains a filesystem representation of the device tree. It maps
@@ -313,7 +312,7 @@ kernel. Each bus's directory contains two subdirectories::
drivers/
devices/ contains symlinks for each device discovered in the system
-that point to the device's directory under root/.
+that point to the device's directory under /sys/devices.
drivers/ contains a directory for each device driver that is loaded
for devices on that particular bus (this assumes that drivers do not
@@ -321,22 +320,36 @@ span multiple bus types).
fs/ contains a directory for some filesystems. Currently each
filesystem wanting to export attributes must create its own hierarchy
-below fs/ (see ./fuse.rst for an example).
+below fs/ (see fuse/fuse.rst for an example).
module/ contains parameter values and state information for all
loaded system modules, for both builtin and loadable modules.
dev/ contains two directories: char/ and block/. Inside these two
directories there are symlinks named <major>:<minor>. These symlinks
-point to the sysfs directory for the given device. /sys/dev provides a
+point to the directories under /sys/devices for each device. /sys/dev provides a
quick way to lookup the sysfs interface for a device from the result of
a stat(2) operation.
More information on driver-model specific features can be found in
Documentation/driver-api/driver-model/.
+block/ contains symlinks to all the block devices discovered on the system.
+These symlinks point to directories under /sys/devices.
-TODO: Finish this section.
+class/ contains a directory for each device class, grouped by functional type.
+Each directory in class/ contains symlinks to devices in the /sys/devices directory.
+
+firmware/ contains system firmware data and configuration such as firmware tables,
+ACPI information, and device tree data.
+
+hypervisor/ contains virtualization platform information and provides an interface to
+the underlying hypervisor. It is only present when running on a virtual machine.
+
+kernel/ contains runtime kernel parameters, configuration settings, and status.
+
+power/ contains power management subsystem information including
+sleep states, suspend/resume capabilities, and policies.
Current Interfaces
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 486a91633474..4f13b01e42eb 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -209,31 +209,8 @@ method fills in is the "s_op" field. This is a pointer to a "struct
super_operations" which describes the next level of the filesystem
implementation.
-Usually, a filesystem uses one of the generic mount() implementations
-and provides a fill_super() callback instead. The generic variants are:
-
-``mount_bdev``
- mount a filesystem residing on a block device
-
-``mount_nodev``
- mount a filesystem that is not backed by a device
-
-``mount_single``
- mount a filesystem which shares the instance between all mounts
-
-A fill_super() callback implementation has the following arguments:
-
-``struct super_block *sb``
- the superblock structure. The callback must initialize this
- properly.
-
-``void *data``
- arbitrary mount options, usually comes as an ASCII string (see
- "Mount Options" section)
-
-``int silent``
- whether or not to be silent on error
-
+For more information on mounting (and the new mount API), see
+Documentation/filesystems/mount_api.rst.
The Superblock Object
=====================
@@ -327,11 +304,11 @@ or bottom half).
inode->i_lock spinlock held.
This method should be either NULL (normal UNIX filesystem
- semantics) or "generic_delete_inode" (for filesystems that do
+ semantics) or "inode_just_drop" (for filesystems that do
not want to cache inodes - causing "delete_inode" to always be
called regardless of the value of i_nlink)
- The "generic_delete_inode()" behavior is equivalent to the old
+ The "inode_just_drop()" behavior is equivalent to the old
practice of using "force_delete" in the put_inode() case, but
does not have the races that the "force_delete()" approach had.
diff --git a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
index e231d127cd40..8cbcd3c26434 100644
--- a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
@@ -454,7 +454,7 @@ filesystem so that it can apply pending filesystem updates to the staging
information.
Once the scan is done, the owning object is re-locked, the live data is used to
write a new ondisk structure, and the repairs are committed atomically.
-The hooks are disabled and the staging staging area is freed.
+The hooks are disabled and the staging area is freed.
Finally, the storage from the old data structure are carefully reaped.
Introducing concurrency helps online repair avoid various locking problems, but
@@ -475,7 +475,7 @@ operation, which may cause application failure or an unplanned filesystem
shutdown.
Inspiration for the secondary metadata repair strategy was drawn from section
-2.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without Side-File")
+2.4 of Srinivasan above, and sections 2 ("NSF: Index Build Without Side-File")
and 3.1.1 ("Duplicate Key Insert Problem") in C. Mohan, `"Algorithms for
Creating Indexes for Very Large Tables Without Quiescing Updates"
<https://dl.acm.org/doi/10.1145/130283.130337>`_, 1992.
@@ -2185,7 +2185,7 @@ The chapter about :ref:`secondary metadata<secondary_metadata>` mentioned that
checking and repairing of secondary metadata commonly requires coordination
between a live metadata scan of the filesystem and writer threads that are
updating that metadata.
-Keeping the scan data up to date requires requires the ability to propagate
+Keeping the scan data up to date requires the ability to propagate
metadata updates from the filesystem into the data being collected by the scan.
This *can* be done by appending concurrent updates into a separate log file and
applying them before writing the new metadata to disk, but this leads to
@@ -4179,7 +4179,7 @@ When the exchange is initiated, the sequence of operations is as follows:
This will be discussed in more detail in subsequent sections.
If the filesystem goes down in the middle of an operation, log recovery will
-find the most recent unfinished maping exchange log intent item and restart
+find the most recent unfinished mapping exchange log intent item and restart
from there.
This is how atomic file mapping exchanges guarantees that an outside observer
will either see the old broken structure or the new one, and never a mismash of