summaryrefslogtreecommitdiff
path: root/Documentation/filesystems
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r--Documentation/filesystems/bcachefs/CodingStyle.rst186
-rw-r--r--Documentation/filesystems/bcachefs/SubmittingPatches.rst105
-rw-r--r--Documentation/filesystems/bcachefs/casefolding.rst108
-rw-r--r--Documentation/filesystems/bcachefs/errorcodes.rst30
-rw-r--r--Documentation/filesystems/bcachefs/future/idle_work.rst78
-rw-r--r--Documentation/filesystems/bcachefs/index.rst38
-rw-r--r--Documentation/filesystems/index.rst1
-rw-r--r--Documentation/filesystems/porting.rst16
-rw-r--r--Documentation/filesystems/proc.rst8
-rw-r--r--Documentation/filesystems/resctrl.rst325
-rw-r--r--Documentation/filesystems/vfs.rst31
11 files changed, 351 insertions, 575 deletions
diff --git a/Documentation/filesystems/bcachefs/CodingStyle.rst b/Documentation/filesystems/bcachefs/CodingStyle.rst
deleted file mode 100644
index b29562a6bf55..000000000000
--- a/Documentation/filesystems/bcachefs/CodingStyle.rst
+++ /dev/null
@@ -1,186 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-bcachefs coding style
-=====================
-
-Good development is like gardening, and codebases are our gardens. Tend to them
-every day; look for little things that are out of place or in need of tidying.
-A little weeding here and there goes a long way; don't wait until things have
-spiraled out of control.
-
-Things don't always have to be perfect - nitpicking often does more harm than
-good. But appreciate beauty when you see it - and let people know.
-
-The code that you are afraid to touch is the code most in need of refactoring.
-
-A little organizing here and there goes a long way.
-
-Put real thought into how you organize things.
-
-Good code is readable code, where the structure is simple and leaves nowhere
-for bugs to hide.
-
-Assertions are one of our most important tools for writing reliable code. If in
-the course of writing a patchset you encounter a condition that shouldn't
-happen (and will have unpredictable or undefined behaviour if it does), or
-you're not sure if it can happen and not sure how to handle it yet - make it a
-BUG_ON(). Don't leave undefined or unspecified behavior lurking in the codebase.
-
-By the time you finish the patchset, you should understand better which
-assertions need to be handled and turned into checks with error paths, and
-which should be logically impossible. Leave the BUG_ON()s in for the ones which
-are logically impossible. (Or, make them debug mode assertions if they're
-expensive - but don't turn everything into a debug mode assertion, so that
-we're not stuck debugging undefined behaviour should it turn out that you were
-wrong).
-
-Assertions are documentation that can't go out of date. Good assertions are
-wonderful.
-
-Good assertions drastically and dramatically reduce the amount of testing
-required to shake out bugs.
-
-Good assertions are based on state, not logic. To write good assertions, you
-have to think about what the invariants on your state are.
-
-Good invariants and assertions will hold everywhere in your codebase. This
-means that you can run them in only a few places in the checked in version, but
-should you need to debug something that caused the assertion to fail, you can
-quickly shotgun them everywhere to find the codepath that broke the invariant.
-
-A good assertion checks something that the compiler could check for us, and
-elide - if we were working in a language with embedded correctness proofs that
-the compiler could check. This is something that exists today, but it'll likely
-still be a few decades before it comes to systems programming languages. But we
-can still incorporate that kind of thinking into our code and document the
-invariants with runtime checks - much like the way people working in
-dynamically typed languages may add type annotations, gradually making their
-code statically typed.
-
-Looking for ways to make your assertions simpler - and higher level - will
-often nudge you towards making the entire system simpler and more robust.
-
-Good code is code where you can poke around and see what it's doing -
-introspection. We can't debug anything if we can't see what's going on.
-
-Whenever we're debugging, and the solution isn't immediately obvious, if the
-issue is that we don't know where the issue is because we can't see what's
-going on - fix that first.
-
-We have the tools to make anything visible at runtime, efficiently - RCU and
-percpu data structures among them. Don't let things stay hidden.
-
-The most important tool for introspection is the humble pretty printer - in
-bcachefs, this means `*_to_text()` functions, which output to printbufs.
-
-Pretty printers are wonderful, because they compose and you can use them
-everywhere. Having functions to print whatever object you're working with will
-make your error messages much easier to write (therefore they will actually
-exist) and much more informative. And they can be used from sysfs/debugfs, as
-well as tracepoints.
-
-Runtime info and debugging tools should come with clear descriptions and
-labels, and good structure - we don't want files with a list of bare integers,
-like in procfs. Part of the job of the debugging tools is to educate users and
-new developers as to how the system works.
-
-Error messages should, whenever possible, tell you everything you need to debug
-the issue. It's worth putting effort into them.
-
-Tracepoints shouldn't be the first thing you reach for. They're an important
-tool, but always look for more immediate ways to make things visible. When we
-have to rely on tracing, we have to know which tracepoints we're looking for,
-and then we have to run the troublesome workload, and then we have to sift
-through logs. This is a lot of steps to go through when a user is hitting
-something, and if it's intermittent it may not even be possible.
-
-The humble counter is an incredibly useful tool. They're cheap and simple to
-use, and many complicated internal operations with lots of things that can
-behave weirdly (anything involving memory reclaim, for example) become
-shockingly easy to debug once you have counters on every distinct codepath.
-
-Persistent counters are even better.
-
-When debugging, try to get the most out of every bug you come across; don't
-rush to fix the initial issue. Look for things that will make related bugs
-easier the next time around - introspection, new assertions, better error
-messages, new debug tools, and do those first. Look for ways to make the system
-better behaved; often one bug will uncover several other bugs through
-downstream effects.
-
-Fix all that first, and then the original bug last - even if that means keeping
-a user waiting. They'll thank you in the long run, and when they understand
-what you're doing you'll be amazed at how patient they're happy to be. Users
-like to help - otherwise they wouldn't be reporting the bug in the first place.
-
-Talk to your users. Don't isolate yourself.
-
-Users notice all sorts of interesting things, and by just talking to them and
-interacting with them you can benefit from their experience.
-
-Spend time doing support and helpdesk stuff. Don't just write code - code isn't
-finished until it's being used trouble free.
-
-This will also motivate you to make your debugging tools as good as possible,
-and perhaps even your documentation, too. Like anything else in life, the more
-time you spend at it the better you'll get, and you the developer are the
-person most able to improve the tools to make debugging quick and easy.
-
-Be wary of how you take on and commit to big projects. Don't let development
-become product-manager focused. Often time an idea is a good one but needs to
-wait for its proper time - but you won't know if it's the proper time for an
-idea until you start writing code.
-
-Expect to throw a lot of things away, or leave them half finished for later.
-Nobody writes all perfect code that all gets shipped, and you'll be much more
-productive in the long run if you notice this early and shift to something
-else. The experience gained and lessons learned will be valuable for all the
-other work you do.
-
-But don't be afraid to tackle projects that require significant rework of
-existing code. Sometimes these can be the best projects, because they can lead
-us to make existing code more general, more flexible, more multipurpose and
-perhaps more robust. Just don't hesitate to abandon the idea if it looks like
-it's going to make a mess of things.
-
-Complicated features can often be done as a series of refactorings, with the
-final change that actually implements the feature as a quite small patch at the
-end. It's wonderful when this happens, especially when those refactorings are
-things that improve the codebase in their own right. When that happens there's
-much less risk of wasted effort if the feature you were going for doesn't work
-out.
-
-Always strive to work incrementally. Always strive to turn the big projects
-into little bite sized projects that can prove their own merits.
-
-Instead of always tackling those big projects, look for little things that
-will be useful, and make the big projects easier.
-
-The question of what's likely to be useful is where junior developers most
-often go astray - doing something because it seems like it'll be useful often
-leads to overengineering. Knowing what's useful comes from many years of
-experience, or talking with people who have that experience - or from simply
-reading lots of code and looking for common patterns and issues. Don't be
-afraid to throw things away and do something simpler.
-
-Talk about your ideas with your fellow developers; often times the best things
-come from relaxed conversations where people aren't afraid to say "what if?".
-
-Don't neglect your tools.
-
-The most important tools (besides the compiler and our text editor) are the
-tools we use for testing. The shortest possible edit/test/debug cycle is
-essential for working productively. We learn, gain experience, and discover the
-errors in our thinking by running our code and seeing what happens. If your
-time is being wasted because your tools are bad or too slow - don't accept it,
-fix it.
-
-Put effort into your documentation, commit messages, and code comments - but
-don't go overboard. A good commit message is wonderful - but if the information
-was important enough to go in a commit message, ask yourself if it would be
-even better as a code comment.
-
-A good code comment is wonderful, but even better is the comment that didn't
-need to exist because the code was so straightforward as to be obvious;
-organized into small clean and tidy modules, with clear and descriptive names
-for functions and variables, where every line of code has a clear purpose.
diff --git a/Documentation/filesystems/bcachefs/SubmittingPatches.rst b/Documentation/filesystems/bcachefs/SubmittingPatches.rst
deleted file mode 100644
index 18c79d548391..000000000000
--- a/Documentation/filesystems/bcachefs/SubmittingPatches.rst
+++ /dev/null
@@ -1,105 +0,0 @@
-Submitting patches to bcachefs
-==============================
-
-Here are suggestions for submitting patches to bcachefs subsystem.
-
-Submission checklist
---------------------
-
-Patches must be tested before being submitted, either with the xfstests suite
-[0]_, or the full bcachefs test suite in ktest [1]_, depending on what's being
-touched. Note that ktest wraps xfstests and will be an easier method to running
-it for most users; it includes single-command wrappers for all the mainstream
-in-kernel local filesystems.
-
-Patches will undergo more testing after being merged (including
-lockdep/kasan/preempt/etc. variants), these are not generally required to be
-run by the submitter - but do put some thought into what you're changing and
-which tests might be relevant, e.g. are you dealing with tricky memory layout
-work? kasan, are you doing locking work? then lockdep; and ktest includes
-single-command variants for the debug build types you'll most likely need.
-
-The exception to this rule is incomplete WIP/RFC patches: if you're working on
-something nontrivial, it's encouraged to send out a WIP patch to let people
-know what you're doing and make sure you're on the right track. Just make sure
-it includes a brief note as to what's done and what's incomplete, to avoid
-confusion.
-
-Rigorous checkpatch.pl adherence is not required (many of its warnings are
-considered out of date), but try not to deviate too much without reason.
-
-Focus on writing code that reads well and is organized well; code should be
-aesthetically pleasing.
-
-CI
---
-
-Instead of running your tests locally, when running the full test suite it's
-preferable to let a server farm do it in parallel, and then have the results
-in a nice test dashboard (which can tell you which failures are new, and
-presents results in a git log view, avoiding the need for most bisecting).
-
-That exists [2]_, and community members may request an account. If you work for
-a big tech company, you'll need to help out with server costs to get access -
-but the CI is not restricted to running bcachefs tests: it runs any ktest test
-(which generally makes it easy to wrap other tests that can run in qemu).
-
-Other things to think about
----------------------------
-
-- How will we debug this code? Is there sufficient introspection to diagnose
- when something starts acting wonky on a user machine?
-
- We don't necessarily need every single field of every data structure visible
- with introspection, but having the important fields of all the core data
- types wired up makes debugging drastically easier - a bit of thoughtful
- foresight greatly reduces the need to have people build custom kernels with
- debug patches.
-
- More broadly, think about all the debug tooling that might be needed.
-
-- Does it make the codebase more or less of a mess? Can we also try to do some
- organizing, too?
-
-- Do new tests need to be written? New assertions? How do we know and verify
- that the code is correct, and what happens if something goes wrong?
-
- We don't yet have automated code coverage analysis or easy fault injection -
- but for now, pretend we did and ask what they might tell us.
-
- Assertions are hugely important, given that we don't yet have a systems
- language that can do ergonomic embedded correctness proofs. Hitting an assert
- in testing is much better than wandering off into undefined behaviour la-la
- land - use them. Use them judiciously, and not as a replacement for proper
- error handling, but use them.
-
-- Does it need to be performance tested? Should we add new performance counters?
-
- bcachefs has a set of persistent runtime counters which can be viewed with
- the 'bcachefs fs top' command; this should give users a basic idea of what
- their filesystem is currently doing. If you're doing a new feature or looking
- at old code, think if anything should be added.
-
-- If it's a new on disk format feature - have upgrades and downgrades been
- tested? (Automated tests exists but aren't in the CI, due to the hassle of
- disk image management; coordinate to have them run.)
-
-Mailing list, IRC
------------------
-
-Patches should hit the list [3]_, but much discussion and code review happens
-on IRC as well [4]_; many people appreciate the more conversational approach
-and quicker feedback.
-
-Additionally, we have a lively user community doing excellent QA work, which
-exists primarily on IRC. Please make use of that resource; user feedback is
-important for any nontrivial feature, and documenting it in commit messages
-would be a good idea.
-
-.. rubric:: References
-
-.. [0] git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
-.. [1] https://evilpiepirate.org/git/ktest.git/
-.. [2] https://evilpiepirate.org/~testdashboard/ci/
-.. [3] linux-bcachefs@vger.kernel.org
-.. [4] irc.oftc.net#bcache, #bcachefs-dev
diff --git a/Documentation/filesystems/bcachefs/casefolding.rst b/Documentation/filesystems/bcachefs/casefolding.rst
deleted file mode 100644
index 871a38f557e8..000000000000
--- a/Documentation/filesystems/bcachefs/casefolding.rst
+++ /dev/null
@@ -1,108 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-Casefolding
-===========
-
-bcachefs has support for case-insensitive file and directory
-lookups using the regular `chattr +F` (`S_CASEFOLD`, `FS_CASEFOLD_FL`)
-casefolding attributes.
-
-The main usecase for casefolding is compatibility with software written
-against other filesystems that rely on casefolded lookups
-(eg. NTFS and Wine/Proton).
-Taking advantage of file-system level casefolding can lead to great
-loading time gains in many applications and games.
-
-Casefolding support requires a kernel with the `CONFIG_UNICODE` enabled.
-Once a directory has been flagged for casefolding, a feature bit
-is enabled on the superblock which marks the filesystem as using
-casefolding.
-When the feature bit for casefolding is enabled, it is no longer possible
-to mount that filesystem on kernels without `CONFIG_UNICODE` enabled.
-
-On the lookup/query side: casefolding is implemented by allocating a new
-string of `BCH_NAME_MAX` length using the `utf8_casefold` function to
-casefold the query string.
-
-On the dirent side: casefolding is implemented by ensuring the `bkey`'s
-hash is made from the casefolded string and storing the cached casefolded
-name with the regular name in the dirent.
-
-The structure looks like this:
-
-* Regular: [dirent data][regular name][nul][nul]...
-* Casefolded: [dirent data][reg len][cf len][regular name][casefolded name][nul][nul]...
-
-(Do note, the number of NULs here is merely for illustration; their count can
-vary per-key, and they may not even be present if the key is aligned to
-`sizeof(u64)`.)
-
-This is efficient as it means that for all file lookups that require casefolding,
-it has identical performance to a regular lookup:
-a hash comparison and a `memcmp` of the name.
-
-Rationale
----------
-
-Several designs were considered for this system:
-One was to introduce a dirent_v2, however that would be painful especially as
-the hash system only has support for a single key type. This would also need
-`BCH_NAME_MAX` to change between versions, and a new feature bit.
-
-Another option was to store without the two lengths, and just take the length of
-the regular name and casefolded name contiguously / 2 as the length. This would
-assume that the regular length == casefolded length, but that could potentially
-not be true, if the uppercase unicode glyph had a different UTF-8 encoding than
-the lowercase unicode glyph.
-It would be possible to disregard the casefold cache for those cases, but it was
-decided to simply encode the two string lengths in the key to avoid random
-performance issues if this edgecase was ever hit.
-
-The option settled on was to use a free-bit in d_type to mark a dirent as having
-a casefold cache, and then treat the first 4 bytes the name block as lengths.
-You can see this in the `d_cf_name_block` member of union in `bch_dirent`.
-
-The feature bit was used to allow casefolding support to be enabled for the majority
-of users, but some allow users who have no need for the feature to still use bcachefs as
-`CONFIG_UNICODE` can increase the kernel side a significant amount due to the tables used,
-which may be decider between using bcachefs for eg. embedded platforms.
-
-Other filesystems like ext4 and f2fs have a super-block level option for casefolding
-encoding, but bcachefs currently does not provide this. ext4 and f2fs do not expose
-any encodings than a single UTF-8 version. When future encodings are desirable,
-they will be added trivially using the opts mechanism.
-
-dentry/dcache considerations
-----------------------------
-
-Currently, in casefolded directories, bcachefs (like other filesystems) will not cache
-negative dentry's.
-
-This is because currently doing so presents a problem in the following scenario:
-
- - Lookup file "blAH" in a casefolded directory
- - Creation of file "BLAH" in a casefolded directory
- - Lookup file "blAH" in a casefolded directory
-
-This would fail if negative dentry's were cached.
-
-This is slightly suboptimal, but could be fixed in future with some vfs work.
-
-
-References
-----------
-
-(from Peter Anvin, on the list)
-
-It is worth noting that Microsoft has basically declared their
-"recommended" case folding (upcase) table to be permanently frozen (for
-new filesystem instances in the case where they use an on-disk
-translation table created at format time.) As far as I know they have
-never supported anything other than 1:1 conversion of BMP code points,
-nor normalization.
-
-The exFAT specification enumerates the full recommended upcase table,
-although in a somewhat annoying format (basically a hex dump of
-compressed data):
-
-https://learn.microsoft.com/en-us/windows/win32/fileio/exfat-specification
diff --git a/Documentation/filesystems/bcachefs/errorcodes.rst b/Documentation/filesystems/bcachefs/errorcodes.rst
deleted file mode 100644
index 2cccaa0ba7cd..000000000000
--- a/Documentation/filesystems/bcachefs/errorcodes.rst
+++ /dev/null
@@ -1,30 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-bcachefs private error codes
-----------------------------
-
-In bcachefs, as a hard rule we do not throw or directly use standard error
-codes (-EINVAL, -EBUSY, etc.). Instead, we define private error codes as needed
-in fs/bcachefs/errcode.h.
-
-This gives us much better error messages and makes debugging much easier. Any
-direct uses of standard error codes you see in the source code are simply old
-code that has yet to be converted - feel free to clean it up!
-
-Private error codes may subtype another error code, this allows for grouping of
-related errors that should be handled similarly (e.g. transaction restart
-errors), as well as specifying which standard error code should be returned at
-the bcachefs module boundary.
-
-At the module boundary, we use bch2_err_class() to convert to a standard error
-code; this also emits a trace event so that the original error code be
-recovered even if it wasn't logged.
-
-Do not reuse error codes! Generally speaking, a private error code should only
-be thrown in one place. That means that when we see it in a log message we can
-see, unambiguously, exactly which file and line number it was returned from.
-
-Try to give error codes names that are as reasonably descriptive of the error
-as possible. Frequently, the error will be logged at a place far removed from
-where the error was generated; good names for error codes mean much more
-descriptive and useful error messages.
diff --git a/Documentation/filesystems/bcachefs/future/idle_work.rst b/Documentation/filesystems/bcachefs/future/idle_work.rst
deleted file mode 100644
index 59a332509dcd..000000000000
--- a/Documentation/filesystems/bcachefs/future/idle_work.rst
+++ /dev/null
@@ -1,78 +0,0 @@
-Idle/background work classes design doc:
-
-Right now, our behaviour at idle isn't ideal, it was designed for servers that
-would be under sustained load, to keep pending work at a "medium" level, to
-let work build up so we can process it in more efficient batches, while also
-giving headroom for bursts in load.
-
-But for desktops or mobile - scenarios where work is less sustained and power
-usage is more important - we want to operate differently, with a "rush to
-idle" so the system can go to sleep. We don't want to be dribbling out
-background work while the system should be idle.
-
-The complicating factor is that there are a number of background tasks, which
-form a heirarchy (or a digraph, depending on how you divide it up) - one
-background task may generate work for another.
-
-Thus proper idle detection needs to model this heirarchy.
-
-- Foreground writes
-- Page cache writeback
-- Copygc, rebalance
-- Journal reclaim
-
-When we implement idle detection and rush to idle, we need to be careful not
-to disturb too much the existing behaviour that works reasonably well when the
-system is under sustained load (or perhaps improve it in the case of
-rebalance, which currently does not actively attempt to let work batch up).
-
-SUSTAINED LOAD REGIME
----------------------
-
-When the system is under continuous load, we want these jobs to run
-continuously - this is perhaps best modelled with a P/D controller, where
-they'll be trying to keep a target value (i.e. fragmented disk space,
-available journal space) roughly in the middle of some range.
-
-The goal under sustained load is to balance our ability to handle load spikes
-without running out of x resource (free disk space, free space in the
-journal), while also letting some work accumululate to be batched (or become
-unnecessary).
-
-For example, we don't want to run copygc too aggressively, because then it
-will be evacuating buckets that would have become empty (been overwritten or
-deleted) anyways, and we don't want to wait until we're almost out of free
-space because then the system will behave unpredicably - suddenly we're doing
-a lot more work to service each write and the system becomes much slower.
-
-IDLE REGIME
------------
-
-When the system becomes idle, we should start flushing our pending work
-quicker so the system can go to sleep.
-
-Note that the definition of "idle" depends on where in the heirarchy a task
-is - a task should start flushing work more quickly when the task above it has
-stopped generating new work.
-
-e.g. rebalance should start flushing more quickly when page cache writeback is
-idle, and journal reclaim should only start flushing more quickly when both
-copygc and rebalance are idle.
-
-It's important to let work accumulate when more work is still incoming and we
-still have room, because flushing is always more efficient if we let it batch
-up. New writes may overwrite data before rebalance moves it, and tasks may be
-generating more updates for the btree nodes that journal reclaim needs to flush.
-
-On idle, how much work we do at each interval should be proportional to the
-length of time we have been idle for. If we're idle only for a short duration,
-we shouldn't flush everything right away; the system might wake up and start
-generating new work soon, and flushing immediately might end up doing a lot of
-work that would have been unnecessary if we'd allowed things to batch more.
-
-To summarize, we will need:
-
- - A list of classes for background tasks that generate work, which will
- include one "foreground" class.
- - Tracking for each class - "Am I doing work, or have I gone to sleep?"
- - And each class should check the class above it when deciding how much work to issue.
diff --git a/Documentation/filesystems/bcachefs/index.rst b/Documentation/filesystems/bcachefs/index.rst
deleted file mode 100644
index e5c4c2120b93..000000000000
--- a/Documentation/filesystems/bcachefs/index.rst
+++ /dev/null
@@ -1,38 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-======================
-bcachefs Documentation
-======================
-
-Subsystem-specific development process notes
---------------------------------------------
-
-Development notes specific to bcachefs. These are intended to supplement
-:doc:`general kernel development handbook </process/index>`.
-
-.. toctree::
- :maxdepth: 1
- :numbered:
-
- CodingStyle
- SubmittingPatches
-
-Filesystem implementation
--------------------------
-
-Documentation for filesystem features and their implementation details.
-At this moment, only a few of these are described here.
-
-.. toctree::
- :maxdepth: 1
- :numbered:
-
- casefolding
- errorcodes
-
-Future design
--------------
-.. toctree::
- :maxdepth: 1
-
- future/idle_work
diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index 11a599387266..622187a96bdc 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -72,7 +72,6 @@ Documentation for filesystem implementations.
afs
autofs
autofs-mount-control
- bcachefs/index
befs
bfs
btrfs
diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index 85f590254f07..78c3d07c0c08 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -340,8 +340,8 @@ of those. Caller makes sure async writeback cannot be running for the inode whil
->drop_inode() returns int now; it's called on final iput() with
inode->i_lock held and it returns true if filesystems wants the inode to be
-dropped. As before, generic_drop_inode() is still the default and it's been
-updated appropriately. generic_delete_inode() is also alive and it consists
+dropped. As before, inode_generic_drop() is still the default and it's been
+updated appropriately. inode_just_drop() is also alive and it consists
simply of return 1. Note that all actual eviction work is done by caller after
->drop_inode() returns.
@@ -1285,3 +1285,15 @@ rather than a VMA, as the VMA at this stage is not yet valid.
The vm_area_desc provides the minimum required information for a filesystem
to initialise state upon memory mapping of a file-backed region, and output
parameters for the file system to set this state.
+
+---
+
+**mandatory**
+
+Several functions are renamed:
+
+- kern_path_locked -> start_removing_path
+- kern_path_create -> start_creating_path
+- user_path_create -> start_creating_user_path
+- user_path_locked_at -> start_removing_user_path_at
+- done_path_create -> end_creating_path
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 2971551b7235..b7e3147ba3d4 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -2362,6 +2362,7 @@ The following mount options are supported:
hidepid= Set /proc/<pid>/ access mode.
gid= Set the group authorized to learn processes information.
subset= Show only the specified subset of procfs.
+ pidns= Specify a the namespace used by this procfs.
========= ========================================================
hidepid=off or hidepid=0 means classic mode - everybody may access all
@@ -2394,6 +2395,13 @@ information about processes information, just add identd to this group.
subset=pid hides all top level files and directories in the procfs that
are not related to tasks.
+pidns= specifies a pid namespace (either as a string path to something like
+`/proc/$pid/ns/pid`, or a file descriptor when using `FSCONFIG_SET_FD`) that
+will be used by the procfs instance when translating pids. By default, procfs
+will use the calling process's active pid namespace. Note that the pid
+namespace of an existing procfs instance cannot be modified (attempting to do
+so will give an `-EBUSY` error).
+
Chapter 5: Filesystem behavior
==============================
diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
index c7949dd44f2f..006d23af66e1 100644
--- a/Documentation/filesystems/resctrl.rst
+++ b/Documentation/filesystems/resctrl.rst
@@ -26,6 +26,7 @@ MBM (Memory Bandwidth Monitoring) "cqm_mbm_total", "cqm_mbm_local"
MBA (Memory Bandwidth Allocation) "mba"
SMBA (Slow Memory Bandwidth Allocation) ""
BMEC (Bandwidth Monitoring Event Configuration) ""
+ABMC (Assignable Bandwidth Monitoring Counters) ""
=============================================== ================================
Historically, new features were made visible by default in /proc/cpuinfo. This
@@ -256,6 +257,144 @@ with the following files:
# cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
0=0x30;1=0x30;3=0x15;4=0x15
+"mbm_assign_mode":
+ The supported counter assignment modes. The enclosed brackets indicate which mode
+ is enabled. The MBM events associated with counters may reset when "mbm_assign_mode"
+ is changed.
+ ::
+
+ # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
+ [mbm_event]
+ default
+
+ "mbm_event":
+
+ mbm_event mode allows users to assign a hardware counter to an RMID, event
+ pair and monitor the bandwidth usage as long as it is assigned. The hardware
+ continues to track the assigned counter until it is explicitly unassigned by
+ the user. Each event within a resctrl group can be assigned independently.
+
+ In this mode, a monitoring event can only accumulate data while it is backed
+ by a hardware counter. Use "mbm_L3_assignments" found in each CTRL_MON and MON
+ group to specify which of the events should have a counter assigned. The number
+ of counters available is described in the "num_mbm_cntrs" file. Changing the
+ mode may cause all counters on the resource to reset.
+
+ Moving to mbm_event counter assignment mode requires users to assign the counters
+ to the events. Otherwise, the MBM event counters will return 'Unassigned' when read.
+
+ The mode is beneficial for AMD platforms that support more CTRL_MON
+ and MON groups than available hardware counters. By default, this
+ feature is enabled on AMD platforms with the ABMC (Assignable Bandwidth
+ Monitoring Counters) capability, ensuring counters remain assigned even
+ when the corresponding RMID is not actively used by any processor.
+
+ "default":
+
+ In default mode, resctrl assumes there is a hardware counter for each
+ event within every CTRL_MON and MON group. On AMD platforms, it is
+ recommended to use the mbm_event mode, if supported, to prevent reset of MBM
+ events between reads resulting from hardware re-allocating counters. This can
+ result in misleading values or display "Unavailable" if no counter is assigned
+ to the event.
+
+ * To enable "mbm_event" counter assignment mode:
+ ::
+
+ # echo "mbm_event" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
+
+ * To enable "default" monitoring mode:
+ ::
+
+ # echo "default" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
+
+"num_mbm_cntrs":
+ The maximum number of counters (total of available and assigned counters) in
+ each domain when the system supports mbm_event mode.
+
+ For example, on a system with maximum of 32 memory bandwidth monitoring
+ counters in each of its L3 domains:
+ ::
+
+ # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
+ 0=32;1=32
+
+"available_mbm_cntrs":
+ The number of counters available for assignment in each domain when mbm_event
+ mode is enabled on the system.
+
+ For example, on a system with 30 available [hardware] assignable counters
+ in each of its L3 domains:
+ ::
+
+ # cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs
+ 0=30;1=30
+
+"event_configs":
+ Directory that exists when "mbm_event" counter assignment mode is supported.
+ Contains a sub-directory for each MBM event that can be assigned to a counter.
+
+ Two MBM events are supported by default: mbm_local_bytes and mbm_total_bytes.
+ Each MBM event's sub-directory contains a file named "event_filter" that is
+ used to view and modify which memory transactions the MBM event is configured
+ with. The file is accessible only when "mbm_event" counter assignment mode is
+ enabled.
+
+ List of memory transaction types supported:
+
+ ========================== ========================================================
+ Name Description
+ ========================== ========================================================
+ dirty_victim_writes_all Dirty Victims from the QOS domain to all types of memory
+ remote_reads_slow_memory Reads to slow memory in the non-local NUMA domain
+ local_reads_slow_memory Reads to slow memory in the local NUMA domain
+ remote_non_temporal_writes Non-temporal writes to non-local NUMA domain
+ local_non_temporal_writes Non-temporal writes to local NUMA domain
+ remote_reads Reads to memory in the non-local NUMA domain
+ local_reads Reads to memory in the local NUMA domain
+ ========================== ========================================================
+
+ For example::
+
+ # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_total_bytes/event_filter
+ local_reads,remote_reads,local_non_temporal_writes,remote_non_temporal_writes,
+ local_reads_slow_memory,remote_reads_slow_memory,dirty_victim_writes_all
+
+ # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_local_bytes/event_filter
+ local_reads,local_non_temporal_writes,local_reads_slow_memory
+
+ Modify the event configuration by writing to the "event_filter" file within
+ the "event_configs" directory. The read/write "event_filter" file contains the
+ configuration of the event that reflects which memory transactions are counted by it.
+
+ For example::
+
+ # echo "local_reads, local_non_temporal_writes" >
+ /sys/fs/resctrl/info/L3_MON/event_configs/mbm_total_bytes/event_filter
+
+ # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_total_bytes/event_filter
+ local_reads,local_non_temporal_writes
+
+"mbm_assign_on_mkdir":
+ Exists when "mbm_event" counter assignment mode is supported. Accessible
+ only when "mbm_event" counter assignment mode is enabled.
+
+ Determines if a counter will automatically be assigned to an RMID, MBM event
+ pair when its associated monitor group is created via mkdir. Enabled by default
+ on boot, also when switched from "default" mode to "mbm_event" counter assignment
+ mode. Users can disable this capability by writing to the interface.
+
+ "0":
+ Auto assignment is disabled.
+ "1":
+ Auto assignment is enabled.
+
+ Example::
+
+ # echo 0 > /sys/fs/resctrl/info/L3_MON/mbm_assign_on_mkdir
+ # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_on_mkdir
+ 0
+
"max_threshold_occupancy":
Read/write file provides the largest value (in
bytes) at which a previously used LLC_occupancy
@@ -380,10 +519,77 @@ When monitoring is enabled all MON groups will also contain:
for the L3 cache they occupy). These are named "mon_sub_L3_YY"
where "YY" is the node number.
+ When the 'mbm_event' counter assignment mode is enabled, reading
+ an MBM event of a MON group returns 'Unassigned' if no hardware
+ counter is assigned to it. For CTRL_MON groups, 'Unassigned' is
+ returned if the MBM event does not have an assigned counter in the
+ CTRL_MON group nor in any of its associated MON groups.
+
"mon_hw_id":
Available only with debug option. The identifier used by hardware
for the monitor group. On x86 this is the RMID.
+When monitoring is enabled all MON groups may also contain:
+
+"mbm_L3_assignments":
+ Exists when "mbm_event" counter assignment mode is supported and lists the
+ counter assignment states of the group.
+
+ The assignment list is displayed in the following format:
+
+ <Event>:<Domain ID>=<Assignment state>;<Domain ID>=<Assignment state>
+
+ Event: A valid MBM event in the
+ /sys/fs/resctrl/info/L3_MON/event_configs directory.
+
+ Domain ID: A valid domain ID. When writing, '*' applies the changes
+ to all the domains.
+
+ Assignment states:
+
+ _ : No counter assigned.
+
+ e : Counter assigned exclusively.
+
+ Example:
+
+ To display the counter assignment states for the default group.
+ ::
+
+ # cd /sys/fs/resctrl
+ # cat /sys/fs/resctrl/mbm_L3_assignments
+ mbm_total_bytes:0=e;1=e
+ mbm_local_bytes:0=e;1=e
+
+ Assignments can be modified by writing to the interface.
+
+ Examples:
+
+ To unassign the counter associated with the mbm_total_bytes event on domain 0:
+ ::
+
+ # echo "mbm_total_bytes:0=_" > /sys/fs/resctrl/mbm_L3_assignments
+ # cat /sys/fs/resctrl/mbm_L3_assignments
+ mbm_total_bytes:0=_;1=e
+ mbm_local_bytes:0=e;1=e
+
+ To unassign the counter associated with the mbm_total_bytes event on all the domains:
+ ::
+
+ # echo "mbm_total_bytes:*=_" > /sys/fs/resctrl/mbm_L3_assignments
+ # cat /sys/fs/resctrl/mbm_L3_assignments
+ mbm_total_bytes:0=_;1=_
+ mbm_local_bytes:0=e;1=e
+
+ To assign a counter associated with the mbm_total_bytes event on all domains in
+ exclusive mode:
+ ::
+
+ # echo "mbm_total_bytes:*=e" > /sys/fs/resctrl/mbm_L3_assignments
+ # cat /sys/fs/resctrl/mbm_L3_assignments
+ mbm_total_bytes:0=e;1=e
+ mbm_local_bytes:0=e;1=e
+
When the "mba_MBps" mount option is used all CTRL_MON groups will also contain:
"mba_MBps_event":
@@ -1429,6 +1635,125 @@ View the llc occupancy snapshot::
# cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy
11234000
+
+Examples on working with mbm_assign_mode
+========================================
+
+a. Check if MBM counter assignment mode is supported.
+::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl/
+
+ # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
+ [mbm_event]
+ default
+
+The "mbm_event" mode is detected and enabled.
+
+b. Check how many assignable counters are supported.
+::
+
+ # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
+ 0=32;1=32
+
+c. Check how many assignable counters are available for assignment in each domain.
+::
+
+ # cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs
+ 0=30;1=30
+
+d. To list the default group's assign states.
+::
+
+ # cat /sys/fs/resctrl/mbm_L3_assignments
+ mbm_total_bytes:0=e;1=e
+ mbm_local_bytes:0=e;1=e
+
+e. To unassign the counter associated with the mbm_total_bytes event on domain 0.
+::
+
+ # echo "mbm_total_bytes:0=_" > /sys/fs/resctrl/mbm_L3_assignments
+ # cat /sys/fs/resctrl/mbm_L3_assignments
+ mbm_total_bytes:0=_;1=e
+ mbm_local_bytes:0=e;1=e
+
+f. To unassign the counter associated with the mbm_total_bytes event on all domains.
+::
+
+ # echo "mbm_total_bytes:*=_" > /sys/fs/resctrl/mbm_L3_assignments
+ # cat /sys/fs/resctrl/mbm_L3_assignment
+ mbm_total_bytes:0=_;1=_
+ mbm_local_bytes:0=e;1=e
+
+g. To assign a counter associated with the mbm_total_bytes event on all domains in
+exclusive mode.
+::
+
+ # echo "mbm_total_bytes:*=e" > /sys/fs/resctrl/mbm_L3_assignments
+ # cat /sys/fs/resctrl/mbm_L3_assignments
+ mbm_total_bytes:0=e;1=e
+ mbm_local_bytes:0=e;1=e
+
+h. Read the events mbm_total_bytes and mbm_local_bytes of the default group. There is
+no change in reading the events with the assignment.
+::
+
+ # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
+ 779247936
+ # cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes
+ 562324232
+ # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
+ 212122123
+ # cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
+ 121212144
+
+i. Check the event configurations.
+::
+
+ # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_total_bytes/event_filter
+ local_reads,remote_reads,local_non_temporal_writes,remote_non_temporal_writes,
+ local_reads_slow_memory,remote_reads_slow_memory,dirty_victim_writes_all
+
+ # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_local_bytes/event_filter
+ local_reads,local_non_temporal_writes,local_reads_slow_memory
+
+j. Change the event configuration for mbm_local_bytes.
+::
+
+ # echo "local_reads, local_non_temporal_writes, local_reads_slow_memory, remote_reads" >
+ /sys/fs/resctrl/info/L3_MON/event_configs/mbm_local_bytes/event_filter
+
+ # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_local_bytes/event_filter
+ local_reads,local_non_temporal_writes,local_reads_slow_memory,remote_reads
+
+k. Now read the local events again. The first read may come back with "Unavailable"
+status. The subsequent read of mbm_local_bytes will display the current value.
+::
+
+ # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
+ Unavailable
+ # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
+ 2252323
+ # cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
+ Unavailable
+ # cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
+ 1566565
+
+l. Users have the option to go back to 'default' mbm_assign_mode if required. This can be
+done using the following command. Note that switching the mbm_assign_mode may reset all
+the MBM counters (and thus all MBM events) of all the resctrl groups.
+::
+
+ # echo "default" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
+ # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
+ mbm_event
+ [default]
+
+m. Unmount the resctrl filesystem.
+::
+
+ # umount /sys/fs/resctrl/
+
Intel RDT Errata
================
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 486a91633474..4f13b01e42eb 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -209,31 +209,8 @@ method fills in is the "s_op" field. This is a pointer to a "struct
super_operations" which describes the next level of the filesystem
implementation.
-Usually, a filesystem uses one of the generic mount() implementations
-and provides a fill_super() callback instead. The generic variants are:
-
-``mount_bdev``
- mount a filesystem residing on a block device
-
-``mount_nodev``
- mount a filesystem that is not backed by a device
-
-``mount_single``
- mount a filesystem which shares the instance between all mounts
-
-A fill_super() callback implementation has the following arguments:
-
-``struct super_block *sb``
- the superblock structure. The callback must initialize this
- properly.
-
-``void *data``
- arbitrary mount options, usually comes as an ASCII string (see
- "Mount Options" section)
-
-``int silent``
- whether or not to be silent on error
-
+For more information on mounting (and the new mount API), see
+Documentation/filesystems/mount_api.rst.
The Superblock Object
=====================
@@ -327,11 +304,11 @@ or bottom half).
inode->i_lock spinlock held.
This method should be either NULL (normal UNIX filesystem
- semantics) or "generic_delete_inode" (for filesystems that do
+ semantics) or "inode_just_drop" (for filesystems that do
not want to cache inodes - causing "delete_inode" to always be
called regardless of the value of i_nlink)
- The "generic_delete_inode()" behavior is equivalent to the old
+ The "inode_just_drop()" behavior is equivalent to the old
practice of using "force_delete" in the put_inode() case, but
does not have the races that the "force_delete()" approach had.