diff options
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r-- | Documentation/filesystems/bcachefs/CodingStyle.rst | 186 | ||||
-rw-r--r-- | Documentation/filesystems/bcachefs/SubmittingPatches.rst | 105 | ||||
-rw-r--r-- | Documentation/filesystems/bcachefs/casefolding.rst | 108 | ||||
-rw-r--r-- | Documentation/filesystems/bcachefs/errorcodes.rst | 30 | ||||
-rw-r--r-- | Documentation/filesystems/bcachefs/future/idle_work.rst | 78 | ||||
-rw-r--r-- | Documentation/filesystems/bcachefs/index.rst | 38 | ||||
-rw-r--r-- | Documentation/filesystems/index.rst | 1 | ||||
-rw-r--r-- | Documentation/filesystems/porting.rst | 16 | ||||
-rw-r--r-- | Documentation/filesystems/proc.rst | 8 | ||||
-rw-r--r-- | Documentation/filesystems/resctrl.rst | 325 | ||||
-rw-r--r-- | Documentation/filesystems/vfs.rst | 31 |
11 files changed, 351 insertions, 575 deletions
diff --git a/Documentation/filesystems/bcachefs/CodingStyle.rst b/Documentation/filesystems/bcachefs/CodingStyle.rst deleted file mode 100644 index b29562a6bf55..000000000000 --- a/Documentation/filesystems/bcachefs/CodingStyle.rst +++ /dev/null @@ -1,186 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -bcachefs coding style -===================== - -Good development is like gardening, and codebases are our gardens. Tend to them -every day; look for little things that are out of place or in need of tidying. -A little weeding here and there goes a long way; don't wait until things have -spiraled out of control. - -Things don't always have to be perfect - nitpicking often does more harm than -good. But appreciate beauty when you see it - and let people know. - -The code that you are afraid to touch is the code most in need of refactoring. - -A little organizing here and there goes a long way. - -Put real thought into how you organize things. - -Good code is readable code, where the structure is simple and leaves nowhere -for bugs to hide. - -Assertions are one of our most important tools for writing reliable code. If in -the course of writing a patchset you encounter a condition that shouldn't -happen (and will have unpredictable or undefined behaviour if it does), or -you're not sure if it can happen and not sure how to handle it yet - make it a -BUG_ON(). Don't leave undefined or unspecified behavior lurking in the codebase. - -By the time you finish the patchset, you should understand better which -assertions need to be handled and turned into checks with error paths, and -which should be logically impossible. Leave the BUG_ON()s in for the ones which -are logically impossible. (Or, make them debug mode assertions if they're -expensive - but don't turn everything into a debug mode assertion, so that -we're not stuck debugging undefined behaviour should it turn out that you were -wrong). - -Assertions are documentation that can't go out of date. Good assertions are -wonderful. - -Good assertions drastically and dramatically reduce the amount of testing -required to shake out bugs. - -Good assertions are based on state, not logic. To write good assertions, you -have to think about what the invariants on your state are. - -Good invariants and assertions will hold everywhere in your codebase. This -means that you can run them in only a few places in the checked in version, but -should you need to debug something that caused the assertion to fail, you can -quickly shotgun them everywhere to find the codepath that broke the invariant. - -A good assertion checks something that the compiler could check for us, and -elide - if we were working in a language with embedded correctness proofs that -the compiler could check. This is something that exists today, but it'll likely -still be a few decades before it comes to systems programming languages. But we -can still incorporate that kind of thinking into our code and document the -invariants with runtime checks - much like the way people working in -dynamically typed languages may add type annotations, gradually making their -code statically typed. - -Looking for ways to make your assertions simpler - and higher level - will -often nudge you towards making the entire system simpler and more robust. - -Good code is code where you can poke around and see what it's doing - -introspection. We can't debug anything if we can't see what's going on. - -Whenever we're debugging, and the solution isn't immediately obvious, if the -issue is that we don't know where the issue is because we can't see what's -going on - fix that first. - -We have the tools to make anything visible at runtime, efficiently - RCU and -percpu data structures among them. Don't let things stay hidden. - -The most important tool for introspection is the humble pretty printer - in -bcachefs, this means `*_to_text()` functions, which output to printbufs. - -Pretty printers are wonderful, because they compose and you can use them -everywhere. Having functions to print whatever object you're working with will -make your error messages much easier to write (therefore they will actually -exist) and much more informative. And they can be used from sysfs/debugfs, as -well as tracepoints. - -Runtime info and debugging tools should come with clear descriptions and -labels, and good structure - we don't want files with a list of bare integers, -like in procfs. Part of the job of the debugging tools is to educate users and -new developers as to how the system works. - -Error messages should, whenever possible, tell you everything you need to debug -the issue. It's worth putting effort into them. - -Tracepoints shouldn't be the first thing you reach for. They're an important -tool, but always look for more immediate ways to make things visible. When we -have to rely on tracing, we have to know which tracepoints we're looking for, -and then we have to run the troublesome workload, and then we have to sift -through logs. This is a lot of steps to go through when a user is hitting -something, and if it's intermittent it may not even be possible. - -The humble counter is an incredibly useful tool. They're cheap and simple to -use, and many complicated internal operations with lots of things that can -behave weirdly (anything involving memory reclaim, for example) become -shockingly easy to debug once you have counters on every distinct codepath. - -Persistent counters are even better. - -When debugging, try to get the most out of every bug you come across; don't -rush to fix the initial issue. Look for things that will make related bugs -easier the next time around - introspection, new assertions, better error -messages, new debug tools, and do those first. Look for ways to make the system -better behaved; often one bug will uncover several other bugs through -downstream effects. - -Fix all that first, and then the original bug last - even if that means keeping -a user waiting. They'll thank you in the long run, and when they understand -what you're doing you'll be amazed at how patient they're happy to be. Users -like to help - otherwise they wouldn't be reporting the bug in the first place. - -Talk to your users. Don't isolate yourself. - -Users notice all sorts of interesting things, and by just talking to them and -interacting with them you can benefit from their experience. - -Spend time doing support and helpdesk stuff. Don't just write code - code isn't -finished until it's being used trouble free. - -This will also motivate you to make your debugging tools as good as possible, -and perhaps even your documentation, too. Like anything else in life, the more -time you spend at it the better you'll get, and you the developer are the -person most able to improve the tools to make debugging quick and easy. - -Be wary of how you take on and commit to big projects. Don't let development -become product-manager focused. Often time an idea is a good one but needs to -wait for its proper time - but you won't know if it's the proper time for an -idea until you start writing code. - -Expect to throw a lot of things away, or leave them half finished for later. -Nobody writes all perfect code that all gets shipped, and you'll be much more -productive in the long run if you notice this early and shift to something -else. The experience gained and lessons learned will be valuable for all the -other work you do. - -But don't be afraid to tackle projects that require significant rework of -existing code. Sometimes these can be the best projects, because they can lead -us to make existing code more general, more flexible, more multipurpose and -perhaps more robust. Just don't hesitate to abandon the idea if it looks like -it's going to make a mess of things. - -Complicated features can often be done as a series of refactorings, with the -final change that actually implements the feature as a quite small patch at the -end. It's wonderful when this happens, especially when those refactorings are -things that improve the codebase in their own right. When that happens there's -much less risk of wasted effort if the feature you were going for doesn't work -out. - -Always strive to work incrementally. Always strive to turn the big projects -into little bite sized projects that can prove their own merits. - -Instead of always tackling those big projects, look for little things that -will be useful, and make the big projects easier. - -The question of what's likely to be useful is where junior developers most -often go astray - doing something because it seems like it'll be useful often -leads to overengineering. Knowing what's useful comes from many years of -experience, or talking with people who have that experience - or from simply -reading lots of code and looking for common patterns and issues. Don't be -afraid to throw things away and do something simpler. - -Talk about your ideas with your fellow developers; often times the best things -come from relaxed conversations where people aren't afraid to say "what if?". - -Don't neglect your tools. - -The most important tools (besides the compiler and our text editor) are the -tools we use for testing. The shortest possible edit/test/debug cycle is -essential for working productively. We learn, gain experience, and discover the -errors in our thinking by running our code and seeing what happens. If your -time is being wasted because your tools are bad or too slow - don't accept it, -fix it. - -Put effort into your documentation, commit messages, and code comments - but -don't go overboard. A good commit message is wonderful - but if the information -was important enough to go in a commit message, ask yourself if it would be -even better as a code comment. - -A good code comment is wonderful, but even better is the comment that didn't -need to exist because the code was so straightforward as to be obvious; -organized into small clean and tidy modules, with clear and descriptive names -for functions and variables, where every line of code has a clear purpose. diff --git a/Documentation/filesystems/bcachefs/SubmittingPatches.rst b/Documentation/filesystems/bcachefs/SubmittingPatches.rst deleted file mode 100644 index 18c79d548391..000000000000 --- a/Documentation/filesystems/bcachefs/SubmittingPatches.rst +++ /dev/null @@ -1,105 +0,0 @@ -Submitting patches to bcachefs -============================== - -Here are suggestions for submitting patches to bcachefs subsystem. - -Submission checklist --------------------- - -Patches must be tested before being submitted, either with the xfstests suite -[0]_, or the full bcachefs test suite in ktest [1]_, depending on what's being -touched. Note that ktest wraps xfstests and will be an easier method to running -it for most users; it includes single-command wrappers for all the mainstream -in-kernel local filesystems. - -Patches will undergo more testing after being merged (including -lockdep/kasan/preempt/etc. variants), these are not generally required to be -run by the submitter - but do put some thought into what you're changing and -which tests might be relevant, e.g. are you dealing with tricky memory layout -work? kasan, are you doing locking work? then lockdep; and ktest includes -single-command variants for the debug build types you'll most likely need. - -The exception to this rule is incomplete WIP/RFC patches: if you're working on -something nontrivial, it's encouraged to send out a WIP patch to let people -know what you're doing and make sure you're on the right track. Just make sure -it includes a brief note as to what's done and what's incomplete, to avoid -confusion. - -Rigorous checkpatch.pl adherence is not required (many of its warnings are -considered out of date), but try not to deviate too much without reason. - -Focus on writing code that reads well and is organized well; code should be -aesthetically pleasing. - -CI --- - -Instead of running your tests locally, when running the full test suite it's -preferable to let a server farm do it in parallel, and then have the results -in a nice test dashboard (which can tell you which failures are new, and -presents results in a git log view, avoiding the need for most bisecting). - -That exists [2]_, and community members may request an account. If you work for -a big tech company, you'll need to help out with server costs to get access - -but the CI is not restricted to running bcachefs tests: it runs any ktest test -(which generally makes it easy to wrap other tests that can run in qemu). - -Other things to think about ---------------------------- - -- How will we debug this code? Is there sufficient introspection to diagnose - when something starts acting wonky on a user machine? - - We don't necessarily need every single field of every data structure visible - with introspection, but having the important fields of all the core data - types wired up makes debugging drastically easier - a bit of thoughtful - foresight greatly reduces the need to have people build custom kernels with - debug patches. - - More broadly, think about all the debug tooling that might be needed. - -- Does it make the codebase more or less of a mess? Can we also try to do some - organizing, too? - -- Do new tests need to be written? New assertions? How do we know and verify - that the code is correct, and what happens if something goes wrong? - - We don't yet have automated code coverage analysis or easy fault injection - - but for now, pretend we did and ask what they might tell us. - - Assertions are hugely important, given that we don't yet have a systems - language that can do ergonomic embedded correctness proofs. Hitting an assert - in testing is much better than wandering off into undefined behaviour la-la - land - use them. Use them judiciously, and not as a replacement for proper - error handling, but use them. - -- Does it need to be performance tested? Should we add new performance counters? - - bcachefs has a set of persistent runtime counters which can be viewed with - the 'bcachefs fs top' command; this should give users a basic idea of what - their filesystem is currently doing. If you're doing a new feature or looking - at old code, think if anything should be added. - -- If it's a new on disk format feature - have upgrades and downgrades been - tested? (Automated tests exists but aren't in the CI, due to the hassle of - disk image management; coordinate to have them run.) - -Mailing list, IRC ------------------ - -Patches should hit the list [3]_, but much discussion and code review happens -on IRC as well [4]_; many people appreciate the more conversational approach -and quicker feedback. - -Additionally, we have a lively user community doing excellent QA work, which -exists primarily on IRC. Please make use of that resource; user feedback is -important for any nontrivial feature, and documenting it in commit messages -would be a good idea. - -.. rubric:: References - -.. [0] git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git -.. [1] https://evilpiepirate.org/git/ktest.git/ -.. [2] https://evilpiepirate.org/~testdashboard/ci/ -.. [3] linux-bcachefs@vger.kernel.org -.. [4] irc.oftc.net#bcache, #bcachefs-dev diff --git a/Documentation/filesystems/bcachefs/casefolding.rst b/Documentation/filesystems/bcachefs/casefolding.rst deleted file mode 100644 index 871a38f557e8..000000000000 --- a/Documentation/filesystems/bcachefs/casefolding.rst +++ /dev/null @@ -1,108 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Casefolding -=========== - -bcachefs has support for case-insensitive file and directory -lookups using the regular `chattr +F` (`S_CASEFOLD`, `FS_CASEFOLD_FL`) -casefolding attributes. - -The main usecase for casefolding is compatibility with software written -against other filesystems that rely on casefolded lookups -(eg. NTFS and Wine/Proton). -Taking advantage of file-system level casefolding can lead to great -loading time gains in many applications and games. - -Casefolding support requires a kernel with the `CONFIG_UNICODE` enabled. -Once a directory has been flagged for casefolding, a feature bit -is enabled on the superblock which marks the filesystem as using -casefolding. -When the feature bit for casefolding is enabled, it is no longer possible -to mount that filesystem on kernels without `CONFIG_UNICODE` enabled. - -On the lookup/query side: casefolding is implemented by allocating a new -string of `BCH_NAME_MAX` length using the `utf8_casefold` function to -casefold the query string. - -On the dirent side: casefolding is implemented by ensuring the `bkey`'s -hash is made from the casefolded string and storing the cached casefolded -name with the regular name in the dirent. - -The structure looks like this: - -* Regular: [dirent data][regular name][nul][nul]... -* Casefolded: [dirent data][reg len][cf len][regular name][casefolded name][nul][nul]... - -(Do note, the number of NULs here is merely for illustration; their count can -vary per-key, and they may not even be present if the key is aligned to -`sizeof(u64)`.) - -This is efficient as it means that for all file lookups that require casefolding, -it has identical performance to a regular lookup: -a hash comparison and a `memcmp` of the name. - -Rationale ---------- - -Several designs were considered for this system: -One was to introduce a dirent_v2, however that would be painful especially as -the hash system only has support for a single key type. This would also need -`BCH_NAME_MAX` to change between versions, and a new feature bit. - -Another option was to store without the two lengths, and just take the length of -the regular name and casefolded name contiguously / 2 as the length. This would -assume that the regular length == casefolded length, but that could potentially -not be true, if the uppercase unicode glyph had a different UTF-8 encoding than -the lowercase unicode glyph. -It would be possible to disregard the casefold cache for those cases, but it was -decided to simply encode the two string lengths in the key to avoid random -performance issues if this edgecase was ever hit. - -The option settled on was to use a free-bit in d_type to mark a dirent as having -a casefold cache, and then treat the first 4 bytes the name block as lengths. -You can see this in the `d_cf_name_block` member of union in `bch_dirent`. - -The feature bit was used to allow casefolding support to be enabled for the majority -of users, but some allow users who have no need for the feature to still use bcachefs as -`CONFIG_UNICODE` can increase the kernel side a significant amount due to the tables used, -which may be decider between using bcachefs for eg. embedded platforms. - -Other filesystems like ext4 and f2fs have a super-block level option for casefolding -encoding, but bcachefs currently does not provide this. ext4 and f2fs do not expose -any encodings than a single UTF-8 version. When future encodings are desirable, -they will be added trivially using the opts mechanism. - -dentry/dcache considerations ----------------------------- - -Currently, in casefolded directories, bcachefs (like other filesystems) will not cache -negative dentry's. - -This is because currently doing so presents a problem in the following scenario: - - - Lookup file "blAH" in a casefolded directory - - Creation of file "BLAH" in a casefolded directory - - Lookup file "blAH" in a casefolded directory - -This would fail if negative dentry's were cached. - -This is slightly suboptimal, but could be fixed in future with some vfs work. - - -References ----------- - -(from Peter Anvin, on the list) - -It is worth noting that Microsoft has basically declared their -"recommended" case folding (upcase) table to be permanently frozen (for -new filesystem instances in the case where they use an on-disk -translation table created at format time.) As far as I know they have -never supported anything other than 1:1 conversion of BMP code points, -nor normalization. - -The exFAT specification enumerates the full recommended upcase table, -although in a somewhat annoying format (basically a hex dump of -compressed data): - -https://learn.microsoft.com/en-us/windows/win32/fileio/exfat-specification diff --git a/Documentation/filesystems/bcachefs/errorcodes.rst b/Documentation/filesystems/bcachefs/errorcodes.rst deleted file mode 100644 index 2cccaa0ba7cd..000000000000 --- a/Documentation/filesystems/bcachefs/errorcodes.rst +++ /dev/null @@ -1,30 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -bcachefs private error codes ----------------------------- - -In bcachefs, as a hard rule we do not throw or directly use standard error -codes (-EINVAL, -EBUSY, etc.). Instead, we define private error codes as needed -in fs/bcachefs/errcode.h. - -This gives us much better error messages and makes debugging much easier. Any -direct uses of standard error codes you see in the source code are simply old -code that has yet to be converted - feel free to clean it up! - -Private error codes may subtype another error code, this allows for grouping of -related errors that should be handled similarly (e.g. transaction restart -errors), as well as specifying which standard error code should be returned at -the bcachefs module boundary. - -At the module boundary, we use bch2_err_class() to convert to a standard error -code; this also emits a trace event so that the original error code be -recovered even if it wasn't logged. - -Do not reuse error codes! Generally speaking, a private error code should only -be thrown in one place. That means that when we see it in a log message we can -see, unambiguously, exactly which file and line number it was returned from. - -Try to give error codes names that are as reasonably descriptive of the error -as possible. Frequently, the error will be logged at a place far removed from -where the error was generated; good names for error codes mean much more -descriptive and useful error messages. diff --git a/Documentation/filesystems/bcachefs/future/idle_work.rst b/Documentation/filesystems/bcachefs/future/idle_work.rst deleted file mode 100644 index 59a332509dcd..000000000000 --- a/Documentation/filesystems/bcachefs/future/idle_work.rst +++ /dev/null @@ -1,78 +0,0 @@ -Idle/background work classes design doc: - -Right now, our behaviour at idle isn't ideal, it was designed for servers that -would be under sustained load, to keep pending work at a "medium" level, to -let work build up so we can process it in more efficient batches, while also -giving headroom for bursts in load. - -But for desktops or mobile - scenarios where work is less sustained and power -usage is more important - we want to operate differently, with a "rush to -idle" so the system can go to sleep. We don't want to be dribbling out -background work while the system should be idle. - -The complicating factor is that there are a number of background tasks, which -form a heirarchy (or a digraph, depending on how you divide it up) - one -background task may generate work for another. - -Thus proper idle detection needs to model this heirarchy. - -- Foreground writes -- Page cache writeback -- Copygc, rebalance -- Journal reclaim - -When we implement idle detection and rush to idle, we need to be careful not -to disturb too much the existing behaviour that works reasonably well when the -system is under sustained load (or perhaps improve it in the case of -rebalance, which currently does not actively attempt to let work batch up). - -SUSTAINED LOAD REGIME ---------------------- - -When the system is under continuous load, we want these jobs to run -continuously - this is perhaps best modelled with a P/D controller, where -they'll be trying to keep a target value (i.e. fragmented disk space, -available journal space) roughly in the middle of some range. - -The goal under sustained load is to balance our ability to handle load spikes -without running out of x resource (free disk space, free space in the -journal), while also letting some work accumululate to be batched (or become -unnecessary). - -For example, we don't want to run copygc too aggressively, because then it -will be evacuating buckets that would have become empty (been overwritten or -deleted) anyways, and we don't want to wait until we're almost out of free -space because then the system will behave unpredicably - suddenly we're doing -a lot more work to service each write and the system becomes much slower. - -IDLE REGIME ------------ - -When the system becomes idle, we should start flushing our pending work -quicker so the system can go to sleep. - -Note that the definition of "idle" depends on where in the heirarchy a task -is - a task should start flushing work more quickly when the task above it has -stopped generating new work. - -e.g. rebalance should start flushing more quickly when page cache writeback is -idle, and journal reclaim should only start flushing more quickly when both -copygc and rebalance are idle. - -It's important to let work accumulate when more work is still incoming and we -still have room, because flushing is always more efficient if we let it batch -up. New writes may overwrite data before rebalance moves it, and tasks may be -generating more updates for the btree nodes that journal reclaim needs to flush. - -On idle, how much work we do at each interval should be proportional to the -length of time we have been idle for. If we're idle only for a short duration, -we shouldn't flush everything right away; the system might wake up and start -generating new work soon, and flushing immediately might end up doing a lot of -work that would have been unnecessary if we'd allowed things to batch more. - -To summarize, we will need: - - - A list of classes for background tasks that generate work, which will - include one "foreground" class. - - Tracking for each class - "Am I doing work, or have I gone to sleep?" - - And each class should check the class above it when deciding how much work to issue. diff --git a/Documentation/filesystems/bcachefs/index.rst b/Documentation/filesystems/bcachefs/index.rst deleted file mode 100644 index e5c4c2120b93..000000000000 --- a/Documentation/filesystems/bcachefs/index.rst +++ /dev/null @@ -1,38 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -====================== -bcachefs Documentation -====================== - -Subsystem-specific development process notes --------------------------------------------- - -Development notes specific to bcachefs. These are intended to supplement -:doc:`general kernel development handbook </process/index>`. - -.. toctree:: - :maxdepth: 1 - :numbered: - - CodingStyle - SubmittingPatches - -Filesystem implementation -------------------------- - -Documentation for filesystem features and their implementation details. -At this moment, only a few of these are described here. - -.. toctree:: - :maxdepth: 1 - :numbered: - - casefolding - errorcodes - -Future design -------------- -.. toctree:: - :maxdepth: 1 - - future/idle_work diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 11a599387266..622187a96bdc 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -72,7 +72,6 @@ Documentation for filesystem implementations. afs autofs autofs-mount-control - bcachefs/index befs bfs btrfs diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst index 85f590254f07..78c3d07c0c08 100644 --- a/Documentation/filesystems/porting.rst +++ b/Documentation/filesystems/porting.rst @@ -340,8 +340,8 @@ of those. Caller makes sure async writeback cannot be running for the inode whil ->drop_inode() returns int now; it's called on final iput() with inode->i_lock held and it returns true if filesystems wants the inode to be -dropped. As before, generic_drop_inode() is still the default and it's been -updated appropriately. generic_delete_inode() is also alive and it consists +dropped. As before, inode_generic_drop() is still the default and it's been +updated appropriately. inode_just_drop() is also alive and it consists simply of return 1. Note that all actual eviction work is done by caller after ->drop_inode() returns. @@ -1285,3 +1285,15 @@ rather than a VMA, as the VMA at this stage is not yet valid. The vm_area_desc provides the minimum required information for a filesystem to initialise state upon memory mapping of a file-backed region, and output parameters for the file system to set this state. + +--- + +**mandatory** + +Several functions are renamed: + +- kern_path_locked -> start_removing_path +- kern_path_create -> start_creating_path +- user_path_create -> start_creating_user_path +- user_path_locked_at -> start_removing_user_path_at +- done_path_create -> end_creating_path diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 2971551b7235..b7e3147ba3d4 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -2362,6 +2362,7 @@ The following mount options are supported: hidepid= Set /proc/<pid>/ access mode. gid= Set the group authorized to learn processes information. subset= Show only the specified subset of procfs. + pidns= Specify a the namespace used by this procfs. ========= ======================================================== hidepid=off or hidepid=0 means classic mode - everybody may access all @@ -2394,6 +2395,13 @@ information about processes information, just add identd to this group. subset=pid hides all top level files and directories in the procfs that are not related to tasks. +pidns= specifies a pid namespace (either as a string path to something like +`/proc/$pid/ns/pid`, or a file descriptor when using `FSCONFIG_SET_FD`) that +will be used by the procfs instance when translating pids. By default, procfs +will use the calling process's active pid namespace. Note that the pid +namespace of an existing procfs instance cannot be modified (attempting to do +so will give an `-EBUSY` error). + Chapter 5: Filesystem behavior ============================== diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst index c7949dd44f2f..006d23af66e1 100644 --- a/Documentation/filesystems/resctrl.rst +++ b/Documentation/filesystems/resctrl.rst @@ -26,6 +26,7 @@ MBM (Memory Bandwidth Monitoring) "cqm_mbm_total", "cqm_mbm_local" MBA (Memory Bandwidth Allocation) "mba" SMBA (Slow Memory Bandwidth Allocation) "" BMEC (Bandwidth Monitoring Event Configuration) "" +ABMC (Assignable Bandwidth Monitoring Counters) "" =============================================== ================================ Historically, new features were made visible by default in /proc/cpuinfo. This @@ -256,6 +257,144 @@ with the following files: # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config 0=0x30;1=0x30;3=0x15;4=0x15 +"mbm_assign_mode": + The supported counter assignment modes. The enclosed brackets indicate which mode + is enabled. The MBM events associated with counters may reset when "mbm_assign_mode" + is changed. + :: + + # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode + [mbm_event] + default + + "mbm_event": + + mbm_event mode allows users to assign a hardware counter to an RMID, event + pair and monitor the bandwidth usage as long as it is assigned. The hardware + continues to track the assigned counter until it is explicitly unassigned by + the user. Each event within a resctrl group can be assigned independently. + + In this mode, a monitoring event can only accumulate data while it is backed + by a hardware counter. Use "mbm_L3_assignments" found in each CTRL_MON and MON + group to specify which of the events should have a counter assigned. The number + of counters available is described in the "num_mbm_cntrs" file. Changing the + mode may cause all counters on the resource to reset. + + Moving to mbm_event counter assignment mode requires users to assign the counters + to the events. Otherwise, the MBM event counters will return 'Unassigned' when read. + + The mode is beneficial for AMD platforms that support more CTRL_MON + and MON groups than available hardware counters. By default, this + feature is enabled on AMD platforms with the ABMC (Assignable Bandwidth + Monitoring Counters) capability, ensuring counters remain assigned even + when the corresponding RMID is not actively used by any processor. + + "default": + + In default mode, resctrl assumes there is a hardware counter for each + event within every CTRL_MON and MON group. On AMD platforms, it is + recommended to use the mbm_event mode, if supported, to prevent reset of MBM + events between reads resulting from hardware re-allocating counters. This can + result in misleading values or display "Unavailable" if no counter is assigned + to the event. + + * To enable "mbm_event" counter assignment mode: + :: + + # echo "mbm_event" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode + + * To enable "default" monitoring mode: + :: + + # echo "default" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode + +"num_mbm_cntrs": + The maximum number of counters (total of available and assigned counters) in + each domain when the system supports mbm_event mode. + + For example, on a system with maximum of 32 memory bandwidth monitoring + counters in each of its L3 domains: + :: + + # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs + 0=32;1=32 + +"available_mbm_cntrs": + The number of counters available for assignment in each domain when mbm_event + mode is enabled on the system. + + For example, on a system with 30 available [hardware] assignable counters + in each of its L3 domains: + :: + + # cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs + 0=30;1=30 + +"event_configs": + Directory that exists when "mbm_event" counter assignment mode is supported. + Contains a sub-directory for each MBM event that can be assigned to a counter. + + Two MBM events are supported by default: mbm_local_bytes and mbm_total_bytes. + Each MBM event's sub-directory contains a file named "event_filter" that is + used to view and modify which memory transactions the MBM event is configured + with. The file is accessible only when "mbm_event" counter assignment mode is + enabled. + + List of memory transaction types supported: + + ========================== ======================================================== + Name Description + ========================== ======================================================== + dirty_victim_writes_all Dirty Victims from the QOS domain to all types of memory + remote_reads_slow_memory Reads to slow memory in the non-local NUMA domain + local_reads_slow_memory Reads to slow memory in the local NUMA domain + remote_non_temporal_writes Non-temporal writes to non-local NUMA domain + local_non_temporal_writes Non-temporal writes to local NUMA domain + remote_reads Reads to memory in the non-local NUMA domain + local_reads Reads to memory in the local NUMA domain + ========================== ======================================================== + + For example:: + + # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_total_bytes/event_filter + local_reads,remote_reads,local_non_temporal_writes,remote_non_temporal_writes, + local_reads_slow_memory,remote_reads_slow_memory,dirty_victim_writes_all + + # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_local_bytes/event_filter + local_reads,local_non_temporal_writes,local_reads_slow_memory + + Modify the event configuration by writing to the "event_filter" file within + the "event_configs" directory. The read/write "event_filter" file contains the + configuration of the event that reflects which memory transactions are counted by it. + + For example:: + + # echo "local_reads, local_non_temporal_writes" > + /sys/fs/resctrl/info/L3_MON/event_configs/mbm_total_bytes/event_filter + + # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_total_bytes/event_filter + local_reads,local_non_temporal_writes + +"mbm_assign_on_mkdir": + Exists when "mbm_event" counter assignment mode is supported. Accessible + only when "mbm_event" counter assignment mode is enabled. + + Determines if a counter will automatically be assigned to an RMID, MBM event + pair when its associated monitor group is created via mkdir. Enabled by default + on boot, also when switched from "default" mode to "mbm_event" counter assignment + mode. Users can disable this capability by writing to the interface. + + "0": + Auto assignment is disabled. + "1": + Auto assignment is enabled. + + Example:: + + # echo 0 > /sys/fs/resctrl/info/L3_MON/mbm_assign_on_mkdir + # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_on_mkdir + 0 + "max_threshold_occupancy": Read/write file provides the largest value (in bytes) at which a previously used LLC_occupancy @@ -380,10 +519,77 @@ When monitoring is enabled all MON groups will also contain: for the L3 cache they occupy). These are named "mon_sub_L3_YY" where "YY" is the node number. + When the 'mbm_event' counter assignment mode is enabled, reading + an MBM event of a MON group returns 'Unassigned' if no hardware + counter is assigned to it. For CTRL_MON groups, 'Unassigned' is + returned if the MBM event does not have an assigned counter in the + CTRL_MON group nor in any of its associated MON groups. + "mon_hw_id": Available only with debug option. The identifier used by hardware for the monitor group. On x86 this is the RMID. +When monitoring is enabled all MON groups may also contain: + +"mbm_L3_assignments": + Exists when "mbm_event" counter assignment mode is supported and lists the + counter assignment states of the group. + + The assignment list is displayed in the following format: + + <Event>:<Domain ID>=<Assignment state>;<Domain ID>=<Assignment state> + + Event: A valid MBM event in the + /sys/fs/resctrl/info/L3_MON/event_configs directory. + + Domain ID: A valid domain ID. When writing, '*' applies the changes + to all the domains. + + Assignment states: + + _ : No counter assigned. + + e : Counter assigned exclusively. + + Example: + + To display the counter assignment states for the default group. + :: + + # cd /sys/fs/resctrl + # cat /sys/fs/resctrl/mbm_L3_assignments + mbm_total_bytes:0=e;1=e + mbm_local_bytes:0=e;1=e + + Assignments can be modified by writing to the interface. + + Examples: + + To unassign the counter associated with the mbm_total_bytes event on domain 0: + :: + + # echo "mbm_total_bytes:0=_" > /sys/fs/resctrl/mbm_L3_assignments + # cat /sys/fs/resctrl/mbm_L3_assignments + mbm_total_bytes:0=_;1=e + mbm_local_bytes:0=e;1=e + + To unassign the counter associated with the mbm_total_bytes event on all the domains: + :: + + # echo "mbm_total_bytes:*=_" > /sys/fs/resctrl/mbm_L3_assignments + # cat /sys/fs/resctrl/mbm_L3_assignments + mbm_total_bytes:0=_;1=_ + mbm_local_bytes:0=e;1=e + + To assign a counter associated with the mbm_total_bytes event on all domains in + exclusive mode: + :: + + # echo "mbm_total_bytes:*=e" > /sys/fs/resctrl/mbm_L3_assignments + # cat /sys/fs/resctrl/mbm_L3_assignments + mbm_total_bytes:0=e;1=e + mbm_local_bytes:0=e;1=e + When the "mba_MBps" mount option is used all CTRL_MON groups will also contain: "mba_MBps_event": @@ -1429,6 +1635,125 @@ View the llc occupancy snapshot:: # cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy 11234000 + +Examples on working with mbm_assign_mode +======================================== + +a. Check if MBM counter assignment mode is supported. +:: + + # mount -t resctrl resctrl /sys/fs/resctrl/ + + # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode + [mbm_event] + default + +The "mbm_event" mode is detected and enabled. + +b. Check how many assignable counters are supported. +:: + + # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs + 0=32;1=32 + +c. Check how many assignable counters are available for assignment in each domain. +:: + + # cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs + 0=30;1=30 + +d. To list the default group's assign states. +:: + + # cat /sys/fs/resctrl/mbm_L3_assignments + mbm_total_bytes:0=e;1=e + mbm_local_bytes:0=e;1=e + +e. To unassign the counter associated with the mbm_total_bytes event on domain 0. +:: + + # echo "mbm_total_bytes:0=_" > /sys/fs/resctrl/mbm_L3_assignments + # cat /sys/fs/resctrl/mbm_L3_assignments + mbm_total_bytes:0=_;1=e + mbm_local_bytes:0=e;1=e + +f. To unassign the counter associated with the mbm_total_bytes event on all domains. +:: + + # echo "mbm_total_bytes:*=_" > /sys/fs/resctrl/mbm_L3_assignments + # cat /sys/fs/resctrl/mbm_L3_assignment + mbm_total_bytes:0=_;1=_ + mbm_local_bytes:0=e;1=e + +g. To assign a counter associated with the mbm_total_bytes event on all domains in +exclusive mode. +:: + + # echo "mbm_total_bytes:*=e" > /sys/fs/resctrl/mbm_L3_assignments + # cat /sys/fs/resctrl/mbm_L3_assignments + mbm_total_bytes:0=e;1=e + mbm_local_bytes:0=e;1=e + +h. Read the events mbm_total_bytes and mbm_local_bytes of the default group. There is +no change in reading the events with the assignment. +:: + + # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes + 779247936 + # cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes + 562324232 + # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes + 212122123 + # cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes + 121212144 + +i. Check the event configurations. +:: + + # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_total_bytes/event_filter + local_reads,remote_reads,local_non_temporal_writes,remote_non_temporal_writes, + local_reads_slow_memory,remote_reads_slow_memory,dirty_victim_writes_all + + # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_local_bytes/event_filter + local_reads,local_non_temporal_writes,local_reads_slow_memory + +j. Change the event configuration for mbm_local_bytes. +:: + + # echo "local_reads, local_non_temporal_writes, local_reads_slow_memory, remote_reads" > + /sys/fs/resctrl/info/L3_MON/event_configs/mbm_local_bytes/event_filter + + # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_local_bytes/event_filter + local_reads,local_non_temporal_writes,local_reads_slow_memory,remote_reads + +k. Now read the local events again. The first read may come back with "Unavailable" +status. The subsequent read of mbm_local_bytes will display the current value. +:: + + # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes + Unavailable + # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes + 2252323 + # cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes + Unavailable + # cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes + 1566565 + +l. Users have the option to go back to 'default' mbm_assign_mode if required. This can be +done using the following command. Note that switching the mbm_assign_mode may reset all +the MBM counters (and thus all MBM events) of all the resctrl groups. +:: + + # echo "default" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode + # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode + mbm_event + [default] + +m. Unmount the resctrl filesystem. +:: + + # umount /sys/fs/resctrl/ + Intel RDT Errata ================ diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst index 486a91633474..4f13b01e42eb 100644 --- a/Documentation/filesystems/vfs.rst +++ b/Documentation/filesystems/vfs.rst @@ -209,31 +209,8 @@ method fills in is the "s_op" field. This is a pointer to a "struct super_operations" which describes the next level of the filesystem implementation. -Usually, a filesystem uses one of the generic mount() implementations -and provides a fill_super() callback instead. The generic variants are: - -``mount_bdev`` - mount a filesystem residing on a block device - -``mount_nodev`` - mount a filesystem that is not backed by a device - -``mount_single`` - mount a filesystem which shares the instance between all mounts - -A fill_super() callback implementation has the following arguments: - -``struct super_block *sb`` - the superblock structure. The callback must initialize this - properly. - -``void *data`` - arbitrary mount options, usually comes as an ASCII string (see - "Mount Options" section) - -``int silent`` - whether or not to be silent on error - +For more information on mounting (and the new mount API), see +Documentation/filesystems/mount_api.rst. The Superblock Object ===================== @@ -327,11 +304,11 @@ or bottom half). inode->i_lock spinlock held. This method should be either NULL (normal UNIX filesystem - semantics) or "generic_delete_inode" (for filesystems that do + semantics) or "inode_just_drop" (for filesystems that do not want to cache inodes - causing "delete_inode" to always be called regardless of the value of i_nlink) - The "generic_delete_inode()" behavior is equivalent to the old + The "inode_just_drop()" behavior is equivalent to the old practice of using "force_delete" in the put_inode() case, but does not have the races that the "force_delete()" approach had. |