diff options
Diffstat (limited to 'Documentation/filesystems')
20 files changed, 501 insertions, 632 deletions
diff --git a/Documentation/filesystems/bcachefs/CodingStyle.rst b/Documentation/filesystems/bcachefs/CodingStyle.rst deleted file mode 100644 index b29562a6bf55..000000000000 --- a/Documentation/filesystems/bcachefs/CodingStyle.rst +++ /dev/null @@ -1,186 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -bcachefs coding style -===================== - -Good development is like gardening, and codebases are our gardens. Tend to them -every day; look for little things that are out of place or in need of tidying. -A little weeding here and there goes a long way; don't wait until things have -spiraled out of control. - -Things don't always have to be perfect - nitpicking often does more harm than -good. But appreciate beauty when you see it - and let people know. - -The code that you are afraid to touch is the code most in need of refactoring. - -A little organizing here and there goes a long way. - -Put real thought into how you organize things. - -Good code is readable code, where the structure is simple and leaves nowhere -for bugs to hide. - -Assertions are one of our most important tools for writing reliable code. If in -the course of writing a patchset you encounter a condition that shouldn't -happen (and will have unpredictable or undefined behaviour if it does), or -you're not sure if it can happen and not sure how to handle it yet - make it a -BUG_ON(). Don't leave undefined or unspecified behavior lurking in the codebase. - -By the time you finish the patchset, you should understand better which -assertions need to be handled and turned into checks with error paths, and -which should be logically impossible. Leave the BUG_ON()s in for the ones which -are logically impossible. (Or, make them debug mode assertions if they're -expensive - but don't turn everything into a debug mode assertion, so that -we're not stuck debugging undefined behaviour should it turn out that you were -wrong). - -Assertions are documentation that can't go out of date. Good assertions are -wonderful. - -Good assertions drastically and dramatically reduce the amount of testing -required to shake out bugs. - -Good assertions are based on state, not logic. To write good assertions, you -have to think about what the invariants on your state are. - -Good invariants and assertions will hold everywhere in your codebase. This -means that you can run them in only a few places in the checked in version, but -should you need to debug something that caused the assertion to fail, you can -quickly shotgun them everywhere to find the codepath that broke the invariant. - -A good assertion checks something that the compiler could check for us, and -elide - if we were working in a language with embedded correctness proofs that -the compiler could check. This is something that exists today, but it'll likely -still be a few decades before it comes to systems programming languages. But we -can still incorporate that kind of thinking into our code and document the -invariants with runtime checks - much like the way people working in -dynamically typed languages may add type annotations, gradually making their -code statically typed. - -Looking for ways to make your assertions simpler - and higher level - will -often nudge you towards making the entire system simpler and more robust. - -Good code is code where you can poke around and see what it's doing - -introspection. We can't debug anything if we can't see what's going on. - -Whenever we're debugging, and the solution isn't immediately obvious, if the -issue is that we don't know where the issue is because we can't see what's -going on - fix that first. - -We have the tools to make anything visible at runtime, efficiently - RCU and -percpu data structures among them. Don't let things stay hidden. - -The most important tool for introspection is the humble pretty printer - in -bcachefs, this means `*_to_text()` functions, which output to printbufs. - -Pretty printers are wonderful, because they compose and you can use them -everywhere. Having functions to print whatever object you're working with will -make your error messages much easier to write (therefore they will actually -exist) and much more informative. And they can be used from sysfs/debugfs, as -well as tracepoints. - -Runtime info and debugging tools should come with clear descriptions and -labels, and good structure - we don't want files with a list of bare integers, -like in procfs. Part of the job of the debugging tools is to educate users and -new developers as to how the system works. - -Error messages should, whenever possible, tell you everything you need to debug -the issue. It's worth putting effort into them. - -Tracepoints shouldn't be the first thing you reach for. They're an important -tool, but always look for more immediate ways to make things visible. When we -have to rely on tracing, we have to know which tracepoints we're looking for, -and then we have to run the troublesome workload, and then we have to sift -through logs. This is a lot of steps to go through when a user is hitting -something, and if it's intermittent it may not even be possible. - -The humble counter is an incredibly useful tool. They're cheap and simple to -use, and many complicated internal operations with lots of things that can -behave weirdly (anything involving memory reclaim, for example) become -shockingly easy to debug once you have counters on every distinct codepath. - -Persistent counters are even better. - -When debugging, try to get the most out of every bug you come across; don't -rush to fix the initial issue. Look for things that will make related bugs -easier the next time around - introspection, new assertions, better error -messages, new debug tools, and do those first. Look for ways to make the system -better behaved; often one bug will uncover several other bugs through -downstream effects. - -Fix all that first, and then the original bug last - even if that means keeping -a user waiting. They'll thank you in the long run, and when they understand -what you're doing you'll be amazed at how patient they're happy to be. Users -like to help - otherwise they wouldn't be reporting the bug in the first place. - -Talk to your users. Don't isolate yourself. - -Users notice all sorts of interesting things, and by just talking to them and -interacting with them you can benefit from their experience. - -Spend time doing support and helpdesk stuff. Don't just write code - code isn't -finished until it's being used trouble free. - -This will also motivate you to make your debugging tools as good as possible, -and perhaps even your documentation, too. Like anything else in life, the more -time you spend at it the better you'll get, and you the developer are the -person most able to improve the tools to make debugging quick and easy. - -Be wary of how you take on and commit to big projects. Don't let development -become product-manager focused. Often time an idea is a good one but needs to -wait for its proper time - but you won't know if it's the proper time for an -idea until you start writing code. - -Expect to throw a lot of things away, or leave them half finished for later. -Nobody writes all perfect code that all gets shipped, and you'll be much more -productive in the long run if you notice this early and shift to something -else. The experience gained and lessons learned will be valuable for all the -other work you do. - -But don't be afraid to tackle projects that require significant rework of -existing code. Sometimes these can be the best projects, because they can lead -us to make existing code more general, more flexible, more multipurpose and -perhaps more robust. Just don't hesitate to abandon the idea if it looks like -it's going to make a mess of things. - -Complicated features can often be done as a series of refactorings, with the -final change that actually implements the feature as a quite small patch at the -end. It's wonderful when this happens, especially when those refactorings are -things that improve the codebase in their own right. When that happens there's -much less risk of wasted effort if the feature you were going for doesn't work -out. - -Always strive to work incrementally. Always strive to turn the big projects -into little bite sized projects that can prove their own merits. - -Instead of always tackling those big projects, look for little things that -will be useful, and make the big projects easier. - -The question of what's likely to be useful is where junior developers most -often go astray - doing something because it seems like it'll be useful often -leads to overengineering. Knowing what's useful comes from many years of -experience, or talking with people who have that experience - or from simply -reading lots of code and looking for common patterns and issues. Don't be -afraid to throw things away and do something simpler. - -Talk about your ideas with your fellow developers; often times the best things -come from relaxed conversations where people aren't afraid to say "what if?". - -Don't neglect your tools. - -The most important tools (besides the compiler and our text editor) are the -tools we use for testing. The shortest possible edit/test/debug cycle is -essential for working productively. We learn, gain experience, and discover the -errors in our thinking by running our code and seeing what happens. If your -time is being wasted because your tools are bad or too slow - don't accept it, -fix it. - -Put effort into your documentation, commit messages, and code comments - but -don't go overboard. A good commit message is wonderful - but if the information -was important enough to go in a commit message, ask yourself if it would be -even better as a code comment. - -A good code comment is wonderful, but even better is the comment that didn't -need to exist because the code was so straightforward as to be obvious; -organized into small clean and tidy modules, with clear and descriptive names -for functions and variables, where every line of code has a clear purpose. diff --git a/Documentation/filesystems/bcachefs/SubmittingPatches.rst b/Documentation/filesystems/bcachefs/SubmittingPatches.rst deleted file mode 100644 index 18c79d548391..000000000000 --- a/Documentation/filesystems/bcachefs/SubmittingPatches.rst +++ /dev/null @@ -1,105 +0,0 @@ -Submitting patches to bcachefs -============================== - -Here are suggestions for submitting patches to bcachefs subsystem. - -Submission checklist --------------------- - -Patches must be tested before being submitted, either with the xfstests suite -[0]_, or the full bcachefs test suite in ktest [1]_, depending on what's being -touched. Note that ktest wraps xfstests and will be an easier method to running -it for most users; it includes single-command wrappers for all the mainstream -in-kernel local filesystems. - -Patches will undergo more testing after being merged (including -lockdep/kasan/preempt/etc. variants), these are not generally required to be -run by the submitter - but do put some thought into what you're changing and -which tests might be relevant, e.g. are you dealing with tricky memory layout -work? kasan, are you doing locking work? then lockdep; and ktest includes -single-command variants for the debug build types you'll most likely need. - -The exception to this rule is incomplete WIP/RFC patches: if you're working on -something nontrivial, it's encouraged to send out a WIP patch to let people -know what you're doing and make sure you're on the right track. Just make sure -it includes a brief note as to what's done and what's incomplete, to avoid -confusion. - -Rigorous checkpatch.pl adherence is not required (many of its warnings are -considered out of date), but try not to deviate too much without reason. - -Focus on writing code that reads well and is organized well; code should be -aesthetically pleasing. - -CI --- - -Instead of running your tests locally, when running the full test suite it's -preferable to let a server farm do it in parallel, and then have the results -in a nice test dashboard (which can tell you which failures are new, and -presents results in a git log view, avoiding the need for most bisecting). - -That exists [2]_, and community members may request an account. If you work for -a big tech company, you'll need to help out with server costs to get access - -but the CI is not restricted to running bcachefs tests: it runs any ktest test -(which generally makes it easy to wrap other tests that can run in qemu). - -Other things to think about ---------------------------- - -- How will we debug this code? Is there sufficient introspection to diagnose - when something starts acting wonky on a user machine? - - We don't necessarily need every single field of every data structure visible - with introspection, but having the important fields of all the core data - types wired up makes debugging drastically easier - a bit of thoughtful - foresight greatly reduces the need to have people build custom kernels with - debug patches. - - More broadly, think about all the debug tooling that might be needed. - -- Does it make the codebase more or less of a mess? Can we also try to do some - organizing, too? - -- Do new tests need to be written? New assertions? How do we know and verify - that the code is correct, and what happens if something goes wrong? - - We don't yet have automated code coverage analysis or easy fault injection - - but for now, pretend we did and ask what they might tell us. - - Assertions are hugely important, given that we don't yet have a systems - language that can do ergonomic embedded correctness proofs. Hitting an assert - in testing is much better than wandering off into undefined behaviour la-la - land - use them. Use them judiciously, and not as a replacement for proper - error handling, but use them. - -- Does it need to be performance tested? Should we add new performance counters? - - bcachefs has a set of persistent runtime counters which can be viewed with - the 'bcachefs fs top' command; this should give users a basic idea of what - their filesystem is currently doing. If you're doing a new feature or looking - at old code, think if anything should be added. - -- If it's a new on disk format feature - have upgrades and downgrades been - tested? (Automated tests exists but aren't in the CI, due to the hassle of - disk image management; coordinate to have them run.) - -Mailing list, IRC ------------------ - -Patches should hit the list [3]_, but much discussion and code review happens -on IRC as well [4]_; many people appreciate the more conversational approach -and quicker feedback. - -Additionally, we have a lively user community doing excellent QA work, which -exists primarily on IRC. Please make use of that resource; user feedback is -important for any nontrivial feature, and documenting it in commit messages -would be a good idea. - -.. rubric:: References - -.. [0] git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git -.. [1] https://evilpiepirate.org/git/ktest.git/ -.. [2] https://evilpiepirate.org/~testdashboard/ci/ -.. [3] linux-bcachefs@vger.kernel.org -.. [4] irc.oftc.net#bcache, #bcachefs-dev diff --git a/Documentation/filesystems/bcachefs/casefolding.rst b/Documentation/filesystems/bcachefs/casefolding.rst deleted file mode 100644 index 871a38f557e8..000000000000 --- a/Documentation/filesystems/bcachefs/casefolding.rst +++ /dev/null @@ -1,108 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Casefolding -=========== - -bcachefs has support for case-insensitive file and directory -lookups using the regular `chattr +F` (`S_CASEFOLD`, `FS_CASEFOLD_FL`) -casefolding attributes. - -The main usecase for casefolding is compatibility with software written -against other filesystems that rely on casefolded lookups -(eg. NTFS and Wine/Proton). -Taking advantage of file-system level casefolding can lead to great -loading time gains in many applications and games. - -Casefolding support requires a kernel with the `CONFIG_UNICODE` enabled. -Once a directory has been flagged for casefolding, a feature bit -is enabled on the superblock which marks the filesystem as using -casefolding. -When the feature bit for casefolding is enabled, it is no longer possible -to mount that filesystem on kernels without `CONFIG_UNICODE` enabled. - -On the lookup/query side: casefolding is implemented by allocating a new -string of `BCH_NAME_MAX` length using the `utf8_casefold` function to -casefold the query string. - -On the dirent side: casefolding is implemented by ensuring the `bkey`'s -hash is made from the casefolded string and storing the cached casefolded -name with the regular name in the dirent. - -The structure looks like this: - -* Regular: [dirent data][regular name][nul][nul]... -* Casefolded: [dirent data][reg len][cf len][regular name][casefolded name][nul][nul]... - -(Do note, the number of NULs here is merely for illustration; their count can -vary per-key, and they may not even be present if the key is aligned to -`sizeof(u64)`.) - -This is efficient as it means that for all file lookups that require casefolding, -it has identical performance to a regular lookup: -a hash comparison and a `memcmp` of the name. - -Rationale ---------- - -Several designs were considered for this system: -One was to introduce a dirent_v2, however that would be painful especially as -the hash system only has support for a single key type. This would also need -`BCH_NAME_MAX` to change between versions, and a new feature bit. - -Another option was to store without the two lengths, and just take the length of -the regular name and casefolded name contiguously / 2 as the length. This would -assume that the regular length == casefolded length, but that could potentially -not be true, if the uppercase unicode glyph had a different UTF-8 encoding than -the lowercase unicode glyph. -It would be possible to disregard the casefold cache for those cases, but it was -decided to simply encode the two string lengths in the key to avoid random -performance issues if this edgecase was ever hit. - -The option settled on was to use a free-bit in d_type to mark a dirent as having -a casefold cache, and then treat the first 4 bytes the name block as lengths. -You can see this in the `d_cf_name_block` member of union in `bch_dirent`. - -The feature bit was used to allow casefolding support to be enabled for the majority -of users, but some allow users who have no need for the feature to still use bcachefs as -`CONFIG_UNICODE` can increase the kernel side a significant amount due to the tables used, -which may be decider between using bcachefs for eg. embedded platforms. - -Other filesystems like ext4 and f2fs have a super-block level option for casefolding -encoding, but bcachefs currently does not provide this. ext4 and f2fs do not expose -any encodings than a single UTF-8 version. When future encodings are desirable, -they will be added trivially using the opts mechanism. - -dentry/dcache considerations ----------------------------- - -Currently, in casefolded directories, bcachefs (like other filesystems) will not cache -negative dentry's. - -This is because currently doing so presents a problem in the following scenario: - - - Lookup file "blAH" in a casefolded directory - - Creation of file "BLAH" in a casefolded directory - - Lookup file "blAH" in a casefolded directory - -This would fail if negative dentry's were cached. - -This is slightly suboptimal, but could be fixed in future with some vfs work. - - -References ----------- - -(from Peter Anvin, on the list) - -It is worth noting that Microsoft has basically declared their -"recommended" case folding (upcase) table to be permanently frozen (for -new filesystem instances in the case where they use an on-disk -translation table created at format time.) As far as I know they have -never supported anything other than 1:1 conversion of BMP code points, -nor normalization. - -The exFAT specification enumerates the full recommended upcase table, -although in a somewhat annoying format (basically a hex dump of -compressed data): - -https://learn.microsoft.com/en-us/windows/win32/fileio/exfat-specification diff --git a/Documentation/filesystems/bcachefs/errorcodes.rst b/Documentation/filesystems/bcachefs/errorcodes.rst deleted file mode 100644 index 2cccaa0ba7cd..000000000000 --- a/Documentation/filesystems/bcachefs/errorcodes.rst +++ /dev/null @@ -1,30 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -bcachefs private error codes ----------------------------- - -In bcachefs, as a hard rule we do not throw or directly use standard error -codes (-EINVAL, -EBUSY, etc.). Instead, we define private error codes as needed -in fs/bcachefs/errcode.h. - -This gives us much better error messages and makes debugging much easier. Any -direct uses of standard error codes you see in the source code are simply old -code that has yet to be converted - feel free to clean it up! - -Private error codes may subtype another error code, this allows for grouping of -related errors that should be handled similarly (e.g. transaction restart -errors), as well as specifying which standard error code should be returned at -the bcachefs module boundary. - -At the module boundary, we use bch2_err_class() to convert to a standard error -code; this also emits a trace event so that the original error code be -recovered even if it wasn't logged. - -Do not reuse error codes! Generally speaking, a private error code should only -be thrown in one place. That means that when we see it in a log message we can -see, unambiguously, exactly which file and line number it was returned from. - -Try to give error codes names that are as reasonably descriptive of the error -as possible. Frequently, the error will be logged at a place far removed from -where the error was generated; good names for error codes mean much more -descriptive and useful error messages. diff --git a/Documentation/filesystems/bcachefs/future/idle_work.rst b/Documentation/filesystems/bcachefs/future/idle_work.rst deleted file mode 100644 index 59a332509dcd..000000000000 --- a/Documentation/filesystems/bcachefs/future/idle_work.rst +++ /dev/null @@ -1,78 +0,0 @@ -Idle/background work classes design doc: - -Right now, our behaviour at idle isn't ideal, it was designed for servers that -would be under sustained load, to keep pending work at a "medium" level, to -let work build up so we can process it in more efficient batches, while also -giving headroom for bursts in load. - -But for desktops or mobile - scenarios where work is less sustained and power -usage is more important - we want to operate differently, with a "rush to -idle" so the system can go to sleep. We don't want to be dribbling out -background work while the system should be idle. - -The complicating factor is that there are a number of background tasks, which -form a heirarchy (or a digraph, depending on how you divide it up) - one -background task may generate work for another. - -Thus proper idle detection needs to model this heirarchy. - -- Foreground writes -- Page cache writeback -- Copygc, rebalance -- Journal reclaim - -When we implement idle detection and rush to idle, we need to be careful not -to disturb too much the existing behaviour that works reasonably well when the -system is under sustained load (or perhaps improve it in the case of -rebalance, which currently does not actively attempt to let work batch up). - -SUSTAINED LOAD REGIME ---------------------- - -When the system is under continuous load, we want these jobs to run -continuously - this is perhaps best modelled with a P/D controller, where -they'll be trying to keep a target value (i.e. fragmented disk space, -available journal space) roughly in the middle of some range. - -The goal under sustained load is to balance our ability to handle load spikes -without running out of x resource (free disk space, free space in the -journal), while also letting some work accumululate to be batched (or become -unnecessary). - -For example, we don't want to run copygc too aggressively, because then it -will be evacuating buckets that would have become empty (been overwritten or -deleted) anyways, and we don't want to wait until we're almost out of free -space because then the system will behave unpredicably - suddenly we're doing -a lot more work to service each write and the system becomes much slower. - -IDLE REGIME ------------ - -When the system becomes idle, we should start flushing our pending work -quicker so the system can go to sleep. - -Note that the definition of "idle" depends on where in the heirarchy a task -is - a task should start flushing work more quickly when the task above it has -stopped generating new work. - -e.g. rebalance should start flushing more quickly when page cache writeback is -idle, and journal reclaim should only start flushing more quickly when both -copygc and rebalance are idle. - -It's important to let work accumulate when more work is still incoming and we -still have room, because flushing is always more efficient if we let it batch -up. New writes may overwrite data before rebalance moves it, and tasks may be -generating more updates for the btree nodes that journal reclaim needs to flush. - -On idle, how much work we do at each interval should be proportional to the -length of time we have been idle for. If we're idle only for a short duration, -we shouldn't flush everything right away; the system might wake up and start -generating new work soon, and flushing immediately might end up doing a lot of -work that would have been unnecessary if we'd allowed things to batch more. - -To summarize, we will need: - - - A list of classes for background tasks that generate work, which will - include one "foreground" class. - - Tracking for each class - "Am I doing work, or have I gone to sleep?" - - And each class should check the class above it when deciding how much work to issue. diff --git a/Documentation/filesystems/bcachefs/index.rst b/Documentation/filesystems/bcachefs/index.rst deleted file mode 100644 index e5c4c2120b93..000000000000 --- a/Documentation/filesystems/bcachefs/index.rst +++ /dev/null @@ -1,38 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -====================== -bcachefs Documentation -====================== - -Subsystem-specific development process notes --------------------------------------------- - -Development notes specific to bcachefs. These are intended to supplement -:doc:`general kernel development handbook </process/index>`. - -.. toctree:: - :maxdepth: 1 - :numbered: - - CodingStyle - SubmittingPatches - -Filesystem implementation -------------------------- - -Documentation for filesystem features and their implementation details. -At this moment, only a few of these are described here. - -.. toctree:: - :maxdepth: 1 - :numbered: - - casefolding - errorcodes - -Future design -------------- -.. toctree:: - :maxdepth: 1 - - future/idle_work diff --git a/Documentation/filesystems/f2fs.rst b/Documentation/filesystems/f2fs.rst index e5bb89452aff..a8d02fe5be83 100644 --- a/Documentation/filesystems/f2fs.rst +++ b/Documentation/filesystems/f2fs.rst @@ -1,8 +1,11 @@ .. SPDX-License-Identifier: GPL-2.0 -========================================== -WHAT IS Flash-Friendly File System (F2FS)? -========================================== +================================= +Flash-Friendly File System (F2FS) +================================= + +Overview +======== NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have been equipped on a variety systems ranging from mobile to server systems. Since @@ -173,9 +176,12 @@ data_flush Enable data flushing before checkpoint in order to persist data of regular and symlink. reserve_root=%d Support configuring reserved space which is used for allocation from a privileged user with specified uid or - gid, unit: 4KB, the default limit is 0.2% of user blocks. -resuid=%d The user ID which may use the reserved blocks. -resgid=%d The group ID which may use the reserved blocks. + gid, unit: 4KB, the default limit is 12.5% of user blocks. +reserve_node=%d Support configuring reserved nodes which are used for + allocation from a privileged user with specified uid or + gid, the default limit is 12.5% of all nodes. +resuid=%d The user ID which may use the reserved blocks and nodes. +resgid=%d The group ID which may use the reserved blocks and nodes. fault_injection=%d Enable fault injection in all supported types with specified injection rate. fault_type=%d Support configuring fault injection type, should be @@ -291,9 +297,13 @@ compress_algorithm=%s Control compress algorithm, currently f2fs supports "lzo" "lz4", "zstd" and "lzo-rle" algorithm. compress_algorithm=%s:%d Control compress algorithm and its compress level, now, only "lz4" and "zstd" support compress level config. + + ========= =========== algorithm level range + ========= =========== lz4 3 - 16 zstd 1 - 22 + ========= =========== compress_log_size=%u Support configuring compress cluster size. The size will be 4KB * (1 << %u). The default and minimum sizes are 16KB. compress_extension=%s Support adding specified extension, so that f2fs can enable @@ -357,6 +367,7 @@ errors=%s Specify f2fs behavior on critical errors. This supports modes: panic immediately, continue without doing anything, and remount the partition in read-only mode. By default it uses "continue" mode. + ====================== =============== =============== ======== mode continue remount-ro panic ====================== =============== =============== ======== @@ -370,6 +381,25 @@ errors=%s Specify f2fs behavior on critical errors. This supports modes: ====================== =============== =============== ======== nat_bits Enable nat_bits feature to enhance full/empty nat blocks access, by default it's disabled. +lookup_mode=%s Control the directory lookup behavior for casefolded + directories. This option has no effect on directories + that do not have the casefold feature enabled. + + ================== ======================================== + Value Description + ================== ======================================== + perf (Default) Enforces a hash-only lookup. + The linear search fallback is always + disabled, ignoring the on-disk flag. + compat Enables the linear search fallback for + compatibility with directory entries + created by older kernel that used a + different case-folding algorithm. + This mode ignores the on-disk flag. + auto F2FS determines the mode based on the + on-disk `SB_ENC_NO_COMPAT_FALLBACK_FL` + flag. + ================== ======================================== ======================== ============================================================ Debugfs Entries @@ -795,11 +825,13 @@ ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREME extension list " " -- buffered io +------------------------------------------------------------------ N/A COLD_DATA WRITE_LIFE_EXTREME N/A HOT_DATA WRITE_LIFE_SHORT N/A WARM_DATA WRITE_LIFE_NOT_SET -- direct io +------------------------------------------------------------------ WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET @@ -915,24 +947,26 @@ compression enabled files (refer to "Compression implementation" section for how enable compression on a regular inode). 1) compress_mode=fs -This is the default option. f2fs does automatic compression in the writeback of the -compression enabled files. + + This is the default option. f2fs does automatic compression in the writeback of the + compression enabled files. 2) compress_mode=user -This disables the automatic compression and gives the user discretion of choosing the -target file and the timing. The user can do manual compression/decompression on the -compression enabled files using F2FS_IOC_DECOMPRESS_FILE and F2FS_IOC_COMPRESS_FILE -ioctls like the below. -To decompress a file, + This disables the automatic compression and gives the user discretion of choosing the + target file and the timing. The user can do manual compression/decompression on the + compression enabled files using F2FS_IOC_DECOMPRESS_FILE and F2FS_IOC_COMPRESS_FILE + ioctls like the below. + +To decompress a file:: -fd = open(filename, O_WRONLY, 0); -ret = ioctl(fd, F2FS_IOC_DECOMPRESS_FILE); + fd = open(filename, O_WRONLY, 0); + ret = ioctl(fd, F2FS_IOC_DECOMPRESS_FILE); -To compress a file, +To compress a file:: -fd = open(filename, O_WRONLY, 0); -ret = ioctl(fd, F2FS_IOC_COMPRESS_FILE); + fd = open(filename, O_WRONLY, 0); + ret = ioctl(fd, F2FS_IOC_COMPRESS_FILE); NVMe Zoned Namespace devices ---------------------------- @@ -962,32 +996,32 @@ reserved and used by another filesystem or for different purposes. Once that external usage is complete, the device aliasing file can be deleted, releasing the reserved space back to F2FS for its own use. -<use-case> - -# ls /dev/vd* -/dev/vdb (32GB) /dev/vdc (32GB) -# mkfs.ext4 /dev/vdc -# mkfs.f2fs -c /dev/vdc@vdc.file /dev/vdb -# mount /dev/vdb /mnt/f2fs -# ls -l /mnt/f2fs -vdc.file -# df -h -/dev/vdb 64G 33G 32G 52% /mnt/f2fs - -# mount -o loop /dev/vdc /mnt/ext4 -# df -h -/dev/vdb 64G 33G 32G 52% /mnt/f2fs -/dev/loop7 32G 24K 30G 1% /mnt/ext4 -# umount /mnt/ext4 - -# f2fs_io getflags /mnt/f2fs/vdc.file -get a flag on /mnt/f2fs/vdc.file ret=0, flags=nocow(pinned),immutable -# f2fs_io setflags noimmutable /mnt/f2fs/vdc.file -get a flag on noimmutable ret=0, flags=800010 -set a flag on /mnt/f2fs/vdc.file ret=0, flags=noimmutable -# rm /mnt/f2fs/vdc.file -# df -h -/dev/vdb 64G 753M 64G 2% /mnt/f2fs +.. code-block:: + + # ls /dev/vd* + /dev/vdb (32GB) /dev/vdc (32GB) + # mkfs.ext4 /dev/vdc + # mkfs.f2fs -c /dev/vdc@vdc.file /dev/vdb + # mount /dev/vdb /mnt/f2fs + # ls -l /mnt/f2fs + vdc.file + # df -h + /dev/vdb 64G 33G 32G 52% /mnt/f2fs + + # mount -o loop /dev/vdc /mnt/ext4 + # df -h + /dev/vdb 64G 33G 32G 52% /mnt/f2fs + /dev/loop7 32G 24K 30G 1% /mnt/ext4 + # umount /mnt/ext4 + + # f2fs_io getflags /mnt/f2fs/vdc.file + get a flag on /mnt/f2fs/vdc.file ret=0, flags=nocow(pinned),immutable + # f2fs_io setflags noimmutable /mnt/f2fs/vdc.file + get a flag on noimmutable ret=0, flags=800010 + set a flag on /mnt/f2fs/vdc.file ret=0, flags=noimmutable + # rm /mnt/f2fs/vdc.file + # df -h + /dev/vdb 64G 753M 64G 2% /mnt/f2fs So, the key idea is, user can do any file operations on /dev/vdc, and reclaim the space after the use, while the space is counted as /data. diff --git a/Documentation/filesystems/fuse-io-uring.rst b/Documentation/filesystems/fuse/fuse-io-uring.rst index d73dd0dbd238..d73dd0dbd238 100644 --- a/Documentation/filesystems/fuse-io-uring.rst +++ b/Documentation/filesystems/fuse/fuse-io-uring.rst diff --git a/Documentation/filesystems/fuse-io.rst b/Documentation/filesystems/fuse/fuse-io.rst index 6464de4266ad..d736ac4cb483 100644 --- a/Documentation/filesystems/fuse-io.rst +++ b/Documentation/filesystems/fuse/fuse-io.rst @@ -1,7 +1,7 @@ .. SPDX-License-Identifier: GPL-2.0 ============== -Fuse I/O Modes +FUSE I/O Modes ============== Fuse supports the following I/O modes: diff --git a/Documentation/filesystems/fuse-passthrough.rst b/Documentation/filesystems/fuse/fuse-passthrough.rst index 2b0e7c2da54a..2b0e7c2da54a 100644 --- a/Documentation/filesystems/fuse-passthrough.rst +++ b/Documentation/filesystems/fuse/fuse-passthrough.rst diff --git a/Documentation/filesystems/fuse.rst b/Documentation/filesystems/fuse/fuse.rst index 1e31e87aee68..0fbd5a03fdc9 100644 --- a/Documentation/filesystems/fuse.rst +++ b/Documentation/filesystems/fuse/fuse.rst @@ -1,8 +1,8 @@ .. SPDX-License-Identifier: GPL-2.0 -==== -FUSE -==== +============= +FUSE Overview +============= Definitions =========== @@ -129,6 +129,20 @@ For each connection the following files exist within this directory: connection. This means that all waiting requests will be aborted an error returned for all aborted and new requests. + max_background + The maximum number of background requests that can be outstanding + at a time. When the number of background requests reaches this limit, + further requests will be blocked until some are completed, potentially + causing I/O operations to stall. + + congestion_threshold + The threshold of background requests at which the kernel considers + the filesystem to be congested. When the number of background requests + exceeds this value, the kernel will skip asynchronous readahead + operations, reducing read-ahead optimizations but preserving essential + I/O, as well as suspending non-synchronous writeback operations + (WB_SYNC_NONE), delaying page cache flushing to the filesystem. + Only the owner of the mount may read or write these files. Interrupting filesystem operations diff --git a/Documentation/filesystems/fuse/index.rst b/Documentation/filesystems/fuse/index.rst new file mode 100644 index 000000000000..393a845214da --- /dev/null +++ b/Documentation/filesystems/fuse/index.rst @@ -0,0 +1,14 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================================================== +FUSE (Filesystem in Userspace) Technical Documentation +====================================================== + +.. toctree:: + :maxdepth: 2 + :numbered: + + fuse + fuse-io + fuse-io-uring + fuse-passthrough diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 11a599387266..af516e528ded 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -72,7 +72,6 @@ Documentation for filesystem implementations. afs autofs autofs-mount-control - bcachefs/index befs bfs btrfs @@ -96,10 +95,7 @@ Documentation for filesystem implementations. hfs hfsplus hpfs - fuse - fuse-io - fuse-io-uring - fuse-passthrough + fuse/index inotify isofs nilfs2 diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst index aa287ccdac2f..77704fde9845 100644 --- a/Documentation/filesystems/locking.rst +++ b/Documentation/filesystems/locking.rst @@ -443,7 +443,7 @@ prototypes:: int (*direct_access) (struct block_device *, sector_t, void **, unsigned long *); void (*unlock_native_capacity) (struct gendisk *); - int (*getgeo)(struct block_device *, struct hd_geometry *); + int (*getgeo)(struct gendisk *, struct hd_geometry *); void (*swap_slot_free_notify) (struct block_device *, unsigned long); locking rules: diff --git a/Documentation/filesystems/mount_api.rst b/Documentation/filesystems/mount_api.rst index e149b89118c8..c99ab1f7fea4 100644 --- a/Documentation/filesystems/mount_api.rst +++ b/Documentation/filesystems/mount_api.rst @@ -506,8 +506,16 @@ returned. * :: + int vfs_parse_fs_qstr(struct fs_context *fc, const char *key, + const struct qstr *value); + + A wrapper around vfs_parse_fs_param() that copies the value string it is + passed. + + * :: + int vfs_parse_fs_string(struct fs_context *fc, const char *key, - const char *value, size_t v_size); + const char *value); A wrapper around vfs_parse_fs_param() that copies the value string it is passed. diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst index 85f590254f07..7233b04668fc 100644 --- a/Documentation/filesystems/porting.rst +++ b/Documentation/filesystems/porting.rst @@ -340,8 +340,8 @@ of those. Caller makes sure async writeback cannot be running for the inode whil ->drop_inode() returns int now; it's called on final iput() with inode->i_lock held and it returns true if filesystems wants the inode to be -dropped. As before, generic_drop_inode() is still the default and it's been -updated appropriately. generic_delete_inode() is also alive and it consists +dropped. As before, inode_generic_drop() is still the default and it's been +updated appropriately. inode_just_drop() is also alive and it consists simply of return 1. Note that all actual eviction work is done by caller after ->drop_inode() returns. @@ -1285,3 +1285,27 @@ rather than a VMA, as the VMA at this stage is not yet valid. The vm_area_desc provides the minimum required information for a filesystem to initialise state upon memory mapping of a file-backed region, and output parameters for the file system to set this state. + +--- + +**mandatory** + +Several functions are renamed: + +- kern_path_locked -> start_removing_path +- kern_path_create -> start_creating_path +- user_path_create -> start_creating_user_path +- user_path_locked_at -> start_removing_user_path_at +- done_path_create -> end_creating_path + +--- + +**mandatory** + +Calling conventions for vfs_parse_fs_string() have changed; it does *not* +take length anymore (value ? strlen(value) : 0 is used). If you want +a different length, use + + vfs_parse_fs_qstr(fc, key, &QSTR_LEN(value, len)) + +instead. diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index ede654478dff..3002258c9c7f 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -270,8 +270,9 @@ It's slow but very precise. HugetlbPages size of hugetlb memory portions CoreDumping process's memory is currently being dumped (killing the process may lead to a corrupted core) - THP_enabled process is allowed to use THP (returns 0 when - PR_SET_THP_DISABLE is set on the process + THP_enabled process is allowed to use THP (returns 0 when + PR_SET_THP_DISABLE is set on the process to disable + THP completely, not just partially) Threads number of threads SigQ number of signals queued/max. number for queue SigPnd bitmap of pending signals for the thread @@ -987,6 +988,19 @@ number, module (if originates from a loadable module) and the function calling the allocation. The number of bytes allocated and number of calls at each location are reported. The first line indicates the version of the file, the second line is the header listing fields in the file. +If file version is 2.0 or higher then each line may contain additional +<key>:<value> pairs representing extra information about the call site. +For example if the counters are not accurate, the line will be appended with +"accurate:no" pair. + +Supported markers in v2: +accurate:no + + Absolute values of the counters in this line are not accurate + because of the failure to allocate memory to track some of the + allocations made at this location. Deltas in these counters are + accurate, therefore counters can be used to track allocation size + and count changes. Example output. @@ -2341,6 +2355,7 @@ The following mount options are supported: hidepid= Set /proc/<pid>/ access mode. gid= Set the group authorized to learn processes information. subset= Show only the specified subset of procfs. + pidns= Specify a the namespace used by this procfs. ========= ======================================================== hidepid=off or hidepid=0 means classic mode - everybody may access all @@ -2373,6 +2388,13 @@ information about processes information, just add identd to this group. subset=pid hides all top level files and directories in the procfs that are not related to tasks. +pidns= specifies a pid namespace (either as a string path to something like +`/proc/$pid/ns/pid`, or a file descriptor when using `FSCONFIG_SET_FD`) that +will be used by the procfs instance when translating pids. By default, procfs +will use the calling process's active pid namespace. Note that the pid +namespace of an existing procfs instance cannot be modified (attempting to do +so will give an `-EBUSY` error). + Chapter 5: Filesystem behavior ============================== diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst index 4db3b07c16c5..b7f35b07876a 100644 --- a/Documentation/filesystems/resctrl.rst +++ b/Documentation/filesystems/resctrl.rst @@ -26,6 +26,7 @@ MBM (Memory Bandwidth Monitoring) "cqm_mbm_total", "cqm_mbm_local" MBA (Memory Bandwidth Allocation) "mba" SMBA (Slow Memory Bandwidth Allocation) "" BMEC (Bandwidth Monitoring Event Configuration) "" +ABMC (Assignable Bandwidth Monitoring Counters) "" =============================================== ================================ Historically, new features were made visible by default in /proc/cpuinfo. This @@ -256,6 +257,144 @@ with the following files: # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config 0=0x30;1=0x30;3=0x15;4=0x15 +"mbm_assign_mode": + The supported counter assignment modes. The enclosed brackets indicate which mode + is enabled. The MBM events associated with counters may reset when "mbm_assign_mode" + is changed. + :: + + # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode + [mbm_event] + default + + "mbm_event": + + mbm_event mode allows users to assign a hardware counter to an RMID, event + pair and monitor the bandwidth usage as long as it is assigned. The hardware + continues to track the assigned counter until it is explicitly unassigned by + the user. Each event within a resctrl group can be assigned independently. + + In this mode, a monitoring event can only accumulate data while it is backed + by a hardware counter. Use "mbm_L3_assignments" found in each CTRL_MON and MON + group to specify which of the events should have a counter assigned. The number + of counters available is described in the "num_mbm_cntrs" file. Changing the + mode may cause all counters on the resource to reset. + + Moving to mbm_event counter assignment mode requires users to assign the counters + to the events. Otherwise, the MBM event counters will return 'Unassigned' when read. + + The mode is beneficial for AMD platforms that support more CTRL_MON + and MON groups than available hardware counters. By default, this + feature is enabled on AMD platforms with the ABMC (Assignable Bandwidth + Monitoring Counters) capability, ensuring counters remain assigned even + when the corresponding RMID is not actively used by any processor. + + "default": + + In default mode, resctrl assumes there is a hardware counter for each + event within every CTRL_MON and MON group. On AMD platforms, it is + recommended to use the mbm_event mode, if supported, to prevent reset of MBM + events between reads resulting from hardware re-allocating counters. This can + result in misleading values or display "Unavailable" if no counter is assigned + to the event. + + * To enable "mbm_event" counter assignment mode: + :: + + # echo "mbm_event" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode + + * To enable "default" monitoring mode: + :: + + # echo "default" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode + +"num_mbm_cntrs": + The maximum number of counters (total of available and assigned counters) in + each domain when the system supports mbm_event mode. + + For example, on a system with maximum of 32 memory bandwidth monitoring + counters in each of its L3 domains: + :: + + # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs + 0=32;1=32 + +"available_mbm_cntrs": + The number of counters available for assignment in each domain when mbm_event + mode is enabled on the system. + + For example, on a system with 30 available [hardware] assignable counters + in each of its L3 domains: + :: + + # cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs + 0=30;1=30 + +"event_configs": + Directory that exists when "mbm_event" counter assignment mode is supported. + Contains a sub-directory for each MBM event that can be assigned to a counter. + + Two MBM events are supported by default: mbm_local_bytes and mbm_total_bytes. + Each MBM event's sub-directory contains a file named "event_filter" that is + used to view and modify which memory transactions the MBM event is configured + with. The file is accessible only when "mbm_event" counter assignment mode is + enabled. + + List of memory transaction types supported: + + ========================== ======================================================== + Name Description + ========================== ======================================================== + dirty_victim_writes_all Dirty Victims from the QOS domain to all types of memory + remote_reads_slow_memory Reads to slow memory in the non-local NUMA domain + local_reads_slow_memory Reads to slow memory in the local NUMA domain + remote_non_temporal_writes Non-temporal writes to non-local NUMA domain + local_non_temporal_writes Non-temporal writes to local NUMA domain + remote_reads Reads to memory in the non-local NUMA domain + local_reads Reads to memory in the local NUMA domain + ========================== ======================================================== + + For example:: + + # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_total_bytes/event_filter + local_reads,remote_reads,local_non_temporal_writes,remote_non_temporal_writes, + local_reads_slow_memory,remote_reads_slow_memory,dirty_victim_writes_all + + # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_local_bytes/event_filter + local_reads,local_non_temporal_writes,local_reads_slow_memory + + Modify the event configuration by writing to the "event_filter" file within + the "event_configs" directory. The read/write "event_filter" file contains the + configuration of the event that reflects which memory transactions are counted by it. + + For example:: + + # echo "local_reads, local_non_temporal_writes" > + /sys/fs/resctrl/info/L3_MON/event_configs/mbm_total_bytes/event_filter + + # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_total_bytes/event_filter + local_reads,local_non_temporal_writes + +"mbm_assign_on_mkdir": + Exists when "mbm_event" counter assignment mode is supported. Accessible + only when "mbm_event" counter assignment mode is enabled. + + Determines if a counter will automatically be assigned to an RMID, MBM event + pair when its associated monitor group is created via mkdir. Enabled by default + on boot, also when switched from "default" mode to "mbm_event" counter assignment + mode. Users can disable this capability by writing to the interface. + + "0": + Auto assignment is disabled. + "1": + Auto assignment is enabled. + + Example:: + + # echo 0 > /sys/fs/resctrl/info/L3_MON/mbm_assign_on_mkdir + # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_on_mkdir + 0 + "max_threshold_occupancy": Read/write file provides the largest value (in bytes) at which a previously used LLC_occupancy @@ -380,10 +519,77 @@ When monitoring is enabled all MON groups will also contain: for the L3 cache they occupy). These are named "mon_sub_L3_YY" where "YY" is the node number. + When the 'mbm_event' counter assignment mode is enabled, reading + an MBM event of a MON group returns 'Unassigned' if no hardware + counter is assigned to it. For CTRL_MON groups, 'Unassigned' is + returned if the MBM event does not have an assigned counter in the + CTRL_MON group nor in any of its associated MON groups. + "mon_hw_id": Available only with debug option. The identifier used by hardware for the monitor group. On x86 this is the RMID. +When monitoring is enabled all MON groups may also contain: + +"mbm_L3_assignments": + Exists when "mbm_event" counter assignment mode is supported and lists the + counter assignment states of the group. + + The assignment list is displayed in the following format: + + <Event>:<Domain ID>=<Assignment state>;<Domain ID>=<Assignment state> + + Event: A valid MBM event in the + /sys/fs/resctrl/info/L3_MON/event_configs directory. + + Domain ID: A valid domain ID. When writing, '*' applies the changes + to all the domains. + + Assignment states: + + _ : No counter assigned. + + e : Counter assigned exclusively. + + Example: + + To display the counter assignment states for the default group. + :: + + # cd /sys/fs/resctrl + # cat /sys/fs/resctrl/mbm_L3_assignments + mbm_total_bytes:0=e;1=e + mbm_local_bytes:0=e;1=e + + Assignments can be modified by writing to the interface. + + Examples: + + To unassign the counter associated with the mbm_total_bytes event on domain 0: + :: + + # echo "mbm_total_bytes:0=_" > /sys/fs/resctrl/mbm_L3_assignments + # cat /sys/fs/resctrl/mbm_L3_assignments + mbm_total_bytes:0=_;1=e + mbm_local_bytes:0=e;1=e + + To unassign the counter associated with the mbm_total_bytes event on all the domains: + :: + + # echo "mbm_total_bytes:*=_" > /sys/fs/resctrl/mbm_L3_assignments + # cat /sys/fs/resctrl/mbm_L3_assignments + mbm_total_bytes:0=_;1=_ + mbm_local_bytes:0=e;1=e + + To assign a counter associated with the mbm_total_bytes event on all domains in + exclusive mode: + :: + + # echo "mbm_total_bytes:*=e" > /sys/fs/resctrl/mbm_L3_assignments + # cat /sys/fs/resctrl/mbm_L3_assignments + mbm_total_bytes:0=e;1=e + mbm_local_bytes:0=e;1=e + When the "mba_MBps" mount option is used all CTRL_MON groups will also contain: "mba_MBps_event": @@ -1429,6 +1635,125 @@ View the llc occupancy snapshot:: # cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy 11234000 + +Examples on working with mbm_assign_mode +======================================== + +a. Check if MBM counter assignment mode is supported. +:: + + # mount -t resctrl resctrl /sys/fs/resctrl/ + + # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode + [mbm_event] + default + +The "mbm_event" mode is detected and enabled. + +b. Check how many assignable counters are supported. +:: + + # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs + 0=32;1=32 + +c. Check how many assignable counters are available for assignment in each domain. +:: + + # cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs + 0=30;1=30 + +d. To list the default group's assign states. +:: + + # cat /sys/fs/resctrl/mbm_L3_assignments + mbm_total_bytes:0=e;1=e + mbm_local_bytes:0=e;1=e + +e. To unassign the counter associated with the mbm_total_bytes event on domain 0. +:: + + # echo "mbm_total_bytes:0=_" > /sys/fs/resctrl/mbm_L3_assignments + # cat /sys/fs/resctrl/mbm_L3_assignments + mbm_total_bytes:0=_;1=e + mbm_local_bytes:0=e;1=e + +f. To unassign the counter associated with the mbm_total_bytes event on all domains. +:: + + # echo "mbm_total_bytes:*=_" > /sys/fs/resctrl/mbm_L3_assignments + # cat /sys/fs/resctrl/mbm_L3_assignment + mbm_total_bytes:0=_;1=_ + mbm_local_bytes:0=e;1=e + +g. To assign a counter associated with the mbm_total_bytes event on all domains in +exclusive mode. +:: + + # echo "mbm_total_bytes:*=e" > /sys/fs/resctrl/mbm_L3_assignments + # cat /sys/fs/resctrl/mbm_L3_assignments + mbm_total_bytes:0=e;1=e + mbm_local_bytes:0=e;1=e + +h. Read the events mbm_total_bytes and mbm_local_bytes of the default group. There is +no change in reading the events with the assignment. +:: + + # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes + 779247936 + # cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes + 562324232 + # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes + 212122123 + # cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes + 121212144 + +i. Check the event configurations. +:: + + # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_total_bytes/event_filter + local_reads,remote_reads,local_non_temporal_writes,remote_non_temporal_writes, + local_reads_slow_memory,remote_reads_slow_memory,dirty_victim_writes_all + + # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_local_bytes/event_filter + local_reads,local_non_temporal_writes,local_reads_slow_memory + +j. Change the event configuration for mbm_local_bytes. +:: + + # echo "local_reads, local_non_temporal_writes, local_reads_slow_memory, remote_reads" > + /sys/fs/resctrl/info/L3_MON/event_configs/mbm_local_bytes/event_filter + + # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_local_bytes/event_filter + local_reads,local_non_temporal_writes,local_reads_slow_memory,remote_reads + +k. Now read the local events again. The first read may come back with "Unavailable" +status. The subsequent read of mbm_local_bytes will display the current value. +:: + + # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes + Unavailable + # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes + 2252323 + # cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes + Unavailable + # cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes + 1566565 + +l. Users have the option to go back to 'default' mbm_assign_mode if required. This can be +done using the following command. Note that switching the mbm_assign_mode may reset all +the MBM counters (and thus all MBM events) of all the resctrl groups. +:: + + # echo "default" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode + # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode + mbm_event + [default] + +m. Unmount the resctrl filesystem. +:: + + # umount /sys/fs/resctrl/ + Intel RDT Errata ================ diff --git a/Documentation/filesystems/sysfs.rst b/Documentation/filesystems/sysfs.rst index 354c5fb310b4..2703c04af7d0 100644 --- a/Documentation/filesystems/sysfs.rst +++ b/Documentation/filesystems/sysfs.rst @@ -320,7 +320,7 @@ span multiple bus types). fs/ contains a directory for some filesystems. Currently each filesystem wanting to export attributes must create its own hierarchy -below fs/ (see ./fuse.rst for an example). +below fs/ (see fuse/fuse.rst for an example). module/ contains parameter values and state information for all loaded system modules, for both builtin and loadable modules. diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst index 486a91633474..4f13b01e42eb 100644 --- a/Documentation/filesystems/vfs.rst +++ b/Documentation/filesystems/vfs.rst @@ -209,31 +209,8 @@ method fills in is the "s_op" field. This is a pointer to a "struct super_operations" which describes the next level of the filesystem implementation. -Usually, a filesystem uses one of the generic mount() implementations -and provides a fill_super() callback instead. The generic variants are: - -``mount_bdev`` - mount a filesystem residing on a block device - -``mount_nodev`` - mount a filesystem that is not backed by a device - -``mount_single`` - mount a filesystem which shares the instance between all mounts - -A fill_super() callback implementation has the following arguments: - -``struct super_block *sb`` - the superblock structure. The callback must initialize this - properly. - -``void *data`` - arbitrary mount options, usually comes as an ASCII string (see - "Mount Options" section) - -``int silent`` - whether or not to be silent on error - +For more information on mounting (and the new mount API), see +Documentation/filesystems/mount_api.rst. The Superblock Object ===================== @@ -327,11 +304,11 @@ or bottom half). inode->i_lock spinlock held. This method should be either NULL (normal UNIX filesystem - semantics) or "generic_delete_inode" (for filesystems that do + semantics) or "inode_just_drop" (for filesystems that do not want to cache inodes - causing "delete_inode" to always be called regardless of the value of i_nlink) - The "generic_delete_inode()" behavior is equivalent to the old + The "inode_just_drop()" behavior is equivalent to the old practice of using "force_delete" in the put_inode() case, but does not have the races that the "force_delete()" approach had. |