summaryrefslogtreecommitdiff
path: root/Documentation
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/ABI/testing/sysfs-kernel-mm-damon7
-rw-r--r--Documentation/accounting/delay-accounting.rst91
-rw-r--r--Documentation/admin-guide/kernel-parameters.txt2
-rw-r--r--Documentation/admin-guide/mm/damon/start.rst2
-rw-r--r--Documentation/admin-guide/mm/damon/usage.rst11
-rw-r--r--Documentation/admin-guide/mm/transhuge.rst42
-rw-r--r--Documentation/admin-guide/mm/zswap.rst33
-rw-r--r--Documentation/admin-guide/sysctl/kernel.rst2
-rw-r--r--Documentation/core-api/mm-api.rst1
-rw-r--r--Documentation/dev-tools/kasan.rst3
-rw-r--r--Documentation/dev-tools/kcov.rst7
-rw-r--r--Documentation/driver-api/crypto/iaa/iaa-crypto.rst2
-rw-r--r--Documentation/filesystems/proc.rst18
-rw-r--r--Documentation/mm/damon/design.rst18
-rw-r--r--Documentation/mm/damon/maintainer-profile.rst17
-rw-r--r--Documentation/mm/index.rst1
-rw-r--r--Documentation/mm/swap-table.rst69
17 files changed, 245 insertions, 81 deletions
diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-damon b/Documentation/ABI/testing/sysfs-kernel-mm-damon
index 6791d879759e..b6b71db36ca7 100644
--- a/Documentation/ABI/testing/sysfs-kernel-mm-damon
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-damon
@@ -77,6 +77,13 @@ Description: Writing a keyword for a monitoring operations set ('vaddr' for
Note that only the operations sets that listed in
'avail_operations' file are valid inputs.
+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/addr_unit
+Date: Aug 2025
+Contact: SeongJae Park <sj@kernel.org>
+Description: Writing an integer to this file sets the 'address unit'
+ parameter of the given operations set of the context. Reading
+ the file returns the last-written 'address unit' value.
+
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/monitoring_attrs/intervals/sample_us
Date: Mar 2022
Contact: SeongJae Park <sj@kernel.org>
diff --git a/Documentation/accounting/delay-accounting.rst b/Documentation/accounting/delay-accounting.rst
index 8ccc5af5ea1e..86d7902a657f 100644
--- a/Documentation/accounting/delay-accounting.rst
+++ b/Documentation/accounting/delay-accounting.rst
@@ -134,47 +134,72 @@ The above command can be used with -v to get more debug information.
After the system starts, use `delaytop` to get the system-wide delay information,
which includes system-wide PSI information and Top-N high-latency tasks.
+Note: PSI support requires `CONFIG_PSI=y` and `psi=1` for full functionality.
-`delaytop` supports sorting by CPU latency in descending order by default,
-displays the top 20 high-latency tasks by default, and refreshes the latency
-data every 2 seconds by default.
+`delaytop` is an interactive tool for monitoring system pressure and task delays.
+It supports multiple sorting options, display modes, and real-time keyboard controls.
-Get PSI information and Top-N tasks delay, since system boot::
+Basic usage with default settings (sorts by CPU delay, shows top 20 tasks, refreshes every 2 seconds)::
bash# ./delaytop
- System Pressure Information: (avg10/avg60/avg300/total)
- CPU some: 0.0%/ 0.0%/ 0.0%/ 345(ms)
+ System Pressure Information: (avg10/avg60vg300/total)
+ CPU some: 0.0%/ 0.0%/ 0.0%/ 106137(ms)
CPU full: 0.0%/ 0.0%/ 0.0%/ 0(ms)
Memory full: 0.0%/ 0.0%/ 0.0%/ 0(ms)
Memory some: 0.0%/ 0.0%/ 0.0%/ 0(ms)
- IO full: 0.0%/ 0.0%/ 0.0%/ 65(ms)
- IO some: 0.0%/ 0.0%/ 0.0%/ 79(ms)
+ IO full: 0.0%/ 0.0%/ 0.0%/ 2240(ms)
+ IO some: 0.0%/ 0.0%/ 0.0%/ 2783(ms)
IRQ full: 0.0%/ 0.0%/ 0.0%/ 0(ms)
- Top 20 processes (sorted by CPU delay):
- PID TGID COMMAND CPU(ms) IO(ms) SWAP(ms) RCL(ms) THR(ms) CMP(ms) WP(ms) IRQ(ms)
- ----------------------------------------------------------------------------------------------
- 161 161 zombie_memcg_re 1.40 0.00 0.00 0.00 0.00 0.00 0.00 0.00
- 130 130 blkcg_punt_bio 1.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00
- 444 444 scsi_tmf_0 0.73 0.00 0.00 0.00 0.00 0.00 0.00 0.00
- 1280 1280 rsyslogd 0.53 0.04 0.00 0.00 0.00 0.00 0.00 0.00
- 12 12 ksoftirqd/0 0.47 0.00 0.00 0.00 0.00 0.00 0.00 0.00
- 1277 1277 nbd-server 0.44 0.00 0.00 0.00 0.00 0.00 0.00 0.00
- 308 308 kworker/2:2-sys 0.41 0.00 0.00 0.00 0.00 0.00 0.00 0.00
- 55 55 netns 0.36 0.00 0.00 0.00 0.00 0.00 0.00 0.00
- 1187 1187 acpid 0.31 0.03 0.00 0.00 0.00 0.00 0.00 0.00
- 6184 6184 kworker/1:2-sys 0.24 0.00 0.00 0.00 0.00 0.00 0.00 0.00
- 186 186 kaluad 0.24 0.00 0.00 0.00 0.00 0.00 0.00 0.00
- 18 18 ksoftirqd/1 0.24 0.00 0.00 0.00 0.00 0.00 0.00 0.00
- 185 185 kmpath_rdacd 0.23 0.00 0.00 0.00 0.00 0.00 0.00 0.00
- 190 190 kstrp 0.23 0.00 0.00 0.00 0.00 0.00 0.00 0.00
- 2759 2759 agetty 0.20 0.03 0.00 0.00 0.00 0.00 0.00 0.00
- 1190 1190 kworker/0:3-sys 0.19 0.00 0.00 0.00 0.00 0.00 0.00 0.00
- 1272 1272 sshd 0.15 0.04 0.00 0.00 0.00 0.00 0.00 0.00
- 1156 1156 license 0.15 0.11 0.00 0.00 0.00 0.00 0.00 0.00
- 134 134 md 0.13 0.00 0.00 0.00 0.00 0.00 0.00 0.00
- 6142 6142 kworker/3:2-xfs 0.13 0.00 0.00 0.00 0.00 0.00 0.00 0.00
-
-Dynamic interactive interface of delaytop::
+ [o]sort [M]memverbose [q]quit
+ Top 20 processes (sorted by cpu delay):
+ PID TGID COMMAND CPU(ms) IO(ms) IRQ(ms) MEM(ms)
+ ------------------------------------------------------------------------
+ 110 110 kworker/15:0H-s 27.91 0.00 0.00 0.00
+ 57 57 cpuhp/7 3.18 0.00 0.00 0.00
+ 99 99 cpuhp/14 2.97 0.00 0.00 0.00
+ 51 51 cpuhp/6 0.90 0.00 0.00 0.00
+ 44 44 kworker/4:0H-sy 0.80 0.00 0.00 0.00
+ 60 60 ksoftirqd/7 0.74 0.00 0.00 0.00
+ 76 76 idle_inject/10 0.31 0.00 0.00 0.00
+ 100 100 idle_inject/14 0.30 0.00 0.00 0.00
+ 1309 1309 systemsettings 0.29 0.00 0.00 0.00
+ 45 45 cpuhp/5 0.22 0.00 0.00 0.00
+ 63 63 cpuhp/8 0.20 0.00 0.00 0.00
+ 87 87 cpuhp/12 0.18 0.00 0.00 0.00
+ 93 93 cpuhp/13 0.17 0.00 0.00 0.00
+ 1265 1265 acpid 0.17 0.00 0.00 0.00
+ 1552 1552 sshd 0.17 0.00 0.00 0.00
+ 2584 2584 sddm-helper 0.16 0.00 0.00 0.00
+ 1284 1284 rtkit-daemon 0.15 0.00 0.00 0.00
+ 1326 1326 nde-netfilter 0.14 0.00 0.00 0.00
+ 27 27 cpuhp/2 0.13 0.00 0.00 0.00
+ 631 631 kworker/11:2-rc 0.11 0.00 0.00 0.00
+
+Interactive keyboard controls during runtime::
+
+ o - Select sort field (CPU, IO, IRQ, Memory, etc.)
+ M - Toggle display mode (Default/Memory Verbose)
+ q - Quit
+
+Available sort fields(use -s/--sort or interactive command)::
+
+ cpu(c) - CPU delay
+ blkio(i) - I/O delay
+ irq(q) - IRQ delay
+ mem(m) - Total memory delay
+ swapin(s) - Swapin delay (memory verbose mode only)
+ freepages(r) - Freepages reclaim delay (memory verbose mode only)
+ thrashing(t) - Thrashing delay (memory verbose mode only)
+ compact(p) - Compaction delay (memory verbose mode only)
+ wpcopy(w) - Write page copy delay (memory verbose mode only)
+
+Advanced usage examples::
+
+ # ./delaytop -s blkio
+ Sorted by IO delay
+
+ # ./delaytop -s mem -M
+ Sorted by memory delay in memory verbose mode
# ./delaytop -p pid
Print delayacct stats
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index e019db1633fd..74ca438d2d6d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4603,7 +4603,7 @@
bit 2: print timer info
bit 3: print locks info if CONFIG_LOCKDEP is on
bit 4: print ftrace buffer
- bit 5: replay all messages on consoles at the end of panic
+ bit 5: replay all kernel messages on consoles at the end of panic
bit 6: print all CPUs backtrace (if available in the arch)
bit 7: print only tasks in uninterruptible (blocked) state
*Be aware* that this option may print a _lot_ of lines,
diff --git a/Documentation/admin-guide/mm/damon/start.rst b/Documentation/admin-guide/mm/damon/start.rst
index ede14b679d02..ec8c34b2d32f 100644
--- a/Documentation/admin-guide/mm/damon/start.rst
+++ b/Documentation/admin-guide/mm/damon/start.rst
@@ -175,4 +175,4 @@ Below command makes every memory region of size >=4K that has not accessed for
$ sudo damo start --damos_access_rate 0 0 --damos_sz_region 4K max \
--damos_age 60s max --damos_action pageout \
- <pid of your workload>
+ --target_pid <pid of your workload>
diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst
index ff3a2dda1f02..2cae60b6f3ca 100644
--- a/Documentation/admin-guide/mm/damon/usage.rst
+++ b/Documentation/admin-guide/mm/damon/usage.rst
@@ -61,7 +61,7 @@ comma (",").
│ :ref:`kdamonds <sysfs_kdamonds>`/nr_kdamonds
│ │ :ref:`0 <sysfs_kdamond>`/state,pid,refresh_ms
│ │ │ :ref:`contexts <sysfs_contexts>`/nr_contexts
- │ │ │ │ :ref:`0 <sysfs_context>`/avail_operations,operations
+ │ │ │ │ :ref:`0 <sysfs_context>`/avail_operations,operations,addr_unit
│ │ │ │ │ :ref:`monitoring_attrs <sysfs_monitoring_attrs>`/
│ │ │ │ │ │ intervals/sample_us,aggr_us,update_us
│ │ │ │ │ │ │ intervals_goal/access_bp,aggrs,min_sample_us,max_sample_us
@@ -188,9 +188,9 @@ details). At the moment, only one context per kdamond is supported, so only
contexts/<N>/
-------------
-In each context directory, two files (``avail_operations`` and ``operations``)
-and three directories (``monitoring_attrs``, ``targets``, and ``schemes``)
-exist.
+In each context directory, three files (``avail_operations``, ``operations``
+and ``addr_unit``) and three directories (``monitoring_attrs``, ``targets``,
+and ``schemes``) exist.
DAMON supports multiple types of :ref:`monitoring operations
<damon_design_configurable_operations_set>`, including those for virtual address
@@ -205,6 +205,9 @@ You can set and get what type of monitoring operations DAMON will use for the
context by writing one of the keywords listed in ``avail_operations`` file and
reading from the ``operations`` file.
+``addr_unit`` file is for setting and getting the :ref:`address unit
+<damon_design_addr_unit>` parameter of the operations set.
+
.. _sysfs_monitoring_attrs:
contexts/<N>/monitoring_attrs/
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 370fba113460..1654211cc6cf 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -225,6 +225,42 @@ to "always" or "madvise"), and it'll be automatically shutdown when
PMD-sized THP is disabled (when both the per-size anon control and the
top-level control are "never")
+process THP controls
+--------------------
+
+A process can control its own THP behaviour using the ``PR_SET_THP_DISABLE``
+and ``PR_GET_THP_DISABLE`` pair of prctl(2) calls. The THP behaviour set using
+``PR_SET_THP_DISABLE`` is inherited across fork(2) and execve(2). These calls
+support the following arguments::
+
+ prctl(PR_SET_THP_DISABLE, 1, 0, 0, 0):
+ This will disable THPs completely for the process, irrespective
+ of global THP controls or madvise(..., MADV_COLLAPSE) being used.
+
+ prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED, 0, 0):
+ This will disable THPs for the process except when the usage of THPs is
+ advised. Consequently, THPs will only be used when:
+ - Global THP controls are set to "always" or "madvise" and
+ madvise(..., MADV_HUGEPAGE) or madvise(..., MADV_COLLAPSE) is used.
+ - Global THP controls are set to "never" and madvise(..., MADV_COLLAPSE)
+ is used. This is the same behavior as if THPs would not be disabled on
+ a process level.
+ Note that MADV_COLLAPSE is currently always rejected if
+ madvise(..., MADV_NOHUGEPAGE) is set on an area.
+
+ prctl(PR_SET_THP_DISABLE, 0, 0, 0, 0):
+ This will re-enable THPs for the process, as if they were never disabled.
+ Whether THPs will actually be used depends on global THP controls and
+ madvise() calls.
+
+ prctl(PR_GET_THP_DISABLE, 0, 0, 0, 0):
+ This returns a value whose bits indicate how THP-disable is configured:
+ Bits
+ 1 0 Value Description
+ |0|0| 0 No THP-disable behaviour specified.
+ |0|1| 1 THP is entirely disabled for this process.
+ |1|1| 3 THP-except-advised mode is set for this process.
+
Khugepaged controls
-------------------
@@ -383,6 +419,8 @@ option: ``huge=``. It can have following values:
always
Attempt to allocate huge pages every time we need a new page;
+ Always try PMD-sized huge pages first, and fall back to smaller-sized
+ huge pages if the PMD-sized huge page allocation fails;
never
Do not allocate huge pages. Note that ``madvise(..., MADV_COLLAPSE)``
@@ -390,7 +428,9 @@ never
is specified everywhere;
within_size
- Only allocate huge page if it will be fully within i_size.
+ Only allocate huge page if it will be fully within i_size;
+ Always try PMD-sized huge pages first, and fall back to smaller-sized
+ huge pages if the PMD-sized huge page allocation fails;
Also respect madvise() hints;
advise
diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst
index fd3370aa43fe..283d77217c6f 100644
--- a/Documentation/admin-guide/mm/zswap.rst
+++ b/Documentation/admin-guide/mm/zswap.rst
@@ -53,26 +53,17 @@ Zswap receives pages for compression from the swap subsystem and is able to
evict pages from its own compressed pool on an LRU basis and write them back to
the backing swap device in the case that the compressed pool is full.
-Zswap makes use of zpool for the managing the compressed memory pool. Each
-allocation in zpool is not directly accessible by address. Rather, a handle is
+Zswap makes use of zsmalloc for the managing the compressed memory pool. Each
+allocation in zsmalloc is not directly accessible by address. Rather, a handle is
returned by the allocation routine and that handle must be mapped before being
accessed. The compressed memory pool grows on demand and shrinks as compressed
-pages are freed. The pool is not preallocated. By default, a zpool
-of type selected in ``CONFIG_ZSWAP_ZPOOL_DEFAULT`` Kconfig option is created,
-but it can be overridden at boot time by setting the ``zpool`` attribute,
-e.g. ``zswap.zpool=zsmalloc``. It can also be changed at runtime using the sysfs
-``zpool`` attribute, e.g.::
-
- echo zsmalloc > /sys/module/zswap/parameters/zpool
-
-The zsmalloc type zpool has a complex compressed page storage method, and it
-can achieve great storage densities.
+pages are freed. The pool is not preallocated.
When a swap page is passed from swapout to zswap, zswap maintains a mapping
-of the swap entry, a combination of the swap type and swap offset, to the zpool
-handle that references that compressed swap page. This mapping is achieved
-with a red-black tree per swap type. The swap offset is the search key for the
-tree nodes.
+of the swap entry, a combination of the swap type and swap offset, to the
+zsmalloc handle that references that compressed swap page. This mapping is
+achieved with a red-black tree per swap type. The swap offset is the search
+key for the tree nodes.
During a page fault on a PTE that is a swap entry, the swapin code calls the
zswap load function to decompress the page into the page allocated by the page
@@ -96,11 +87,11 @@ attribute, e.g.::
echo lzo > /sys/module/zswap/parameters/compressor
-When the zpool and/or compressor parameter is changed at runtime, any existing
-compressed pages are not modified; they are left in their own zpool. When a
-request is made for a page in an old zpool, it is uncompressed using its
-original compressor. Once all pages are removed from an old zpool, the zpool
-and its compressor are freed.
+When the compressor parameter is changed at runtime, any existing compressed
+pages are not modified; they are left in their own pool. When a request is
+made for a page in an old pool, it is uncompressed using its original
+compressor. Once all pages are removed from an old pool, the pool and its
+compressor are freed.
Some of the pages in zswap are same-value filled pages (i.e. contents of the
page have same value or repetitive pattern). These pages include zero-filled
diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index 8b49eab937d0..f3ee807b5d8b 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -890,7 +890,7 @@ bit 1 print system memory info
bit 2 print timer info
bit 3 print locks info if ``CONFIG_LOCKDEP`` is on
bit 4 print ftrace buffer
-bit 5 replay all messages on consoles at the end of panic
+bit 5 replay all kernel messages on consoles at the end of panic
bit 6 print all CPUs backtrace (if available in the arch)
bit 7 print only tasks in uninterruptible (blocked) state
===== ============================================
diff --git a/Documentation/core-api/mm-api.rst b/Documentation/core-api/mm-api.rst
index 5063179cfc70..68193a4cfcf5 100644
--- a/Documentation/core-api/mm-api.rst
+++ b/Documentation/core-api/mm-api.rst
@@ -118,7 +118,6 @@ More Memory Management Functions
.. kernel-doc:: mm/memremap.c
.. kernel-doc:: mm/hugetlb.c
.. kernel-doc:: mm/swap.c
-.. kernel-doc:: mm/zpool.c
.. kernel-doc:: mm/memcontrol.c
.. #kernel-doc:: mm/memory-tiers.c (build warnings)
.. kernel-doc:: mm/shmem.c
diff --git a/Documentation/dev-tools/kasan.rst b/Documentation/dev-tools/kasan.rst
index 0a1418ab72fd..a034700da7c4 100644
--- a/Documentation/dev-tools/kasan.rst
+++ b/Documentation/dev-tools/kasan.rst
@@ -143,6 +143,9 @@ disabling KASAN altogether or controlling its features:
Asymmetric mode: a bad access is detected synchronously on reads and
asynchronously on writes.
+- ``kasan.write_only=off`` or ``kasan.write_only=on`` controls whether KASAN
+ checks the write (store) accesses only or all accesses (default: ``off``).
+
- ``kasan.vmalloc=off`` or ``=on`` disables or enables tagging of vmalloc
allocations (default: ``on``).
diff --git a/Documentation/dev-tools/kcov.rst b/Documentation/dev-tools/kcov.rst
index 6611434e2dd2..8127849d40f5 100644
--- a/Documentation/dev-tools/kcov.rst
+++ b/Documentation/dev-tools/kcov.rst
@@ -361,7 +361,12 @@ local tasks spawned by the process and the global task that handles USB bus #1:
*/
sleep(2);
- n = __atomic_load_n(&cover[0], __ATOMIC_RELAXED);
+ /*
+ * The load to the coverage count should be an acquire to pair with
+ * pair with the corresponding write memory barrier (smp_wmb()) on
+ * the kernel-side in kcov_move_area().
+ */
+ n = __atomic_load_n(&cover[0], __ATOMIC_ACQUIRE);
for (i = 0; i < n; i++)
printf("0x%lx\n", cover[i + 1]);
if (ioctl(fd, KCOV_DISABLE, 0))
diff --git a/Documentation/driver-api/crypto/iaa/iaa-crypto.rst b/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
index 8e50b900d51c..f815d4fd8372 100644
--- a/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
+++ b/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
@@ -476,7 +476,6 @@ Use the following commands to enable zswap::
# echo 0 > /sys/module/zswap/parameters/enabled
# echo 50 > /sys/module/zswap/parameters/max_pool_percent
# echo deflate-iaa > /sys/module/zswap/parameters/compressor
- # echo zsmalloc > /sys/module/zswap/parameters/zpool
# echo 1 > /sys/module/zswap/parameters/enabled
# echo 100 > /proc/sys/vm/swappiness
# echo never > /sys/kernel/mm/transparent_hugepage/enabled
@@ -625,7 +624,6 @@ the 'fixed' compression mode::
echo 0 > /sys/module/zswap/parameters/enabled
echo 50 > /sys/module/zswap/parameters/max_pool_percent
echo deflate-iaa > /sys/module/zswap/parameters/compressor
- echo zsmalloc > /sys/module/zswap/parameters/zpool
echo 1 > /sys/module/zswap/parameters/enabled
echo 100 > /proc/sys/vm/swappiness
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index b7e3147ba3d4..42f2fb9e3c8f 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -291,8 +291,9 @@ It's slow but very precise.
HugetlbPages size of hugetlb memory portions
CoreDumping process's memory is currently being dumped
(killing the process may lead to a corrupted core)
- THP_enabled process is allowed to use THP (returns 0 when
- PR_SET_THP_DISABLE is set on the process
+ THP_enabled process is allowed to use THP (returns 0 when
+ PR_SET_THP_DISABLE is set on the process to disable
+ THP completely, not just partially)
Threads number of threads
SigQ number of signals queued/max. number for queue
SigPnd bitmap of pending signals for the thread
@@ -1008,6 +1009,19 @@ number, module (if originates from a loadable module) and the function calling
the allocation. The number of bytes allocated and number of calls at each
location are reported. The first line indicates the version of the file, the
second line is the header listing fields in the file.
+If file version is 2.0 or higher then each line may contain additional
+<key>:<value> pairs representing extra information about the call site.
+For example if the counters are not accurate, the line will be appended with
+"accurate:no" pair.
+
+Supported markers in v2:
+accurate:no
+
+ Absolute values of the counters in this line are not accurate
+ because of the failure to allocate memory to track some of the
+ allocations made at this location. Deltas in these counters are
+ accurate, therefore counters can be used to track allocation size
+ and count changes.
Example output.
diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst
index 03f8137256f5..80354f4f42ba 100644
--- a/Documentation/mm/damon/design.rst
+++ b/Documentation/mm/damon/design.rst
@@ -67,7 +67,7 @@ processes, NUMA nodes, files, and backing memory devices would be supportable.
Also, if some architectures or devices support special optimized access check
features, those will be easily configurable.
-DAMON currently provides below three operation sets. Below two subsections
+DAMON currently provides below three operation sets. Below three subsections
describe how those work.
- vaddr: Monitor virtual address spaces of specific processes
@@ -135,6 +135,20 @@ the interference is the responsibility of sysadmins. However, it solves the
conflict with the reclaim logic using ``PG_idle`` and ``PG_young`` page flags,
as Idle page tracking does.
+.. _damon_design_addr_unit:
+
+Address Unit
+------------
+
+DAMON core layer uses ``unsinged long`` type for monitoring target address
+ranges. In some cases, the address space for a given operations set could be
+too large to be handled with the type. ARM (32-bit) with large physical
+address extension is an example. For such cases, a per-operations set
+parameter called ``address unit`` is provided. It represents the scale factor
+that need to be multiplied to the core layer's address for calculating real
+address on the given address space. Support of ``address unit`` parameter is
+up to each operations set implementation. ``paddr`` is the only operations set
+implementation that supports the parameter.
.. _damon_core_logic:
@@ -689,7 +703,7 @@ DAMOS accounts below statistics for each scheme, from the beginning of the
scheme's execution.
- ``nr_tried``: Total number of regions that the scheme is tried to be applied.
-- ``sz_trtied``: Total size of regions that the scheme is tried to be applied.
+- ``sz_tried``: Total size of regions that the scheme is tried to be applied.
- ``sz_ops_filter_passed``: Total bytes that passed operations set
layer-handled DAMOS filters.
- ``nr_applied``: Total number of regions that the scheme is applied.
diff --git a/Documentation/mm/damon/maintainer-profile.rst b/Documentation/mm/damon/maintainer-profile.rst
index 5cd07905a193..58a3fb3c5762 100644
--- a/Documentation/mm/damon/maintainer-profile.rst
+++ b/Documentation/mm/damon/maintainer-profile.rst
@@ -89,18 +89,13 @@ the maintainer.
Community meetup
----------------
-DAMON community is maintaining two bi-weekly meetup series for community
-members who prefer synchronous conversations over mails.
+DAMON community has a bi-weekly meetup series for members who prefer
+synchronous conversations over mails. It is for discussions on specific topics
+between a group of members including the maintainer. The maintainer shares the
+available time slots, and attendees should reserve one of those at least 24
+hours before the time slot, by reaching out to the maintainer.
-The first one is for any discussion between every community member. No
-reservation is needed.
-
-The seconds one is for discussions on specific topics between restricted
-members including the maintainer. The maintainer shares the available time
-slots, and attendees should reserve one of those at least 24 hours before the
-time slot, by reaching out to the maintainer.
-
-Schedules and available reservation time slots are available at the Google `doc
+Schedules and reservation status are available at the Google `doc
<https://docs.google.com/document/d/1v43Kcj3ly4CYqmAkMaZzLiM2GEnWfgdGbZAH3mi2vpM/edit?usp=sharing>`_.
There is also a public Google `calendar
<https://calendar.google.com/calendar/u/0?cid=ZDIwOTA4YTMxNjc2MDQ3NTIyMmUzYTM5ZmQyM2U4NDA0ZGIwZjBiYmJlZGQxNDM0MmY4ZTRjOTE0NjdhZDRiY0Bncm91cC5jYWxlbmRhci5nb29nbGUuY29t>`_
diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst
index fb45acba16ac..ba6a8872849b 100644
--- a/Documentation/mm/index.rst
+++ b/Documentation/mm/index.rst
@@ -20,6 +20,7 @@ see the :doc:`admin guide <../admin-guide/mm/index>`.
highmem
page_reclaim
swap
+ swap-table
page_cache
shmfs
oom
diff --git a/Documentation/mm/swap-table.rst b/Documentation/mm/swap-table.rst
new file mode 100644
index 000000000000..da10bb7a0dc3
--- /dev/null
+++ b/Documentation/mm/swap-table.rst
@@ -0,0 +1,69 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+:Author: Chris Li <chrisl@kernel.org>, Kairui Song <kasong@tencent.com>
+
+==========
+Swap Table
+==========
+
+Swap table implements swap cache as a per-cluster swap cache value array.
+
+Swap Entry
+----------
+
+A swap entry contains the information required to serve the anonymous page
+fault.
+
+Swap entry is encoded as two parts: swap type and swap offset.
+
+The swap type indicates which swap device to use.
+The swap offset is the offset of the swap file to read the page data from.
+
+Swap Cache
+----------
+
+Swap cache is a map to look up folios using swap entry as the key. The result
+value can have three possible types depending on which stage of this swap entry
+was in.
+
+1. NULL: This swap entry is not used.
+
+2. folio: A folio has been allocated and bound to this swap entry. This is
+ the transient state of swap out or swap in. The folio data can be in
+ the folio or swap file, or both.
+
+3. shadow: The shadow contains the working set information of the swapped
+ out folio. This is the normal state for a swapped out page.
+
+Swap Table Internals
+--------------------
+
+The previous swap cache is implemented by XArray. The XArray is a tree
+structure. Each lookup will go through multiple nodes. Can we do better?
+
+Notice that most of the time when we look up the swap cache, we are either
+in a swap in or swap out path. We should already have the swap cluster,
+which contains the swap entry.
+
+If we have a per-cluster array to store swap cache value in the cluster.
+Swap cache lookup within the cluster can be a very simple array lookup.
+
+We give such a per-cluster swap cache value array a name: the swap table.
+
+A swap table is an array of pointers. Each pointer is the same size as a
+PTE. The size of a swap table for one swap cluster typically matches a PTE
+page table, which is one page on modern 64-bit systems.
+
+With swap table, swap cache lookup can achieve great locality, simpler,
+and faster.
+
+Locking
+-------
+
+Swap table modification requires taking the cluster lock. If a folio
+is being added to or removed from the swap table, the folio must be
+locked prior to the cluster lock. After adding or removing is done, the
+folio shall be unlocked.
+
+Swap table lookup is protected by RCU and atomic read. If the lookup
+returns a folio, the user must lock the folio before use.