diff options
Diffstat (limited to 'Documentation/filesystems/gfs2')
| -rw-r--r-- | Documentation/filesystems/gfs2/glocks.rst | 249 | ||||
| -rw-r--r-- | Documentation/filesystems/gfs2/index.rst | 64 | ||||
| -rw-r--r-- | Documentation/filesystems/gfs2/uevents.rst | 112 |
3 files changed, 425 insertions, 0 deletions
diff --git a/Documentation/filesystems/gfs2/glocks.rst b/Documentation/filesystems/gfs2/glocks.rst new file mode 100644 index 000000000000..ce5ff08cbd59 --- /dev/null +++ b/Documentation/filesystems/gfs2/glocks.rst @@ -0,0 +1,249 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================ +Glock internal locking rules +============================ + +This documents the basic principles of the glock state machine +internals. Each glock (struct gfs2_glock in fs/gfs2/incore.h) +has two main (internal) locks: + + 1. A spinlock (gl_lockref.lock) which protects the internal state such + as gl_state, gl_target and the list of holders (gl_holders) + 2. A non-blocking bit lock, GLF_LOCK, which is used to prevent other + threads from making calls to the DLM, etc. at the same time. If a + thread takes this lock, it must then call run_queue (usually via the + workqueue) when it releases it in order to ensure any pending tasks + are completed. + +The gl_holders list contains all the queued lock requests (not +just the holders) associated with the glock. If there are any +held locks, then they will be contiguous entries at the head +of the list. Locks are granted in strictly the order that they +are queued. + +There are three lock states that users of the glock layer can request, +namely shared (SH), deferred (DF) and exclusive (EX). Those translate +to the following DLM lock modes: + +========== ====== ===================================================== +Glock mode DLM lock mode +========== ====== ===================================================== + UN IV/NL Unlocked (no DLM lock associated with glock) or NL + SH PR (Protected read) + DF CW (Concurrent write) + EX EX (Exclusive) +========== ====== ===================================================== + +Thus DF is basically a shared mode which is incompatible with the "normal" +shared lock mode, SH. In GFS2 the DF mode is used exclusively for direct I/O +operations. The glocks are basically a lock plus some routines which deal +with cache management. The following rules apply for the cache: + +========== ============== ========== ========== ============== +Glock mode Cache Metadata Cache data Dirty Data Dirty Metadata +========== ============== ========== ========== ============== + UN No No No No + DF Yes No No No + SH Yes Yes No No + EX Yes Yes Yes Yes +========== ============== ========== ========== ============== + +These rules are implemented using the various glock operations which +are defined for each type of glock. Not all types of glocks use +all the modes. Only inode glocks use the DF mode for example. + +Table of glock operations and per type constants: + +============== ============================================================= +Field Purpose +============== ============================================================= +go_sync Called before remote state change (e.g. to sync dirty data) +go_xmote_bh Called after remote state change (e.g. to refill cache) +go_inval Called if remote state change requires invalidating the cache +go_instantiate Called when a glock has been acquired +go_held Called every time a glock holder is acquired +go_dump Called to print content of object for debugfs file, or on + error to dump glock to the log. +go_callback Called if the DLM sends a callback to drop this lock +go_unlocked Called when a glock is unlocked (dlm_unlock()) +go_type The type of the glock, ``LM_TYPE_*`` +go_flags GLOF_ASPACE is set, if the glock has an address space + associated with it +============== ============================================================= + +The minimum hold time for each lock is the time after a remote lock +grant for which we ignore remote demote requests. This is in order to +prevent a situation where locks are being bounced around the cluster +from node to node with none of the nodes making any progress. This +tends to show up most with shared mmapped files which are being written +to by multiple nodes. By delaying the demotion in response to a +remote callback, that gives the userspace program time to make +some progress before the pages are unmapped. + +Eventually, we hope to make the glock "EX" mode locally shared such that any +local locking will be done with the i_mutex as required rather than via the +glock. + +Locking rules for glock operations: + +============== ====================== ============================= +Operation GLF_LOCK bit lock held gl_lockref.lock spinlock held +============== ====================== ============================= +go_sync Yes No +go_xmote_bh Yes No +go_inval Yes No +go_instantiate No No +go_held No No +go_dump Sometimes Yes +go_callback Sometimes (N/A) Yes +go_unlocked Yes No +============== ====================== ============================= + +.. Note:: + + Operations must not drop either the bit lock or the spinlock + if its held on entry. go_dump and do_demote_ok must never block. + Note that go_dump will only be called if the glock's state + indicates that it is caching up-to-date data. + +Glock locking order within GFS2: + + 1. i_rwsem (if required) + 2. Rename glock (for rename only) + 3. Inode glock(s) + (Parents before children, inodes at "same level" with same parent in + lock number order) + 4. Rgrp glock(s) (for (de)allocation operations) + 5. Transaction glock (via gfs2_trans_begin) for non-read operations + 6. i_rw_mutex (if required) + 7. Page lock (always last, very important!) + +There are two glocks per inode. One deals with access to the inode +itself (locking order as above), and the other, known as the iopen +glock is used in conjunction with the i_nlink field in the inode to +determine the lifetime of the inode in question. Locking of inodes +is on a per-inode basis. Locking of rgrps is on a per rgrp basis. +In general we prefer to lock local locks prior to cluster locks. + +Glock Statistics +---------------- + +The stats are divided into two sets: those relating to the +super block and those relating to an individual glock. The +super block stats are done on a per cpu basis in order to +try and reduce the overhead of gathering them. They are also +further divided by glock type. All timings are in nanoseconds. + +In the case of both the super block and glock statistics, +the same information is gathered in each case. The super +block timing statistics are used to provide default values for +the glock timing statistics, so that newly created glocks +should have, as far as possible, a sensible starting point. +The per-glock counters are initialised to zero when the +glock is created. The per-glock statistics are lost when +the glock is ejected from memory. + +The statistics are divided into three pairs of mean and +variance, plus two counters. The mean/variance pairs are +smoothed exponential estimates and the algorithm used is +one which will be very familiar to those used to calculation +of round trip times in network code. See "TCP/IP Illustrated, +Volume 1", W. Richard Stevens, sect 21.3, "Round-Trip Time Measurement", +p. 299 and onwards. Also, Volume 2, Sect. 25.10, p. 838 and onwards. +Unlike the TCP/IP Illustrated case, the mean and variance are +not scaled, but are in units of integer nanoseconds. + +The three pairs of mean/variance measure the following +things: + + 1. DLM lock time (non-blocking requests) + 2. DLM lock time (blocking requests) + 3. Inter-request time (again to the DLM) + +A non-blocking request is one which will complete right +away, whatever the state of the DLM lock in question. That +currently means any requests when (a) the current state of +the lock is exclusive, i.e. a lock demotion (b) the requested +state is either null or unlocked (again, a demotion) or (c) the +"try lock" flag is set. A blocking request covers all the other +lock requests. + +There are two counters. The first is there primarily to show +how many lock requests have been made, and thus how much data +has gone into the mean/variance calculations. The other counter +is counting queuing of holders at the top layer of the glock +code. Hopefully that number will be a lot larger than the number +of dlm lock requests issued. + +So why gather these statistics? There are several reasons +we'd like to get a better idea of these timings: + +1. To be able to better set the glock "min hold time" +2. To spot performance issues more easily +3. To improve the algorithm for selecting resource groups for + allocation (to base it on lock wait time, rather than blindly + using a "try lock") + +Due to the smoothing action of the updates, a step change in +some input quantity being sampled will only fully be taken +into account after 8 samples (or 4 for the variance) and this +needs to be carefully considered when interpreting the +results. + +Knowing both the time it takes a lock request to complete and +the average time between lock requests for a glock means we +can compute the total percentage of the time for which the +node is able to use a glock vs. time that the rest of the +cluster has its share. That will be very useful when setting +the lock min hold time. + +Great care has been taken to ensure that we +measure exactly the quantities that we want, as accurately +as possible. There are always inaccuracies in any +measuring system, but I hope this is as accurate as we +can reasonably make it. + +Per sb stats can be found here:: + + /sys/kernel/debug/gfs2/<fsname>/sbstats + +Per glock stats can be found here:: + + /sys/kernel/debug/gfs2/<fsname>/glstats + +Assuming that debugfs is mounted on /sys/kernel/debug and also +that <fsname> is replaced with the name of the gfs2 filesystem +in question. + +The abbreviations used in the output as are follows: + +========= ================================================================ +srtt Smoothed round trip time for non blocking dlm requests +srttvar Variance estimate for srtt +srttb Smoothed round trip time for (potentially) blocking dlm requests +srttvarb Variance estimate for srttb +sirt Smoothed inter request time (for dlm requests) +sirtvar Variance estimate for sirt +dlm Number of dlm requests made (dcnt in glstats file) +queue Number of glock requests queued (qcnt in glstats file) +========= ================================================================ + +The sbstats file contains a set of these stats for each glock type (so 8 lines +for each type) and for each cpu (one column per cpu). The glstats file contains +a set of these stats for each glock in a similar format to the glocks file, but +using the format mean/variance for each of the timing stats. + +The gfs2_glock_lock_time tracepoint prints out the current values of the stats +for the glock in question, along with some addition information on each dlm +reply that is received: + +====== ======================================= +status The status of the dlm request +flags The dlm request flags +tdiff The time taken by this specific request +====== ======================================= + +(remaining fields as per above list) + + diff --git a/Documentation/filesystems/gfs2/index.rst b/Documentation/filesystems/gfs2/index.rst new file mode 100644 index 000000000000..e5e195403561 --- /dev/null +++ b/Documentation/filesystems/gfs2/index.rst @@ -0,0 +1,64 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================== +Global File System 2 +==================== + +Overview +======== + +GFS2 is a cluster file system. It allows a cluster of computers to +simultaneously use a block device that is shared between them (with FC, +iSCSI, NBD, etc). GFS2 reads and writes to the block device like a local +file system, but also uses a lock module to allow the computers coordinate +their I/O so file system consistency is maintained. One of the nifty +features of GFS2 is perfect consistency -- changes made to the file system +on one machine show up immediately on all other machines in the cluster. + +GFS2 uses interchangeable inter-node locking mechanisms, the currently +supported mechanisms are: + + lock_nolock + - allows GFS2 to be used as a local file system + + lock_dlm + - uses the distributed lock manager (dlm) for inter-node locking. + The dlm is found at linux/fs/dlm/ + +lock_dlm depends on user space cluster management systems found +at the URL above. + +To use GFS2 as a local file system, no external clustering systems are +needed, simply:: + + $ mkfs -t gfs2 -p lock_nolock -j 1 /dev/block_device + $ mount -t gfs2 /dev/block_device /dir + +The gfs2-utils package is required on all cluster nodes and, for lock_dlm, you +will also need the dlm and corosync user space utilities configured as per the +documentation. + +gfs2-utils can be found at https://pagure.io/gfs2-utils + +GFS2 is not on-disk compatible with previous versions of GFS, but it +is pretty close. + +The following man pages are available from gfs2-utils: + + ============ ============================================= + fsck.gfs2 to repair a filesystem + gfs2_grow to expand a filesystem online + gfs2_jadd to add journals to a filesystem online + tunegfs2 to manipulate, examine and tune a filesystem + gfs2_convert to convert a gfs filesystem to GFS2 in-place + mkfs.gfs2 to make a filesystem + ============ ============================================= + +Implementation Notes +==================== + +.. toctree:: + :maxdepth: 1 + + glocks + uevents diff --git a/Documentation/filesystems/gfs2/uevents.rst b/Documentation/filesystems/gfs2/uevents.rst new file mode 100644 index 000000000000..f162a2c76c69 --- /dev/null +++ b/Documentation/filesystems/gfs2/uevents.rst @@ -0,0 +1,112 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================ +uevents and GFS2 +================ + +During the lifetime of a GFS2 mount, a number of uevents are generated. +This document explains what the events are and what they are used +for (by gfs_controld in gfs2-utils). + +A list of GFS2 uevents +====================== + +1. ADD +------ + +The ADD event occurs at mount time. It will always be the first +uevent generated by the newly created filesystem. If the mount +is successful, an ONLINE uevent will follow. If it is not successful +then a REMOVE uevent will follow. + +The ADD uevent has two environment variables: SPECTATOR=[0|1] +and RDONLY=[0|1] that specify the spectator status (a read-only mount +with no journal assigned), and read-only (with journal assigned) status +of the filesystem respectively. + +2. ONLINE +--------- + +The ONLINE uevent is generated after a successful mount or remount. It +has the same environment variables as the ADD uevent. The ONLINE +uevent, along with the two environment variables for spectator and +RDONLY are a relatively recent addition (2.6.32-rc+) and will not +be generated by older kernels. + +3. CHANGE +--------- + +The CHANGE uevent is used in two places. One is when reporting the +successful mount of the filesystem by the first node (FIRSTMOUNT=Done). +This is used as a signal by gfs_controld that it is then ok for other +nodes in the cluster to mount the filesystem. + +The other CHANGE uevent is used to inform of the completion +of journal recovery for one of the filesystems journals. It has +two environment variables, JID= which specifies the journal id which +has just been recovered, and RECOVERY=[Done|Failed] to indicate the +success (or otherwise) of the operation. These uevents are generated +for every journal recovered, whether it is during the initial mount +process or as the result of gfs_controld requesting a specific journal +recovery via the /sys/fs/gfs2/<fsname>/lock_module/recovery file. + +Because the CHANGE uevent was used (in early versions of gfs_controld) +without checking the environment variables to discover the state, we +cannot add any more functions to it without running the risk of +someone using an older version of the user tools and breaking their +cluster. For this reason the ONLINE uevent was used when adding a new +uevent for a successful mount or remount. + +4. OFFLINE +---------- + +The OFFLINE uevent is only generated due to filesystem errors and is used +as part of the "withdraw" mechanism. Currently this doesn't give any +information about what the error is, which is something that needs to +be fixed. + +5. REMOVE +--------- + +The REMOVE uevent is generated at the end of an unsuccessful mount +or at the end of a umount of the filesystem. All REMOVE uevents will +have been preceded by at least an ADD uevent for the same filesystem, +and unlike the other uevents is generated automatically by the kernel's +kobject subsystem. + + +Information common to all GFS2 uevents (uevent environment variables) +===================================================================== + +1. LOCKTABLE= +-------------- + +The LOCKTABLE is a string, as supplied on the mount command +line (locktable=) or via fstab. It is used as a filesystem label +as well as providing the information for a lock_dlm mount to be +able to join the cluster. + +2. LOCKPROTO= +------------- + +The LOCKPROTO is a string, and its value depends on what is set +on the mount command line, or via fstab. It will be either +lock_nolock or lock_dlm. In the future other lock managers +may be supported. + +3. JOURNALID= +------------- + +If a journal is in use by the filesystem (journals are not +assigned for spectator mounts) then this will give the +numeric journal id in all GFS2 uevents. + +4. UUID= +-------- + +With recent versions of gfs2-utils, mkfs.gfs2 writes a UUID +into the filesystem superblock. If it exists, this will +be included in every uevent relating to the filesystem. + + + |
