diff options
author | Simon Riggs <simon@2ndQuadrant.com> | 2016-01-09 10:10:08 +0000 |
---|---|---|
committer | Simon Riggs <simon@2ndQuadrant.com> | 2016-01-09 10:10:08 +0000 |
commit | 687f2cd7a0150647794efe432ae0397cb41b60ff (patch) | |
tree | 0d6813a68bf31028de8170da5d740a397667cd49 /src/backend/access/nbtree/nbtxlog.c | |
parent | 463172116634423f8708ad9d7afb0f759a40cf2c (diff) |
Avoid pin scan for replay of XLOG_BTREE_VACUUM
Replay of XLOG_BTREE_VACUUM during Hot Standby was previously thought to require
complex interlocking that matched the requirements on the master. This required
an O(N) operation that became a significant problem with large indexes, causing
replication delays of seconds or in some cases minutes while the
XLOG_BTREE_VACUUM was replayed.
This commit skips the “pin scan” that was previously required, by observing in
detail when and how it is safe to do so, with full documentation. The pin scan
is skipped only in replay; the VACUUM code path on master is not touched here.
The current commit still performs the pin scan for toast indexes, though this
can also be avoided if we recheck scans on toast indexes. Later patch will
address this.
No tests included. Manual tests using an additional patch to view WAL records
and their timing have shown the change in WAL records and their handling has
successfully reduced replication delay.
Diffstat (limited to 'src/backend/access/nbtree/nbtxlog.c')
-rw-r--r-- | src/backend/access/nbtree/nbtxlog.c | 18 |
1 files changed, 16 insertions, 2 deletions
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c index bba4840da05..0d094ca7faa 100644 --- a/src/backend/access/nbtree/nbtxlog.c +++ b/src/backend/access/nbtree/nbtxlog.c @@ -391,6 +391,19 @@ btree_xlog_vacuum(XLogReaderState *record) BTPageOpaque opaque; /* + * If we are running non-MVCC scans using this index we need to do some + * additional work to ensure correctness, which is known as a "pin scan" + * described in more detail in next paragraphs. We used to do the extra + * work in all cases, whereas we now avoid that work except when the index + * is a toast index, since toast scans aren't fully MVCC compliant. + * If lastBlockVacuumed is set to InvalidBlockNumber then we skip the + * additional work required for the pin scan. + * + * Avoiding this extra work is important since it requires us to touch + * every page in the index, so is an O(N) operation. Worse, it is an + * operation performed in the foreground during redo, so it delays + * replication directly. + * * If queries might be active then we need to ensure every leaf page is * unpinned between the lastBlockVacuumed and the current block, if there * are any. This prevents replay of the VACUUM from reaching the stage of @@ -412,7 +425,7 @@ btree_xlog_vacuum(XLogReaderState *record) * isn't yet consistent; so we need not fear reading still-corrupt blocks * here during crash recovery. */ - if (HotStandbyActiveInReplay()) + if (HotStandbyActiveInReplay() && BlockNumberIsValid(xlrec->lastBlockVacuumed)) { RelFileNode thisrnode; BlockNumber thisblkno; @@ -433,7 +446,8 @@ btree_xlog_vacuum(XLogReaderState *record) * XXX we don't actually need to read the block, we just need to * confirm it is unpinned. If we had a special call into the * buffer manager we could optimise this so that if the block is - * not in shared_buffers we confirm it as unpinned. + * not in shared_buffers we confirm it as unpinned. Optimizing + * this is now moot, since in most cases we avoid the scan. */ buffer = XLogReadBufferExtended(thisrnode, MAIN_FORKNUM, blkno, RBM_NORMAL_NO_LOG); |