diff options
author | Noah Misch <noah@leadboat.com> | 2020-03-22 09:24:09 -0700 |
---|---|---|
committer | Noah Misch <noah@leadboat.com> | 2020-03-22 09:24:15 -0700 |
commit | 348f15e22e9456bf53bba1a1ca4e2279fb3e507a (patch) | |
tree | db1e69dcebf5fc5131b908b3abfb386a9f768bd3 /src/backend/access/transam | |
parent | a653bd8aa76e7c734388cc39663f45f9c8483b0f (diff) |
Revert "Skip WAL for new relfilenodes, under wal_level=minimal."
This reverts commit cb2fd7eac285b1b0a24eeb2b8ed4456b66c5a09f. Per
numerous buildfarm members, it was incompatible with parallel query, and
a test case assumed LP64. Back-patch to 9.5 (all supported versions).
Discussion: https://postgr.es/m/20200321224920.GB1763544@rfd.leadboat.com
Diffstat (limited to 'src/backend/access/transam')
-rw-r--r-- | src/backend/access/transam/README | 45 | ||||
-rw-r--r-- | src/backend/access/transam/xact.c | 15 | ||||
-rw-r--r-- | src/backend/access/transam/xlogutils.c | 18 |
3 files changed, 14 insertions, 64 deletions
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README index 9973c4464ee..4ae4715339e 100644 --- a/src/backend/access/transam/README +++ b/src/backend/access/transam/README @@ -717,38 +717,6 @@ then restart recovery. This is part of the reason for not writing a WAL entry until we've successfully done the original action. -Skipping WAL for New RelFileNode --------------------------------- - -Under wal_level=minimal, if a change modifies a relfilenode that ROLLBACK -would unlink, in-tree access methods write no WAL for that change. Code that -writes WAL without calling RelationNeedsWAL() must check for this case. This -skipping is mandatory. If a WAL-writing change preceded a WAL-skipping change -for the same block, REDO could overwrite the WAL-skipping change. If a -WAL-writing change followed a WAL-skipping change for the same block, a -related problem would arise. When a WAL record contains no full-page image, -REDO expects the page to match its contents from just before record insertion. -A WAL-skipping change may not reach disk at all, violating REDO's expectation -under full_page_writes=off. For any access method, CommitTransaction() writes -and fsyncs affected blocks before recording the commit. - -Prefer to do the same in future access methods. However, two other approaches -can work. First, an access method can irreversibly transition a given fork -from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and -smgrimmedsync(). Second, an access method can opt to write WAL -unconditionally for permanent relations. Under these approaches, the access -method callbacks must not call functions that react to RelationNeedsWAL(). - -This applies only to WAL records whose replay would modify bytes stored in the -new relfilenode. It does not apply to other records about the relfilenode, -such as XLOG_SMGR_CREATE. Because it operates at the level of individual -relfilenodes, RelationNeedsWAL() can differ for tightly-coupled relations. -Consider "CREATE TABLE t (); BEGIN; ALTER TABLE t ADD c text; ..." in which -ALTER TABLE adds a TOAST relation. The TOAST relation will skip WAL, while -the table owning it will not. ALTER TABLE SET TABLESPACE will cause a table -to skip WAL, but that won't affect its indexes. - - Asynchronous Commit ------------------- @@ -852,12 +820,13 @@ Changes to a temp table are not WAL-logged, hence could reach disk in advance of T1's commit, but we don't care since temp table contents don't survive crashes anyway. -Database writes that skip WAL for new relfilenodes are also safe. In these -cases it's entirely possible for the data to reach disk before T1's commit, -because T1 will fsync it down to disk without any sort of interlock. However, -all these paths are designed to write data that no other transaction can see -until after T1 commits. The situation is thus not different from ordinary -WAL-logged updates. +Database writes made via any of the paths we have introduced to avoid WAL +overhead for bulk updates are also safe. In these cases it's entirely +possible for the data to reach disk before T1's commit, because T1 will +fsync it down to disk without any sort of interlock, as soon as it finishes +the bulk update. However, all these paths are designed to write data that +no other transaction can see until after T1 commits. The situation is thus +not different from ordinary WAL-logged updates. Transaction Emulation during Recovery ------------------------------------- diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index 14fe4ec36d8..02aadc0ed4e 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -2032,13 +2032,6 @@ CommitTransaction(void) */ PreCommit_on_commit_actions(); - /* - * Synchronize files that are created and not WAL-logged during this - * transaction. This must happen before AtEOXact_RelationMap(), so that we - * don't see committed-but-broken files after a crash. - */ - smgrDoPendingSyncs(true); - /* close large objects before lower-level cleanup */ AtEOXact_LargeObject(true); @@ -2267,13 +2260,6 @@ PrepareTransaction(void) */ PreCommit_on_commit_actions(); - /* - * Synchronize files that are created and not WAL-logged during this - * transaction. This must happen before EndPrepare(), so that we don't see - * committed-but-broken files after a crash and COMMIT PREPARED. - */ - smgrDoPendingSyncs(true); - /* close large objects before lower-level cleanup */ AtEOXact_LargeObject(true); @@ -2574,7 +2560,6 @@ AbortTransaction(void) */ AfterTriggerEndXact(false); /* 'false' means it's abort */ AtAbort_Portals(); - smgrDoPendingSyncs(false); AtEOXact_LargeObject(false); AtAbort_Notify(); AtEOXact_RelationMap(false); diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c index 8aee623cb41..a82ccd45697 100644 --- a/src/backend/access/transam/xlogutils.c +++ b/src/backend/access/transam/xlogutils.c @@ -542,8 +542,6 @@ typedef FakeRelCacheEntryData *FakeRelCacheEntry; * fields related to physical storage, like rd_rel, are initialized, so the * fake entry is only usable in low-level operations like ReadBuffer(). * - * This is also used for syncing WAL-skipped files. - * * Caller must free the returned entry with FreeFakeRelcacheEntry(). */ Relation @@ -552,20 +550,18 @@ CreateFakeRelcacheEntry(RelFileNode rnode) FakeRelCacheEntry fakeentry; Relation rel; + Assert(InRecovery); + /* Allocate the Relation struct and all related space in one block. */ fakeentry = palloc0(sizeof(FakeRelCacheEntryData)); rel = (Relation) fakeentry; rel->rd_rel = &fakeentry->pgc; rel->rd_node = rnode; - - /* - * We will never be working with temp rels during recovery or while - * syncing WAL-skipped files. - */ + /* We will never be working with temp rels during recovery */ rel->rd_backend = InvalidBackendId; - /* It must be a permanent table here */ + /* It must be a permanent table if we're in recovery. */ rel->rd_rel->relpersistence = RELPERSISTENCE_PERMANENT; /* We don't know the name of the relation; use relfilenode instead */ @@ -574,9 +570,9 @@ CreateFakeRelcacheEntry(RelFileNode rnode) /* * We set up the lockRelId in case anything tries to lock the dummy * relation. Note that this is fairly bogus since relNode may be - * different from the relation's OID. It shouldn't really matter though. - * In recovery, we are running by ourselves and can't have any lock - * conflicts. While syncing, we already hold AccessExclusiveLock. + * different from the relation's OID. It shouldn't really matter though, + * since we are presumably running by ourselves and can't have any lock + * conflicts ... */ rel->rd_lockInfo.lockRelId.dbId = rnode.dbNode; rel->rd_lockInfo.lockRelId.relId = rnode.relNode; |