summaryrefslogtreecommitdiff
path: root/src/backend/access/transam
AgeCommit message (Collapse)Author
2024-11-11Fix improper interactions between session_authorization and role.Tom Lane
The SQL spec mandates that SET SESSION AUTHORIZATION implies SET ROLE NONE. We tried to implement that within the lowest-level functions that manipulate these settings, but that was a bad idea. In particular, guc.c assumes that it doesn't matter in what order it applies GUC variable updates, but that was not the case for these two variables. This problem, compounded by some hackish attempts to work around it, led to some security-grade issues: * Rolling back a transaction that had done SET SESSION AUTHORIZATION would revert to SET ROLE NONE, even if that had not been the previous state, so that the effective user ID might now be different from what it had been. * The same for SET SESSION AUTHORIZATION in a function SET clause. * If a parallel worker inspected current_setting('role'), it saw "none" even when it should see something else. Also, although the parallel worker startup code intended to cope with the current role's pg_authid row having disappeared, its implementation of that was incomplete so it would still fail. Fix by fully separating the miscinit.c functions that assign session_authorization from those that assign role. To implement the spec's requirement, teach set_config_option itself to perform "SET ROLE NONE" when it sets session_authorization. (This is undoubtedly ugly, but the alternatives seem worse. In particular, there's no way to do it within assign_session_authorization without incompatible changes in the API for GUC assign hooks.) Also, improve ParallelWorkerMain to directly set all the relevant user-ID variables instead of relying on some of them to get set indirectly. That allows us to survive not finding the pg_authid row during worker startup. In v16 and earlier, this includes back-patching 9987a7bf3 which fixed a violation of GUC coding rules: SetSessionAuthorization is not an appropriate place to be throwing errors from. Security: CVE-2024-10978
2024-11-08Improve fix for not entering parallel mode when holding interrupts.Tom Lane
Commit ac04aa84a put the shutoff for this into the planner, which is not ideal because it doesn't prevent us from re-using a previously made parallel plan. Revert the planner change and instead put the shutoff into InitializeParallelDSM, modeling it on the existing code there for recovering from failure to allocate a DSM segment. However, that code path is mostly untested, and testing a bit harder showed there's at least one bug: ExecHashJoinReInitializeDSM is not prepared for us to have skipped doing parallel DSM setup. I also thought the Assert in ReinitializeParallelWorkers is pretty ill-advised, and replaced it with a silent Min() operation. The existing test case added by ac04aa84a serves fine to test this version of the fix, so no change needed there. Patch by me, but thanks to Noah Misch for the core idea that we could shut off worker creation when !INTERRUPTS_CAN_BE_PROCESSED. Back-patch to v12, as ac04aa84a was. Discussion: https://postgr.es/m/CAC-SaSzHUKT=vZJ8MPxYdC_URPfax+yoA1hKTcF4ROz_Q6z0_Q@mail.gmail.com
2024-11-02Revert "For inplace update, send nontransactional invalidations."Noah Misch
This reverts commit 95c5acb3fc261067ab65ddc0b2dca8e162f09442 (v17) and counterparts in each other non-master branch. If released, that commit would have caused a worst-in-years minor release regression, via undetected LWLock self-deadlock. This commit and its self-deadlock fix warrant more bake time in the master branch. Reported by Alexander Lakhin. Discussion: https://postgr.es/m/10ec0bc3-5933-1189-6bb8-5dec4114558e@gmail.com
2024-10-25For inplace update, send nontransactional invalidations.Noah Misch
The inplace update survives ROLLBACK. The inval didn't, so another backend's DDL could then update the row without incorporating the inplace update. In the test this fixes, a mix of CREATE INDEX and ALTER TABLE resulted in a table with an index, yet relhasindex=f. That is a source of index corruption. Back-patch to v12 (all supported versions). The back branch versions don't change WAL, because those branches just added end-of-recovery SIResetAll(). All branches change the ABI of extern function PrepareToInvalidateCacheTuple(). No PGXN extension calls that, and there's no apparent use case in extensions. Reviewed by Nitin Motiani and (in earlier versions) Andres Freund. Discussion: https://postgr.es/m/20240523000548.58.nmisch@google.com
2024-10-25At end of recovery, reset all sinval-managed caches.Noah Misch
An inplace update's invalidation messages are part of its transaction's commit record. However, the update survives even if its transaction aborts or we stop recovery before replaying its transaction commit. After recovery, a backend that started in recovery could update the row without incorporating the inplace update. That could result in a table with an index, yet relhasindex=f. That is a source of index corruption. This bulk invalidation avoids the functional consequences. A future change can fix the !RecoveryInProgress() scenario without changing the WAL format. Back-patch to v17 - v12 (all supported versions). v18 will instead add invalidations to WAL. Discussion: https://postgr.es/m/20240618152349.7f.nmisch@google.com
2024-10-01Fix race condition in COMMIT PREPARED causing orphaned 2PC filesMichael Paquier
COMMIT PREPARED removes on-disk 2PC files near its end, but the state checked if a file is on-disk or not gets read from shared memory while not holding the two-phase state lock. Because of that, there was a small window where a second backend doing a PREPARE TRANSACTION could reuse the GlobalTransaction put back into the 2PC free list by the COMMIT PREPARED, overwriting the "ondisk" flag read afterwards by the COMMIT PREPARED to decide if its on-disk two-phase state file should be removed, preventing the file deletion. This commit fixes this issue so as the "ondisk" flag in the GlobalTransaction is read while holding the two-phase state lock, not from shared memory after its entry has been added to the free list. Orphaned two-phase state files flushed to disk after a checkpoint are discarded at the beginning of recovery. However, a truncation of pg_xact/ would make the startup process issue a FATAL when it cannot read the SLRU page holding the state of the transaction whose 2PC file was orphaned, which is a necessary step to decide if the 2PC file should be removed or not. Removing manually the file would be necessary in this case. Issue introduced by effe7d9552dd, so backpatch all the way down. Mea culpa. Author: wuchengwen Discussion: https://postgr.es/m/tencent_A7F059B5136A359625C7B2E4A386B3C3F007@qq.com Backpatch-through: 12
2024-07-31Revert "Allow parallel workers to cope with a newly-created session user ID."Tom Lane
This reverts commit 48536305370acd75c6264fcefec7fae7af8c5440. Some buildfarm animals are failing with "cannot change "client_encoding" during a parallel operation". It looks like assign_client_encoding is unhappy at being asked to roll back a client_encoding setting after a parallel worker encounters a failure. There must be more to it though: why didn't I see this during local testing? In any case, it's clear that moving the RestoreGUCState() call is not as side-effect-free as I thought. Given that the bug f5f30c22e intended to fix has gone unreported for years, it's not something that's urgent to fix; I'm not willing to risk messing with it further with only days to our next release wrap.
2024-07-31Allow parallel workers to cope with a newly-created session user ID.Tom Lane
Parallel workers failed after a sequence like BEGIN; CREATE USER foo; SET SESSION AUTHORIZATION foo; because check_session_authorization could not see the uncommitted pg_authid row for "foo". This is because we ran RestoreGUCState() in a separate transaction using an ordinary just-created snapshot. The same disease afflicts any other GUC that requires catalog lookups and isn't forgiving about the lookups failing. To fix, postpone RestoreGUCState() into the worker's main transaction after we've set up a snapshot duplicating the leader's. This affects check_transaction_isolation and check_transaction_deferrable, which think they should only run during transaction start. Make them act like check_transaction_read_only, which already knows it should silently accept the value when InitializingParallelWorker. Per bug #18545 from Andrey Rachitskiy. Back-patch to all supported branches, because this has been wrong for awhile. Discussion: https://postgr.es/m/18545-feba138862f19aaa@postgresql.org
2024-06-27Fix MVCC bug with prepared xact with subxacts on standbyHeikki Linnakangas
We did not recover the subtransaction IDs of prepared transactions when starting a hot standby from a shutdown checkpoint. As a result, such subtransactions were considered as aborted, rather than in-progress. That would lead to hint bits being set incorrectly, and the subtransactions suddenly becoming visible to old snapshots when the prepared transaction was committed. To fix, update pg_subtrans with prepared transactions's subxids when starting hot standby from a shutdown checkpoint. The snapshots taken from that state need to be marked as "suboverflowed", so that we also check the pg_subtrans. Backport to all supported versions. Discussion: https://www.postgresql.org/message-id/6b852e98-2d49-4ca1-9e95-db419a2696e0@iki.fi
2024-06-26Fix bugs in MultiXact truncationHeikki Linnakangas
1. TruncateMultiXact() performs the SLRU truncations in a critical section. Deleting the SLRU segments calls ForwardSyncRequest(), which will try to compact the request queue if it's full (CompactCheckpointerRequestQueue()). That in turn allocates memory, which is not allowed in a critical section. Backtrace: TRAP: failed Assert("CritSectionCount == 0 || (context)->allowInCritSection"), File: "../src/backend/utils/mmgr/mcxt.c", Line: 1353, PID: 920981 postgres: autovacuum worker template0(ExceptionalCondition+0x6e)[0x560a501e866e] postgres: autovacuum worker template0(+0x5dce3d)[0x560a50217e3d] postgres: autovacuum worker template0(ForwardSyncRequest+0x8e)[0x560a4ffec95e] postgres: autovacuum worker template0(RegisterSyncRequest+0x2b)[0x560a50091eeb] postgres: autovacuum worker template0(+0x187b0a)[0x560a4fdc2b0a] postgres: autovacuum worker template0(SlruDeleteSegment+0x101)[0x560a4fdc2ab1] postgres: autovacuum worker template0(TruncateMultiXact+0x2fb)[0x560a4fdbde1b] postgres: autovacuum worker template0(vac_update_datfrozenxid+0x4b3)[0x560a4febd2f3] postgres: autovacuum worker template0(+0x3adf66)[0x560a4ffe8f66] postgres: autovacuum worker template0(AutoVacWorkerMain+0x3ed)[0x560a4ffe7c2d] postgres: autovacuum worker template0(+0x3b1ead)[0x560a4ffecead] postgres: autovacuum worker template0(+0x3b620e)[0x560a4fff120e] postgres: autovacuum worker template0(+0x3b3fbb)[0x560a4ffeefbb] postgres: autovacuum worker template0(+0x2f724e)[0x560a4ff3224e] /lib/x86_64-linux-gnu/libc.so.6(+0x27c8a)[0x7f62cc642c8a] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f62cc642d45] postgres: autovacuum worker template0(_start+0x21)[0x560a4fd16f31] To fix, bail out in CompactCheckpointerRequestQueue() without doing anything, if it's called in a critical section. That covers the above call path, as well as any other similar cases where RegisterSyncRequest might be called in a critical section. 2. After fixing that, another problem became apparent: Autovacuum process doing that truncation can deadlock with the checkpointer process. TruncateMultiXact() sets "MyProc->delayChkptFlags |= DELAY_CHKPT_START". If the sync request queue is full and cannot be compacted, the process will repeatedly sleep and retry, until there is room in the queue. However, if the checkpointer is trying to start a checkpoint at the same time, and is waiting for the DELAY_CHKPT_START processes to finish, the queue will never shrink. More concretely, the autovacuum process is stuck here: #0 0x00007fc934926dc3 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x000056220b24348b in WaitEventSetWaitBlock (set=0x56220c2e4b50, occurred_events=0x7ffe7856d040, nevents=1, cur_timeout=<optimized out>) at ../src/backend/storage/ipc/latch.c:1570 #2 WaitEventSetWait (set=0x56220c2e4b50, timeout=timeout@entry=10, occurred_events=<optimized out>, occurred_events@entry=0x7ffe7856d040, nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=150994949) at ../src/backend/storage/ipc/latch.c:1516 #3 0x000056220b243224 in WaitLatch (latch=<optimized out>, latch@entry=0x0, wakeEvents=wakeEvents@entry=40, timeout=timeout@entry=10, wait_event_info=wait_event_info@entry=150994949) at ../src/backend/storage/ipc/latch.c:538 #4 0x000056220b26cf46 in RegisterSyncRequest (ftag=ftag@entry=0x7ffe7856d0a0, type=type@entry=SYNC_FORGET_REQUEST, retryOnError=true) at ../src/backend/storage/sync/sync.c:614 #5 0x000056220af9db0a in SlruInternalDeleteSegment (ctl=ctl@entry=0x56220b7beb60 <MultiXactMemberCtlData>, segno=segno@entry=11350) at ../src/backend/access/transam/slru.c:1495 #6 0x000056220af9dab1 in SlruDeleteSegment (ctl=ctl@entry=0x56220b7beb60 <MultiXactMemberCtlData>, segno=segno@entry=11350) at ../src/backend/access/transam/slru.c:1566 #7 0x000056220af98e1b in PerformMembersTruncation (oldestOffset=<optimized out>, newOldestOffset=<optimized out>) at ../src/backend/access/transam/multixact.c:3006 #8 TruncateMultiXact (newOldestMulti=newOldestMulti@entry=3221225472, newOldestMultiDB=newOldestMultiDB@entry=4) at ../src/backend/access/transam/multixact.c:3201 #9 0x000056220b098303 in vac_truncate_clog (frozenXID=749, minMulti=<optimized out>, lastSaneFrozenXid=749, lastSaneMinMulti=3221225472) at ../src/backend/commands/vacuum.c:1917 #10 vac_update_datfrozenxid () at ../src/backend/commands/vacuum.c:1760 #11 0x000056220b1c3f76 in do_autovacuum () at ../src/backend/postmaster/autovacuum.c:2550 #12 0x000056220b1c2c3d in AutoVacWorkerMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at ../src/backend/postmaster/autovacuum.c:1569 and the checkpointer is stuck here: #0 0x00007fc9348ebf93 in clock_nanosleep () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007fc9348fe353 in nanosleep () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x000056220b40ecb4 in pg_usleep (microsec=microsec@entry=10000) at ../src/port/pgsleep.c:50 #3 0x000056220afb43c3 in CreateCheckPoint (flags=flags@entry=108) at ../src/backend/access/transam/xlog.c:7098 #4 0x000056220b1c6e86 in CheckpointerMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at ../src/backend/postmaster/checkpointer.c:464 To fix, add AbsorbSyncRequests() to the loops where the checkpointer waits for DELAY_CHKPT_START or DELAY_CHKPT_COMPLETE operations to finish. Backpatch to v14. Before that, SLRU deletion didn't call RegisterSyncRequest, which avoided this failure. I'm not sure if there are other similar scenarios on older versions, but we haven't had any such reports. Discussion: https://www.postgresql.org/message-id/ccc66933-31c1-4f6a-bf4b-45fef0d4f22e@iki.fi
2024-06-13Clamp result of MultiXactMemberFreezeThresholdHeikki Linnakangas
The purpose of the function is to reduce the effective autovacuum_multixact_freeze_max_age if the multixact members SLRU is approaching wraparound, to make multixid freezing more aggressive. The returned value should therefore never be greater than plain autovacuum_multixact_freeze_max_age. Reviewed-by: Robert Haas Discussion: https://www.postgresql.org/message-id/85fb354c-f89f-4d47-b3a2-3cbd461c90a3@iki.fi Backpatch-through: 12, all supported versions
2024-01-29Add more LOG messages when starting and ending recovery from a backupMichael Paquier
Three LOG messages are added in the recovery code paths, providing information that can be useful to track corruption issues depending on the state of the cluster, telling that: - Recovery has started from a backup_label. - Recovery is restarting from a backup start LSN, without a backup_label. - Recovery has completed from a backup. This was originally applied on HEAD as of 1d35f705e191, and there is consensus that this can be useful for older versions. This applies cleanly down to 15, so do it down to this version for now (older versions have heavily refactored the WAL recovery paths, making the change less straight-forward to do). Author: Andres Freund Reviewed-by: David Steele, Laurenz Albe, Michael Paquier Discussion: https://postgr.es/m/20231117041811.vz4vgkthwjnwp2pp@awork3.anarazel.de Backpatch-through: 15
2024-01-18lwlock: Fix quadratic behavior with very long wait listsAndres Freund
Until now LWLockDequeueSelf() sequentially searched the list of waiters to see if the current proc is still is on the list of waiters, or has already been removed. In extreme workloads, where the wait lists are very long, this leads to a quadratic behavior. #backends iterating over a list #backends long. Additionally, the likelihood of needing to call LWLockDequeueSelf() in the first place also increases with the increased length of the wait queue, as it becomes more likely that a lock is released while waiting for the wait list lock, which is held for longer during lock release. Due to the exponential back-off in perform_spin_delay() this is surprisingly hard to detect. We should make that easier, e.g. by adding a wait event around the pg_usleep() - but that's a separate patch. The fix is simple - track whether a proc is currently waiting in the wait list or already removed but waiting to be woken up in PGPROC->lwWaiting. In some workloads with a lot of clients contending for a small number of lwlocks (e.g. WALWriteLock), the fix can substantially increase throughput. This has been originally fixed for 16~ with a4adc31f6902 without a backpatch, and we have heard complaints from users impacted by this quadratic behavior in older versions as well. Author: Andres Freund <andres@anarazel.de> Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Discussion: https://postgr.es/m/20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de Discussion: https://postgr.es/m/CALj2ACXktNbG=K8Xi7PSqbofTZozavhaxjatVc14iYaLu4Maag@mail.gmail.com Backpatch-through: 12
2023-12-12Prevent tuples to be marked as dead in subtransactions on standbysMichael Paquier
Dead tuples are ignored and are not marked as dead during recovery, as it can lead to MVCC issues on a standby because its xmin may not match with the primary. This information is tracked by a field called "xactStartedInRecovery" in the transaction state data, switched on when starting a transaction in recovery. Unfortunately, this information was not correctly tracked when starting a subtransaction, because the transaction state used for the subtransaction did not update "xactStartedInRecovery" based on the state of its parent. This would cause index scans done in subtransactions to return inconsistent data, depending on how the xmin of the primary and/or the standby evolved. This is broken since the introduction of hot standby in efc16ea52067, so backpatch all the way down. Author: Fei Changhong Reviewed-by: Kyotaro Horiguchi Discussion: https://postgr.es/m/tencent_C4D907A5093C071A029712E73B43C6512706@qq.com Backpatch-through: 12
2023-12-08Fix potential pointer overflow in xlogreader.c.Thomas Munro
While checking if a record could fit in the circular WAL decoding buffer, the coding from commit 3f1ce973 used arithmetic that could overflow. 64 bit systems were unaffected for various technical reasons, which probably explains the lack of problem reports. Likewise for 32 bit systems running known 32 bit kernels. The systems at risk of problems appear to be 32 bit processes running on 64 bit kernels, with unlucky placement in memory. Per complaint from GCC -fsanitize=undefined -m32, while testing variations of 039_end_of_wal.pl. Back-patch to 15. Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGKH0oRPOX7DhiQ_b51sM8HqcPp2J3WA-Oen%3DdXog%2BAGGQ%40mail.gmail.com
2023-12-06Fix compilation on Windows with WAL_DEBUGMichael Paquier
This has been broken since b060dbe0001a that has reworked the callback mechanism of XLogReader, most likely unnoticed because any form of development involving WAL happens on platforms where this compiles fine. Author: Bharath Rupireddy Discussion: https://postgr.es/m/CALj2ACVF14WKQMFwcJ=3okVDhiXpuK5f7YdT+BdYXbbypMHqWA@mail.gmail.com Backpatch-through: 13
2023-10-16Move extra code out of the Pre/PostRestoreCommand() section.Nathan Bossart
If SIGTERM is received within this section, the startup process will immediately proc_exit() in the signal handler, so it is inadvisable to include any more code than is required there (as such code is unlikely to be compatible with doing proc_exit() in a signal handler). This commit moves the code recently added to this section (see 1b06d7bac9 and 7fed801135) to outside of the section. This ensures that the startup process only calls proc_exit() in its SIGTERM handler for the duration of the system() call, which is how this code worked from v8.4 to v14. Reported-by: Michael Paquier, Thomas Munro Analyzed-by: Andres Freund Suggested-by: Tom Lane Reviewed-by: Michael Paquier, Robert Haas, Thomas Munro, Andres Freund Discussion: https://postgr.es/m/Y9nGDSgIm83FHcad%40paquier.xyz Discussion: https://postgr.es/m/20230223231503.GA743455%40nathanxps13 Backpatch-through: 15
2023-10-10Fix bug in GenericXLogFinish().Jeff Davis
Mark the buffers dirty before writing WAL. Discussion: https://postgr.es/m/25104133-7df8-cae3-b9a2-1c0aaa1c094a@iki.fi Reviewed-by: Heikki Linnakangas Backpatch-through: 11
2023-10-03Fail hard on out-of-memory failures in xlogreader.cMichael Paquier
This commit changes the WAL reader routines so as a FATAL for the backend or exit(FAILURE) for the frontend is triggered if an allocation for a WAL record decode fails in walreader.c, rather than treating this case as bogus data, which would be equivalent to the end of WAL. The key is to avoid palloc_extended(MCXT_ALLOC_NO_OOM) in walreader.c, relying on plain palloc() calls. The previous behavior could make WAL replay finish too early than it should. For example, crash recovery finishing earlier may corrupt clusters because not all the WAL available locally was replayed to ensure a consistent state. Out-of-memory failures would show up randomly depending on the memory pressure on the host, but one simple case would be to generate a large record, then replay this record after downsizing a host, as Ethan Mertz originally reported. This relies on bae868caf222, as the WAL reader routines now do the memory allocation required for a record only once its header has been fully read and validated, making xl_tot_len trustable. Making the WAL reader react differently on out-of-memory or bogus record data would require ABI changes, so this is the safest choice for stable branches. Also, it is worth noting that 3f1ce973467a has been using a plain palloc() in this code for some time now. Thanks to Noah Misch and Thomas Munro for the discussion. Like the other commit, backpatch down to 12, leaving out v11 that will be EOL'd soon. The behavior of considering a failed allocation as bogus data comes originally from 0ffe11abd3a0, where the record length retrieved from its header was not entirely trustable. Reported-by: Ethan Mertz Discussion: https://postgr.es/m/ZRKKdI5-RRlta3aF@paquier.xyz Backpatch-through: 12
2023-09-28Fix typo in src/backend/access/transam/README.Etsuro Fujita
2023-09-26Fix edge-case for xl_tot_len broken by bae868ca.Thomas Munro
bae868ca removed a check that was still needed. If you had an xl_tot_len at the end of a page that was too small for a record header, but not big enough to span onto the next page, we'd immediately perform the CRC check using a bogus large length. Because of arbitrary coding differences between the CRC implementations on different platforms, nothing very bad happened on common modern systems. On systems using the _sb8.c fallback we could segfault. Restore that check, add a new assertion and supply a test for that case. Back-patch to 12, like bae868ca. Tested-by: Tom Lane <tgl@sss.pgh.pa.us> Tested-by: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGLCkTT7zYjzOxuLGahBdQ%3DMcF%3Dz5ZvrjSOnW4EDhVjT-g%40mail.gmail.com
2023-09-23Don't trust unvalidated xl_tot_len.Thomas Munro
xl_tot_len comes first in a WAL record. Usually we don't trust it to be the true length until we've validated the record header. If the record header was split across two pages, previously we wouldn't do the validation until after we'd already tried to allocate enough memory to hold the record, which was bad because it might actually be garbage bytes from a recycled WAL file, so we could try to allocate a lot of memory. Release 15 made it worse. Since 70b4f82a4b5, we'd at least generate an end-of-WAL condition if the garbage 4 byte value happened to be > 1GB, but we'd still try to allocate up to 1GB of memory bogusly otherwise. That was an improvement, but unfortunately release 15 tries to allocate another object before that, so you could get a FATAL error and recovery could fail. We can fix both variants of the problem more fundamentally using pre-existing page-level validation, if we just re-order some logic. The new order of operations in the split-header case defers all memory allocation based on xl_tot_len until we've read the following page. At that point we know that its first few bytes are not recycled data, by checking its xlp_pageaddr, and that its xlp_rem_len agrees with xl_tot_len on the preceding page. That is strong evidence that xl_tot_len was truly the start of a record that was logged. This problem was most likely to occur on a standby, because walreceiver.c recycles WAL files without zeroing out trailing regions of each page. We could fix that too, but that wouldn't protect us from rare crash scenarios where the trailing zeroes don't make it to disk. With reliable xl_tot_len validation in place, the ancient policy of considering malloc failure to indicate corruption at end-of-WAL seems quite surprising, but changing that is left for later work. Also included is a new TAP test to exercise various cases of end-of-WAL detection by writing contrived data into the WAL from Perl. Back-patch to 12. We decided not to put this change into the final release of 11. Author: Thomas Munro <thomas.munro@gmail.com> Author: Michael Paquier <michael@paquier.xyz> Reported-by: Alexander Lakhin <exclusion@gmail.com> Reviewed-by: Noah Misch <noah@leadboat.com> (the idea, not the code) Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Sergei Kornilov <sk@zsrv.org> Reviewed-by: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/17928-aa92416a70ff44a2%40postgresql.org
2023-09-21Fix COMMIT/ROLLBACK AND CHAIN in the presence of subtransactions.Tom Lane
In older branches, COMMIT/ROLLBACK AND CHAIN failed to propagate the current transaction's properties to the new transaction if there was any open subtransaction (unreleased savepoint). Instead, some previous transaction's properties would be restored. This is because the "if (s->chain)" check in CommitTransactionCommand examined the wrong instance of the "chain" flag and falsely concluded that it didn't need to save transaction properties. Our regression tests would have noticed this, except they used identical transaction properties for multiple tests in a row, so that the faulty behavior was not distinguishable from correct behavior. Commit 12d768e70 fixed the problem in v15 and later, but only rather accidentally, because I removed the "if (s->chain)" test to avoid a compiler warning, while not realizing that the warning was flagging a real bug. In v14 and before, remove the if-test and save transaction properties unconditionally; just as in the newer branches, that's not expensive enough to justify thinking harder. Add the comment and extra regression test to v15 and later to forestall any future recurrence, but there's no live bug in those branches. Patch by me, per bug #18118 from Liu Xiang. Back-patch to v12 where the AND CHAIN feature was added. Discussion: https://postgr.es/m/18118-4b72fcbb903aace6@postgresql.org
2023-07-18Fix indentation in twophase.cMichael Paquier
This has been missed in cb0cca1, noticed before buildfarm member koel has been able to complain while poking at a different patch. Like the other commit, backpatch all the way down to limit the odds of merge conflicts. Backpatch-through: 11
2023-07-18Fix recovery of 2PC transaction during crash recoveryMichael Paquier
A crash in the middle of a checkpoint with some two-phase state data already flushed to disk by this checkpoint could cause a follow-up crash recovery to recover twice the same transaction, once from what has been found in pg_twophase/ at the beginning of recovery and a second time when replaying its corresponding record. This would lead to FATAL failures in the startup process during recovery, where the same transaction would have a state recovered twice instead of once: LOG: recovering prepared transaction 731 from shared memory LOG: recovering prepared transaction 731 from shared memory FATAL: lock ExclusiveLock on object 731/0/0 is already held This issue is fixed by skipping the addition of any 2PC state coming from a record whose equivalent 2PC state file has already been loaded in TwoPhaseState at the beginning of recovery by restoreTwoPhaseData(), which is OK as long as the system has not reached a consistent state. The timing to get a messed up recovery processing is very racy, and would very unlikely happen. The thread that has reported the issue has demonstrated the bug using injection points to force a PANIC in the middle of a checkpoint. Issue introduced in 728bd99, so backpatch all the way down. Reported-by: "suyu.cmj" <mengjuan.cmj@alibaba-inc.com> Author: "suyu.cmj" <mengjuan.cmj@alibaba-inc.com> Author: Michael Paquier Discussion: https://postgr.es/m/109e6994-b971-48cb-84f6-829646f18b4c.mengjuan.cmj@alibaba-inc.com Backpatch-through: 11
2023-07-03Silence "missing contrecord" error.Thomas Munro
Commit dd38ff28ad added a new error message "missing contrecord" when we fail to reassemble a record. Unfortunately that caused noisy messages to be logged by pg_waldump at end of segment, and by walsender when asked to shut down on a segment boundary. Remove the new error message, so that this condition signals end-of- WAL without a message. It's arguably a reportable condition that should not be silenced while performing crash recovery, but fixing that without introducing noise in the other cases will require more research. Back-patch to 15. Reported-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: https://postgr.es/m/6a1df56e-4656-b3ce-4b7a-a9cb41df8189%40enterprisedb.com
2023-06-20Enable archiving in recovery TAP test 009_twophase.plMichael Paquier
This is a follow-up of f663b00, that has been committed to v13 and v14, tweaking the TAP test for two-phase transactions so as it provides coverage for the bug that has been fixed. This change is done in its own commit for clarity, as v15 and HEAD did not show the problematic behavior, still missed coverage for it. While on it, this adds a comment about the dependency of the last partial segment rename and RecoverPreparedTransactions() at the end of recovery, as that can be easy to miss. Author: Michael Paquier Reviewed-by: Kyotaro Horiguchi Discussion: https://postgr.es/m/743b9b45a2d4013bd90b6a5cba8d6faeb717ee34.camel@cybertec.at Backpatch-through: 13
2023-06-06Initialize 'recordXtime' to silence compiler warning.Heikki Linnakangas
In reality, recordXtime will always be set by the getRecordTimestamp call, but the compiler doesn't necessarily see that. Back-patch to all supported versions. Author: Tristan Partin Discussion: https://www.postgresql.org/message-id/CT5MN8E11U0M.1NYNCHXYUHY41@gonk
2023-04-27Prevent underflow in KeepLogSeg().Nathan Bossart
The call to XLogGetReplicationSlotMinimumLSN() might return a greater LSN than the one given to the function. Subsequent segment number calculations might then underflow, which could result in unexpected behavior when removing or recyling WAL files. This was introduced with max_slot_wal_keep_size in c655077639. To fix, skip the block of code for replication slots if the LSN is greater. Reported-by: Xu Xingwang Author: Kyotaro Horiguchi Reviewed-by: Junwang Zhao Discussion: https://postgr.es/m/17903-4288d439dee856c6%40postgresql.org Backpatch-through: 13
2023-04-26Re-add tracking of wait event SLRUFlushSyncMichael Paquier
SLRUFlushSync has been accidently removed during dee663f, that has moved the flush of the SLRU files to the checkpointer, so add it back. The issue has been noticed by Thomas when checking for orphaned wait events. Author: Thomas Munro Reviewed-by: Bharath Rupireddy Discussion: https://postgr.es/m/CA+hUKGK6tqm59KuF1z+h5Y8fsWcu5v8+84kduSHwRzwjB2aa_A@mail.gmail.com
2023-04-17Avoid trying to write an empty WAL record in log_newpage_range().Tom Lane
If the last few pages in the specified range are empty (all zero), then log_newpage_range() could try to emit an empty WAL record containing no FPIs. This at least upsets an Assert in ReserveXLogInsertLocation, and might perhaps have bad real-world consequences in non-assert builds. This has been broken since log_newpage_range() was introduced, but the case was hard if not impossible to hit before commit 3d6a98457 decided it was okay to leave VM and FSM pages intentionally zero. Nonetheless, it seems prudent to back-patch. log_newpage_range() was added in v12 but later back-patched, so this affects all supported branches. Matthias van de Meent, per report from Justin Pryzby Discussion: https://postgr.es/m/ZD1daibg4RF50IOj@telsasoft.com
2023-02-10Un-revert "Disable STARTUP_PROGRESS_TIMEOUT in standby mode."Robert Haas
This reverts commit 1eadfbdd7eb0679ba8d45787aa8b2f06e76de20a and thus reinstates commit 98e7234242a652497c99d4d0d6f2bf9a75d4e921. It's a better time to commit this now that the release is over. Discussion: http://postgr.es/m/3509384.1675878203@sss.pgh.pa.us
2023-02-06Revert "Disable STARTUP_PROGRESS_TIMEOUT in standby mode."Robert Haas
This reverts commit 98e7234242a652497c99d4d0d6f2bf9a75d4e921. I forgot that we're about to wrap a release, and this fix isn't critical enough to justify committing it right before we wrap a release. Discussion: http://postgr.es/m/2676424.1675700113@sss.pgh.pa.us
2023-02-06Disable STARTUP_PROGRESS_TIMEOUT in standby mode.Robert Haas
In standby mode, we don't actually report progress of recovery, but up until now, startup_progress_timeout_handler() nevertheless got called every log_startup_progress_interval seconds. That's an unnecessary expense, so avoid it. Report by Thomas Munro. Patch by Bharath Rupireddy, reviewed by Simon Riggs, Thomas Munro, and me. Back-patch to v15, where the problem was introduced. Discussion: https://www.postgresql.org/message-id/CA%2BhUKGKCHSffAj8zZJKJvNX7ygnQFxVD6wm1d-2j3fVw%2BMafPQ%40mail.gmail.com
2023-01-19Log the correct ending timestamp in recovery_target_xid mode.Tom Lane
When ending recovery based on recovery_target_xid matching with recovery_target_inclusive = off, we printed an incorrect timestamp (always 2000-01-01) in the "recovery stopping before ... transaction" log message. This is a consequence of sloppy refactoring in c945af80c: the code to fetch recordXtime out of the commit/abort record used to be executed unconditionally, but it was changed to get called only in the RECOVERY_TARGET_TIME case. We need only flip the order of operations to restore the intended behavior. Per report from Torsten Förtsch. Back-patch to all supported branches. Discussion: https://postgr.es/m/CAKkG4_kUevPqbmyOfLajx7opAQk6Cvwkvx0HRcFjSPfRPTXanA@mail.gmail.com
2022-12-13Rethink handling of [Prevent|Is]InTransactionBlock in pipeline mode.Tom Lane
Commits f92944137 et al. made IsInTransactionBlock() set the XACT_FLAGS_NEEDIMMEDIATECOMMIT flag before returning "false", on the grounds that that kept its API promises equivalent to those of PreventInTransactionBlock(). This turns out to be a bad idea though, because it allows an ANALYZE in a pipelined series of commands to cause an immediate commit, which is unexpected. Furthermore, if we return "false" then we have another issue, which is that ANALYZE will decide it's allowed to do internal commit-and-start-transaction sequences, thus possibly unexpectedly committing the effects of previous commands in the pipeline. To fix the latter situation, invent another transaction state flag XACT_FLAGS_PIPELINING, which explicitly records the fact that we have executed some extended-protocol command and not yet seen a commit for it. Then, require that flag to not be set before allowing InTransactionBlock() to return "false". Having done that, we can remove its setting of NEEDIMMEDIATECOMMIT without fear of causing problems. This means that the API guarantees of IsInTransactionBlock now diverge from PreventInTransactionBlock, which is mildly annoying, but it seems OK given the very limited usage of IsInTransactionBlock. (In any case, a caller preferring the old behavior could always set NEEDIMMEDIATECOMMIT for itself.) For consistency also require XACT_FLAGS_PIPELINING to not be set in PreventInTransactionBlock. This too is meant to prevent commands such as CREATE DATABASE from silently committing previous commands in a pipeline. Per report from Peter Eisentraut. As before, back-patch to all supported branches (which sadly no longer includes v10). Discussion: https://postgr.es/m/65a899dd-aebc-f667-1d0a-abb89ff3abf8@enterprisedb.com
2022-11-29Improve heuristics for compressing the KnownAssignedXids array.Tom Lane
Previously, we'd compress only when the active range of array entries reached Max(4 * PROCARRAY_MAXPROCS, 2 * pArray->numKnownAssignedXids). If max_connections is large, the first term could result in not compressing for a long time, resulting in much wastage of cycles in hot-standby backends scanning the array to take snapshots. Get rid of that term, and just bound it to 2 * pArray->numKnownAssignedXids. That however creates the opposite risk, that we might spend too much effort compressing. Hence, consider compressing only once every 128 commit records. (This frequency was chosen by benchmarking. While we only tried one benchmark scenario, the results seem stable over a fairly wide range of frequencies.) Also, force compression when processing RecoveryInfo WAL records (which should be infrequent); the old code could perform compression then, but would do so only after the same array-range check as for the transaction-commit path. Also, opportunistically run compression if the startup process is about to wait for WAL, though not oftener than once a second. This should prevent cases where we waste lots of time by leaving the array not-compressed for long intervals due to low WAL traffic. Lastly, add a simple check to keep us from uselessly compressing when the array storage is already compact. Back-patch, as the performance problem is worse in pre-v14 branches than in HEAD. Simon Riggs and Michail Nikolaev, with help from Tom Lane and Andres Freund. Discussion: https://postgr.es/m/CALdSSPgahNUD_=pB_j=1zSnDBaiOtqVfzo8Ejt5J_k7qZiU1Tw@mail.gmail.com
2022-11-24Make multixact error message more explicitAlvaro Herrera
There are recent reports involving a very old error message that we have no history of hitting -- perhaps a recently introduced bug. Improve the error message in an attempt to improve our chances of investigating the bug. Per reports from Dimos Stamatakis and Bob Krier. Backpatch to 11. Discussion: https://postgr.es/m/CO2PR0801MB2310579F65529380A4E5EDC0E20A9@CO2PR0801MB2310.namprd08.prod.outlook.com Discussion: https://postgr.es/m/17518-04e368df5ad7f2ee@postgresql.org
2022-11-10Fix comment of SimpleLruInit() in slru.cMichael Paquier
sync_handler was not mentioned in the comment block of the function. Oversight in dee663f. Author: Aleksander Alekseev Discussion: https://postgr.es/m/CAJ7c6TPUd9BwNY47TtMxaijLHSbyHNdhu=kvbGnvO_bi+oC6_Q@mail.gmail.com Backpatch-through: 14
2022-11-09Doc: add comments about PreventInTransactionBlock/IsInTransactionBlock.Tom Lane
Add a little to the header comments for these functions to make it clearer what guarantees about commit behavior are provided to callers. (See commit f92944137 for context.) Although this is only a comment change, it's really documentation aimed at authors of extensions, so it seems appropriate to back-patch. Yugo Nagata and Tom Lane, per further discussion of bug #17434. Discussion: https://postgr.es/m/17434-d9f7a064ce2a88a3@postgresql.org
2022-10-18Rename SetSingleFuncCall() to InitMaterializedSRF()Michael Paquier
Per discussion, the existing routine name able to initialize a SRF function with materialize mode is unpopular, so rename it. Equally, the flags of this function are renamed, as of: - SRF_SINGLE_USE_EXPECTED -> MAT_SRF_USE_EXPECTED_DESC - SRF_SINGLE_BLESS -> MAT_SRF_BLESS The previous function and flags introduced in 9e98583 are kept around for compatibility purposes, so as any extension code already compiled with v15 continues to work as-is. The declarations introduced here for compatibility will be removed from HEAD in a follow-up commit. The new names have been suggested by Andres Freund and Melanie Plageman. Discussion: https://postgr.es/m/20221013194820.ciktb2sbbpw7cljm@awork3.anarazel.de Backpatch-through: 15
2022-10-01doc: Fix some grammar and typosMichael Paquier
This fixes some areas related to logical replication and custom RMGRs. Author: Ekaterina Kiryanova Discussion: https://postgr.es/m/fa4773f1-1396-384a-bcd7-85b5e013f399@postgrespro.ru Backpatch-through: 15
2022-09-24Message style improvementsPeter Eisentraut
2022-09-22Clear ps display of startup process at the end of recoveryMichael Paquier
If the ps display is not cleared at this point, the process could continue displaying "recovering NNN" even if handling end-of-recovery steps. df9274a has tackled that by providing some information with the end-of-recovery checkpoint but 7ff23c6 has nullified the effect of the first commit. Per a suggestion from Justin, just clear the ps display when we are done with recovery, so as no incorrect information is displayed. This may get extended in the future, but for now restore the pre-7ff23c6 behavior. Author: Justin Prysby Discussion: https://postgr.es/m/20220913223954.GU31833@telsasoft.com Backpatch-through: 15
2022-09-20Suppress variable-set-but-not-used warnings from clang 15.Tom Lane
clang 15+ will issue a set-but-not-used warning when the only use of a variable is in autoincrements (e.g., "foo++;"). That's perfectly sensible, but it detects a few more cases that we'd not noticed before. Silence the warnings with our usual methods, such as PG_USED_FOR_ASSERTS_ONLY, or in one case by actually removing a useless variable. One thing that we can't nicely get rid of is that with %pure-parser, Bison emits "yynerrs" as a local variable that falls foul of this warning. To silence those, I inserted "(void) yynerrs;" in the top-level productions of affected grammars. Per recently-established project policy, this is a candidate for back-patching into out-of-support branches: it suppresses annoying compiler warnings but changes no behavior. Hence, back-patch to 9.5, which is as far as these patches go without issues. (A preliminary check shows that the prior branches need some other set-but-not-used cleanups too, so I'll leave them for another day.) Discussion: https://postgr.es/m/514615.1663615243@sss.pgh.pa.us
2022-09-15Reset InstallXLogFileSegmentActive after walreceiver self-initiated exit.Noah Misch
After commit cc2c7d65fc27e877c9f407587b0b92d46cd6dd16 added this flag, failure to reset it caused assertion failures. In non-assert builds, it made the system fail to achieve the objectives listed in that commit; chiefly, we might emit a spurious log message. Back-patch to v15, where that commit first appeared. Bharath Rupireddy and Kyotaro Horiguchi. Reviewed by Dilip Kumar, Nathan Bossart and Michael Paquier. Reported by Dilip Kumar. Discussion: https://postgr.es/m/CAFiTN-sE3ry=ycMPVtC+Djw4Fd7gbUGVv_qqw6qfzp=JLvqT3g@mail.gmail.com
2022-09-09Add more error context to RestoreBlockImage() and consume itMichael Paquier
On failure in restoring a block image, no details were provided, while it is possible to see failure with an inconsistent record state, a failure in processing decompression or a failure in decompression because a build does not support this option. RestoreBlockImage() is used in two code paths in the backend code, during recovery and when checking a page consistency after applying masking, and both places are changed to consume the error message produced by the internal routine when it returns a false status. All the error messages are reported under ERRCODE_INTERNAL_ERROR, that gets used also when attempting to access a page compressed by a method not supported by the build attempting the decompression. This is something that can happen in core when doing physical replication with primary and standby using inconsistent build options, for example. This routine is available since 2c03216d and it has never provided any context about the error happening when it failed. This change is justified even more after 57aa5b2, that introduced compression of FPWs in WAL. Reported-by: Justin Prysby Author: Michael Paquier Discussion: https://postgr.es/m/20220905002320.GD31833@telsasoft.com Backpatch-through: 15
2022-09-08Fix recovery_prefetch with low maintenance_io_concurrency.Thomas Munro
We should process completed IOs *before* trying to start more, so that it is always possible to decode one more record when the decoded record queue is empty, even if maintenance_io_concurrency is set so low that a single earlier WAL record might have saturated the IO queue. That bug was hidden because the effect of maintenance_io_concurrency was arbitrarily clamped to be at least 2. Fix the ordering, and also remove that clamp. We need a special case for 0, which is now treated the same as recovery_prefetch=off, but otherwise the number is used directly. This allows for testing with 1, which would have made the problem obvious in simple test scenarios. Also add an explicit error message for missing contrecords. It was a bit strange that we didn't report an error already, and became a latent bug with prefetching, since the internal state that tracks aborted contrecords would not survive retrying, as revealed by 026_overwrite_contrecord.pl with this adjustment. Reporting an error prevents that. Back-patch to 15. Reported-by: Justin Pryzby <pryzby@telsasoft.com> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://postgr.es/m/20220831140128.GS31833%40telsasoft.com
2022-09-03Fix cache invalidation bug in recovery_prefetch.Thomas Munro
XLogPageRead() can retry internally after a pread() system call has succeeded, in the case of short reads, and page validation failures while in standby mode (see commit 0668719801). Due to an oversight in commit 3f1ce973, these cases could leave stale data in the internal cache of xlogreader.c without marking it invalid. The main defense against stale cached data on failure to read a page was in the error handling path of the calling function ReadPageInternal(), but that wasn't quite enough for errors handled internally by XLogPageRead()'s retry loop if we then exited with XLREAD_WOULDBLOCK. 1. ReadPageInternal() now marks the cache invalid before calling the page_read callback, by setting state->readLen to 0. It'll be set to a non-zero value only after a successful read. It'll stay valid as long as the caller requests data in the cached range. 2. XLogPageRead() no long performs internal retries while reading ahead. While such retries should work, the general philosophy is that we should give up prefetching if anything unusual happens so we can handle it when recovery catches up, to reduce the complexity of the system. Let's do that here too. 3. While here, a new function XLogReaderResetError() improves the separation between xlogrecovery.c and xlogreader.c, where the former previously clobbered the latter's internal error buffer directly. The new function makes this more explicit, and also clears a related flag, without which a standby would needlessly retry in the outer function. Thanks to Noah Misch for tracking down the conditions required for a rare build farm failure in src/bin/pg_ctl/t/003_promote.pl, and providing a reproducer. Back-patch to 15. Reported-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/20220807003627.GA4168930%40rfd.leadboat.com
2022-08-29Prevent WAL corruption after a standby promotion.Robert Haas
When a PostgreSQL instance performing archive recovery but not using standby mode is promoted, and the last WAL segment that it attempted to read ended in a partial record, the previous code would create invalid WAL on the new timeline. The WAL from the previously timeline would be copied to the new timeline up until the end of the last valid record, but instead of beginning to write WAL at immediately afterwards, the promoted server would write an overwrite contrecord at the beginning of the next segment. The end of the previous segment would be left as all-zeroes, resulting in failures if anything tried to read WAL from that file. The root of the issue is that ReadRecord() decides whether to set abortedRecPtr and missingContrecPtr based on the value of StandbyMode, but ReadRecord() switches to a new timeline based on the value of ArchiveRecoveryRequested. We shouldn't try to write an overwrite contrecord if we're switching to a new timeline, so change the test in ReadRecod() to check ArchiveRecoveryRequested instead. Code fix by Dilip Kumar. Comments by me incorporating suggested language from Álvaro Herrera. Further review from Kyotaro Horiguchi and Sami Imseih. Discussion: http://postgr.es/m/CAFiTN-t7umki=PK8dT1tcPV=mOUe2vNhHML6b3T7W7qqvvajjg@mail.gmail.com Discussion: http://postgr.es/m/FB0DEA0B-E14E-43A0-811F-C1AE93D00FF3%40amazon.com