summaryrefslogtreecommitdiff
path: root/src/backend/access/transam/xlog.c
AgeCommit message (Collapse)Author
2024-07-18Do not summarize WAL if generated with wal_level=minimal.Robert Haas
To do this, we must include the wal_level in the first WAL record covered by each summary file; so add wal_level to struct Checkpoint and the payload of XLOG_CHECKPOINT_REDO and XLOG_END_OF_RECOVERY. This, in turn, requires bumping XLOG_PAGE_MAGIC and, since the Checkpoint is also stored in the control file, also PG_CONTROL_VERSION. It's not great to do that so late in the release cycle, but the alternative seems to ship v17 without robust protections against this scenario, which could result in corrupted incremental backups. A side effect of this patch is that, when a server with wal_level=replica is started with summarize_wal=on for the first time, summarization will no longer begin with the oldest WAL that still exists in pg_wal, but rather from the first checkpoint after that. This change should be harmless, because a WAL summary for a partial checkpoint cycle can never make an incremental backup possible when it would otherwise not have been. Report by Fujii Masao. Patch by me. Review and/or testing by Jakub Wartak and Fujii Masao. Discussion: http://postgr.es/m/6e30082e-041b-4e31-9633-95a66de76f5d@oss.nttdata.com
2024-07-02Use safe string copy routineDaniel Gustafsson
Using memcpy with strlen as the size parameter will not take the NULL terminator into account, relying instead on the destination buffer being properly initialized. Replace with strlcpy which is a safer alternative, and more in line with how we handle copying strings elsewhere. Author: Ranier Vilela <ranier.vf@gmail.com> Discussion: https://postgr.es/m/CAEudQApAsbLsQ+gGiw-hT+JwGhgogFa_=5NUkgFO6kOPxyNidQ@mail.gmail.com
2024-06-28Prevent summarizer hang when summarize_wal turned off and back on.Robert Haas
Before this commit, when the WAL summarizer started up or recovered from an error, it would resume summarization from wherever it left off. That was OK normally, but wrong if summarize_wal=off had been turned off temporary, allowing some WAL to be removed, and then turned back on again. In such cases, the WAL summarizer would simply hang forever. This commit changes the reinitialization sequence for WAL summarizer to rederive the starting position in the way we were already doing at initial startup, fixing the problem. Per report from Israel Barth Rubio. Reviewed by Tom Lane. Discussion: http://postgr.es/m/CA+TgmoYN6x=YS+FoFOS6=nr6=qkXZFWhdiL7k0oatGwug2hcuA@mail.gmail.com
2024-06-27Fix MVCC bug with prepared xact with subxacts on standbyHeikki Linnakangas
We did not recover the subtransaction IDs of prepared transactions when starting a hot standby from a shutdown checkpoint. As a result, such subtransactions were considered as aborted, rather than in-progress. That would lead to hint bits being set incorrectly, and the subtransactions suddenly becoming visible to old snapshots when the prepared transaction was committed. To fix, update pg_subtrans with prepared transactions's subxids when starting hot standby from a shutdown checkpoint. The snapshots taken from that state need to be marked as "suboverflowed", so that we also check the pg_subtrans. Backport to all supported versions. Discussion: https://www.postgresql.org/message-id/6b852e98-2d49-4ca1-9e95-db419a2696e0@iki.fi
2024-06-26Fix bugs in MultiXact truncationHeikki Linnakangas
1. TruncateMultiXact() performs the SLRU truncations in a critical section. Deleting the SLRU segments calls ForwardSyncRequest(), which will try to compact the request queue if it's full (CompactCheckpointerRequestQueue()). That in turn allocates memory, which is not allowed in a critical section. Backtrace: TRAP: failed Assert("CritSectionCount == 0 || (context)->allowInCritSection"), File: "../src/backend/utils/mmgr/mcxt.c", Line: 1353, PID: 920981 postgres: autovacuum worker template0(ExceptionalCondition+0x6e)[0x560a501e866e] postgres: autovacuum worker template0(+0x5dce3d)[0x560a50217e3d] postgres: autovacuum worker template0(ForwardSyncRequest+0x8e)[0x560a4ffec95e] postgres: autovacuum worker template0(RegisterSyncRequest+0x2b)[0x560a50091eeb] postgres: autovacuum worker template0(+0x187b0a)[0x560a4fdc2b0a] postgres: autovacuum worker template0(SlruDeleteSegment+0x101)[0x560a4fdc2ab1] postgres: autovacuum worker template0(TruncateMultiXact+0x2fb)[0x560a4fdbde1b] postgres: autovacuum worker template0(vac_update_datfrozenxid+0x4b3)[0x560a4febd2f3] postgres: autovacuum worker template0(+0x3adf66)[0x560a4ffe8f66] postgres: autovacuum worker template0(AutoVacWorkerMain+0x3ed)[0x560a4ffe7c2d] postgres: autovacuum worker template0(+0x3b1ead)[0x560a4ffecead] postgres: autovacuum worker template0(+0x3b620e)[0x560a4fff120e] postgres: autovacuum worker template0(+0x3b3fbb)[0x560a4ffeefbb] postgres: autovacuum worker template0(+0x2f724e)[0x560a4ff3224e] /lib/x86_64-linux-gnu/libc.so.6(+0x27c8a)[0x7f62cc642c8a] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f62cc642d45] postgres: autovacuum worker template0(_start+0x21)[0x560a4fd16f31] To fix, bail out in CompactCheckpointerRequestQueue() without doing anything, if it's called in a critical section. That covers the above call path, as well as any other similar cases where RegisterSyncRequest might be called in a critical section. 2. After fixing that, another problem became apparent: Autovacuum process doing that truncation can deadlock with the checkpointer process. TruncateMultiXact() sets "MyProc->delayChkptFlags |= DELAY_CHKPT_START". If the sync request queue is full and cannot be compacted, the process will repeatedly sleep and retry, until there is room in the queue. However, if the checkpointer is trying to start a checkpoint at the same time, and is waiting for the DELAY_CHKPT_START processes to finish, the queue will never shrink. More concretely, the autovacuum process is stuck here: #0 0x00007fc934926dc3 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x000056220b24348b in WaitEventSetWaitBlock (set=0x56220c2e4b50, occurred_events=0x7ffe7856d040, nevents=1, cur_timeout=<optimized out>) at ../src/backend/storage/ipc/latch.c:1570 #2 WaitEventSetWait (set=0x56220c2e4b50, timeout=timeout@entry=10, occurred_events=<optimized out>, occurred_events@entry=0x7ffe7856d040, nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=150994949) at ../src/backend/storage/ipc/latch.c:1516 #3 0x000056220b243224 in WaitLatch (latch=<optimized out>, latch@entry=0x0, wakeEvents=wakeEvents@entry=40, timeout=timeout@entry=10, wait_event_info=wait_event_info@entry=150994949) at ../src/backend/storage/ipc/latch.c:538 #4 0x000056220b26cf46 in RegisterSyncRequest (ftag=ftag@entry=0x7ffe7856d0a0, type=type@entry=SYNC_FORGET_REQUEST, retryOnError=true) at ../src/backend/storage/sync/sync.c:614 #5 0x000056220af9db0a in SlruInternalDeleteSegment (ctl=ctl@entry=0x56220b7beb60 <MultiXactMemberCtlData>, segno=segno@entry=11350) at ../src/backend/access/transam/slru.c:1495 #6 0x000056220af9dab1 in SlruDeleteSegment (ctl=ctl@entry=0x56220b7beb60 <MultiXactMemberCtlData>, segno=segno@entry=11350) at ../src/backend/access/transam/slru.c:1566 #7 0x000056220af98e1b in PerformMembersTruncation (oldestOffset=<optimized out>, newOldestOffset=<optimized out>) at ../src/backend/access/transam/multixact.c:3006 #8 TruncateMultiXact (newOldestMulti=newOldestMulti@entry=3221225472, newOldestMultiDB=newOldestMultiDB@entry=4) at ../src/backend/access/transam/multixact.c:3201 #9 0x000056220b098303 in vac_truncate_clog (frozenXID=749, minMulti=<optimized out>, lastSaneFrozenXid=749, lastSaneMinMulti=3221225472) at ../src/backend/commands/vacuum.c:1917 #10 vac_update_datfrozenxid () at ../src/backend/commands/vacuum.c:1760 #11 0x000056220b1c3f76 in do_autovacuum () at ../src/backend/postmaster/autovacuum.c:2550 #12 0x000056220b1c2c3d in AutoVacWorkerMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at ../src/backend/postmaster/autovacuum.c:1569 and the checkpointer is stuck here: #0 0x00007fc9348ebf93 in clock_nanosleep () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007fc9348fe353 in nanosleep () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x000056220b40ecb4 in pg_usleep (microsec=microsec@entry=10000) at ../src/port/pgsleep.c:50 #3 0x000056220afb43c3 in CreateCheckPoint (flags=flags@entry=108) at ../src/backend/access/transam/xlog.c:7098 #4 0x000056220b1c6e86 in CheckpointerMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at ../src/backend/postmaster/checkpointer.c:464 To fix, add AbsorbSyncRequests() to the loops where the checkpointer waits for DELAY_CHKPT_START or DELAY_CHKPT_COMPLETE operations to finish. Backpatch to v14. Before that, SLRU deletion didn't call RegisterSyncRequest, which avoided this failure. I'm not sure if there are other similar scenarios on older versions, but we haven't had any such reports. Discussion: https://www.postgresql.org/message-id/ccc66933-31c1-4f6a-bf4b-45fef0d4f22e@iki.fi
2024-05-17Revise GUC names quoting in messages againPeter Eisentraut
After further review, we want to move in the direction of always quoting GUC names in error messages, rather than the previous (PG16) wildly mixed practice or the intermittent (mid-PG17) idea of doing this depending on how possibly confusing the GUC name is. This commit applies appropriate quotes to (almost?) all mentions of GUC names in error messages. It partially supersedes a243569bf65 and 8d9978a7176, which had moved things a bit in the opposite direction but which then were abandoned in a partial state. Author: Peter Smith <smithpb2250@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CAHut%2BPv-kSN8SkxSdoHano_wPubqcg5789ejhCDZAcLFceBR-w%40mail.gmail.com
2024-05-04Fix an assortment of typosDavid Rowley
Author: Alexander Lakhin Discussion: https://postgr.es/m/ae9f2fcb-4b24-5bb0-4240-efbbbd944ca1@gmail.com
2024-04-11Revert: Implement pg_wal_replay_wait() stored procedureAlexander Korotkov
This commit reverts 06c418e163, e37662f221, bf1e650806, 25f42429e2, ee79928441, and 74eaf66f98 per review by Heikki Linnakangas. Discussion: https://postgr.es/m/b155606b-e744-4218-bda5-29379779da1a%40iki.fi
2024-04-07Add XLogCtl->logInsertResultAlvaro Herrera
This tracks the position of WAL that's been fully copied into WAL buffers by all processes emitting WAL. (For some reason we call that "WAL insertion"). This is updated using atomic monotonic advance during WaitXLogInsertionsToFinish, which is not when the insertions actually occur, but it's the only place where we know where have all the insertions have completed. This value is useful in WALReadFromBuffers, which can verify that callers don't try to read past what has been inserted. (However, more infrastructure is needed in order to actually use WAL after the flush point, since it could be lost.) The value is also useful in WaitXLogInsertionsToFinish() itself, since we can now exit quickly when all WAL has been already inserted, without even having to take any locks.
2024-04-05Operate XLogCtl->log{Write,Flush}Result with atomicsAlvaro Herrera
This removes the need to hold both the info_lck spinlock and WALWriteLock to update them. We use stock atomic write instead, with WALWriteLock held. Readers can use atomic read, without any locking. This allows for some code to be reordered: some places were a bit contorted to avoid repeated spinlock acquisition, but that's no longer a concern, so we can turn them to more natural coding. Some further changes are possible (maybe to performance wins), but in this commit I did rather minimal ones only, to avoid increasing the blast radius. Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Reviewed-by: Jeff Davis <pgsql@j-davis.com> Reviewed-by: Andres Freund <andres@anarazel.de> (earlier versions) Discussion: https://postgr.es/m/20200831182156.GA3983@alvherre.pgsql
2024-04-03Split XLogCtl->LogwrtResult into separate struct membersAlvaro Herrera
After this change we have XLogCtl->logWriteResult and ->logFlushResult. There's no functional change, other than the fact that the assignment from shared memory to local is no longer done via struct assignment, but instead using a macro that copies each member separately. The current representation is inconvenient going forward; notably, we would like to add a new member "Copy" (to keep track of the last position copied into WAL buffers), so the symmetry between the values in shared memory vs. those in local would be lost. This also gives us freedom to later change the concurrency model for the values in shared memory: we can make them use atomics instead of relying on the info_lck spinlock. Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Discussion: https://postgr.es/m/202404031119.cd2kugjk2vho@alvherre.pgsql
2024-04-03Add error codes to some PANIC/FATAL errors reportsDaniel Gustafsson
This adds errcodes to a set of PANIC and FATAL errors in xlog.c and relcache.c, which previously had no errcode at all set, in order to make fleetwide analysis of errorlogs easier. There are many more ereport/elogs left which could benefit from having an errcode but this at least makes a dent in the issue. Author: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/CAN55FZ1k8LgLEqncPGmz_fWnrobV6bjABOTH4tOWta6xNcPQig@mail.gmail.com
2024-04-02Implement pg_wal_replay_wait() stored procedureAlexander Korotkov
pg_wal_replay_wait() is to be used on standby and specifies waiting for the specific WAL location to be replayed before starting the transaction. This option is useful when the user makes some data changes on primary and needs a guarantee to see these changes on standby. The queue of waiters is stored in the shared memory array sorted by LSN. During replay of WAL waiters whose LSNs are already replayed are deleted from the shared memory array and woken up by setting of their latches. pg_wal_replay_wait() needs to wait without any snapshot held. Otherwise, the snapshot could prevent the replay of WAL records implying a kind of self-deadlock. This is why it is only possible to implement pg_wal_replay_wait() as a procedure working in a non-atomic context, not a function. Catversion is bumped. Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru Author: Kartyshov Ivan, Alexander Korotkov Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira
2024-03-04Remove unused #include's from backend .c filesPeter Eisentraut
as determined by include-what-you-use (IWYU) While IWYU also suggests to *add* a bunch of #include's (which is its main purpose), this patch does not do that. In some cases, a more specific #include replaces another less specific one. Some manual adjustments of the automatic result: - IWYU currently doesn't know about includes that provide global variable declarations (like -Wmissing-variable-declarations), so those includes are being kept manually. - All includes for port(ability) headers are being kept for now, to play it safe. - No changes of catalog/pg_foo.h to catalog/pg_foo_d.h, to keep the patch from exploding in size. Note that this patch touches just *.c files, so nothing declared in header files changes in hidden ways. As a small example, in src/backend/access/transam/rmgr.c, some IWYU pragma annotations are added to handle a special case there. Discussion: https://www.postgresql.org/message-id/flat/af837490-6b2f-46df-ba05-37ea6a6653fc%40eisentraut.org
2024-03-04Add regression test for restart points during promotionMichael Paquier
This test serves as a way to demonstrate how to use the features introduced in 37b369dc67bc, while providing coverage for 7863ee4def65 that caused the startup process to throw "PANIC: could not locate a valid checkpoint record" when starting recovery. The test checks that a node is able to properly restart following a crash when a restart point was finishing across a promotion, with an injection point added in the middle of CreateRestartPoint() to stop the restartpoint in flight. Note that this test fails when 7863ee4def65 is reverted. Kyotaro Horiguchi is the original author of this test, that has been originally posted on the thread where 7863ee4def65 was discussed. I have just upgraded and polished it to rely on injection points, making it much cheaper to reproduce the failure. This test requires injection points to be enabled in the builds, hence meson and ./configure need an update to pass this knowledge down to the test. The name of the new injection point follows the same naming convention as 6a1ea02c491d. The Makefile's EXTRA_INSTALL of recovery TAP tests is updated to include modules/injection_points. Author: Kyotaro Horiguchi, Michael Paquier Reviewed-by: Andrey Borodin, Bertrand Drouvot Discussion: https://postgr.es/m/ZdLuxBk5hGpol91B@paquier.xyz
2024-02-29Convert unloggedLSN to an atomic variable.Nathan Bossart
Currently, this variable is an XLogRecPtr protected by a spinlock. By converting it to an atomic variable, we can remove the spinlock, which saves a small amount of shared memory space. Since this code is not performance-critical, we use atomic operations with full barrier semantics to make it easy to reason about correctness. Author: John Morris Reviewed-by: Michael Paquier, Robert Haas, Andres Freund, Stephen Frost, Bharath Rupireddy Discussion: https://postgr.es/m/BYAPR13MB26772534335255E50318C574A0409%40BYAPR13MB2677.namprd13.prod.outlook.com Discussion: https://postgr.es/m/MN2PR13MB2688FD8B757316CB5C54C8A2A0DDA%40MN2PR13MB2688.namprd13.prod.outlook.com
2024-02-22Remove superfluous 'pgprocno' field from PGPROCHeikki Linnakangas
It was always just the index of the PGPROC entry from the beginning of the proc array. Introduce a macro to compute it from the pointer instead. Reviewed-by: Andres Freund Discussion: https://www.postgresql.org/message-id/8171f1aa-496f-46a6-afc3-c46fe7a9b407@iki.fi
2024-02-16Add assert to WALReadFromBuffers().Jeff Davis
Per suggestion from Andres. Discussion: https://postgr.es/m/20240214025508.6mcblauossthvaw3@awork3.anarazel.de
2024-02-12Read WAL directly from WAL buffers.Jeff Davis
If available, read directly from WAL buffers, avoiding the need to go through the filesystem. Only for physical replication for now, but can be expanded to other callers. In preparation for replicating unflushed WAL data. Author: Bharath Rupireddy Discussion: https://postgr.es/m/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com Reviewed-by: Andres Freund, Alvaro Herrera, Nathan Bossart, Dilip Kumar, Nitin Jadhav, Melih Mutlu, Kyotaro Horiguchi
2024-01-03Update copyright for 2024Bruce Momjian
Reported-by: Michael Paquier Discussion: https://postgr.es/m/ZZKTDPxBBMt3C0J9@paquier.xyz Backpatch-through: 12
2023-12-27Fix incorrect data type choices in some read and write calls.Tom Lane
Recently-introduced code in reconstruct.c was using "unsigned" to store the result of read(), pg_pread(), or write(). This is completely bogus: it breaks subsequent tests for the result being negative, as we're being reminded of by a chorus of buildfarm warnings. Switch to "int" as was doubtless intended. (There are several other uses of "unsigned" in this file that also look poorly chosen to me, but for now I'm just trying to clean up the buildfarm.) A larger problem is that "int" is not necessarily wide enough to hold the result: per POSIX, all these functions return ssize_t. In places where the requested read or write length clearly fits in int, that's academic. It may be academic anyway as long as we constrain individual data files to 1GB, since even a readv or writev-like operation would then not be responsible for transferring more than 1GB. Nonetheless it seems like trouble waiting to happen, so I made a pass over readv and writev calls and fixed the result variables where that seemed appropriate. We might want to think about changing some of the fd.c functions to return ssize_t too, for future-proofing; but I didn't tackle that here. Discussion: https://postgr.es/m/1672202.1703441340@sss.pgh.pa.us
2023-12-20Add a new WAL summarizer process.Robert Haas
When active, this process writes WAL summary files to $PGDATA/pg_wal/summaries. Each summary file contains information for a certain range of LSNs on a certain TLI. For each relation, it stores a "limit block" which is 0 if a relation is created or destroyed within a certain range of WAL records, or otherwise the shortest length to which the relation was truncated during that range of WAL records, or otherwise InvalidBlockNumber. In addition, it stores a list of blocks which have been modified during that range of WAL records, but excluding blocks which were removed by truncation after they were modified and never subsequently modified again. In other words, it tells us which blocks need to copied in case of an incremental backup covering that range of WAL records. But this doesn't yet add the capability to actually perform an incremental backup; the next patch will do that. A new parameter summarize_wal enables or disables this new background process. The background process also automatically deletes summary files that are older than wal_summarize_keep_time, if that parameter has a non-zero value and the summarizer is configured to run. Patch by me, with some design help from Dilip Kumar and Andres Freund. Reviewed by Matthias van de Meent, Dilip Kumar, Jakub Wartak, Peter Eisentraut, and Álvaro Herrera. Discussion: http://postgr.es/m/CA+TgmoYOYZfMCyOXFyC-P+-mdrZqm5pP2N7S-r0z3_402h9rsA@mail.gmail.com
2023-12-19Additional write barrier in AdvanceXLInsertBuffer().Jeff Davis
First, mark the xlblocks member with InvalidXLogRecPtr, then issue a write barrier, then initialize it. That ensures that the xlblocks member doesn't appear valid while the contents are being initialized. In preparation for reading WAL buffer contents without a lock. Author: Bharath Rupireddy Discussion: https://postgr.es/m/CALj2ACVfFMfqD5oLzZSQQZWfXiJqd-NdX0_317veP6FuB31QWA@mail.gmail.com Reviewed-by: Andres Freund
2023-12-19Use 64-bit atomics for xlblocks array elements.Jeff Davis
In preparation for reading the contents of WAL buffers without a lock. Also, avoids the previously-needed comment in GetXLogBuffer() explaining why it's safe from torn reads. Author: Bharath Rupireddy Discussion: https://postgr.es/m/CALj2ACVfFMfqD5oLzZSQQZWfXiJqd-NdX0_317veP6FuB31QWA@mail.gmail.com Reviewed-by: Andres Freund
2023-12-11Remove trace_recovery_messagesMichael Paquier
This GUC was intended as a debugging help in the 9.0 area when hot standby and streaming replication were being developped, able to offer more information at LOG level rather than DEBUGn. There are more tools available these days that are able to offer rather equivalent information, like pg_waldump introduced in 9.3. It is not obvious how this facility is useful these days, so let's remove it. Author: Bharath Rupireddy Discussion: https://postgr.es/m/ZXEXEAUVFrvpquSd@paquier.xyz
2023-12-08Rename ShmemVariableCache to TransamVariablesHeikki Linnakangas
The old name was misleading: It's not a cache, the values kept in the struct are the authoritative source. Reviewed-by: Tristan Partin, Richard Guo Discussion: https://www.postgresql.org/message-id/6537d63d-4bb5-46f8-9b5d-73a8ba4720ab@iki.fi
2023-12-06Fix compilation on Windows with WAL_DEBUGMichael Paquier
This has been broken since b060dbe0001a that has reworked the callback mechanism of XLogReader, most likely unnoticed because any form of development involving WAL happens on platforms where this compiles fine. Author: Bharath Rupireddy Discussion: https://postgr.es/m/CALj2ACVF14WKQMFwcJ=3okVDhiXpuK5f7YdT+BdYXbbypMHqWA@mail.gmail.com Backpatch-through: 13
2023-11-30Apply quotes more consistently to GUC names in logsMichael Paquier
Quotes are applied to GUCs in a very inconsistent way across the code base, with a mix of double quotes or no quotes used. This commit removes double quotes around all the GUC names that are obviously referred to as parameters with non-English words (use of underscore, mixed case, etc). This is the result of a discussion with Álvaro Herrera, Nathan Bossart, Laurenz Albe, Peter Eisentraut, Tom Lane and Daniel Gustafsson. Author: Peter Smith Discussion: https://postgr.es/m/CAHut+Pv-kSN8SkxSdoHano_wPubqcg5789ejhCDZAcLFceBR-w@mail.gmail.com
2023-11-27Reduce rate of walwriter wakeups due to async commits.Heikki Linnakangas
XLogSetAsyncXactLSN(), called at asynchronous commit, would wake up walwriter every time the LSN advances, but walwriter doesn't actually do anything unless it has at least 'wal_writer_flush_after' full blocks of WAL to write. Repeatedly waking up walwriter to do nothing is a waste of CPU cycles in both walwriter and the backends doing the wakeups. To fix, apply the same logic in XLogSetAsyncXactLSN() to decide whether to wake up walwriter, as walwriter uses to determine if it has any work to do. In the passing, rename misleadingly named 'flushbytes' local variable to 'flushblocks'. Author: Andres Freund, Heikki Linnakangas Discussion: https://www.postgresql.org/message-id/20231024230929.vsc342baqs7kmbte@awork3.anarazel.de
2023-11-25C comment: clarify that WAL files can be _recycled_ or removedBruce Momjian
Reported-by: Michael Paquier Discussion: https://postgr.es/m/CAB7nPqSDdF0heotQU3gsepgqx+9c+6KjLd3R6aNYH7KKfDd2ig@mail.gmail.com Author: Michael Paquier Backpatch-through: master
2023-11-10Prohibit max_slot_wal_keep_size to value other than -1 during upgrade.Amit Kapila
We don't want existing slots in the old cluster to get invalidated during the upgrade. During an upgrade, we set this variable to -1 via the command line in an attempt to prevent such invalidations, but users have ways to override it. This patch ensures the value is not overridden by the user. Author: Kyotaro Horiguchi Reviewed-by: Peter Smith, Alvaro Herrera Discussion: http://postgr.es/m/20231027.115759.2206827438943188717.horikyota.ntt@gmail.com
2023-10-30Introduce pg_stat_checkpointerMichael Paquier
Historically, the statistics of the checkpointer have been always part of pg_stat_bgwriter. This commit removes a few columns from pg_stat_bgwriter, and introduces pg_stat_checkpointer with equivalent, renamed columns (plus a new one for the reset timestamp): - checkpoints_timed -> num_timed - checkpoints_req -> num_requested - checkpoint_write_time -> write_time - checkpoint_sync_time -> sync_time - buffers_checkpoint -> buffers_written The fields of PgStat_CheckpointerStats and its SQL functions are renamed to match with the new field names, for consistency. Note that background writer and checkpointer have been split into two different processes in commits 806a2aee3791 and bf405ba8e460. The pgstat structures were already split, making this change straight-forward. Bump catalog version. Author: Bharath Rupireddy Reviewed-by: Bertrand Drouvot, Andres Freund, Michael Paquier Discussion: https://postgr.es/m/CALj2ACVxX2ii=66RypXRweZe2EsBRiPMj0aHfRfHUeXJcC7kHg@mail.gmail.com
2023-10-23Change struct tablespaceinfo's oid member from 'char *' to 'Oid'Robert Haas
This shouldn't change behavior except in the unusual case where there are file in the tablespace directory that have entirely numeric names but are nevertheless not possible names for a tablespace directory, either because their names have leading zeroes that shouldn't be there, or the value is actually zero, or because the value is too large to represent as an OID. In those cases, the directory would previously have made it into the list of tablespaceinfo objects and no longer will. Thus, base backups will now ignore such directories, instead of treating them as legitimate tablespace directories. Similarly, if entries for such tablespaces occur in a tablespace_map file, they will now be rejected as erroneous, instead of being honored. This is infrastructure for future work that wants to be able to know the tablespace of each relation that is part of a backup *as an OID*. By strengthening the up-front validation, we don't have to worry about weird cases later, and can more easily avoid repeated string->integer conversions. Patch by me, reviewed by David Steele. Discussion: http://postgr.es/m/CA+TgmoZNVeBzoqDL8xvr-nkaepq815jtDR4nJzPew7=3iEuM1g@mail.gmail.com
2023-10-19During online checkpoints, insert XLOG_CHECKPOINT_REDO at redo point.Robert Haas
This allows tools that read the WAL sequentially to identify (possible) redo points when they're reached, rather than only being able to detect them in retrospect when XLOG_CHECKPOINT_ONLINE is found, possibly much later in the WAL stream. There are other possible applications as well; see the discussion links below. Any redo location that precedes the checkpoint location should now point to an XLOG_CHECKPOINT_REDO record, so add a cross-check to verify this. While adjusting the code in CreateCheckPoint() for this patch, I made it call WALInsertLockAcquireExclusive a bit later than before, since there appears to be no need for it to be held while checking whether the system is idle, whether this is an end-of-recovery checkpoint, or what the current timeline is. Bump XLOG_PAGE_MAGIC. Patch by me, based in part on earlier work from Dilip Kumar. Review by Dilip Kumar, Amit Kapila, Andres Freund, and Michael Paquier. Discussion: http://postgr.es/m/CA+TgmoYy-Vc6G9QKcAKNksCa29cv__czr+N9X_QCxEfQVpp_8w@mail.gmail.com Discussion: http://postgr.es/m/20230614194717.jyuw3okxup4cvtbt%40awork3.anarazel.de Discussion: http://postgr.es/m/CA+hUKG+b2ego8=YNW2Ohe9QmSiReh1-ogrv8V_WZpJTqP3O+2w@mail.gmail.com
2023-10-13Improve the naming in wal_sync_method code.Nathan Bossart
* sync_method is renamed to wal_sync_method. * sync_method_options[] is renamed to wal_sync_method_options[]. * assign_xlog_sync_method() is renamed to assign_wal_sync_method(). * The names of the available synchronization methods are now prefixed with "WAL_SYNC_METHOD_" and have been moved into a WalSyncMethod enum. * PLATFORM_DEFAULT_SYNC_METHOD is renamed to PLATFORM_DEFAULT_WAL_SYNC_METHOD, and DEFAULT_SYNC_METHOD is renamed to DEFAULT_WAL_SYNC_METHOD. These more descriptive names help distinguish the code for wal_sync_method from the code for DataDirSyncMethod (e.g., the recovery_init_sync_method configuration parameter and the --sync-method option provided by several frontend utilities). This change also prevents name collisions between the aforementioned sets of code. Since this only improves the naming of internal identifiers, there should be no behavior change. Author: Maxim Orlov Discussion: https://postgr.es/m/CACG%3DezbL1gwE7_K7sr9uqaCGkWhmvRTcTEnm3%2BX1xsRNwbXULQ%40mail.gmail.com
2023-10-13Add wait events for checkpoint delay mechanism.Thomas Munro
When MyProc->delayChkptFlags is set to temporarily block phase transitions in a concurrent checkpoint, the checkpointer enters a sleep-poll loop to wait for the flag to be cleared. We should show that as a wait event in the pg_stat_activity view. Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CA%2BhUKGL7Whi8iwKbzkbn_1fixH3Yy8aAPz7mfq6Hpj7FeJrKMg%40mail.gmail.com
2023-10-12Unify two isLogSwitch tests in XLogInsertRecord.Robert Haas
An upcoming patch wants to introduce an additional special case in this function. To keep that as cheap as possible, minimize the amount of branching that we do based on whether this is an XLOG_SWITCH record. Additionally, and also in the interest of keeping the overhead of special-case code paths as low as possible, apply likely() to the non-XLOG_SWITCH case, since only a very tiny fraction of WAL records will be XLOG_SWITCH records. Patch by me, reviewed by Dilip Kumar, Amit Kapila, Andres Freund, and Michael Paquier. Discussion: http://postgr.es/m/CA+TgmoYy-Vc6G9QKcAKNksCa29cv__czr+N9X_QCxEfQVpp_8w@mail.gmail.com
2023-09-15Rename variable for code clarityDaniel Gustafsson
When tracking IO timing for WAL, the duration is what we calculate based on the start and end timestamps, it's not what the variable contains. Rename the timestamp variable to end to better communicate what it contains. Original patch by Krishnakumar with additional hacking to fix another occurrence by me. Author: Krishnakumar R <kksrcv001@gmail.com> Discussion: https://postgr.es/m/CAPMWgZ9f9o8awrQpjo8oxnNQ=bMDVPx00NE0QcDzvHD_ZrdLPw@mail.gmail.com
2023-09-14Quote filenames in error messagesDaniel Gustafsson
The majority of all filenames are quoted in user facing error and log messages, but a few were still printed without quotes. While these filenames do not risk causing any ambiguity as their format is strict, quote them anyways to be consistent across all logs. Also concatenate a message to keep it one line to make it easier to grep for in the code. Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/080EEABE-6645-4A46-AB20-6285ADAC44FE@yesql.se
2023-09-14Flush logical slots to disk during a shutdown checkpoint if required.Amit Kapila
It's entirely possible for a logical slot to have a confirmed_flush LSN higher than the last value saved on disk while not being marked as dirty. Currently, it is not a major problem but a later patch adding support for the upgrade of slots relies on that value being properly flushed to disk. It can also help avoid processing the same transactions again in some boundary cases after the clean shutdown and restart. Say, we process some transactions for which we didn't send anything downstream (the changes got filtered) but the confirm_flush LSN is updated due to keepalives. As we don't flush the latest value of confirm_flush LSN, it may lead to processing the same changes again without this patch. The approach taken by this patch has been suggested by Ashutosh Bapat. Author: Vignesh C, Julien Rouhaud, Kuroda Hayato Reviewed-by: Amit Kapila, Dilip Kumar, Michael Paquier, Ashutosh Bapat, Peter Smith, Hou Zhijie Discussion: http://postgr.es/m/CAA4eK1JzJagMmb_E8D4au=GYQkxox0AfNBm1FbP7sy7t4YWXPQ@mail.gmail.com Discussion: http://postgr.es/m/TYAPR01MB58664C81887B3AF2EB6B16E3F5939@TYAPR01MB5866.jpnprd01.prod.outlook.com
2023-08-28Make error messages about WAL segment size more consistentPeter Eisentraut
Make the primary messages more compact and make the detail messages uniform. In initdb.c and pg_resetwal.c, use the newish option_parse_int() to simplify some of the option parsing. For the backend GUC wal_segment_size, add a GUC check hook to do the verification instead of coding it in bootstrap.c. This might be overkill, but that way the check is in the right place and it becomes more self-documenting. In passing, make pg_controldata use the logging API for warning messages. Reviewed-by: Aleksander Alekseev <aleksander@timescale.com> Discussion: https://www.postgresql.org/message-id/flat/9939aa8a-d7be-da2c-7715-0a0b5535a1f7@eisentraut.org
2023-07-26Document more assumptions of LWLock variable changes with WAL insertsMichael Paquier
This commit adds a few comments about what LWLockWaitForVar() relies on when a backend waits for a variable update on its LWLocks for WAL insertions up to an expected LSN. First, LWLockWaitForVar() does not include a memory barrier, relying on a spinlock taken at the beginning of WaitXLogInsertionsToFinish(). This was hidden behind two layers of routines in lwlock.c. This assumption is now documented at the top of LWLockWaitForVar(), and detailed at bit more within LWLockConflictsWithVar(). Second, document why WaitXLogInsertionsToFinish() does not include memory barriers, relying on a spinlock at its top, which is, per Andres' input, fine for two different reasons, both depending on the fact that the caller of WaitXLogInsertionsToFinish() is waiting for a LSN up to a certain value. This area's documentation and assumptions could be improved more in the future, but at least that's a beginning. Author: Bharath Rupireddy, Andres Freund Reviewed-by: Michael Paquier Discussion: https://postgr.es/m/CALj2ACVF+6jLvqKe6xhDzCCkr=rfd6upaGc3477Pji1Ke9G7Bg@mail.gmail.com
2023-07-25Optimize WAL insertion lock acquisition and release with some atomicsMichael Paquier
The WAL insertion lock variable insertingAt is currently being read and written with the help of the LWLock wait list lock to avoid any read of torn values. This wait list lock can become a point of contention on a highly concurrent write workloads. This commit switches insertingAt to a 64b atomic variable that provides torn-free reads/writes. On platforms without 64b atomic support, the fallback implementation uses spinlocks to provide the same guarantees for the values read. LWLockWaitForVar(), through LWLockConflictsWithVar(), reads the new value to check if it still needs to wait with a u64 atomic operation. LWLockUpdateVar() updates the variable before waking up the waiters with an exchange_u64 (full memory barrier). LWLockReleaseClearVar() now uses also an exchange_u64 to reset the variable. Before this commit, all these steps relied on LWLockWaitListLock() and LWLockWaitListUnlock(). This reduces contention on LWLock wait list lock and improves performance of highly-concurrent write workloads. Here are some numbers using pg_logical_emit_message() (HEAD at d6677b93) with various arbitrary record lengths and clients up to 1k on a rather-large machine (64 vCPUs, 512GB of RAM, 16 cores per sockets, 2 sockets), in terms of TPS numbers coming from pgbench: message_size_b | 16 | 64 | 256 | 1024 --------------------+--------+--------+--------+------- patch_4_clients | 83830 | 82929 | 80478 | 73131 patch_16_clients | 267655 | 264973 | 250566 | 213985 patch_64_clients | 380423 | 378318 | 356907 | 294248 patch_256_clients | 360915 | 354436 | 326209 | 263664 patch_512_clients | 332654 | 321199 | 287521 | 240128 patch_1024_clients | 288263 | 276614 | 258220 | 217063 patch_2048_clients | 252280 | 243558 | 230062 | 192429 patch_4096_clients | 212566 | 213654 | 205951 | 166955 head_4_clients | 83686 | 83766 | 81233 | 73749 head_16_clients | 266503 | 265546 | 249261 | 213645 head_64_clients | 366122 | 363462 | 341078 | 261707 head_256_clients | 132600 | 132573 | 134392 | 165799 head_512_clients | 118937 | 114332 | 116860 | 150672 head_1024_clients | 133546 | 115256 | 125236 | 151390 head_2048_clients | 137877 | 117802 | 120909 | 138165 head_4096_clients | 113440 | 115611 | 120635 | 114361 Bharath has been measuring similar improvements, where the limit of the WAL insertion lock begins to be felt when more than 256 concurrent clients are involved in this specific workload. An extra patch has been discussed to introduce a fast-exit path in LWLockUpdateVar() when there are no waiters, still this does not influence the write-heavy workload cases discussed as there are always waiters. This will be considered separately. Author: Bharath Rupireddy Reviewed-by: Nathan Bossart, Andres Freund, Michael Paquier Discussion: https://postgr.es/m/CALj2ACVF+6jLvqKe6xhDzCCkr=rfd6upaGc3477Pji1Ke9G7Bg@mail.gmail.com
2023-06-20Enable archiving in recovery TAP test 009_twophase.plMichael Paquier
This is a follow-up of f663b00, that has been committed to v13 and v14, tweaking the TAP test for two-phase transactions so as it provides coverage for the bug that has been fixed. This change is done in its own commit for clarity, as v15 and HEAD did not show the problematic behavior, still missed coverage for it. While on it, this adds a comment about the dependency of the last partial segment rename and RecoverPreparedTransactions() at the end of recovery, as that can be easy to miss. Author: Michael Paquier Reviewed-by: Kyotaro Horiguchi Discussion: https://postgr.es/m/743b9b45a2d4013bd90b6a5cba8d6faeb717ee34.camel@cybertec.at Backpatch-through: 13
2023-05-19Pre-beta mechanical code beautification.Tom Lane
Run pgindent, pgperltidy, and reformat-dat-files. This set of diffs is a bit larger than typical. We've updated to pg_bsd_indent 2.1.2, which properly indents variable declarations that have multi-line initialization expressions (the continuation lines are now indented one tab stop). We've also updated to perltidy version 20230309 and changed some of its settings, which reduces its desire to add whitespace to lines to make assignments etc. line up. Going forward, that should make for fewer random-seeming changes to existing code. Discussion: https://postgr.es/m/20230428092545.qfb3y5wcu4cm75ur@alvherre.pgsql
2023-04-27Prevent underflow in KeepLogSeg().Nathan Bossart
The call to XLogGetReplicationSlotMinimumLSN() might return a greater LSN than the one given to the function. Subsequent segment number calculations might then underflow, which could result in unexpected behavior when removing or recyling WAL files. This was introduced with max_slot_wal_keep_size in c655077639. To fix, skip the block of code for replication slots if the LSN is greater. Reported-by: Xu Xingwang Author: Kyotaro Horiguchi Reviewed-by: Junwang Zhao Discussion: https://postgr.es/m/17903-4288d439dee856c6%40postgresql.org Backpatch-through: 13
2023-04-19Fix various typos and incorrect/outdated name referencesDavid Rowley
Author: Alexander Lakhin Discussion: https://postgr.es/m/699beab4-a6ca-92c9-f152-f559caf6dc25@gmail.com
2023-04-18Fix pg_basebackup with in-place tablespaces some more.Robert Haas
Commit c6f2f01611d4f2c412e92eb7893f76fa590818e8 purported to make this work, but problems remained. In a plain-format backup, the files from an in-place tablespace got included in the tar file for the main tablespace, which is wrong but it's not clear that it has any user-visible consequences. In a tar-format backup, the TABLESPACE_MAP option is used, and so we never iterated over pg_tblspc and thus never backed up the in-place tablespaces anywhere at all. To fix this, reverse the changes in that commit, so that when we scan pg_tblspc during a backup, we create tablespaceinfo objects even for in-place tablespaces. We set the field that would normally contain the absolute pathname to the relative path pg_tblspc/${TSOID}, and that's good enough to make basebackup.c happy without any further changes. However, pg_basebackup needs a couple of adjustments to make it work. First, it needs to understand that a relative path for a tablespace means it's an in-place tablespace. Second, it needs to tolerate the situation where restoring the main tablespace tries to create pg_tblspc or a subdirectory and finds that it already exists, because we restore user-defined tablespaces before the main tablespace. Since in-place tablespaces are only intended for use in development and testing, no back-patch. Patch by me, reviewed by Thomas Munro and Michael Paquier. Discussion: http://postgr.es/m/CA+TgmobwvbEp+fLq2PykMYzizcvuNv0a7gPMJtxOTMOuuRLMHg@mail.gmail.com
2023-04-08Allow logical decoding on standbysAndres Freund
Unsurprisingly, this requires wal_level = logical to be set on the primary and standby. The infrastructure added in 26669757b6a ensures that slots are invalidated if the primary's wal_level is lowered. Creating a slot on a standby waits for a xl_running_xact record to be processed. If the primary is idle (and thus not emitting xl_running_xact records), that can take a while. To make that faster, this commit also introduces the pg_log_standby_snapshot() function. By executing it on the primary, completion of slot creation on the standby can be accelerated. Note that logical decoding on a standby does not itself enforce that required catalog rows are not removed. The user has to use physical replication slots + hot_standby_feedback or other measures to prevent that. If catalog rows required for a slot are removed, the slot is invalidated. See 6af1793954e for an overall design of logical decoding on a standby. Bumps catversion, for the addition of the pg_log_standby_snapshot() function. Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> Author: Andres Freund <andres@anarazel.de> (in an older version) Author: Amit Khandekar <amitdkhan.pg@gmail.com> (in an older version) Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: FabrÌzio de Royes Mello <fabriziomello@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-By: Robert Haas <robertmhaas@gmail.com>
2023-04-08For cascading replication, wake physical and logical walsenders separatelyAndres Freund
Physical walsenders can't send data until it's been flushed; logical walsenders can't decode and send data until it's been applied. On the standby, the WAL is flushed first, which will only wake up physical walsenders; and then applied, which will only wake up logical walsenders. Previously, all walsenders were awakened when the WAL was flushed. That was fine for logical walsenders on the primary; but on the standby the flushed WAL would have been not applied yet, so logical walsenders were awakened too early. Per idea from Jeff Davis and Amit Kapila. Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> Reviewed-By: Jeff Davis <pgsql@j-davis.com> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAA4eK1+zO5LUeisabX10c81LU-fWMKO4M9Wyg1cdkbW7Hqh6vQ@mail.gmail.com