summaryrefslogtreecommitdiff
path: root/src/backend/access
AgeCommit message (Collapse)Author
2 daysSplit some long Makefile listsMichael Paquier
This change makes more readable code diffs when adding new items or removing old items, while ensuring that lines do not get excessively long. Some SUBDIRS, PROGRAMS and REGRESS lists are split. Note that there are a few more REGRESS lists that could be split, particularly in contrib/. Author: Jelte Fennema-Nio <postgres@jeltef.nl> Co-Authored-By: Jacob Champion <jacob.champion@enterprisedb.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Japin Li <japinli@hotmail.com> Reviewed-by: Man Zeng <zengman@halodbtech.com> Discussion: https://postgr.es/m/DF6HDGB559U5.3MPRFCWPONEAE@jeltef.nl
7 daysToggle logical decoding dynamically based on logical slot presence.Masahiko Sawada
Previously logical decoding required wal_level to be set to 'logical' at server start. This meant that users had to incur the overhead of logical-level WAL logging even when no logical replication slots were in use. This commit adds functionality to automatically control logical decoding availability based on logical replication slot presence. The newly introduced module logicalctl.c allows logical decoding to be dynamically activated when needed when wal_level is set to 'replica'. When the first logical replication slot is created, the system automatically increases the effective WAL level to maintain logical-level WAL records. Conversely, after the last logical slot is dropped or invalidated, it decreases back to 'replica' WAL level. While activation occurs synchronously right after creating the first logical slot, deactivation happens asynchronously through the checkpointer process. This design avoids a race condition at the end of recovery; a concurrent deactivation could happen while the startup process enables logical decoding at the end of recovery, but WAL writes are still not permitted until recovery fully completes. The checkpointer will handle it after recovery is done. Asynchronous deactivation also avoids excessive toggling of the logical decoding status in workloads that repeatedly create and drop a single logical slot. On the other hand, this lazy approach can delay changes to effective_wal_level and the disabling logical decoding, especially when the checkpointer is busy with other tasks. We chose this lazy approach in all deactivation paths to keep the implementation simple, even though laziness is strictly required only for end-of-recovery cases. Future work might address this limitation either by using a dedicated worker instead of the checkpointer, or by implementing synchronous waiting during slot drops if workloads are significantly affected by the lazy deactivation of logical decoding. The effective WAL level, determined internally by XLogLogicalInfo, is allowed to change within a transaction until an XID is assigned. Once an XID is assigned, the value becomes fixed for the remainder of the transaction. This behavior ensures that the logging mode remains consistent within a writing transaction, similar to the behavior of GUC parameters. A new read-only GUC parameter effective_wal_level is introduced to monitor the actual WAL level in effect. This parameter reflects the current operational WAL level, which may differ from the configured wal_level setting. Bump PG_CONTROL_VERSION as it adds a new field to CheckPoint struct. Reviewed-by: Shveta Malik <shveta.malik@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Shlok Kyal <shlok.kyal.oss@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://postgr.es/m/CAD21AoCVLeLYq09pQPaWs+Jwdni5FuJ8v2jgq-u9_uFbcp6UbA@mail.gmail.com
7 daysFix bug in following update chain when locking a heap tupleHeikki Linnakangas
After waiting for a concurrent updater to finish, heap_lock_tuple() followed the update chain to lock all tuple versions. However, when stepping from the initial tuple to the next one, it failed to check that the next tuple's XMIN matches the initial tuple's XMAX. That's an important check whenever following an update chain, and the recursive part that follows the chain did it, but the initial step missed it. Without the check, if the updating transaction aborts, the updated tuple is vacuumed away and replaced by an unrelated tuple, the unrelated tuple might get incorrectly locked. Author: Jasper Smit <jasper.smit@servicenow.com> Discussion: https://www.postgresql.org/message-id/CAOG+RQ74x0q=kgBBQ=mezuvOeZBfSxM1qu_o0V28bwDz3dHxLw@mail.gmail.com Backpatch-through: 14
8 daysFix another typo in gininsert.cMichael Paquier
Reported-by: Tender Wang <tndrwang@gmail.com> Discussion: https://postgr.es/m/CAHewXNkRJ9DMFZMQKWQ32U+OTBR78KeGh2=9Wy5jEeWDxMVFcQ@mail.gmail.com
9 daysRemove obsolete name_ops index-only scan comments.Peter Geoghegan
nbtree index-only scans of an index that uses btree/name_ops as one of its index column's input opclasses are no longer at any risk of reading past the end of currTuples. We're no longer reliant on such scans being able to at least read from the start of markTuples storage (which uses space from the same allocation as currTuples) to avoid a segfault: StoreIndexTuple (from nodeIndexonlyscan.c) won't actually read past the end of a cstring datum from a name_ops index. In other words, we already have the "special-case treatment for name_ops" that the removed comment supposed we could avoid by relying on markTuples in this way. Oversight in commit a63224be49, which added special case handling of name_ops cstrings to StoreIndexTuple, but missed these comments.
11 daysheapam: Move logic to handle HEAP_MOVED into a helper functionAndres Freund
Before we dealt with this in 6 near identical and one very similar copy. The helper function errors out when encountering a HEAP_MOVED_IN/HEAP_MOVED_OUT tuple with xvac considered current or in-progress. It'd be preferrable to do that change separately, but otherwise it'd not be possible to deduplicate the handling in HeapTupleSatisfiesVacuum(). Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://postgr.es/m/lxzj26ga6ippdeunz6kuncectr5gfuugmm2ry22qu6hcx6oid6@lzx3sjsqhmt6 Discussion: https://postgr.es/m/6rgb2nvhyvnszz4ul3wfzlf5rheb2kkwrglthnna7qhe24onwr@vw27225tkyar
11 daysFix typos in gininsert.cMichael Paquier
Introduced by 8492feb98f6d. Author: Xingbin She <xingbin.she@qq.com> Discussion: https://postgr.es/m/tencent_C254AE962588605F132DB4A6F87205D6A30A@qq.com
12 daysUse table/index_close() more consistentlyMichael Paquier
All the code paths updated here have been using relation_close() to close a relation that has already been opened with table_open() or index_open(), where a relkind check is enforced. table_close() and index_open() do the same thing as relation_close(), so there was no harm, but being inconsistent could lead to issues if the internals of these close() functions begin to introduce some logic specific to each relkind in the future. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/aUKamYGiDKO6byp5@ip-10-97-1-34.eu-west-3.compute.internal
12 daysDo not emit WAL for unlogged BRIN indexesHeikki Linnakangas
Operations on unlogged relations should not be WAL-logged. The brin_initialize_empty_new_buffer() function didn't get the memo. The function is only called when a concurrent update to a brin page uses up space that we're just about to insert to, which makes it pretty hard to hit. If you do manage to hit it, a full-page WAL record is erroneously emitted for the unlogged index. If you then crash, crash recovery will fail on that record with an error like this: FATAL: could not create file "base/5/32819": File exists Author: Kirill Reshke <reshkekirill@gmail.com> Discussion: https://www.postgresql.org/message-id/CALdSSPhpZXVFnWjwEBNcySx_vXtXHwB2g99gE6rK0uRJm-3GgQ@mail.gmail.com Backpatch-through: 14
13 daysChange pgstat_report_vacuum() to use RelationMichael Paquier
This change makes pgstat_report_vacuum() more consistent with pgstat_report_analyze(), that also uses a Relation. This enforces a policy that callers of this routine should open and lock the relation whose statistics are updated before calling this routine. We will unlikely have a lot of callers of this routine in the tree, but it seems like a good idea to imply this requirement in the long run. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Suggested-by: Andres Freund <andres@anarazel.de> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/aUEA6UZZkDCQFgSA@ip-10-97-1-34.eu-west-3.compute.internal
14 daysSwitch memory contexts in ReinitializeParallelDSM.Robert Haas
We already do this in CreateParallelContext, InitializeParallelDSM, and LaunchParallelWorkers. I suspect the reason why the matching logic was omitted from ReinitializeParallelDSM is that I failed to realize that any memory allocation was happening here -- but shm_mq_attach does allocate, which could result in a shm_mq_handle being allocated in a shorter-lived context than the ParallelContext which points to it. That could result in a crash if the shorter-lived context is freed before the parallel context is destroyed. As far as I am currently aware, there is no way to reach a crash using only code that is present in core PostgreSQL, but extensions could potentially trip over this. Fixing this in the back-branches appears low-risk, so back-patch to all supported versions. Author: Jakub Wartak <jakub.wartak@enterprisedb.com> Co-authored-by: Jeevan Chalke <jeevan.chalke@enterprisedb.com> Backpatch-through: 14 Discussion: http://postgr.es/m/CAKZiRmwfVripa3FGo06=5D1EddpsLu9JY2iJOTgbsxUQ339ogQ@mail.gmail.com
14 daysAdd explanatory comment to prune_freeze_setup()Melanie Plageman
heap_page_prune_and_freeze() fills in PruneState->deadoffsets, the array of OffsetNumbers of dead tuples. It is returned to the caller in the PruneFreezeResult. To avoid having two copies of the array, the PruneState saves only a pointer to the array. This was a bit unusual and confusing, so add a clarifying comment. Author: Melanie Plageman <melanieplageman@gmail.com> Suggested-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CAEoWx2=jiD1nqch4JQN+odAxZSD7mRvdoHUGJYN2r6tQG_66yQ@mail.gmail.com
14 daysFix const qualification in prune_freeze_setup()Melanie Plageman
The const qualification of the presult argument to prune_freeze_setup() is later cast away, so it was not correct. Remove it and add a comment explaining that presult should not be modified. Author: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/fb97d0ae-a0bc-411d-8a87-f84e7e146488%40eisentraut.org
2025-12-16Add TAP test to check recovery when redo LSN is missingMichael Paquier
This commit provides test coverage for dc7c77f825d7, where the redo record and the checkpoint record finish on different WAL segments with the start of recovery able to detect that the redo record is missing. This test uses a wait injection point done in the critical section of a checkpoint, method that requires not one but actually two wait injection points to avoid any memory allocations within the critical section of the checkpoint: - Checkpoint run with a background psql. - One first wait point is run by the checkpointer before the critical section, allocating the shared memory required by the DSM registry for the wait machinery in the library injection_points. - First point is woken up. - Second wait point is loaded before the critical section, allocating the memory to build the path to the library loaded, then run in the critical section once the checkpoint redo record has been logged. - WAL segment is switched while waiting on the second point. - Checkpoint completes. - Stop cluster with immediate mode. - The segment that includes the redo record is removed. - Start, recovery fails as the redo record cannot be found. The error message introduced in dc7c77f825d7 is now reduced to a FATAL, meaning that the information is still provided while being able to use a test for it. Nitin has provided a basic version of the test, that I have enhanced to make it portable with two points. Without dc7c77f825d7, the cluster crashes in this test, not on a PANIC but due to the pointer dereference at the beginning of recovery, failure mentioned in the other commit. Author: Nitin Jadhav <nitinjadhavpostgres@gmail.com> Co-authored-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CAMm1aWaaJi2w49c0RiaDBfhdCL6ztbr9m=daGqiOuVdizYWYaA@mail.gmail.com
2025-12-16Fail recovery when missing redo checkpoint record without backup_labelMichael Paquier
This commit adds an extra check at the beginning of recovery to ensure that the redo record of a checkpoint exists before attempting WAL replay, logging a PANIC if the redo record referenced by the checkpoint record could not be found. This is the same level of failure as when a checkpoint record is missing. This check is added when a cluster is started without a backup_label, after retrieving its checkpoint record. The redo LSN used for the check is retrieved from the checkpoint record successfully read. In the case where a backup_label exists, the startup process already fails if the redo record cannot be found after reading a checkpoint record at the beginning of recovery. Previously, the presence of the redo record was not checked. If the redo and checkpoint records were located on different WAL segments, it would be possible to miss a entire range of WAL records that should have been replayed but were just ignored. The consequences of missing the redo record depend on the version dealt with, these becoming worse the older the version used: - On HEAD, v18 and v17, recovery fails with a pointer dereference at the beginning of the redo loop, as the redo record is expected but cannot be found. These versions are good students, because we detect a failure before doing anything, even if the failure is misleading in the shape of a segmentation fault, giving no information that the redo record is missing. - In v16 and v15, problems show at the end of recovery within FinishWalRecovery(), the startup process using a buggy LSN to decide from where to start writing WAL. The cluster gets corrupted, still it is noisy about it. - v14 and older versions are worse: a cluster gets corrupted but it is entirely silent about the matter. The redo record missing causes the startup process to skip entirely recovery, because a missing record is the same as not redo being required at all. This leads to data loss, as everything is missed between the redo record and the checkpoint record. Note that I have tested that down to 9.4, reproducing the issue with a version of the author's reproducer slightly modified. The code is wrong since at least 9.2, but I did not look at the exact point of origin. This problem has been found by debugging a cluster where the WAL segment including the redo segment was missing due to an operator error, leading to a crash, based on an investigation in v15. Requesting archive recovery with the creation of a recovery.signal or a standby.signal even without a backup_label would mitigate the issue: if the record cannot be found in pg_wal/, the missing segment can be retrieved with a restore_command when checking that the redo record exists. This was already the case without this commit, where recovery would re-fetch the WAL segment that includes the redo record. The check introduced by this commit makes the segment to be retrieved earlier to make sure that the redo record can be found. On HEAD, the code will be slightly changed in a follow-up commit to not rely on a PANIC, to include a test able to emulate the original problem. This is a minimal backpatchable fix, kept separated for clarity. Reported-by: Andres Freund <andres@anarazel.de> Analyzed-by: Andres Freund <andres@anarazel.de> Author: Nitin Jadhav <nitinjadhavpostgres@gmail.com> Discussion: https://postgr.es/m/20231023232145.cmqe73stvivsmlhs@awork3.anarazel.de Discussion: https://postgr.es/m/CAMm1aWaaJi2w49c0RiaDBfhdCL6ztbr9m=daGqiOuVdizYWYaA@mail.gmail.com Backpatch-through: 14
2025-12-15Revisit cosmetics of "For inplace update, send nontransactional invalidations."Noah Misch
This removes a never-used CacheInvalidateHeapTupleInplace() parameter. It adds README content about inplace update visibility in logical decoding. It rewrites other comments. Back-patch to v18, where commit 243e9b40f1b2dd09d6e5bf91ebf6e822a2cd3704 first appeared. Since this removes a CacheInvalidateHeapTupleInplace() parameter, expect a v18 ".abi-compliance-history" edit to follow. PGXN contains no calls to that function. Reported-by: Paul A Jungwirth <pj@illuminatedcomputing.com> Reported-by: Ilyasov Ian <ianilyasov@outlook.com> Reviewed-by: Paul A Jungwirth <pj@illuminatedcomputing.com> Reviewed-by: Surya Poondla <s_poondla@apple.com> Discussion: https://postgr.es/m/CA+renyU+LGLvCqS0=fHit-N1J-2=2_mPK97AQxvcfKm+F-DxJA@mail.gmail.com Backpatch-through: 18
2025-12-15Correct comments of "Fix data loss at inplace update after heap_update()".Noah Misch
This corrects commit a07e03fd8fa7daf4d1356f7cb501ffe784ea6257. Reported-by: Paul A Jungwirth <pj@illuminatedcomputing.com> Reported-by: Surya Poondla <s_poondla@apple.com> Reviewed-by: Paul A Jungwirth <pj@illuminatedcomputing.com> Discussion: https://postgr.es/m/CA+renyWCW+_2QvXERBQ+mna6ANwAVXXmHKCA-WzL04bZRsjoBA@mail.gmail.com
2025-12-15Add offnum range checks to suppress compile warnings with UBSAN.Tom Lane
Late-model gcc with -fsanitize=undefined enabled issues warnings about uses of PageGetItemId() when it can't prove that the offsetNumber is > 0. The call sites where this happens are checking that the offnum is <= PageGetMaxOffsetNumber(page), so it seems reasonable to add an explicit check that offnum >= 1 too. While at it, rearrange the code to be less contorted and avoid duplicate checks on PageGetMaxOffsetNumber. Maybe the compiler would optimize away the duplicate logic or maybe not, but the existing coding has little to recommend it anyway. There are multiple instances of this identical coding pattern in heapam.c and heapam_xlog.c. Current gcc only complains about two of them, but I fixed them all in the name of consistency. Potentially this could be back-patched in the name of silencing warnings; but I think enabling UBSAN is mainly something people would do on HEAD, so for now it seems not worth the trouble. Discussion: https://postgr.es/m/1699806.1765746897@sss.pgh.pa.us
2025-12-15Improve sanity checks on multixid members lengthHeikki Linnakangas
In the server, check explicitly for multixids with zero members. We used to have an assertion for it, but commit d4b7bde418 replaced it with more extensive runtime checks, but it missed the original case of zero members. In the upgrade code, a negative length never makes sense, so better check for it explicitly. Commit d4b7bde418 added a similar sanity check to the corresponding server code on master, and in backbranches, the 'length' is passed to palloc which would fail with "invalid memory alloc request size" error. Clarify the comments on what kind of invalid entries are tolerated by the upgrade code and which ones are reported as fatal errors. Coverity complained about 'length' in the upgrade code being tainted. That's bogus because we trust the data on disk at least to some extent, but hopefully this will silence the complaint. If not, I'll dismiss it manually. Discussion: https://www.postgresql.org/message-id/7b505284-c6e9-4c80-a7ee-816493170abc@iki.fi
2025-12-14Update typedefs.list to match what the buildfarm currently reports.Tom Lane
The current list from the buildfarm includes quite a few typedef names that it used to miss. The reason is a bit obscure, but it seems likely to have something to do with our recent increased use of palloc_object and palloc_array. In any case, this makes the relevant struct declarations be much more nicely formatted, so I'll take it. Install the current list and re-run pgindent to update affected code. Syncing with the current list also removes some obsolete typedef names and fixes some alphabetization errors. Discussion: https://postgr.es/m/1681301.1765742268@sss.pgh.pa.us
2025-12-14Refactor WaitLSNType enum to use a macro for type countAlexander Korotkov
Change WAIT_LSN_TYPE_COUNT from an enum sentinel to a macro definition, in a similar way to IOObject, IOContext, and BackendType enums. Remove explicit enum value assignments well. Author: Xuneng Zhou <xunengzhou@gmail.com>
2025-12-12Replace most StaticAssertStmt() with StaticAssertDecl()Peter Eisentraut
Similar to commit 75f49221c22, it is preferable to use StaticAssertDecl() instead of StaticAssertStmt() when possible. Discussion: https://www.postgresql.org/message-id/flat/CA%2BhUKGKvr0x_oGmQTUkx%3DODgSksT2EtgCA6LmGx_jQFG%3DsDUpg%40mail.gmail.com
2025-12-12Never store 0 as the nextMXactHeikki Linnakangas
Before this commit, when multixid wraparound happens, MultiXactState->nextMXact goes to 0, which is invalid. All the readers need to deal with that possibility and skip over the 0. That's error-prone and we've missed it a few times in the past. This commit changes the responsibility so that all the writers of MultiXactState->nextMXact skip over the zero already, and readers can trust that it's never 0. We were already doing that for MultiXactState->oldestMultiXactId; none of its writers would set it to 0. ReadMultiXactIdRange() was nevertheless checking for that possibility. For clarity, remove that check. Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Maxim Orlov <orlovmg@gmail.com> Discussion: https://www.postgresql.org/message-id/3624730d-6dae-42bf-9458-76c4c965fb27@iki.fi
2025-12-11Add runtime checks for bogus multixact offsetsHeikki Linnakangas
It's not far-fetched that we'd try to read a multixid with an invalid offset in case of bugs or corruption. Or if you call pg_get_multixact_members() after a crash that left behind invalid but unused multixids. Better to get a somewhat descriptive error message if that happens. Discussion: https://www.postgresql.org/message-id/3624730d-6dae-42bf-9458-76c4c965fb27@iki.fi
2025-12-10Clarify why _bt_killitems sorts its items array.Peter Geoghegan
Make it clear why _bt_killitems sorts the scan's so->killedItems[] array. Also add an assertion to the _bt_killitems loop (that iterates through this array) to verify it accesses tuples in leaf page order. Follow-up to commit bfb335df58. Author: Peter Geoghegan <pg@bowt.ie> Suggested-by: Victor Yegorov <vyegorov@gmail.com> Discussion: https://postgr.es/m/CAGnEboirgArezZDNeFrR8FOGvKF-Xok333s2iVwWi65gZf8MEA@mail.gmail.com
2025-12-10Fix MULTIXACT_DEBUG builds.Peter Geoghegan
Oversight in commit bd8d9c9b. Discussion: https://postgr.es/m/CAH2-WzmvwVKZ+0Z=RL_+g_aOku8QxWddDCXmtyLj02y+nYaD0g@mail.gmail.com
2025-12-10Return TIDs in desc order during backwards scans.Peter Geoghegan
Always return TIDs in descending order when returning groups of TIDs from an nbtree posting list tuple during nbtree backwards scans. This makes backwards scans tend to require fewer buffer hits, since the scan is less likely to repeatedly pin and unpin the same heap page/buffer (we'll get exactly as many buffer hits as we get with a similar forwards scan case). Commit 0d861bbb, which added nbtree deduplication, originally did things this way to avoid interfering with _bt_killitems's approach to setting LP_DEAD bits on posting list tuples. _bt_killitems makes a soft assumption that it can always iterate through posting lists in ascending TID order, finding corresponding killItems[]/so->currPos.items[] entries in that same order. This worked out because of the prior _bt_readpage backwards scan behavior. If we just changed the backwards scan posting list logic in _bt_readpage, without altering _bt_killitems itself, it would break its soft assumption. Avoid that problem by sorting the so->killedItems[] array at the start of _bt_killitems. That way the order that dead items are saved in from btgettuple can't matter; so->killedItems[] will always be in the same order as so->currPos.items[] in the end. Since so->currPos.items[] is now always in leaf page order, regardless of the scan direction used within _bt_readpage, and since so->killedItems[] is always in that same order, the _bt_killitems loop can continue to make a uniform assumption about everything being in page order. In fact, sorting like this makes the previous soft assumption about item order into a hard invariant. Also deduplicate the so->killedItems[] array after it is sorted. That way there's no risk of the _bt_killitems loop becoming confused by a duplicate dead item/TID. This was possible in cases that involved a scrollable cursor that encountered the same dead TID more than once (within the same leaf page/so->currPos context). This doesn't come up very much in practice, but it seems best to be as consistent as possible about how and when _bt_killitems will LP_DEAD-mark index tuples. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Mircea Cadariu <cadariu.mircea@gmail.com> Reviewed-By: Victor Yegorov <vyegorov@gmail.com> Discussion: https://postgr.es/m/CAH2-Wz=Wut2pKvbW-u3hJ_LXwsYeiXHiW8oN1GfbKPavcGo8Ow@mail.gmail.com
2025-12-10Add comment about keeping PD_ALL_VISIBLE and VM in syncMelanie Plageman
The comment above heap_xlog_visible() about the critical integrity requirement for PD_ALL_VISIBLE and the visibility map should also be in heap_xlog_prune_freeze() where we set PD_ALL_VISIBLE. Oversight in add323da40a6bf9e Author: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
2025-12-10Simplify vacuum visibility assertionMelanie Plageman
Phase I vacuum gives the page a once-over after pruning and freezing to check that the values of all_visible and all_frozen agree with the result of heap_page_is_all_visible(). This is meant to keep the logic in phase I for determining visibility in sync with the logic in phase III. Rewrite the assertion to avoid an Assert(false). Suggested by Andres Freund. Author: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/mhf4vkmh3j57zx7vuxp4jagtdzwhu3573pgfpmnjwqa6i6yj5y%40sy4ymcdtdklo
2025-12-10Use palloc_object() and palloc_array() in backend codeMichael Paquier
The idea is to encourage more the use of these new routines across the tree, as these offer stronger type safety guarantees than palloc(). This batch of changes includes most of the trivial changes suggested by the author for src/backend/. A total of 334 files are updated here. Among these files, 48 of them have their build change slightly; these are caused by line number changes as the new allocation formulas are simpler, shaving around 100 lines of code in total. Similar work has been done in 0c3c5c3b06a3 and 31d3847a37be. Author: David Geier <geidav.pg@gmail.com> Discussion: https://postgr.es/m/ad0748d4-3080-436e-b0bc-ac8f86a3466a@gmail.com
2025-12-09Add mode and started_by columns to pg_stat_progress_vacuum view.Masahiko Sawada
The new columns, mode and started_by, indicate the vacuum mode ('normal', 'aggressive', or 'failsafe') and the initiator of the vacuum ('manual', 'autovacuum', or 'autovacuum_wraparound'), respectively. This allows users and monitoring tools to better understand VACUUM behavior. Bump catalog version. Author: Shinya Kato <shinya11.kato@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Robert Treat <rob@xzilla.net> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Yu Wang <wangyu_runtime@163.com> Discussion: https://postgr.es/m/CAOzEurQcOY-OBL_ouEVfEaFqe_md3vB5pXjR_m6L71Dcp1JKCQ@mail.gmail.com
2025-12-09Add wait event for the group commit delay before WAL flushHeikki Linnakangas
Author: Rafia Sabih <rafia.pghackers@gmail.com> Reviewed-by: Sami Imseih <samimseih@gmail.com> Discussion: https://www.postgresql.org/message-id/CA%2BFpmFf-hWXtrC0Q3Cr_Xo78zuP_M_VC5xgWPOYOkwqOD0T8eg@mail.gmail.com
2025-12-09Widen MultiXactOffset to 64 bitsHeikki Linnakangas
This eliminates MultiXactOffset wraparound and the 2^32 limit on the total number of multixid members. Multixids are still limited to 2^31, but this is a nice improvement because 'members' can grow much faster than the number of multixids. On such systems, you can now run longer before hitting hard limits or triggering anti-wraparound vacuums. Not having to deal with MultiXactOffset wraparound also simplifies the code and removes some gnarly corner cases. We no longer need to perform emergency anti-wraparound freezing because of running out of 'members' space, so the offset stop limit is gone. But you might still not want 'members' to consume huge amounts of disk space. For that reason, I kept the logic for lowering vacuum's multixid freezing cutoff if a large amount of 'members' space is used. The thresholds for that are roughly the same as the "safe" and "danger" thresholds used before, 2 billion transactions and 4 billion transactions. This keeps the behavior for the freeze cutoff roughly the same as before. It might make sense to make this smarter or configurable, now that the threshold is only needed to manage disk usage, but that's left for the future. Add code to pg_upgrade to convert multitransactions from the old to the new format, rewriting the pg_multixact SLRU files. Because pg_upgrade now rewrites the files, we can get rid of some hacks we had put in place to deal with old bugs and upgraded clusters. Bump catalog version for the pg_multixact/offsets format change. Author: Maxim Orlov <orlovmg@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com> Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com> Discussion: https://www.postgresql.org/message-id/CACG%3DezaWg7_nt-8ey4aKv2w9LcuLthHknwCawmBgEeTnJrJTcw@mail.gmail.com
2025-12-09Move pg_multixact SLRU page format definitions to a separate headerHeikki Linnakangas
This makes them accessible from pg_upgrade, needed by the next commit. I'm doing this mechanical move as a separate commit to make the next commit's changes to these definitions more obvious. Author: Maxim Orlov <orlovmg@gmail.com> Discussion: https://www.postgresql.org/message-id/CACG%3DezbZo_3_fnx%3DS5BfepwRftzrpJ%2B7WET4EkTU6wnjDTsnjg@mail.gmail.com
2025-12-09Remove useless casts in format argumentsPeter Eisentraut
There were a number of useless casts in format arguments, either where the input to the cast was already in the right type, or seemingly uselessly casting between types instead of just using the right format placeholder to begin with. Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/07fa29f9-42d7-4aac-8834-197918cbbab6%40eisentraut.org
2025-12-09Remove unnecessary casts in printf format arguments (%zu/%zd)Peter Eisentraut
Many of these are probably left over from before use of %zu/%zd was portable. Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/07fa29f9-42d7-4aac-8834-197918cbbab6%40eisentraut.org
2025-12-09Doc: fix typo in hash index documentationDavid Rowley
Plus a similar fix to the README. Backpatch as far back as the sgml issue exists. The README issue does exist in v14, but that seems unlikely to harm anyone. Author: David Geier <geidav.pg@gmail.com> Discussion: https://postgr.es/m/ed3db7ea-55b4-4809-86af-81ad3bb2c7d3@gmail.com Backpatch-through: 15
2025-12-08Avoid pointer chasing in _bt_readpage inner loop.Peter Geoghegan
Make _bt_readpage pass down the current scan direction to various utility functions within its pstate variable. Also have _bt_readpage work off of a local copy of scan->ignore_killed_tuples within its per-tuple loop (rather than using scan->ignore_killed_tuples directly). Testing has shown that this significantly benefits large range scans, which are naturally able to take full advantage of the pstate.startikey optimization added by commit 8a510275. Running a pgbench script with a "SELECT abalance FROM pgbench_accounts WHERE aid BETWEEN ..." query shows an increase in transaction throughput of over 5%. There also appears to be a small performance benefit when running pgbench's built-in select-only script. Follow-up to commit 65d6acbc. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Victor Yegorov <vyegorov@gmail.com> Discussion: https://postgr.es/m/CAH2-WzmwMwcwKFgaf+mYPwiz3iL4AqpXnwtW_O0vqpWPXRom9Q@mail.gmail.com
2025-12-08Relocate _bt_readpage and related functions.Peter Geoghegan
Quite a bit of code within nbtutils.c is only called by _bt_readpage. Move _bt_readpage and all of the nbtutils.c functions it depends on into a new .c file, nbtreadpage.c. Also reorder some of the functions within the new file for clarity. This commit has no functional impact. It is strictly mechanical. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Victor Yegorov <vyegorov@gmail.com> Discussion: https://postgr.es/m/CAH2-WzmwMwcwKFgaf+mYPwiz3iL4AqpXnwtW_O0vqpWPXRom9Q@mail.gmail.com
2025-12-08Use PGAlignedXLogBlock for some code simplificationPeter Eisentraut
The code in BootStrapXLOG() and in pg_test_fsync.c tried to align WAL buffers in complicated ways. Also, they still used XLOG_BLCKSZ for the alignment, even though that should now be PG_IO_ALIGN_SIZE. This can now be simplified and made more consistent by using PGAlignedXLogBlock, either directly in BootStrapXLOG() and using alignas in pg_test_fsync.c. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/f462a175-b608-44a1-b428-bdf351e914f4%40eisentraut.org
2025-12-05Fix setting next multixid's offset at offset wraparoundHeikki Linnakangas
In commit 789d65364c, we started updating the next multixid's offset too when recording a multixid, so that it can always be used to calculate the number of members. I got it wrong at offset wraparound: we need to skip over offset 0. Fix that. Discussion: https://www.postgresql.org/message-id/d9996478-389a-4340-8735-bfad456b313c@iki.fi Backpatch-through: 14
2025-12-04Suppress spurious Coverity warning in prune freeze logicMelanie Plageman
Adjust the prune_freeze_setup() parameter types of new_relfrozen_xid and new_relmin_mxid to prevent misleading Coverity analysis. heap_page_prune_and_freeze() compared these values against NULL when passing them to prune_freeze_setup(), causing Coverity to assume they could be NULL and flag a possible null-pointer dereference later, even though it occurs inside a directly related conditional. Reported-by: Coverity Author: Melanie Plageman <melanieplageman@gmail.com>
2025-12-04Remove no longer needed casts from PointerPeter Eisentraut
These casts used to be required when Pointer was char *, but now it's void * (commit 1b2bb5077e9), so they are not needed anymore. Author: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Discussion: https://www.postgresql.org/message-id/4154950a-47ae-4223-bd01-1235cc50e933%40eisentraut.org
2025-12-04Remove no longer needed casts to PointerPeter Eisentraut
These casts used to be required when Pointer was char *, but now it's void * (commit 1b2bb5077e9), so they are not needed anymore. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://www.postgresql.org/message-id/4154950a-47ae-4223-bd01-1235cc50e933%40eisentraut.org
2025-12-04Fix incorrect assertion bound in WaitForLSN()Alexander Korotkov
The assertion checking MyProcNumber used MaxBackends as the upper bound, but the procInfos array is allocated with size MaxBackends + NUM_AUXILIARY_PROCS. This inconsistency would cause a false assertion failure if an auxiliary process calls WaitForLSN(). Author: Xuneng Zhou <xunengzhou@gmail.com>
2025-12-03Set next multixid's offset when creating a new multixidHeikki Linnakangas
With this commit, the next multixid's offset will always be set on the offsets page, by the time that a backend might try to read it, so we no longer need the waiting mechanism with the condition variable. In other words, this eliminates "corner case 2" mentioned in the comments. The waiting mechanism was broken in a few scenarios: - When nextMulti was advanced without WAL-logging the next multixid. For example, if a later multixid was already assigned and WAL-logged before the previous one was WAL-logged, and then the server crashed. In that case the next offset would never be set in the offsets SLRU, and a query trying to read it would get stuck waiting for it. Same thing could happen if pg_resetwal was used to forcibly advance nextMulti. - In hot standby mode, a deadlock could happen where one backend waits for the next multixid assignment record, but WAL replay is not advancing because of a recovery conflict with the waiting backend. The old TAP test used carefully placed injection points to exercise the old waiting code, but now that the waiting code is gone, much of the old test is no longer relevant. Rewrite the test to reproduce the IPC/MultixactCreation hang after crash recovery instead, and to verify that previously recorded multixids stay readable. Backpatch to all supported versions. In back-branches, we still need to be able to read WAL that was generated before this fix, so in the back-branches this includes a hack to initialize the next offsets page when replaying XLOG_MULTIXACT_CREATE_ID for the last multixid on a page. On 'master', bump XLOG_PAGE_MAGIC instead to indicate that the WAL is not compatible. Author: Andrey Borodin <amborodin@acm.org> Reviewed-by: Dmitry Yurichev <dsy.075@yandex.ru> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Ivan Bykov <i.bykov@modernsys.ru> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/172e5723-d65f-4eec-b512-14beacb326ce@yandex.ru Backpatch-through: 14
2025-12-03Use "foo(void)" for definitions of functions with no parameters.Nathan Bossart
Standard practice in PostgreSQL is to use "foo(void)" instead of "foo()", as the latter looks like an "old-style" function declaration. Similar changes were made in commits cdf4b9aff2, 0e72b9d440, 7069dbcc31, f1283ed6cc, 7b66e2c086, e95126cf04, and 9f7c527af3. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/aTBObQPg%2Bps5I7vl%40ip-10-97-1-34.eu-west-3.compute.internal
2025-12-03Don't rely on pointer arithmetic with Pointer typePeter Eisentraut
The comment for the Pointer type says 'XXX Pointer arithmetic is done with this, so it can't be void * under "true" ANSI compilers.'. This fixes that. Change from Pointer to use char * explicitly where pointer arithmetic is needed. This makes the meaning of the code clearer locally and removes a dependency on the actual definition of the Pointer type. (The definition of the Pointer type is not changed in this commit.) Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://www.postgresql.org/message-id/4154950a-47ae-4223-bd01-1235cc50e933%40eisentraut.org
2025-12-03Remove useless casts to PointerPeter Eisentraut
in arguments of memcpy() and memmove() calls Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://www.postgresql.org/message-id/4154950a-47ae-4223-bd01-1235cc50e933%40eisentraut.org
2025-12-02Add a test for half-dead pages in B-tree indexesHeikki Linnakangas
To increase our test coverage in general, and because I will use this in the next commit to test a bug we currently have in amcheck. Reviewed-by: Peter Geoghegan <pg@bowt.ie> Discussion: https://www.postgresql.org/message-id/33e39552-6a2a-46f3-8b34-3f9f8004451f@garret.ru