summaryrefslogtreecommitdiff
path: root/src/backend/access
AgeCommit message (Collapse)Author
2024-03-11Set all_visible_according_to_vm correctly with DISABLE_PAGE_SKIPPINGHeikki Linnakangas
It's important for 'all_visible_according_to_vm' to correctly reflect whether the VM bit is set or not, even when we are not trusting the VM to skip pages, because contrary to what the comment said, lazy_scan_prune() relies on it. If it's incorrectly set to 'false', when the VM bit is in fact set, lazy_scan_prune() will try to set the VM bit again and dirty the page unnecessarily. As a result, if you used DISABLE_PAGE_SKIPPING, all heap pages were dirtied, even if there were no changes. We would also fail to clear any VM bits that were set incorrectly. This was broken in commit 980ae17310, so backpatch to v16. Backpatch-through: 16 Reviewed-by: Melanie Plageman, Peter Geoghegan Discussion: https://www.postgresql.org/message-id/3df2b582-dc1c-46b6-99b6-38eddd1b2784@iki.fi
2024-02-25Promote assertion about !ReindexIsProcessingIndex to runtime error.Tom Lane
When this assertion was installed (in commit d2f60a3ab), I thought it was only for catching server logic errors that caused accesses to catalogs that were undergoing index rebuilds. However, it will also fire in case of a user-defined index expression that attempts to access its own table. We occasionally see reports of people trying to do that, and typically getting unintelligible low-level errors as a result. We can provide a more on-point message by making this a regular runtime check. While at it, adjust the similar error check in systable_beginscan_ordered to use the same message text. That one is (probably) not reachable without a coding bug, but we might as well use a translatable message if we have one. Per bug #18363 from Alexander Lakhin. Back-patch to all supported branches. Discussion: https://postgr.es/m/18363-e3598a5a572d0699@postgresql.org
2024-01-29Fix locking when fixing an incomplete split of a GIN internal pageHeikki Linnakangas
ginFinishSplit() expects the caller to hold an exclusive lock on the buffer, but when finishing an earlier "leftover" incomplete split of an internal page, the caller held a shared lock. That caused an assertion failure in MarkBufferDirty(). Without assertions, it could lead to corruption if two backends tried to complete the split at the same time. On master, add a test case using the new injection point facility. Report and analysis by Fei Changhong. Backpatch the fix to all supported versions. Reviewed-by: Fei Changhong, Michael Paquier Discussion: https://www.postgresql.org/message-id/tencent_A3CE810F59132D8E230475A5F0F7A08C8307@qq.com
2024-01-29Add more LOG messages when starting and ending recovery from a backupMichael Paquier
Three LOG messages are added in the recovery code paths, providing information that can be useful to track corruption issues depending on the state of the cluster, telling that: - Recovery has started from a backup_label. - Recovery is restarting from a backup start LSN, without a backup_label. - Recovery has completed from a backup. This was originally applied on HEAD as of 1d35f705e191, and there is consensus that this can be useful for older versions. This applies cleanly down to 15, so do it down to this version for now (older versions have heavily refactored the WAL recovery paths, making the change less straight-forward to do). Author: Andres Freund Reviewed-by: David Steele, Laurenz Albe, Michael Paquier Discussion: https://postgr.es/m/20231117041811.vz4vgkthwjnwp2pp@awork3.anarazel.de Backpatch-through: 15
2024-01-18Add try_index_open(), conditional variant of index_open()Michael Paquier
try_index_open() is able to open an index if its relkind fits, except that it would return NULL instead of generated an error if the relation does not exist. This new routine will be used by an upcoming patch to make REINDEX on partitioned relations more robust when an index in a partition tree is dropped. Extracted from a larger patch by the same author. Author: Fei Changhong Discussion: https://postgr.es/m/tencent_6A52106095ACDE55333E3AD33F304C0C3909@qq.com Backpatch-through: 14
2023-12-21Avoid trying to fetch metapage of an SPGist partitioned index.Tom Lane
This is necessary when spgcanreturn() is invoked on a partitioned index, and the failure might be reachable in other scenarios as well. The rest of what spgGetCache() does is perfectly sensible for a partitioned index, so we should allow it to go through. I think the main takeaway from this is that we lack sufficient test coverage for non-btree partitioned indexes. Therefore, I added simple test cases for brin and gin as well as spgist (hash and gist AMs were covered already in indexing.sql). Per bug #18256 from Alexander Lakhin. Although the known test case only fails since v16 (3c569049b), I've got no faith at all that there aren't other ways to reach this problem; so back-patch to all supported branches. Discussion: https://postgr.es/m/18256-0b0e1b6e4a620f1b@postgresql.org
2023-12-12Prevent tuples to be marked as dead in subtransactions on standbysMichael Paquier
Dead tuples are ignored and are not marked as dead during recovery, as it can lead to MVCC issues on a standby because its xmin may not match with the primary. This information is tracked by a field called "xactStartedInRecovery" in the transaction state data, switched on when starting a transaction in recovery. Unfortunately, this information was not correctly tracked when starting a subtransaction, because the transaction state used for the subtransaction did not update "xactStartedInRecovery" based on the state of its parent. This would cause index scans done in subtransactions to return inconsistent data, depending on how the xmin of the primary and/or the standby evolved. This is broken since the introduction of hot standby in efc16ea52067, so backpatch all the way down. Author: Fei Changhong Reviewed-by: Kyotaro Horiguchi Discussion: https://postgr.es/m/tencent_C4D907A5093C071A029712E73B43C6512706@qq.com Backpatch-through: 12
2023-12-08Fix potential pointer overflow in xlogreader.c.Thomas Munro
While checking if a record could fit in the circular WAL decoding buffer, the coding from commit 3f1ce973 used arithmetic that could overflow. 64 bit systems were unaffected for various technical reasons, which probably explains the lack of problem reports. Likewise for 32 bit systems running known 32 bit kernels. The systems at risk of problems appear to be 32 bit processes running on 64 bit kernels, with unlucky placement in memory. Per complaint from GCC -fsanitize=undefined -m32, while testing variations of 039_end_of_wal.pl. Back-patch to 15. Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGKH0oRPOX7DhiQ_b51sM8HqcPp2J3WA-Oen%3DdXog%2BAGGQ%40mail.gmail.com
2023-12-06Fix compilation on Windows with WAL_DEBUGMichael Paquier
This has been broken since b060dbe0001a that has reworked the callback mechanism of XLogReader, most likely unnoticed because any form of development involving WAL happens on platforms where this compiles fine. Author: Bharath Rupireddy Discussion: https://postgr.es/m/CALj2ACVF14WKQMFwcJ=3okVDhiXpuK5f7YdT+BdYXbbypMHqWA@mail.gmail.com Backpatch-through: 13
2023-11-28Fix assertions with RI triggers in heap_update and heap_delete.Heikki Linnakangas
If the tuple being updated is not visible to the crosscheck snapshot, we return TM_Updated but the assertions would not hold in that case. Move them to before the cross-check. Fixes bug #17893. Backpatch to all supported versions. Author: Alexander Lakhin Backpatch-through: 12 Discussion: https://www.postgresql.org/message-id/17893-35847009eec517b5%40postgresql.org
2023-11-13Don't release index root page pin in ginFindParents().Tom Lane
It's clearly stated in the comments that ginFindParents() must keep the pin on the index's root page that's associated with the topmost GinBtreeStack item. However, the code path for the case that the desired downlink has been pushed down to the next index level ignored this proviso, and would release the pin anyway if we were still examining the root level. That led to an assertion failure or "buffer NNNN is not owned by resource owner" error later, when we try to release the pin again at the end of the insertion. This is quite hard to reproduce, since it can only happen if an index root page split occurs concurrently with our own insertion. Thanks to Jeff Janes for finding a test case that triggers it often enough to allow investigation. This has been there since the beginning of GIN, so back-patch to all supported branches. Discussion: https://postgr.es/m/CAMkU=1yCAKtv86dMrD__Ja-7KzjE=uMeKX8y__cx5W-OEWy2ow@mail.gmail.com
2023-11-10Ensure we use the correct spelling of "ensure"David Rowley
We seem to have accidentally used "insure" in a few places. Correct that. Author: Peter Smith Discussion: https://postgr.es/m/CAHut+Pv0biqrhA3pMhu40aDsj343mTsD75khKnHsLqR8P04f=Q@mail.gmail.com Backpatch-through: 12, oldest supported version
2023-11-08Enlarge assertion in bloom_init() for false_positive_rateMichael Paquier
false_positive_rate is a parameter that can be set with the bloom opclass in BRIN, and setting it to a value of exactly 0.25 would trigger an assertion in the first INSERT done on the index with value set. The assertion changed here relied on BLOOM_{MIN|MAX}_FALSE_POSITIVE_RATE that are somewhat arbitrary values, and specifying an out-of-range value would also trigger a failure when defining such an index. So, as-is, the assertion was just doubling on the min-max check of the reloption. This is now enlarged to check that it is a correct percentage value, instead, based on a suggestion by Tom Lane. Author: Alexander Lakhin Reviewed-by: Tom Lane, Shihao Zhong Discussion: https://postgr.es/m/17969-a6c54de48026d694@postgresql.org Backpatch-through: 14
2023-10-31doc: 1-byte varlena headers can be used for user PLAIN storageBruce Momjian
This also updates some C comments. Reported-by: suchithjn22@gmail.com Discussion: https://postgr.es/m/167336599095.2667301.15497893107226841625@wrigleys.postgresql.org Author: Laurenz Albe (doc patch) Backpatch-through: 11
2023-10-30Diagnose !indisvalid in more SQL functions.Noah Misch
pgstatindex failed with ERRCODE_DATA_CORRUPTED, of the "can't-happen" class XX. The other functions succeeded on an empty index; they might have malfunctioned if the failed index build left torn I/O or other complex state. Report an ERROR in statistics functions pgstatindex, pgstatginindex, pgstathashindex, and pgstattuple. Report DEBUG1 and skip all index I/O in maintenance functions brin_desummarize_range, brin_summarize_new_values, brin_summarize_range, and gin_clean_pending_list. Back-patch to v11 (all supported versions). Discussion: https://postgr.es/m/20231001195309.a3@google.com
2023-10-27Fix minmax-multi distance for extreme interval valuesTomas Vondra
When calculating distance for interval values, the code mostly mimicked interval_mi, i.e. it built a new interval value for the difference. That however does not work for sufficiently distant interval values, when the difference overflows the interval range. Instead, we can calculate the distance directly, without constructing the intermediate (and unnecessary) interval value. Backpatch to 14, where minmax-multi indexes were introduced. Reported-by: Dean Rasheed Reviewed-by: Ashutosh Bapat, Dean Rasheed Backpatch-through: 14 Discussion: https://postgr.es/m/eef0ea8c-4aaa-8d0d-027f-58b1f35dd170@enterprisedb.com
2023-10-27Fix minmax-multi on infinite date/timestamp valuesTomas Vondra
Make sure that infinite values in date/timestamp columns are treated as if in infinite distance. Infinite values should not be merged with other values, leaving them as outliers. The code however returned distance 0 in this case, so that infinite values were merged first. While this does not break the index (i.e. it still produces correct query results), it may make it much less efficient. We don't need explicit handling of infinite date/timestamp values when calculating distances, because those values are represented as extreme but regular values (e.g. INT64_MIN/MAX for the timestamp type). We don't need an exact distance, just a value that is much larger than distanced between regular values. With the added cast to double values, we can simply subtract the values. The regression test queries a value in the "gap" and checks the range was properly eliminated by the BRIN index. This only affects minmax-multi indexes on timestamp/date columns with infinite values, which is not very common in practice. The affected indexes may need to be rebuilt. Backpatch to 14, where minmax-multi indexes were introduced. Reported-by: Ashutosh Bapat Reviewed-by: Ashutosh Bapat, Dean Rasheed Backpatch-through: 14 Discussion: https://postgr.es/m/eef0ea8c-4aaa-8d0d-027f-58b1f35dd170@enterprisedb.com
2023-10-27Fix calculation in brin_minmax_multi_distance_dateTomas Vondra
When calculating the distance between date values, make sure to subtract them in the right order, i.e. (larger - smaller). The distance is used to determine which values to merge, and is expected to be a positive value. The code unfortunately did the subtraction in the opposite order, i.e. (smaller - larger), thus producing negative values and merging values the most distant values first. The resulting index is correct (i.e. produces correct results), but may be significantly less efficient. This affects all minmax-multi indexes on date columns. Backpatch to 14, where minmax-multi indexes were introduced. Reported-by: Ashutosh Bapat Reviewed-by: Ashutosh Bapat, Dean Rasheed Backpatch-through: 14 Discussion: https://postgr.es/m/eef0ea8c-4aaa-8d0d-027f-58b1f35dd170@enterprisedb.com
2023-10-27Fix overflow when calculating timestamp distance in BRINTomas Vondra
When calculating distances for timestamp values for BRIN minmax-multi indexes, we need to be careful about overflows for extreme values. If the value overflows into a negative value, the index may be inefficient. The new regression test checks this for the timestamp type by adding a table with enough values to force range compaction/merging. The values are close to min/max, which means a risk of overflow. Fixed by converting the int64 values to double first, before calculating the distance. This prevents the overflow. We may lose some precision, of course, but that's good enough. In the worst case we build a slightly less efficient index, but for large distances this won't matter. This only affects minmax-multi indexes on timestamp columns, with ranges containing values sufficiently distant to cause an overflow. That seems like a fairly rare case in practice. Backpatch to 14, where minmax-multi indexes were introduced. Reported-by: Ashutosh Bapat Reviewed-by: Ashutosh Bapat, Dean Rasheed Backpatch-through: 14 Discussion: https://postgr.es/m/eef0ea8c-4aaa-8d0d-027f-58b1f35dd170@enterprisedb.com
2023-10-16Move extra code out of the Pre/PostRestoreCommand() section.Nathan Bossart
If SIGTERM is received within this section, the startup process will immediately proc_exit() in the signal handler, so it is inadvisable to include any more code than is required there (as such code is unlikely to be compatible with doing proc_exit() in a signal handler). This commit moves the code recently added to this section (see 1b06d7bac9 and 7fed801135) to outside of the section. This ensures that the startup process only calls proc_exit() in its SIGTERM handler for the duration of the system() call, which is how this code worked from v8.4 to v14. Reported-by: Michael Paquier, Thomas Munro Analyzed-by: Andres Freund Suggested-by: Tom Lane Reviewed-by: Michael Paquier, Robert Haas, Thomas Munro, Andres Freund Discussion: https://postgr.es/m/Y9nGDSgIm83FHcad%40paquier.xyz Discussion: https://postgr.es/m/20230223231503.GA743455%40nathanxps13 Backpatch-through: 15
2023-10-16Fix comment from commit 22655aa231.Thomas Munro
Per automated complaint from BF animal koel this needed to be re-indented, but there was also a typo. Back-patch to 16.
2023-10-13Fix bulk table extension when copying into multiple partitionsAndres Freund
When COPYing into a partitioned table that does now permit the use of table_multi_insert(), we could error out with ERROR: could not read block NN in file "base/...": read only 0 of 8192 bytes because BulkInsertState->next_free was not reset between partitions. This problem occurred only when not able to use table_multi_insert(), as a dedicated BulkInsertState for each partition is used in that case. The bug was introduced in 00d1e02be24, but it was hard to hit at that point, as commonly bulk relation extension is not used when not using table_multi_insert(). It became more likely after 82a4edabd27, which expanded the use of bulk extension. To fix the bug, reset the bulk relation extension state in BulkInsertState in ReleaseBulkInsertStatePin(). That was added (in b1ecb9b3fcf) to tackle a very similar issue. Obviously the name is not quite correct, but there might be external callers, and bulk insert state needs to be reset in precisely in the situations that ReleaseBulkInsertStatePin() already needed to be called. Medium term the better fix likely is to disallow reusing BulkInsertState across relations. Add a test that, without the fix, reproduces #18130 in most configurations. The test also catches the problem fixed in b1ecb9b3fcf when run with small shared_buffers. Reported-by: Ivan Kolombet <enderstd@gmail.com> Analyzed-by: Tom Lane <tgl@sss.pgh.pa.us> Analyzed-by: Andres Freund <andres@anarazel.de> Bug: #18130 Discussion: https://postgr.es/m/18130-7a86a7356a75209d%40postgresql.org Discussion: https://postgr.es/m/257696.1695670946%40sss.pgh.pa.us Backpatch: 16-
2023-10-10Fix bug in GenericXLogFinish().Jeff Davis
Mark the buffers dirty before writing WAL. Discussion: https://postgr.es/m/25104133-7df8-cae3-b9a2-1c0aaa1c094a@iki.fi Reviewed-by: Heikki Linnakangas Backpatch-through: 11
2023-10-03Fail hard on out-of-memory failures in xlogreader.cMichael Paquier
This commit changes the WAL reader routines so as a FATAL for the backend or exit(FAILURE) for the frontend is triggered if an allocation for a WAL record decode fails in walreader.c, rather than treating this case as bogus data, which would be equivalent to the end of WAL. The key is to avoid palloc_extended(MCXT_ALLOC_NO_OOM) in walreader.c, relying on plain palloc() calls. The previous behavior could make WAL replay finish too early than it should. For example, crash recovery finishing earlier may corrupt clusters because not all the WAL available locally was replayed to ensure a consistent state. Out-of-memory failures would show up randomly depending on the memory pressure on the host, but one simple case would be to generate a large record, then replay this record after downsizing a host, as Ethan Mertz originally reported. This relies on bae868caf222, as the WAL reader routines now do the memory allocation required for a record only once its header has been fully read and validated, making xl_tot_len trustable. Making the WAL reader react differently on out-of-memory or bogus record data would require ABI changes, so this is the safest choice for stable branches. Also, it is worth noting that 3f1ce973467a has been using a plain palloc() in this code for some time now. Thanks to Noah Misch and Thomas Munro for the discussion. Like the other commit, backpatch down to 12, leaving out v11 that will be EOL'd soon. The behavior of considering a failed allocation as bogus data comes originally from 0ffe11abd3a0, where the record length retrieved from its header was not entirely trustable. Reported-by: Ethan Mertz Discussion: https://postgr.es/m/ZRKKdI5-RRlta3aF@paquier.xyz Backpatch-through: 12
2023-09-28Fix btmarkpos/btrestrpos array key wraparound bug.Peter Geoghegan
nbtree's mark/restore processing failed to correctly handle an edge case involving array key advancement and related search-type scan key state. Scans with ScalarArrayScalarArrayOpExpr quals requiring mark/restore processing (for a merge join) could incorrectly conclude that an affected array/scan key must not have advanced during the time between marking and restoring the scan's position. As a result of all this, array key handling within btrestrpos could skip a required call to _bt_preprocess_keys(). This confusion allowed later primitive index scans to overlook tuples matching the true current array keys. The scan's search-type scan keys would still have spurious values corresponding to the final array element(s) -- not values matching the first/now-current array element(s). To fix, remember that "array key wraparound" has taken place during the ongoing btrescan in a flag variable stored in the scan's state, and use that information at the point where btrestrpos decides if another call to _bt_preprocess_keys is required. Oversight in commit 70bc5833, which taught nbtree to handle array keys during mark/restore processing, but missed this subtlety. That commit was itself a bug fix for an issue in commit 9e8da0f7, which taught nbtree to handle ScalarArrayOpExpr quals natively. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-WzkgP3DDRJxw6DgjCxo-cu-DKrvjEv_ArkP2ctBJatDCYg@mail.gmail.com Backpatch: 11- (all supported branches).
2023-09-28Fix typo in src/backend/access/transam/README.Etsuro Fujita
2023-09-26Fix another bug in parent page splitting during GiST index build.Heikki Linnakangas
Yet another bug in the ilk of commits a7ee7c851 and 741b88435. In 741b88435, we took care to clear the memorized location of the downlink when we split the parent page, because splitting the parent page can move the downlink. But we missed that even *updating* a tuple on the parent can move it, because updating a tuple on a gist page is implemented as a delete+insert, so the updated tuple gets moved to the end of the page. This commit fixes the bug in two different ways (belt and suspenders): 1. Clear the downlink when we update a tuple on the parent page, even if it's not split. This the same approach as in commits a7ee7c851 and 741b88435. I also noticed that gistFindCorrectParent did not clear the 'downlinkoffnum' when it stepped to the right sibling. Fix that too, as it seems like a clear bug even though I haven't been able to find a test case to hit that. 2. Change gistFindCorrectParent so that it treats 'downlinkoffnum' merely as a hint. It now always first checks if the downlink is still at that location, and if not, it scans the page like before. That's more robust if there are still more cases where we fail to clear 'downlinkoffnum' that we haven't yet uncovered. With this, it's no longer necessary to meticulously clear 'downlinkoffnum', so this makes the previous fixes unnecessary, but I didn't revert them because it still seems nice to clear it when we know that the downlink has moved. Also add the test case using the same test data that Alexander posted. I tried to reduce it to a smaller test, and I also tried to reproduce this with different test data, but I was not able to, so let's just include what we have. Backpatch to v12, like the previous fixes. Reported-by: Alexander Lakhin Discussion: https://www.postgresql.org/message-id/18129-caca016eaf0c3702@postgresql.org
2023-09-26Fix edge-case for xl_tot_len broken by bae868ca.Thomas Munro
bae868ca removed a check that was still needed. If you had an xl_tot_len at the end of a page that was too small for a record header, but not big enough to span onto the next page, we'd immediately perform the CRC check using a bogus large length. Because of arbitrary coding differences between the CRC implementations on different platforms, nothing very bad happened on common modern systems. On systems using the _sb8.c fallback we could segfault. Restore that check, add a new assertion and supply a test for that case. Back-patch to 12, like bae868ca. Tested-by: Tom Lane <tgl@sss.pgh.pa.us> Tested-by: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGLCkTT7zYjzOxuLGahBdQ%3DMcF%3Dz5ZvrjSOnW4EDhVjT-g%40mail.gmail.com
2023-09-23Don't trust unvalidated xl_tot_len.Thomas Munro
xl_tot_len comes first in a WAL record. Usually we don't trust it to be the true length until we've validated the record header. If the record header was split across two pages, previously we wouldn't do the validation until after we'd already tried to allocate enough memory to hold the record, which was bad because it might actually be garbage bytes from a recycled WAL file, so we could try to allocate a lot of memory. Release 15 made it worse. Since 70b4f82a4b5, we'd at least generate an end-of-WAL condition if the garbage 4 byte value happened to be > 1GB, but we'd still try to allocate up to 1GB of memory bogusly otherwise. That was an improvement, but unfortunately release 15 tries to allocate another object before that, so you could get a FATAL error and recovery could fail. We can fix both variants of the problem more fundamentally using pre-existing page-level validation, if we just re-order some logic. The new order of operations in the split-header case defers all memory allocation based on xl_tot_len until we've read the following page. At that point we know that its first few bytes are not recycled data, by checking its xlp_pageaddr, and that its xlp_rem_len agrees with xl_tot_len on the preceding page. That is strong evidence that xl_tot_len was truly the start of a record that was logged. This problem was most likely to occur on a standby, because walreceiver.c recycles WAL files without zeroing out trailing regions of each page. We could fix that too, but it wouldn't protect us from rare crash scenarios where the trailing zeroes don't make it to disk. With reliable xl_tot_len validation in place, the ancient policy of considering malloc failure to indicate corruption at end-of-WAL seems quite surprising, but changing that is left for later work. Also included is a new TAP test to exercise various cases of end-of-WAL detection by writing contrived data into the WAL from Perl. Back-patch to 12. We decided not to put this change into the final release of 11. Author: Thomas Munro <thomas.munro@gmail.com> Author: Michael Paquier <michael@paquier.xyz> Reported-by: Alexander Lakhin <exclusion@gmail.com> Reviewed-by: Noah Misch <noah@leadboat.com> (the idea, not the code) Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Sergei Kornilov <sk@zsrv.org> Reviewed-by: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/17928-aa92416a70ff44a2%40postgresql.org
2023-09-21Fix COMMIT/ROLLBACK AND CHAIN in the presence of subtransactions.Tom Lane
In older branches, COMMIT/ROLLBACK AND CHAIN failed to propagate the current transaction's properties to the new transaction if there was any open subtransaction (unreleased savepoint). Instead, some previous transaction's properties would be restored. This is because the "if (s->chain)" check in CommitTransactionCommand examined the wrong instance of the "chain" flag and falsely concluded that it didn't need to save transaction properties. Our regression tests would have noticed this, except they used identical transaction properties for multiple tests in a row, so that the faulty behavior was not distinguishable from correct behavior. Commit 12d768e70 fixed the problem in v15 and later, but only rather accidentally, because I removed the "if (s->chain)" test to avoid a compiler warning, while not realizing that the warning was flagging a real bug. In v14 and before, remove the if-test and save transaction properties unconditionally; just as in the newer branches, that's not expensive enough to justify thinking harder. Add the comment and extra regression test to v15 and later to forestall any future recurrence, but there's no live bug in those branches. Patch by me, per bug #18118 from Liu Xiang. Back-patch to v12 where the AND CHAIN feature was added. Discussion: https://postgr.es/m/18118-4b72fcbb903aace6@postgresql.org
2023-09-19Fix GiST README's explanation of the NSN cross-check.Heikki Linnakangas
The text got the condition backwards, it's "NSN > LSN", not "NSN < LSN". While we're at it, expand it a little for clarity. Reviewed-by: Daniel Gustafsson Discussion: https://www.postgresql.org/message-id/4cb46e18-e688-524a-0f73-b1f03ed5d6ee@iki.fi
2023-08-31Report syncscan position at end of scan.Heikki Linnakangas
The comment in heapgettup_advance_block() says that it reports the scan position before checking for end of scan, but that didn't match the code. The code was refactored in commit 7ae0ab0ad9, which inadvertently changed the order of the check and reporting. Change it back. This caused a few regression test failures with a small shared_buffers setting like 10 MB. The 'portals' and 'cluster' tests perform seqscans that are large enough that sync seqscans kick in. When the sync scan position is not updated at end of scan, the next seq scan doesn't start at the beginning of the table, and the test queries are sensitive to that. Reviewed-by: Melanie Plageman, David Rowley Discussion: https://www.postgresql.org/message-id/6f991389-ae22-d844-a9d8-9aceb7c01a9a@iki.fi Backpatch-through: 16
2023-08-23Fix _bt_allequalimage() call within critical section.Heikki Linnakangas
_bt_allequalimage() does complicated things, so it's not OK to call it in a critical section. Per buildfarm failure on 'prion', which uses -DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE options. Discussion: https://www.postgresql.org/message-id/6e5bbc08-cdfc-b2b3-9e23-1a914b9850a9@iki.fi Backpatch-through: 16, like commit ccadf73163 that introduced this
2023-08-23Use the buffer cache when initializing an unlogged index.Heikki Linnakangas
Some of the ambuildempty functions used smgrwrite() directly, followed by smgrimmedsync(). A few small problems with that: Firstly, one is supposed to use smgrextend() when extending a relation, not smgrwrite(). It doesn't make much difference in production builds. smgrextend() updates the relation size cache, so you miss that, but that's harmless because we never use the cached relation size of an init fork. But if you compile with CHECK_WRITE_VS_EXTEND, you get an assertion failure. Secondly, the smgrwrite() calls were performed before WAL-logging, so the page image written to disk had 0/0 as the LSN, not the LSN of the WAL record. That's also harmless in practice, but seems sloppy. Thirdly, it's better to use the buffer cache, because then you don't need to smgrimmedsync() the relation to disk, which adds latency. Bypassing the cache makes sense for bulk operations like index creation, but not when you're just initializing an empty index. Creation of unlogged tables is hardly performance bottleneck in any real world applications, but nevertheless. Backpatch to v16, but no further. These issues should be harmless in practice, so better to not rock the boat in older branches. Reviewed-by: Robert Haas Discussion: https://www.postgresql.org/message-id/6e5bbc08-cdfc-b2b3-9e23-1a914b9850a9@iki.fi
2023-08-23ExtendBufferedWhat -> BufferManagerRelation.Thomas Munro
Commit 31966b15 invented a way for functions dealing with relation extension to accept a Relation in online code and an SMgrRelation in recovery code. It seems highly likely that future bufmgr.c interfaces will face the same problem, and need to do something similar. Generalize the names so that each interface doesn't have to re-invent the wheel. Back-patch to 16. Since extension AM authors might start using the constructor macros once 16 ships, we agreed to do the rename in 16 rather than waiting for 17. Reviewed-by: Peter Geoghegan <pg@bowt.ie> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CA%2BhUKG%2B6tLD2BhpRWycEoti6LVLyQq457UL4ticP5xd8LqHySA%40mail.gmail.com
2023-08-22Cache by-reference missing values in a long lived contextAndrew Dunstan
Attribute missing values might be needed past the lifetime of the tuple descriptors from which they are extracted. To avoid possibly using pointers for by-reference values which might thus be left dangling, we cache a datumCopy'd version of the datum in the TopMemoryContext. Since we first search for the value this only needs to be done once per session for any such value. Original complaint from Tom Lane, idea for mitigation by Andrew Dunstan, tweaked by Tom Lane. Backpatch to version 11 where missing values were introduced. Discussion: https://postgr.es/m/1306569.1687978174@sss.pgh.pa.us
2023-08-14hio: Take number of prior relation extensions into accountAndres Freund
The new relation extension logic, introduced in 00d1e02be24, could lead to slowdowns in some scenarios. E.g., when loading narrow rows into a table using COPY, the caller of RelationGetBufferForTuple() will only request a small number of pages. Without concurrency, we just extended using pwritev() in that case. However, if there is *some* concurrency, we switched between extending by a small number of pages and a larger number of pages, depending on the number of waiters for the relation extension logic. However, some filesystems, XFS in particular, do not perform well when switching between extending files using fallocate() and pwritev(). To avoid that issue, remember the number of prior relation extensions in BulkInsertState and extend more aggressively if there were prior relation extensions. That not just avoids the aforementioned slowdown, but also leads to noticeable performance gains in other situations, primarily due to extending more aggressively when there is no concurrency. I should have done it this way from the get go. Reported-by: Masahiko Sawada <sawada.mshk@gmail.com> Author: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAD21AoDvDmUQeJtZrau1ovnT_smN940=Kp6mszNGK3bq9yRN6g@mail.gmail.com Backpatch: 16-, where the new relation extension code was added
2023-08-12Fix off-by-one in XLogRecordMaxSize check.Noah Misch
pg_logical_emit_message(false, '_', repeat('x', 1069547465)) failed with self-contradictory message "WAL record would be 1069547520 bytes (of maximum 1069547520 bytes)". There's no particular benefit from allowing or denying one byte in either direction; XLogRecordMaxSize could rise a few megabytes without trouble. Hence, this is just for cleanliness. Back-patch to v16, where this check first appeared.
2023-07-18Fix indentation in twophase.cMichael Paquier
This has been missed in cb0cca1, noticed before buildfarm member koel has been able to complain while poking at a different patch. Like the other commit, backpatch all the way down to limit the odds of merge conflicts. Backpatch-through: 11
2023-07-18Fix recovery of 2PC transaction during crash recoveryMichael Paquier
A crash in the middle of a checkpoint with some two-phase state data already flushed to disk by this checkpoint could cause a follow-up crash recovery to recover twice the same transaction, once from what has been found in pg_twophase/ at the beginning of recovery and a second time when replaying its corresponding record. This would lead to FATAL failures in the startup process during recovery, where the same transaction would have a state recovered twice instead of once: LOG: recovering prepared transaction 731 from shared memory LOG: recovering prepared transaction 731 from shared memory FATAL: lock ExclusiveLock on object 731/0/0 is already held This issue is fixed by skipping the addition of any 2PC state coming from a record whose equivalent 2PC state file has already been loaded in TwoPhaseState at the beginning of recovery by restoreTwoPhaseData(), which is OK as long as the system has not reached a consistent state. The timing to get a messed up recovery processing is very racy, and would very unlikely happen. The thread that has reported the issue has demonstrated the bug using injection points to force a PANIC in the middle of a checkpoint. Issue introduced in 728bd99, so backpatch all the way down. Reported-by: "suyu.cmj" <mengjuan.cmj@alibaba-inc.com> Author: "suyu.cmj" <mengjuan.cmj@alibaba-inc.com> Author: Michael Paquier Discussion: https://postgr.es/m/109e6994-b971-48cb-84f6-829646f18b4c.mengjuan.cmj@alibaba-inc.com Backpatch-through: 11
2023-07-10Message wording improvementsPeter Eisentraut
2023-07-07Document relaxed HOT for summarizing indexesTomas Vondra
Commit 19d8e2308b allowed a weaker check for HOT with summarizing indexes, but it did not update README.HOT. So do that now. Patch by Matthias van de Meent, minor changes by me. Backpatch to 16, where the optimization was introduced. Author: Matthias van de Meent Reviewed-by: Tomas Vondra Backpatch-through: 16 Discussion: https://postgr.es/m/CAEze2WiEOm8V+c9kUeYp2BPhbEc5s473fUf51xNeqvSFGv44Ew@mail.gmail.com
2023-07-04Fix race in SSI interaction with gin fast path.Thomas Munro
The ginfast.c code previously checked for conflicts in before locking the relevant buffer, leaving a window where a RW conflict could be missed. Re-order. There was also a place where buffer ID and block number were confused while trying to predicate-lock a page, noted by visual inspection. Back-patch to all supported releases. Fixes one more problem discovered with the reproducer from bug #17949, in this case when Dmitry tried other index types. Reported-by: Artem Anisimov <artem.anisimov.255@gmail.com> Reported-by: Dmitry Dolgov <9erthalion6@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://postgr.es/m/17949-a0f17035294a55e2%40postgresql.org
2023-07-04Fix race in SSI interaction with bitmap heap scan.Thomas Munro
When performing a bitmap heap scan, we don't want to miss concurrent writes that occurred after we observed the heap's rs_nblocks, but before we took predicate locks on index pages. Therefore, we can't skip fetching any heap tuples that are referenced by the index, because we need to test them all with CheckForSerializableConflictOut(). The old optimization that would ignore any references to blocks >= rs_nblocks gets in the way of that requirement, because it means that concurrent writes in that window are ignored. Removing that optimization shouldn't affect correctness at any isolation level, because any new tuples shouldn't be visible to an MVCC snapshot. There also shouldn't be any error-causing references to heap blocks past the end, because we should have held at least an AccessShareLock on the table before the index scan. It can't get smaller while our transaction is running. For now, though, we'll keep the optimization at lower levels to avoid making unnecessary changes in a bug fix. Back-patch to all supported releases. In release 11, the code is in a different place but not fundamentally different. Fixes one aspect of bug #17949. Reported-by: Artem Anisimov <artem.anisimov.255@gmail.com> Reviewed-by: Dmitry Dolgov <9erthalion6@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://postgr.es/m/17949-a0f17035294a55e2%40postgresql.org
2023-07-04Fix race in SSI interaction with empty btrees.Thomas Munro
When predicate-locking btrees, we have a special case for completely empty btrees, since there is no page to lock. This was racy, because, without buffer lock held, a matching key could be inserted between the _bt_search() and the PredicateLockRelation() calls. Fix, by rechecking _bt_search() after taking the relation-level SIREAD lock, if using SERIALIZABLE isolation and an empty btree is discovered. Back-patch to all supported releases. Fixes one aspect of bug #17949. Reported-by: Artem Anisimov <artem.anisimov.255@gmail.com> Reviewed-by: Dmitry Dolgov <9erthalion6@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://postgr.es/m/17949-a0f17035294a55e2%40postgresql.org
2023-07-03Silence "missing contrecord" error.Thomas Munro
Commit dd38ff28ad added a new error message "missing contrecord" when we fail to reassemble a record. Unfortunately that caused noisy messages to be logged by pg_waldump at end of segment, and by walsender when asked to shut down on a segment boundary. Remove the new error message, so that this condition signals end-of- WAL without a message. It's arguably a reportable condition that should not be silenced while performing crash recovery, but fixing that without introducing noise in the other cases will require more research. Back-patch to 15. Reported-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: https://postgr.es/m/6a1df56e-4656-b3ce-4b7a-a9cb41df8189%40enterprisedb.com
2023-06-21nbtree VACUUM: cope with topparent inconsistencies.Peter Geoghegan
Avoid "right sibling %u of block %u is not next child" errors when vacuuming a corrupt nbtree index. Just LOG the issue and press on. That way VACUUM will have a decent chance of finishing off all required processing for the index (and for the table as a whole). This is similar to recent work from commit 5abff197, as well as work from commit 5b861baa (later backpatched as commit 43e409ce), which taught nbtree VACUUM to keep going when its "re-find" check fails. The hardening added by this commit takes place directly after the "re-find" check, right before the critical section for the first stage of page deletion. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-Wz=dayg0vjs4+er84TS9ami=csdzjpuiCGbEw=idhwqhzQ@mail.gmail.com Backpatch: 11- (all supported versions).
2023-06-20Enable archiving in recovery TAP test 009_twophase.plMichael Paquier
This is a follow-up of f663b00, that has been committed to v13 and v14, tweaking the TAP test for two-phase transactions so as it provides coverage for the bug that has been fixed. This change is done in its own commit for clarity, as v15 and HEAD did not show the problematic behavior, still missed coverage for it. While on it, this adds a comment about the dependency of the last partial segment rename and RecoverPreparedTransactions() at the end of recovery, as that can be easy to miss. Author: Michael Paquier Reviewed-by: Kyotaro Horiguchi Discussion: https://postgr.es/m/743b9b45a2d4013bd90b6a5cba8d6faeb717ee34.camel@cybertec.at Backpatch-through: 13
2023-06-10nbtree: Allocate new pages in separate function.Peter Geoghegan
Split nbtree's _bt_getbuf function is two: code that read locks or write locks existing pages remains in _bt_getbuf, while code that deals with allocating new pages is moved to a new, dedicated function called _bt_allocbuf. This simplifies most _bt_getbuf callers, since it is no longer necessary for them to pass a heaprel argument. Many of the changes to nbtree from commit 61b313e4 can be reverted. This minimizes the divergence between HEAD/PostgreSQL 16 and earlier release branches. _bt_allocbuf replaces the previous nbtree idiom of passing P_NEW to _bt_getbuf. There are only 3 affected call sites, all of which continue to pass a heaprel for recovery conflict purposes. Note that nbtree's use of P_NEW was superficial; nbtree never actually relied on the P_NEW code paths in bufmgr.c, so this change is strictly mechanical. GiST already took the same approach; it has a dedicated function for allocating new pages called gistNewBuffer(). That factor allowed commit 61b313e4 to make much more targeted changes to GiST. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://postgr.es/m/CAH2-Wz=8Z9qY58bjm_7TAHgtW6RzZ5Ke62q5emdCEy9BAzwhmg@mail.gmail.com
2023-06-10Revert "Fix search_path to a safe value during maintenance operations."Jeff Davis
This reverts commit 05e17373517114167d002494e004fa0aa32d1fd1.