summaryrefslogtreecommitdiff
path: root/src/include/access
AgeCommit message (Collapse)Author
2025-02-02Move CompareType to separate header filePeter Eisentraut
We'll want to make use of it in more places, and we'd prefer to not have to include all of primnodes.h everywhere. Author: Mark Dilger <mark.dilger@enterprisedb.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
2025-02-01Rename GistTranslateStratnum() to GistTranslateCompareType()Peter Eisentraut
Follow up to commit 630f9a43cec. The previous name had become confusing, because it doesn't actually translate a strategy number but a CompareType into a strategy number. We might add the inverse at some point, which would then probably be called something like GistTranslateStratnum. Reviewed-by: Mark Dilger <mark.dilger@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
2025-01-25At update of non-LP_NORMAL TID, fail instead of corrupting page header.Noah Misch
The right mix of DDL and VACUUM could corrupt a catalog page header such that PageIsVerified() durably fails, requiring a restore from backup. This affects only catalogs that both have a syscache and have DDL code that uses syscache tuples to construct updates. One of the test permutations shows a variant not yet fixed. This makes !TransactionIdIsValid(TM_FailureData.xmax) possible with TM_Deleted. I think core and PGXN are indifferent to that. Per bug #17821 from Alexander Lakhin. Back-patch to v13 (all supported versions). The test case is v17+, since it uses INJECTION_POINT. Discussion: https://postgr.es/m/17821-dd8c334263399284@postgresql.org
2025-01-25Merge copies of converting an XID to a FullTransactionId.Noah Misch
Assume twophase.c is the performance-sensitive caller, and preserve its choice of unlikely() branch hint. Add some retrospective rationale for that choice. Back-patch to v17, for the next commit to use it. Reviewed (in earlier versions) by Michael Paquier. Discussion: https://postgr.es/m/17821-dd8c334263399284@postgresql.org Discussion: https://postgr.es/m/20250116010051.f3.nmisch@google.com
2025-01-23Convert macros to static inline functions (htup_details.h, itup.h)Peter Eisentraut
Discussion: https://www.postgresql.org/message-id/flat/5b558da8-99fb-0a99-83dd-f72f05388517@enterprisedb.com
2025-01-23Add some const decorations (htup.h)Peter Eisentraut
Discussion: https://www.postgresql.org/message-id/flat/5b558da8-99fb-0a99-83dd-f72f05388517@enterprisedb.com
2025-01-20Add some more use of Page/PageData rather than char *Peter Eisentraut
Discussion: https://www.postgresql.org/message-id/flat/692ee0da-49da-4d32-8dca-da224cc2800e@eisentraut.org
2025-01-16Add and use BitmapHeapScanDescData structMelanie Plageman
Move the several members of HeapScanDescData which are specific to Bitmap Heap Scans into a new struct, BitmapHeapScanDescData, which inherits from HeapScanDescData. This reduces the size of the HeapScanDescData for other types of scans and will allow us to add additional bitmap heap scan-specific members in the future without fear of bloating the HeapScanDescData. Reviewed-by: Tomas Vondra Discussion: https://postgr.es/m/c736f6aa-8b35-4e20-9621-62c7c82e2168%40vondra.me
2025-01-15Fix cpluspluscheck for "Change gist stratnum function to use CompareType"Peter Eisentraut
Commit 630f9a43cec introduced an enum forward declaration, which doesn't work in C++. To fix, just include the header file to get the type.
2025-01-15Change gist stratnum function to use CompareTypePeter Eisentraut
This changes commit 7406ab623fe in that the gist strategy number mapping support function is changed to use the CompareType enum as input, instead of the "well-known" RT*StrategyNumber strategy numbers. This is a bit cleaner, since you are not dealing with two sets of strategy numbers. Also, this will enable us to subsume this system into a more general system of using CompareType to define operator semantics across index methods. Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
2025-01-13Add BTOPTIONS_PROC comments to nbtree.h.Peter Geoghegan
Add comments explaining the purpose of B-Tree support function 5 to nbtree.h for consistency (all other support functions were already described by nearby comments). This fixes what was arguably an oversight in commit 911e702077, or in follow-up doc commit 15cb2bd2 (which documented support function 5 in btree.sgml, but neglected to add anything to nbtree.h).
2025-01-13Move nbtree preprocessing into new .c file.Peter Geoghegan
Quite a bit of code within nbtutils.c is only called during nbtree preprocessing. Move that code into a new .c file, nbtpreprocesskeys.c. Also reorder some of the functions within the new file for clarity. This commit has no functional impact. It is strictly mechanical. Author: Peter Geoghegan <pg@bowt.ie> Suggested-by: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://postgr.es/m/CAH2-WznwNn1BDOpWxHBUK1f3Rdw8pO9UCenWXnvT=n9GO8GnLA@mail.gmail.com Discussion: https://postgr.es/m/86930045-5df5-494a-b4f1-815bc3fbcce0%40iki.fi
2025-01-11Add support for NOT ENFORCED in CHECK constraintsPeter Eisentraut
This adds support for the NOT ENFORCED/ENFORCED flag for constraints, with support for check constraints. The plan is to eventually support this for foreign key constraints, where it is typically more useful. Note that CHECK constraints do not currently support ALTER operations, so changing the enforceability of an existing constraint isn't possible without dropping and recreating it. This could be added later. Author: Amul Sul <amul.sul@enterprisedb.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: jian he <jian.universality@gmail.com> Tested-by: Triveni N <triveni.n@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/CAAJ_b962c5AcYW9KUt_R_ER5qs3fUGbe4az-SP-vuwPS-w-AGA@mail.gmail.com
2025-01-11Make verify_compact_attribute available in non-assert buildsDavid Rowley
6f3820f37 adjusted the assert-enabled validation of the CompactAttribute to call a new external function to perform the validation. That commit made it so the function was only available when building with USE_ASSERT_CHECKING, and because TupleDescCompactAttr() is a static inline function, the call to verify_compact_attribute() was compiled into any extension which uses TupleDescCompactAttr(). This caused issues for such extensions when loading the assert-enabled extension into PostgreSQL versions without asserts enabled due to that function being unavailable in core. To fix this, make verify_compact_attribute() available unconditionally, but make it do nothing unless building with USE_ASSERT_CHECKING. Author: Andrew Kane <andrew@ankane.org> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAOdR5yHfMEMW00XGo=v1zCVUS6Huq2UehXdvKnwtXPTcZwXhmg@mail.gmail.com
2025-01-09Fix SLRU bank selection codeÁlvaro Herrera
The originally submitted code (using bit masking) was correct when the number of slots was restricted to be a power of two -- but that limitation was removed during development that led to commit 53c2a97a9266, which made the bank selection code incorrect. This led to always using a smaller number of banks than available. Change said code to use integer modulo instead, which works correctly with an arbitrary number of banks. It's likely that we could improve on this to avoid runtime use of integer division. But with this change we're, at least, not wasting memory on unused banks, and more banks mean less contention, which is likely to have a much higher performance impact than a single instruction's latency. Author: Yura Sokolov <y.sokolov@postgrespro.ru> Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru> Discussion: https://postgr.es/m/9444dc46-ca47-43ed-9058-89c456316306@postgrespro.ru
2025-01-01Update copyright for 2025Bruce Momjian
Backpatch-through: 13
2024-12-24Fix race condition in TupleDescCompactAttr assert codeDavid Rowley
5983a4cff added CompactAttribute as an abbreviated alternative to FormData_pg_attribute to allow more cache-friendly processing in tasks related to TupleDescs. That commit contained some assert-only code to check that the CompactAttribute had been populated correctly, however, the method used to do that checking caused the TupleDesc's CompactAttribute to be zeroed before it was repopulated and compared to the snapshot taken before the memset call. This caused issues as the type cache caches TupleDescs in shared memory which can be used by multiple backend processes at the same time. There was a window of time between the zero and repopulation of the CompactAttribute where another process would mistakenly think that the CompactAttribute is invalid due to the memset. To fix this, instead of taking a snapshot of the CompactAttribute and calling populate_compact_attribute() and comparing the snapshot to the freshly populated TupleDesc's CompactAttribute, refactor things so we can just populate a temporary CompactAttribute on the stack. This way we don't touch the TupleDesc's memory. Reported-by: Alexander Lakhin, SQLsmith Discussion: https://postgr.es/m/ca3a256a-5d12-42db-aabe-a75a030d9fb9@gmail.com
2024-12-21Optimize alignment calculations in tuple form/deformDavid Rowley
Here we convert CompactAttribute.attalign from a char, which is directly derived from pg_attribute.attalign into a uint8, which stores the number of bytes to align the column's value by in the tuple. This allows tuple deformation and tuple size calculations to move away from using the inefficient att_align_nominal() macro, which manually checks each TYPALIGN_* char to translate that into the alignment bytes for the given type. Effectively, this commit changes those to TYPEALIGN calls, which are branchless and only perform some simple arithmetic with some bit-twiddling. The removed branches were often mispredicted by CPUs, especially so in real-world tables which often contain a mishmash of different types with different alignment requirements. Author: David Rowley Reviewed-by: Andres Freund, Victor Yegorov Discussion: https://postgr.es/m/CAApHDvrBztXP3yx=NKNmo3xwFAFhEdyPnvrDg3=M0RhDs+4vYw@mail.gmail.com
2024-12-20Introduce CompactAttribute array in TupleDesc, take 2David Rowley
The new compact_attrs array stores a few select fields from FormData_pg_attribute in a more compact way, using only 16 bytes per column instead of the 104 bytes that FormData_pg_attribute uses. Using CompactAttribute allows performance-critical operations such as tuple deformation to be performed without looking at the FormData_pg_attribute element in TupleDesc which means fewer cacheline accesses. For some workloads, tuple deformation can be the most CPU intensive part of processing the query. Some testing with 16 columns on a table where the first column is variable length showed around a 10% increase in transactions per second for an OLAP type query performing aggregation on the 16th column. However, in certain cases, the increases were much higher, up to ~25% on one AMD Zen4 machine. This also makes pg_attribute.attcacheoff redundant. A follow-on commit will remove it, thus shrinking the FormData_pg_attribute struct by 4 bytes. Author: David Rowley Reviewed-by: Andres Freund, Victor Yegorov Discussion: https://postgr.es/m/CAApHDvrBztXP3yx=NKNmo3xwFAFhEdyPnvrDg3=M0RhDs+4vYw@mail.gmail.com
2024-12-18Bitmap Table Scans use unified TBMIteratorMelanie Plageman
With the repurposing of TBMIterator as an interface for both parallel and serial iteration through TIDBitmaps in commit 7f9d4187e7bab10329cc, bitmap table scans may now use it. Modify bitmap table scan code to use the TBMIterator. This requires moving around a bit of code, so a few variables are initialized elsewhere. Author: Melanie Plageman Reviewed-by: Tomas Vondra Discussion: https://postgr.es/m/c736f6aa-8b35-4e20-9621-62c7c82e2168%40vondra.me
2024-12-18Add common interface for TBMIteratorsMelanie Plageman
Add and use TBMPrivateIterator, which replaces the current TBMIterator for serial use cases, and repurpose TBMIterator to be a unified interface for both the serial ("private") and parallel ("shared") TID Bitmap iterator interfaces. This encapsulation simplifies call sites for callers supporting both parallel and serial TID Bitmap access. TBMIterator is not yet used in this commit. Author: Melanie Plageman Reviewed-by: Tomas Vondra, Heikki Linnakangas Discussion: https://postgr.es/m/063e4eb4-32d9-439e-a0b1-75565a9835a8%40iki.fi
2024-12-18Make rs_cindex and rs_ntuples unsignedMelanie Plageman
HeapScanDescData.rs_cindex and rs_ntuples can't be less than 0. All scan types using the heap scan descriptor expect these values to be >= 0. Make that expectation clear by making rs_cindex and rs_ntuples unsigned. Also remove the test in heapam_scan_bitmap_next_tuple() that checks if rs_cindex < 0. This was never true, but now that rs_cindex is unsigned, it makes even less sense. While we are at it, initialize both rs_cindex and rs_ntuples to 0 in initscan(). Author: Melanie Plageman Reviewed-by: Dilip Kumar Discussion: https://postgr.es/m/CAAKRu_ZxF8cDCM_BFi_L-t%3DRjdCZYP1usd1Gd45mjHfZxm0nZw%40mail.gmail.com
2024-12-17Make visibilitymap_set() return previous state of vmbitsMelanie Plageman
It can be useful to know the state of a relation page's VM bits before visibilitymap_set(). visibilitymap_set() has the old value on hand, so returning it is simple. This commit does not use visibilitymap_set()'s new return value. Author: Melanie Plageman Reviewed-by: Masahiko Sawada, Andres Freund, Nitin Jadhav, Bilal Yavuz Discussion: https://postgr.es/m/flat/CAAKRu_ZQe26xdvAqo4weHLR%3DivQ8J4xrSfDDD8uXnh-O-6P6Lg%40mail.gmail.com#6d8d2b4219394f774889509bf3bdc13d, https://postgr.es/m/ctdjzroezaxmiyah3gwbwm67defsrwj2b5fpfs4ku6msfpxeia%40mwjyqlhwr2wu
2024-12-09Remove remants of "snapshot too old"Heikki Linnakangas
Remove the 'whenTaken' and 'lsn' fields from SnapshotData. After the removal of the "snapshot too old" feature, they were never set to a non-zero value. This largely reverts commit 3e2f3c2e423, which added the OldestActiveSnapshot tracking, and the init_toast_snapshot() function. That was only required for setting the 'whenTaken' and 'lsn' fields. SnapshotToast is now a constant again, like SnapshotSelf and SnapshotAny. I kept a thin get_toast_snapshot() wrapper around SnapshotToast though, to check that you have a registered or active snapshot. That's still a useful sanity check. Reviewed-by: Nathan Bossart, Andres Freund, Tom Lane Discussion: https://www.postgresql.org/message-id/cd4b4f8c-e63a-41c0-95f6-6e6cd9b83f6d@iki.fi
2024-12-07Ensure that pg_amop/amproc entries depend on their lefttype/righttype.Tom Lane
Usually an entry in pg_amop or pg_amproc does not need a dependency on its amoplefttype/amoprighttype/amproclefttype/amprocrighttype types, because there is an indirect dependency via the argument types of its referenced operator or procedure, or via the opclass it belongs to. However, for some support procedures in some index AMs, the argument types of the support procedure might not mention the column data type at all. Also, the amop/amproc entry might be treated as "loose" in the opfamily, in which case it lacks a dependency on any particular opclass; or it might be a cross-type entry having a reference to a datatype that is not its opclass' opcintype. The upshot of all this is that there are cases where a datatype can be dropped while leaving behind amop/amproc entries that mention it, because there is no path in pg_depend showing that those entries depend on that type. Such entries are harmless in normal activity, because they won't get used, but they cause problems for maintenance actions such as dropping the operator family. They also cause pg_dump to produce bogus output. The previous commit put a band-aid on the DROP OPERATOR FAMILY failure, but a real fix is needed. To fix, add pg_depend entries showing that a pg_amop/pg_amproc entry depends on its lefttype/righttype. To avoid bloating pg_depend too much, skip this if the referenced operator or function has that type as an input type. (I did not bother with considering the possible indirect dependency via the opclass' opcintype; at least in the reported case, that wouldn't help anyway.) Probably, the reason this has escaped notice for so long is that add-on datatypes and relevant opclasses/opfamilies are usually packaged as extensions nowadays, so that there's no way to drop a type without dropping the referencing opclasses/opfamilies too. Still, in the absence of pg_depend entries there's nothing that constrains DROP EXTENSION to drop the opfamily entries before the datatype, so it seems possible for a DROP failure to occur anyway. The specific case that was reported doesn't fail in v13, because v13 prefers to attach the support procedure to the opclass not the opfamily. But it's surely possible to construct other edge cases that do fail in v13, so patch that too. Per report from Yoran Heling. Back-patch to all supported branches. Discussion: https://postgr.es/m/Z1MVCOh1hprjK5Sf@gmai021
2024-12-03Revert "Introduce CompactAttribute array in TupleDesc"David Rowley
This reverts commit d28dff3f6cd6a7562fb2c211ac0fb74a33ffd032. Quite a large number of buildfarm members didn't like this commit and it's not yet clear why. Reverting this before too many animals turn red. Discussion: https://postgr.es/m/CAApHDvr9i6T5=iAwQCxFDgMsthr_obVxgwBaEJkC8KUH6yM3Hw@mail.gmail.com
2024-12-03Introduce CompactAttribute array in TupleDescDavid Rowley
The new compact_attrs array stores a few select fields from FormData_pg_attribute in a more compact way, using only 16 bytes per column instead of the 104 bytes that FormData_pg_attribute uses. Using CompactAttribute allows performance-critical operations such as tuple deformation to be performed without looking at the FormData_pg_attribute element in TupleDesc which means fewer cacheline accesses. With this change, NAMEDATALEN could be increased with a much smaller negative impact on performance. For some workloads, tuple deformation can be the most CPU intensive part of processing the query. Some testing with 16 columns on a table where the first column is variable length showed around a 10% increase in transactions per second for an OLAP type query performing aggregation on the 16th column. However, in certain cases, the increases were much higher, up to ~25% on one AMD Zen4 machine. This also makes pg_attribute.attcacheoff redundant. A follow-on commit will remove it, thus shrinking the FormData_pg_attribute struct by 4 bytes. Author: David Rowley Discussion: https://postgr.es/m/CAApHDvrBztXP3yx=NKNmo3xwFAFhEdyPnvrDg3=M0RhDs+4vYw@mail.gmail.com Reviewed-by: Andres Freund, Victor Yegorov
2024-11-04Revert pg_wal_replay_wait() stored procedureAlexander Korotkov
This commit reverts 3c5db1d6b0, and subsequent improvements and fixes including 8036d73ae3, 867d396ccd, 3ac3ec580c, 0868d7ae70, 85b98b8d5a, 2520226c95, 014f9f34d2, e658038772, e1555645d7, 5035172e4a, 6cfebfe88b, 73da6b8d1b, and e546989a26. The reason for reverting is a set of remaining issues. Most notably, the stored procedure appears to need more effort than the utility statement to turn the backend into a "snapshot-less" state. This makes an approach to use stored procedures questionable. Catversion is bumped. Discussion: https://postgr.es/m/Zyhj2anOPRKtb0xW%40paquier.xyz
2024-11-01Rename two functions that wake up other processesHeikki Linnakangas
Instead of talking about setting latches, which is a pretty low-level mechanism, emphasize that they wake up other processes. This is in preparation for replacing Latches with a new abstraction. That's still work in progress, but this seems a little tidier anyway, so let's get this refactoring out of the way already. Discussion: https://www.postgresql.org/message-id/391abe21-413e-4d91-a650-b663af49500c%40iki.fi
2024-11-01Use ProcNumbers instead of direct Latch pointers to address other procsHeikki Linnakangas
This is in preparation for replacing Latches with a new abstraction. That's still work in progress, but this seems a little tidier anyway, so let's get this refactoring out of the way already. Discussion: https://www.postgresql.org/message-id/391abe21-413e-4d91-a650-b663af49500c%40iki.fi
2024-10-30Fix bug in nbtree array primitive scan scheduling.Peter Geoghegan
A bug in nbtree's handling of primitive index scan scheduling could lead to wrong answers when a scrollable cursor was used with an index scan that had a SAOP index qual. Wrong answers were only possible when the scan direction changed after a primitive scan was scheduled, but before _bt_next was asked to fetch the next tuple in line (i.e. for things to break, _bt_next had to be denied the opportunity to step off the page in the same direction as the one used when the primscan was scheduled). Furthermore, the issue only occurred when the page in question happened to be the first page to be visited by the entire top-level scan; the issue hinged upon the cursor backing up to the absolute beginning of the key space that it returns tuples from (fetching in the opposite scan direction across a "primitive scan boundary" always worked correctly). To fix, make _bt_next unset the "needs primitive index scan" flag when it detects that the current scan direction is not the one that was used by _bt_readpage back when the primitive scan in question was scheduled. This fixes the cases that are known to be faulty, and also seems like a good idea on general robustness grounds. Affected scrollable cursor cases now avoid a spurious primitive index scan when they fetch backwards to the absolute start of the key space to be visited by their cursor. Fetching backwards now only returns those tuples at the start of the scan, as expected. It'll also be okay to once again fetch forwards from the start at that point, since the scan will be left in a state that's exactly consistent with the state it was in before any tuples were ever fetched, as expected. Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp execution. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-Wznv49bFsE2jkt4GuZ0tU2C91dEST=50egzjY2FeOcHL4Q@mail.gmail.com Backpatch: 17-, where commit 5bf748b8 first appears.
2024-10-29Unpin buffer before inplace update waits for an XID to end.Noah Misch
Commit a07e03fd8fa7daf4d1356f7cb501ffe784ea6257 changed inplace updates to wait for heap_update() commands like GRANT TABLE and GRANT DATABASE. By keeping the pin during that wait, a sequence of autovacuum workers and an uncommitted GRANT starved one foreground LockBufferForCleanup() for six minutes, on buildfarm member sarus. Prevent, at the cost of a bit of complexity. Back-patch to v12, like the earlier commit. That commit and heap_inplace_lock() have not yet appeared in any release. Discussion: https://postgr.es/m/20241026184936.ae.nmisch@google.com
2024-10-25Make table_scan_bitmap_next_block() async-friendlyMelanie Plageman
Move all responsibility for indicating a block is exhuasted into table_scan_bitmap_next_tuple() and advance the main iterator in heap-specific code. This flow control makes more sense and is a step toward using the read stream API for bitmap heap scans. Previously, table_scan_bitmap_next_block() returned false to indicate table_scan_bitmap_next_tuple() should not be called for the tuples on the page. This happened both when 1) there were no visible tuples on the page and 2) when the block returned by the iterator was past the end of the table. BitmapHeapNext() (generic bitmap table scan code) handled the case when the bitmap was exhausted. It makes more sense for table_scan_bitmap_next_tuple() to return false when there are no visible tuples on the page and table_scan_bitmap_next_block() to return false when the bitmap is exhausted or there are no more blocks in the table. As part of this new design, TBMIterateResults are no longer used as a flow control mechanism in BitmapHeapNext(), so we removed table_scan_bitmap_next_tuple's TBMIterateResult parameter. Note that the prefetch iterator is still saved in the BitmapHeapScanState node and advanced in generic bitmap table scan code. This is because 1) it was not necessary to change the prefetch iterator location to change the flow control in BitmapHeapNext() 2) modifying prefetch iterator management requires several more steps better split over multiple commits and 3) the prefetch iterator will be removed once the read stream API is used. Author: Melanie Plageman Reviewed-by: Tomas Vondra, Andres Freund, Heikki Linnakangas, Mark Dilger Discussion: https://postgr.es/m/063e4eb4-32d9-439e-a0b1-75565a9835a8%40iki.fi
2024-10-25Move EXPLAIN counter increment to heapam_scan_bitmap_next_blockMelanie Plageman
Increment the lossy and exact page counters for EXPLAIN of bitmap heap scans in heapam_scan_bitmap_next_block(). Note that other table AMs will need to do this as well Pushing the counters into heapam_scan_bitmap_next_block() is required to be able to use the read stream API for bitmap heap scans. The bitmap iterator must be advanced from inside the read stream callback, so TBMIterateResults cannot be used as a flow control mechanism in BitmapHeapNext(). Author: Melanie Plageman Reviewed-by: Tomas Vondra, Heikki Linnakangas Discussion: https://postgr.es/m/063e4eb4-32d9-439e-a0b1-75565a9835a8%40iki.fi
2024-10-25For inplace update, send nontransactional invalidations.Noah Misch
The inplace update survives ROLLBACK. The inval didn't, so another backend's DDL could then update the row without incorporating the inplace update. In the test this fixes, a mix of CREATE INDEX and ALTER TABLE resulted in a table with an index, yet relhasindex=f. That is a source of index corruption. Back-patch to v12 (all supported versions). The back branch versions don't change WAL, because those branches just added end-of-recovery SIResetAll(). All branches change the ABI of extern function PrepareToInvalidateCacheTuple(). No PGXN extension calls that, and there's no apparent use case in extensions. Reviewed by Nitin Motiani and (in earlier versions) Andres Freund. Discussion: https://postgr.es/m/20240523000548.58.nmisch@google.com
2024-10-24Refactor WaitForLSNReplay() to return the result of waitingAlexander Korotkov
Currently, WaitForLSNReplay() immediately throws an error if waiting for LSN replay is not successful. This commit teaches WaitForLSNReplay() to return the result of waiting, while making pg_wal_replay_wait() responsible for throwing an appropriate error. This is preparation to adding 'no_error' argument to pg_wal_replay_wait() and new function pg_wal_replay_wait_status(), which returns the last wait result status. Additionally, we stop distinguishing situations when we find our instance to be not in a recovery state before entering the waiting loop and inside the waiting loop. Standby promotion may happen at any moment, even between issuing a procedure call statement and pg_wal_replay_wait() doing a first check of recovery status. Thus, there is no pointing distinguishing these situations. Also, since we may exit the waiting loop and see our instance not in recovery without throwing an error, we need to deleteLSNWaiter() in that case. We do this unconditionally for the sake of simplicity, even if standby was already promoted after reaching the target LSN, the startup process surely already deleted us. Reported-by: Michael Paquier Discussion: https://postgr.es/m/ZtUF17gF0pNpwZDI%40paquier.xyz Reviewed-by: Michael Paquier, Pavel Borisov
2024-10-24Move LSN waiting declarations and definitions to better placeAlexander Korotkov
3c5db1d6b implemented the pg_wal_replay_wait() stored procedure. Due to the patch development history, the implementation resided in src/backend/commands/waitlsn.c (src/include/commands/waitlsn.h for headers). 014f9f34d moved pg_wal_replay_wait() itself to src/backend/access/transam/xlogfuncs.c near to the WAL-manipulation functions. But most of the implementation stayed in place. The code in src/backend/commands/waitlsn.c has nothing to do with commands, but is related to WAL. So, this commit moves this code into src/backend/access/transam/xlogwait.c (src/include/access/xlogwait.h for headers). Reported-by: Peter Eisentraut Discussion: https://postgr.es/m/18c0fa64-0475-415e-a1bd-665d922c5201%40eisentraut.org Reviewed-by: Pavel Borisov
2024-10-23Fix typo in tidstore.h.Masahiko Sawada
An oversight in commit f6bef362c. Reviewed-by: David Rowley Discussion: https://postgr.es/m/CAD21AoB8MJ5OHtpUw1UEGf7spioFmP3PNH44KNx6Yb3FiZSwKA%40mail.gmail.com
2024-10-18Optimize nbtree backwards scans.Peter Geoghegan
Make nbtree backwards scans optimistically access the next page to be read to the left by following a prevPage block number that's now stashed in currPos when the leaf page is first read. This approach matches the one taken during forward scans, which follow a symmetric nextPage block number from currPos. We stash both a prevPage and a nextPage, since the scan direction might change (when fetching from a scrollable cursor). Backwards scans will no longer need to lock the same page twice, except in rare cases where the scan detects a concurrent page split (or page deletion). Testing has shown this optimization to be particularly effective during parallel index-only backwards scans: ~12% reductions in query execution time are quite possible. We're much better off being optimistic; concurrent left sibling page splits are rare in general. It's possible that we'll need to lock more pages than the pessimistic approach would have, but only when there are _multiple_ concurrent splits of the left sibling page we now start at. If there's just a single concurrent left sibling page split, the new approach to scanning backwards will at least break even relative to the old one (we'll acquire the same number of leaf page locks as before). The optimization from this commit has long been contemplated by comments added by commit 2ed5b87f96, which changed the rules for locking/pinning during nbtree index scans. The approach that that commit introduced to leaf level link traversal when scanning forwards is now more or less applied all the time, regardless of the direction we're scanning in. Following uniform conventions around sibling link traversal is simpler. The only real remaining difference between our forward and backwards handling is that our backwards handling must still detect and recover from any concurrent left sibling splits (and concurrent page deletions), as documented in the nbtree README. That is structured as a single, isolated extra step that takes place in _bt_readnextpage. Also use this opportunity to further simplify the functions that deal with reading pages and traversing sibling links on the leaf level, and to document their preconditions and postconditions (with respect to things like buffer locks, buffer pins, and seizing the parallel scan). This enhancement completely supersedes the one recently added by commit 3f44959f. Author: Matthias van de Meent <boekewurm+postgres@gmail.com> Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAEze2WgpBGRgTTxTWVPXc9+PB6fc1a7t+VyGXHzfnrFXcQVxnA@mail.gmail.com Discussion: https://postgr.es/m/CAH2-WzkBTuFv7W2+84jJT8mWZLXVL0GHq2hMUTn6c9Vw=eYrCw@mail.gmail.com
2024-10-16Normalize nbtree truncated high key array behavior.Peter Geoghegan
Commit 5bf748b8 taught nbtree ScalarArrayOp index scans to decide when and how to start the next primitive index scan based on physical index characteristics. This included rules for deciding whether to start a new primitive index scan (or whether to move onto the right sibling leaf page instead) that specifically consider truncated lower-order columns (-inf columns) from leaf page high keys. These omitted columns were treated as satisfying the scan's required scan keys, though only for scan keys marked required in the current scan direction (forward). Scan keys that didn't get this behavior (those marked required in the backwards direction only) usually didn't give the scan reasonable cause to reposition itself to a later leaf page (via another descent of the index in _bt_first), but _bt_advance_array_keys would nevertheless always give up by forcing another call to _bt_first. _bt_advance_array_keys was unwilling to allow the scan to continue onto the next leaf page, to reconsider whether we really should start another primitive scan based on the details of the sibling page's tuples. This didn't match its behavior with similar cases involving keys required in the current scan direction (forward), which seems unprincipled. It led to an excessive number of primitive scans/index descents for queries with a higher-order = array scan key (with dense, contiguous values) mixed with a lower-order required > or >= scan key. Bring > and >= strategy scan keys in line with other required scan key types: treat truncated -inf scan keys as having satisfied scan keys required in either scan direction (forwards and backwards alike) during array advancement. That way affected scans can continue to the right sibling leaf page. Advancement must now schedule an explicit recheck of the right sibling page's high key in cases involving > or >= scan keys. The recheck gives the scan a way to back out and start another primitive index scan (we can't just rely on _bt_checkkeys with > or >= scan keys). This work can be considered a stand alone optimization on top of the work from commit 5bf748b8. But it was written in preparation for an upcoming patch that will add skip scan to nbtree. In practice scans that use "skip arrays" will tend to be much more sensitive to any implementation deficiencies in this area. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAH2-Wz=9A_UtM7HzUThSkQ+BcrQsQZuNhWOvQWK06PRkEp=SKQ@mail.gmail.com
2024-10-02Fix inconsistent reporting of checkpointer stats.Fujii Masao
Previously, the pg_stat_checkpointer view and the checkpoint completion log message could show different numbers for buffers written during checkpoints. The view only counted shared buffers, while the log message included both shared and SLRU buffers, causing inconsistencies. This commit resolves the issue by updating both the view and the log message to separately report shared and SLRU buffers written during checkpoints. A new slru_written column is added to the pg_stat_checkpointer view to track SLRU buffers, while the existing buffers_written column now tracks only shared buffers. This change would help users distinguish between the two types of buffers, in the pg_stat_checkpointer view and the checkpoint complete log message, respectively. Bump catalog version. Author: Nitin Jadhav Reviewed-by: Bharath Rupireddy, Michael Paquier, Kyotaro Horiguchi, Robert Haas Reviewed-by: Andres Freund, vignesh C, Fujii Masao Discussion: https://postgr.es/m/CAMm1aWb18EpT0whJrjG+-nyhNouXET6ZUw0pNYYAe+NezpvsAA@mail.gmail.com
2024-09-30Add num_done counter to the pg_stat_checkpointer view.Fujii Masao
Checkpoints can be skipped when the server is idle. The existing num_timed and num_requested counters in pg_stat_checkpointer track both completed and skipped checkpoints, but there was no way to count only the completed ones. This commit introduces the num_done counter, which tracks only completed checkpoints, making it easier to see how many were actually performed. Bump catalog version. Author: Anton A. Melnikov Reviewed-by: Fujii Masao Discussion: https://postgr.es/m/9ea77f40-818d-4841-9dee-158ac8f6e690@oss.nttdata.com
2024-09-24Fix data loss at inplace update after heap_update().Noah Misch
As previously-added tests demonstrated, heap_inplace_update() could instead update an unrelated tuple of the same catalog. It could lose the update. Losing relhasindex=t was a source of index corruption. Inplace-updating commands like VACUUM will now wait for heap_update() commands like GRANT TABLE and GRANT DATABASE. That isn't ideal, but a long-running GRANT already hurts VACUUM progress more just by keeping an XID running. The VACUUM will behave like a DELETE or UPDATE waiting for the uncommitted change. For implementation details, start at the systable_inplace_update_begin() header comment and README.tuplock. Back-patch to v12 (all supported versions). In back branches, retain a deprecated heap_inplace_update(), for extensions. Reported by Smolkin Grigory. Reviewed by Nitin Motiani, (in earlier versions) Heikki Linnakangas, and (in earlier versions) Alexander Lakhin. Discussion: https://postgr.es/m/CAMp+ueZQz3yDk7qg42hk6-9gxniYbp-=bG2mgqecErqR5gGGOA@mail.gmail.com
2024-09-20Improve Asserts checking relation matching in parallel scans.Tom Lane
table_beginscan_parallel and index_beginscan_parallel contain Asserts checking that the relation a worker will use in a parallel scan is the same one the leader intended. However, they were checking for relation OID match, which was not strong enough to detect the mismatch problem fixed in 126ec0bc7. What would be strong enough is to compare relfilenodes instead. Arguably, that's a saner definition anyway, since a scan surely operates on a physical relation not a logical one. Hence, store and compare RelFileLocators not relation OIDs. Also ensure that index_beginscan_parallel checks the index identity not just the table identity. Discussion: https://postgr.es/m/2127254.1726789524@sss.pgh.pa.us
2024-09-18Extend PgStat_HashKey.objid from 4 to 8 bytesMichael Paquier
This opens the possibility to define keys for more types of statistics kinds in PgStat_HashKey, the first case being 8-byte query IDs for statistics like pg_stat_statements. This increases the size of PgStat_HashKey from 12 to 16 bytes, while PgStatShared_HashEntry, entry stored in the dshash for pgstats, keeps the same size due to alignment. xl_xact_stats_item, that tracks the stats items to drop in commit WAL records, is increased from 12 to 16 bytes. Note that individual chunks in commit WAL records should be multiples of sizeof(int), hence 8-byte object IDs are stored as two uint32, based on a suggestion from Heikki Linnakangas. While on it, the field of PgStat_HashKey is renamed from "objoid" to "objid", as for some stats kinds this field does not refer to OIDs but just IDs, like for replication slot stats. This commit bumps the following format variables: - PGSTAT_FILE_FORMAT_ID, as PgStat_HashKey is written to the stats file for non-serialized stats kinds in the dshash table. - XLOG_PAGE_MAGIC for the changes in xl_xact_stats_item. - Catalog version, for the SQL function pg_stat_have_stats(). Reviewed-by: Bertrand Drouvot Discussion: https://postgr.es/m/ZsvTS9EW79Up8I62@paquier.xyz
2024-09-17Add temporal PRIMARY KEY and UNIQUE constraintsPeter Eisentraut
Add WITHOUT OVERLAPS clause to PRIMARY KEY and UNIQUE constraints. These are backed by GiST indexes instead of B-tree indexes, since they are essentially exclusion constraints with = for the scalar parts of the key and && for the temporal part. (previously committed as 46a0cd4cefb, reverted by 46a0cd4cefb; the new part is this:) Because 'empty' && 'empty' is false, the temporal PK/UQ constraint allowed duplicates, which is confusing to users and breaks internal expectations. For instance, when GROUP BY checks functional dependencies on the PK, it allows selecting other columns from the table, but in the presence of duplicate keys you could get the value from any of their rows. So we need to forbid empties. This all means that at the moment we can only support ranges and multiranges for temporal PK/UQs, unlike the original patch (above). Documentation and tests for this are added. But this could conceivably be extended by introducing some more general support for the notion of "empty" for other types. Author: Paul A. Jungwirth <pj@illuminatedcomputing.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: jian he <jian.universality@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CA+renyUApHgSZF9-nd-a0+OPGharLQLO=mDHcY4_qQ0+noCUVg@mail.gmail.com
2024-09-17Add stratnum GiST support functionPeter Eisentraut
This is support function 12 for the GiST AM and translates "well-known" RT*StrategyNumber values into whatever strategy number is used by the opclass (since no particular numbers are actually required). We will use this to support temporal PRIMARY KEY/UNIQUE/FOREIGN KEY/FOR PORTION OF functionality. This commit adds two implementations, one for internal GiST opclasses (just an identity function) and another for btree_gist opclasses. It updates btree_gist from 1.7 to 1.8, adding the support function for all its opclasses. (previously committed as 6db4598fcb8, reverted by 8aee330af55; this is essentially unchanged from those) Author: Paul A. Jungwirth <pj@illuminatedcomputing.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: jian he <jian.universality@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CA+renyUApHgSZF9-nd-a0+OPGharLQLO=mDHcY4_qQ0+noCUVg@mail.gmail.com
2024-09-12Move logic related to WAL replay of Heap/Heap2 into its own fileMichael Paquier
This brings more clarity to heapam.c, by cleanly separating all the logic related to WAL replay and the rest of Heap and Heap2, similarly to other RMGRs like hash, btree, etc. The header reorganization is also nice in heapam.c, cutting half of the headers required. Author: Li Yong Reviewed-by: Sutou Kouhei, Michael Paquier Discussion: https://postgr.es/m/EFE55E65-D7BD-4C6A-B630-91F43FD0771B@ebay.com
2024-09-10Add amgettreeheight index AM API routinePeter Eisentraut
The only current implementation is for btree where it calls _bt_getrootheight(). Other index types can now also use this to pass information to their amcostestimate routine. Previously, btree was hardcoded and other index types could not hook into the optimizer at this point. Author: Mark Dilger <mark.dilger@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
2024-09-03Add const qualifiers to XLogRegister*() functionsPeter Eisentraut
Add const qualifiers to XLogRegisterData() and XLogRegisterBufData(). Several unconstify() calls can be removed. Reviewed-by: Aleksander Alekseev <aleksander@timescale.com> Discussion: https://www.postgresql.org/message-id/dd889784-9ce7-436a-b4f1-52e4a5e577bd@eisentraut.org