user/sven/postgresql.git

Age	Commit message (Collapse)	Author
2024-11-04	Fix obsolete _bt_first comments.	Peter Geoghegan
	_bt_first doesn't necessarily hold onto a buffer pin on success exit. Fix header comments that claimed that we'll always hold onto a pin. Oversight in commit 2ed5b87f96.
2024-11-04	nbtree: Remove useless 'strat' local variable.	Peter Geoghegan
	Remove a local variable that was used to avoid overwriting strat_total with the = operator strategy when a >= operator strategy key was already included in the initial positioning/insertion scan keys by _bt_first (for backwards scans it would have to be a <= key that was included). _bt_first's strat_total local variable now simply tracks the operator strategy of the final scan key that was included in the scan's insertion scan key (barring the case where the !used_all_subkeys row compare path adjusts strat_total in its own way). _bt_first already treated >= keys (or <= keys) as = keys for initial positioning purposes. There is no good reason to remember that that was what happened; no later _bt_first step cares about the distinction. Note, in particular, that the insertion scan key's 'nextkey' and 'backward' fields will be initialized the same way regardless. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAH2-Wz=PKR6rB7qbx+Vnd7eqeB5VTcrW=iJvAsTsKbdG+kW_UA@mail.gmail.com
2024-11-04	Clarify nbtree parallel scan _bt_endpoint contract.	Peter Geoghegan
	_bt_endpoint is a helper function for _bt_first that's called whenever no useful insertion scan key can be used, and we need to lock and read either the leftmost or rightmost leaf page in the index. Simplify and document its preconditions, relieving its _bt_first caller from having to end the parallel scan when it returns false. Also stop unnecessarily invalidating the current scan position in nearby code in both _bt_first and _bt_endpoint. This seems to have been copy-pasted from _bt_readnextpage, where invalidating the scan's current position really is necessary. Follow-up to the refactoring work in commit 1bd4bc85.
2024-11-01	Clarify nbtree array preprocessing comment.	Peter Geoghegan
	Oversight in commit 5bf748b8.
2024-10-30	nbtree: assert no scheduled primscan between pages.	Peter Geoghegan
	Follow-up to bugfix commit 763d65ae. Technically this new assertion is redundant with the assertion recently added to _bt_readpage by that same commit, but it seems like a good idea to have both. The new assertion makes it clear that we expect to call _bt_readnextpage when there's another primitive index scan scheduled, though only when needed as the final step of ending the current primitive scan.
2024-10-30	Clarify nbtree array exhaustion comments.	Peter Geoghegan
	Strictly speaking, we only need to make sure to leave the scan's array keys in their final positions (final for the current scan direction) to handle SAOP array exhaustion because btgettuple might only return a subset of the items for the final page (final for the current scan direction), before the scan changes direction. While it's typical for so->currPos to be invalidated shortly after the scan's arrays are first exhausted, and while so->currPos invalidation does obviate the need to leave the scan's arrays in any particular state, we can't rely on any of that actually happening when handling array exhaustion. Adjust comments to make all of that a lot clearer. Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp execution.
2024-10-30	Fix bug in nbtree array primitive scan scheduling.	Peter Geoghegan
	A bug in nbtree's handling of primitive index scan scheduling could lead to wrong answers when a scrollable cursor was used with an index scan that had a SAOP index qual. Wrong answers were only possible when the scan direction changed after a primitive scan was scheduled, but before _bt_next was asked to fetch the next tuple in line (i.e. for things to break, _bt_next had to be denied the opportunity to step off the page in the same direction as the one used when the primscan was scheduled). Furthermore, the issue only occurred when the page in question happened to be the first page to be visited by the entire top-level scan; the issue hinged upon the cursor backing up to the absolute beginning of the key space that it returns tuples from (fetching in the opposite scan direction across a "primitive scan boundary" always worked correctly). To fix, make _bt_next unset the "needs primitive index scan" flag when it detects that the current scan direction is not the one that was used by _bt_readpage back when the primitive scan in question was scheduled. This fixes the cases that are known to be faulty, and also seems like a good idea on general robustness grounds. Affected scrollable cursor cases now avoid a spurious primitive index scan when they fetch backwards to the absolute start of the key space to be visited by their cursor. Fetching backwards now only returns those tuples at the start of the scan, as expected. It'll also be okay to once again fetch forwards from the start at that point, since the scan will be left in a state that's exactly consistent with the state it was in before any tuples were ever fetched, as expected. Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp execution. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-Wznv49bFsE2jkt4GuZ0tU2C91dEST=50egzjY2FeOcHL4Q@mail.gmail.com Backpatch: 17-, where commit 5bf748b8 first appears.
2024-10-28	nbtree: Minor sibling link traversal tweaks.	Peter Geoghegan
	Tweak some code comments for clarity, and relocate some local variable declarations to the scope where they're actually used. Follow-up to recent commit 1bd4bc85.
2024-10-27	Fix obsolete nbtree split buffer comment.	Peter Geoghegan
	Oversight in commit d088ba5a.
2024-10-27	Remove unused #include's from backend .c files	Peter Eisentraut
	as determined by IWYU These are mostly issues that are new since commit dbbca2cf299. Discussion: https://www.postgresql.org/message-id/flat/0df1d5b1-8ca8-4f84-93be-121081bde049%40eisentraut.org
2024-10-18	Optimize nbtree backwards scans.	Peter Geoghegan
	Make nbtree backwards scans optimistically access the next page to be read to the left by following a prevPage block number that's now stashed in currPos when the leaf page is first read. This approach matches the one taken during forward scans, which follow a symmetric nextPage block number from currPos. We stash both a prevPage and a nextPage, since the scan direction might change (when fetching from a scrollable cursor). Backwards scans will no longer need to lock the same page twice, except in rare cases where the scan detects a concurrent page split (or page deletion). Testing has shown this optimization to be particularly effective during parallel index-only backwards scans: ~12% reductions in query execution time are quite possible. We're much better off being optimistic; concurrent left sibling page splits are rare in general. It's possible that we'll need to lock more pages than the pessimistic approach would have, but only when there are _multiple_ concurrent splits of the left sibling page we now start at. If there's just a single concurrent left sibling page split, the new approach to scanning backwards will at least break even relative to the old one (we'll acquire the same number of leaf page locks as before). The optimization from this commit has long been contemplated by comments added by commit 2ed5b87f96, which changed the rules for locking/pinning during nbtree index scans. The approach that that commit introduced to leaf level link traversal when scanning forwards is now more or less applied all the time, regardless of the direction we're scanning in. Following uniform conventions around sibling link traversal is simpler. The only real remaining difference between our forward and backwards handling is that our backwards handling must still detect and recover from any concurrent left sibling splits (and concurrent page deletions), as documented in the nbtree README. That is structured as a single, isolated extra step that takes place in _bt_readnextpage. Also use this opportunity to further simplify the functions that deal with reading pages and traversing sibling links on the leaf level, and to document their preconditions and postconditions (with respect to things like buffer locks, buffer pins, and seizing the parallel scan). This enhancement completely supersedes the one recently added by commit 3f44959f. Author: Matthias van de Meent <boekewurm+postgres@gmail.com> Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAEze2WgpBGRgTTxTWVPXc9+PB6fc1a7t+VyGXHzfnrFXcQVxnA@mail.gmail.com Discussion: https://postgr.es/m/CAH2-WzkBTuFv7W2+84jJT8mWZLXVL0GHq2hMUTn6c9Vw=eYrCw@mail.gmail.com
2024-10-16	nbtree: fix read page recheck typo.	Peter Geoghegan
	Oversight in commit 79fa7b3b.
2024-10-16	Normalize nbtree truncated high key array behavior.	Peter Geoghegan
	Commit 5bf748b8 taught nbtree ScalarArrayOp index scans to decide when and how to start the next primitive index scan based on physical index characteristics. This included rules for deciding whether to start a new primitive index scan (or whether to move onto the right sibling leaf page instead) that specifically consider truncated lower-order columns (-inf columns) from leaf page high keys. These omitted columns were treated as satisfying the scan's required scan keys, though only for scan keys marked required in the current scan direction (forward). Scan keys that didn't get this behavior (those marked required in the backwards direction only) usually didn't give the scan reasonable cause to reposition itself to a later leaf page (via another descent of the index in _bt_first), but _bt_advance_array_keys would nevertheless always give up by forcing another call to _bt_first. _bt_advance_array_keys was unwilling to allow the scan to continue onto the next leaf page, to reconsider whether we really should start another primitive scan based on the details of the sibling page's tuples. This didn't match its behavior with similar cases involving keys required in the current scan direction (forward), which seems unprincipled. It led to an excessive number of primitive scans/index descents for queries with a higher-order = array scan key (with dense, contiguous values) mixed with a lower-order required > or >= scan key. Bring > and >= strategy scan keys in line with other required scan key types: treat truncated -inf scan keys as having satisfied scan keys required in either scan direction (forwards and backwards alike) during array advancement. That way affected scans can continue to the right sibling leaf page. Advancement must now schedule an explicit recheck of the right sibling page's high key in cases involving > or >= scan keys. The recheck gives the scan a way to back out and start another primitive index scan (we can't just rely on _bt_checkkeys with > or >= scan keys). This work can be considered a stand alone optimization on top of the work from commit 5bf748b8. But it was written in preparation for an upcoming patch that will add skip scan to nbtree. In practice scans that use "skip arrays" will tend to be much more sensitive to any implementation deficiencies in this area. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAH2-Wz=9A_UtM7HzUThSkQ+BcrQsQZuNhWOvQWK06PRkEp=SKQ@mail.gmail.com
2024-09-30	Set query ID in parallel workers for vacuum, BRIN and btree	Michael Paquier
	All these code paths use their own entry point when starting parallel workers, but failed to set a query ID, even if they set a text query. Hence, this data would be missed in pg_stat_activity for the worker processes. The main entry point for parallel query processing, ParallelQueryMain(), is already doing that by saving its query ID in a dummy PlannedStmt, but not the others. The code is changed so as the query ID of these queries is set in their shared state, and reported back once the parallel workers start. Some tests are added to show how the failures can happen for btree and BRIN with a parallel build enforced, which are able to trigger a failure in an assertion added by 24f520594809 in the recovery TAP test 027_stream_regress.pl where pg_stat_statements is always loaded. In this case, the executor path was taken because the index expression needs to be flattened when building its IndexInfo. Alexander Lakhin has noticed the problem in btree, and I have noticed that the issue was more spread. This is arguably a bug, but nobody has complained about that until now, so no backpatch is done out of caution. If folks would like to see a backpatch, well, let me know. Reported-by: Alexander Lakhin Reviewed-by: Sami Imseih Discussion: https://postgr.es/m/cf3547c1-498a-6a61-7b01-819f902a251f@gmail.com
2024-09-24	Update obsolete nbtree array preprocessing comments.	Peter Geoghegan
	The array->scan_key references fixed up at the end of preprocessing start out as offsets into the arrayKeyData[] array (the array returned by _bt_preprocess_array_keys at the start of preprocessing that involves array scan keys). Offsets into the arrayKeyData[] array are no longer guaranteed to be valid offsets into our original scan->keyData[] input scan key array, but comments describing the array->scan_key references still talked about scan->keyData[]. Update those comments. Oversight in commit b5249741.
2024-09-21	Refactor handling of nbtree array redundancies.	Peter Geoghegan
	Teach _bt_preprocess_array_keys to eliminate redundant array equality scan keys directly, rather than just marking them as redundant. Its _bt_preprocess_keys caller is no longer required to ignore input scan keys that were marked redundant in this way. Oversights like the one fixed by commit f22e17f7 are no longer possible. The new scheme also makes it easier for _bt_preprocess_keys to output a so.keyData[] scan key array with _more_ scan keys than it was passed in its scan.keyData[] input scan key array. An upcoming patch that adds skip scan optimizations to nbtree will take advantage of this. In passing, remove and rename certain _bt_preprocess_keys variables to make the difference between our input scan key array and our output scan key array clearer. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAH2-Wz=9A_UtM7HzUThSkQ+BcrQsQZuNhWOvQWK06PRkEp=SKQ@mail.gmail.com
2024-09-20	Fix nbtree pgstats accounting with parallel scans.	Peter Geoghegan
	Commit 5bf748b8, which enhanced nbtree ScalarArrayOp execution, made parallel index scans work with the new design for arrays via explicit scheduling of primitive index scans. Under this scheme a parallel index scan with array keys will perform the same number of index descents as an equivalent serial index scan (barring corner cases where an individual parallel worker discovers that it can advance the scan's array keys without anybody needing to perform another descent of the index to get to the relevant page on the leaf level). Despite all this, the pgstats accounting wasn't updated; it continued to increment the total number of index scans for the rel once per _bt_first call, no matter the details. As a result, the number of (primitive) index scans could be over-counted during parallel scans. To fix, delay incrementing the count of index scans until after we've established that another descent of the index (using either _bt_search or _bt_endpoint) is required. That way pg_stat_user_tables.idx_scan always advances in the same way, regardless of whether or not the scan makes use of parallelism. Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp execution. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAH2-Wz=E7XrkvscBN0U6V81NK3Q-dQOmivvbEsjG-zwEfDdFpg@mail.gmail.com Discussion: https://postgr.es/m/CAH2-WzkRqvaqR2CTNqTZP0z6FuL4-3ED6eQB0yx38XBNj1v-4Q@mail.gmail.com Backpatch: 17-, where nbtree SAOP execution was enhanced.
2024-09-17	Avoid parallel nbtree index scan hangs with SAOPs.	Peter Geoghegan
	Commit 5bf748b8, which enhanced nbtree ScalarArrayOp execution, made parallel index scans work with the new design for arrays via explicit scheduling of primitive index scans. A backend that successfully scheduled the scan's next primitive index scan saved its backend local array keys in shared memory. Any backend could pick up the scheduled primitive scan within _bt_first. This scheme decouples scheduling a primitive scan from starting the scan (by performing another descent of the index via a _bt_search call from _bt_first) to make things robust. The scheme had a deadlock hazard, at least when the leader process participated in the scan. _bt_parallel_seize had a code path that made backends that were not in an immediate position to start a scheduled primitive index scan wait for some other backend to do so instead. Under the right circumstances, the leader process could wait here forever: the leader would wait for any other backend to start the primitive scan, while every worker was busy waiting on the leader to consume tuples from the scan's tuple queue. To fix, don't wait for a scheduled primitive index scan to be started by some other eligible backend from within _bt_parallel_seize (when the calling backend isn't in a position to do so itself). Return false instead, while recording that the scan has a scheduled primitive index scan in backend local state. This leaves the backend in the same state as the existing case where a backend schedules (or tries to schedule) another primitive index scan from within _bt_advance_array_keys, before calling _bt_parallel_seize. _bt_parallel_seize already handles that case by returning false without waiting, and without unsetting the backend local state. Leaving the backend in this state enables it to start a previously scheduled primitive index scan once it gets back to _bt_first. Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp execution. Matthias van de Meent, with tweaks by me. Author: Matthias van de Meent <boekewurm+postgres@gmail.com> Reported-By: Tomas Vondra <tomas@vondra.me> Reviewed-By: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-WzmMGaPa32u9x_FvEbPTUkP5e95i=QxR8054nvCRydP-sw@mail.gmail.com Backpatch: 17-, where nbtree SAOP execution was enhanced.
2024-09-11	Replace gratuitous memmove() with memcpy()	Peter Eisentraut
	The index access methods all had similar code that copied the passed-in scan keys to local storage. They all used memmove() for that, which is not wrong, but it seems confusing not to use memcpy() when that would work. Presumably, this was all once copied from ancient code and never adjusted. Discussion: https://www.postgresql.org/message-id/flat/f8c739d9-f48d-4187-b214-df3391ba41ab@eisentraut.org
2024-09-10	Add amgettreeheight index AM API routine	Peter Eisentraut
	The only current implementation is for btree where it calls _bt_getrootheight(). Other index types can now also use this to pass information to their amcostestimate routine. Previously, btree was hardcoded and other index types could not hook into the optimizer at this point. Author: Mark Dilger <mark.dilger@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
2024-08-26	Fix nbtree lookahead overflow bug.	Peter Geoghegan
	Add bounds checking to nbtree's lookahead/skip-within-a-page mechanism. Otherwise it's possible for cases with lots of before-array-keys tuples to overflow an int16 variable, causing the mechanism to generate an out of bounds page offset number. Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp execution. Reported-By: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/6c68ac42-bbb5-8b24-103e-af0e279c536f@gmail.com Backpatch: 17-, where nbtree SAOP execution was enhanced.
2024-08-12	Give nbtree move right function internal linkage.	Peter Geoghegan
	Declare _bt_moveright() static. This is a minor modularity win; the routine was already private to nbtsearch.c for all practical purposes. Author: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAEze2WgWVzCNEXQB_op5MMZMDgJ3fg3AhVm6bq2iZPpJNXGhWw@mail.gmail.com
2024-08-11	Avoid unneeded nbtree backwards scan buffer locks.	Peter Geoghegan
	Teach nbtree backwards scans to avoid relocking a just-read leaf page to read its current left sibling link when it isn't truly necessary. This happened inside _bt_readnextpage whenever _bt_readpage had already determined that there'll be no further matches to the left (or at least none for the current primitive index scan, for a scan with array keys). A new precheck inside _bt_readnextpage is all that we need to avoid these useless lock acquisitions. Arguably, using a precheck like this was a missed opportunity for commit 2ed5b87f96, which taught nbtree to drop leaf page pins early to avoid blocking cleanup by VACUUM. Forwards scans already managed to avoid relocking the page like this. The optimization added by this commit is particularly helpful with backwards scans that use array keys where the scan must perform multiple primitive index scans. Such backwards scans will now avoid a useless leaf page re-lock at the end of each primitive index scan. Note that this commit does not attempt to avoid needlessly re-locking a leaf page that was just read when the scan must follow the leaf page's left link. That more ambitious optimization could work by stashing the left link when the page is first read by a backwards scan, allowing the subsequent _bt_readnextpage call to optimistically skip re-reading the original page just to get a new copy of its left link. For now we only address cases where we don't care about our original page's left link. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-Wz=xgs7PojG=EUvhgadwENzu_mY_riNh-w9wFPsaS717ew@mail.gmail.com
2024-06-26	Fix nbtree array unsatisfied inequality check.	Peter Geoghegan
	_bt_advance_array_keys didn't take sufficient care at the point where it decides whether to start a new primitive index scan based on a call to _bt_check_compare against finaltup (a call with the scan direction flipped around). The final decision was conditioned on rules about how the scan key offset sktrig that initially triggered array advancement (passed to _bt_advance_array_keys from its _bt_checkkeys caller) compared to the offset set by its own _bt_check_compare finaltup call. This approach was faulty, in that it allowed _bt_advance_array_keys to incorrectly start a new primitive index scan, that landed on the same leaf page (on assert-enabled builds it led to an assertion failure). In general, scans with array keys are expected to never have to read the same leaf page more than once (barring cases involving cursors, and cases where the scan restores a marked position for the inner side of a merge join). This principle was established by commit 5bf748b8. To fix, make the final decision based on whether the scan key offset set by the _bt_check_compare finaltup call is an offset to an inequality strategy scan key. An unsatisfied required inequality strategy scan key indicates that all of the scan's required equality strategy scan keys must also be satisfied by finaltup (not just by caller's tuple), and that there is a decent chance that _bt_first will be able to reposition the scan to a position many leaf pages ahead of the current leaf page. Oversight in commit 5bf748b8. Discussion: https://postgr.es/m/CAH2-Wz=DyHbcg7o6zXqzyiin8WE8vzk4tvU8Lrnh-a=EAvO0TQ@mail.gmail.com
2024-04-22	Remove unneeded nbtree array preprocessing assert.	Peter Geoghegan
	Certain cases involving the use of cursors had assertion failures within _bt_preprocess_keys's recently added no-op return path. The assertion in question made the faulty assumption that a second or third call to _bt_preprocess_keys (within the same btrescan) could only happen when another scheduled primitive index scan was just about to begin. It would be possible to address the problem by only allowing scans that have array keys to take the new no-op path, forcing affected cases to perform redundant preprocessing work. It seems simpler to just remove the assertion, and reframe the no-op path as a more general mechanism. Take this simpler approach. The important underlying principle is that we only need to perform preprocessing once per btrescan (at most). This is expected regardless of whether or not the scan happens to have array keys. Oversight in commit 1b134ca5, which enhanced nbtree ScalarArrayOp execution. Reported-By: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/ef0f7c8b-a6fa-362e-6fd6-054950f947ca@gmail.com
2024-04-21	Remove overzealous array element type assertion.	Peter Geoghegan
	This led to spurious assertion failures in certain scenarios involving pseudo types. Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp execution. Reported-By: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/CAMbWs48f5rDOwxaT76Zd40m7n9iGZQcjEk7vG_5p3YWNh6oPfA@mail.gmail.com
2024-04-18	Fix typos and duplicate words	Daniel Gustafsson
	This fixes various typos, duplicated words, and tiny bits of whitespace mainly in code comments but also in docs. Author: Daniel Gustafsson <daniel@yesql.se> Author: Heikki Linnakangas <hlinnaka@iki.fi> Author: Alexander Lakhin <exclusion@gmail.com> Author: David Rowley <dgrowleyml@gmail.com> Author: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/3F577953-A29E-4722-98AD-2DA9EFF2CBB8@yesql.se
2024-04-18	Don't try to fix eliminated nbtree array scan keys.	Peter Geoghegan
	Preprocessing for nbtree index scans allowed array "input" scan keys already marked eliminated during array-specific preprocessing to be "fixed up" during preprocessing proper. This allowed eliminated scan keys on DESC index columns to spurious have their strategy commuted, causing assertion failures. To fix, teach _bt_fix_scankey_strategy to ignore these scan keys. This brings it in line with its only caller, _bt_preprocess_keys. Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp execution. Reported-By: Donghang Lin <donghanglin@gmail.com> Discussion: https://postgr.es/m/CAA=D8a2sHK6CAzZ=0CeafC-Y-MFXbYxnRSHvZTi=+JHu6kAa8Q@mail.gmail.com
2024-04-16	Fix nbtree "deduce NOT NULL" scan key comment.	Peter Geoghegan
	Oversight in commit c9c0589fda.
2024-04-07	Remove redundant nbtree preprocessing assertions.	Peter Geoghegan
	One of the assertions was the subject of a false positive complaint from Coverity, but none of the assertions added much, so get rid of them. Reported-By: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/3000247.1712537309@sss.pgh.pa.us
2024-04-07	Avoid extra lookups with nbtree array inequalities.	Peter Geoghegan
	nbtree index scans with SAOP inequalities (but no SAOP equalities) performed extra ORDER proc lookups for any remaining equality strategy scan keys. This could waste cycles, and caused assertion failures. Keeping around a separate ORDER proc is only necessary for a scan's non-array/non-SAOP equality scan keys when the scan has at least one other SAOP equality strategy key (a SAOP inequality shouldn't count). To fix, replace _bt_preprocess_array_keys_final's assertion with a test that makes the function return early when the scan has no SAOP equality scan keys. Oversight in commit 1b134ca5, which enhanced nbtree ScalarArrayOp execution. Reported-By: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/0539d3d3-a402-0a49-ed5e-26429dffc4bd@gmail.com
2024-04-06	Enhance nbtree ScalarArrayOp execution.	Peter Geoghegan
	Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals natively. This works by pushing down the full context (the array keys) to the nbtree index AM, enabling it to execute multiple primitive index scans that the planner treats as one continuous index scan/index path. This earlier enhancement enabled nbtree ScalarArrayOp index-only scans. It also allowed scans with ScalarArrayOp quals to return ordered results (with some notable restrictions, described further down). Take this general approach a lot further: teach nbtree SAOP index scans to decide how to execute ScalarArrayOp scans (when and where to start the next primitive index scan) based on physical index characteristics. This can be far more efficient. All SAOP scans will now reliably avoid duplicative leaf page accesses (just like any other nbtree index scan). SAOP scans whose array keys are naturally clustered together now require far fewer index descents, since we'll reliably avoid starting a new primitive scan just to get to a later offset from the same leaf page. The scan's arrays now advance using binary searches for the array element that best matches the next tuple's attribute value. Required scan key arrays (i.e. arrays from scan keys that can terminate the scan) ratchet forward in lockstep with the index scan. Non-required arrays (i.e. arrays from scan keys that can only exclude non-matching tuples) "advance" without the process ever rolling over to a higher-order array. Naturally, only required SAOP scan keys trigger skipping over leaf pages (non-required arrays cannot safely end or start primitive index scans). Consequently, even index scans of a composite index with a high-order inequality scan key (which we'll mark required) and a low-order SAOP scan key (which we won't mark required) now avoid repeating leaf page accesses -- that benefit isn't limited to simpler equality-only cases. In general, all nbtree index scans now output tuples as if they were one continuous index scan -- even scans that mix a high-order inequality with lower-order SAOP equalities reliably output tuples in index order. This allows us to remove a couple of special cases that were applied when building index paths with SAOP clauses during planning. Bugfix commit 807a40c5 taught the planner to avoid generating unsafe path keys: path keys on a multicolumn index path, with a SAOP clause on any attribute beyond the first/most significant attribute. These cases are now all safe, so we go back to generating path keys without regard for the presence of SAOP clauses (just like with any other clause type). Affected queries can now exploit scan output order in all the usual ways (e.g., certain "ORDER BY ... LIMIT n" queries can now terminate early). Also undo changes from follow-up bugfix commit a4523c5a, which taught the planner to produce alternative index paths, with path keys, but without low-order SAOP index quals (filter quals were used instead). We'll no longer generate these alternative paths, since they can no longer offer any meaningful advantages over standard index qual paths. Affected queries thereby avoid all of the disadvantages that come from using filter quals within index scan nodes. They can avoid extra heap page accesses from using filter quals to exclude non-matching tuples (index quals will never have that problem). They can also skip over irrelevant sections of the index in more cases (though only when nbtree determines that starting another primitive scan actually makes sense). There is a theoretical risk that removing restrictions on SAOP index paths from the planner will break compatibility with amcanorder-based index AMs maintained as extensions. Such an index AM could have the same limitations around ordered SAOP scans as nbtree had up until now. Adding a pro forma incompatibility item about the issue to the Postgres 17 release notes seems like a good idea. Author: Peter Geoghegan <pg@bowt.ie> Author: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-By: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
2024-03-22	Fix an oversight in refactoring in 06b10f80ba4.	Alexander Korotkov
	It was against intended skipping prechecking keys optimization in the first page of range queries to not influence point queries performance. Reported-by: Anton Melnikov Discussion: https://postgr.es/m/30cd7524-b9f1-4cf8-9c4a-223eb2e34441%40postgrespro.ru Author: Pavel Borisov
2024-03-11	Update obsolete index scan TID comments.	Peter Geoghegan
	Oversight in commit c2fe139c20.
2024-03-04	Remove unused #include's from backend .c files	Peter Eisentraut
	as determined by include-what-you-use (IWYU) While IWYU also suggests to add a bunch of #include's (which is its main purpose), this patch does not do that. In some cases, a more specific #include replaces another less specific one. Some manual adjustments of the automatic result: - IWYU currently doesn't know about includes that provide global variable declarations (like -Wmissing-variable-declarations), so those includes are being kept manually. - All includes for port(ability) headers are being kept for now, to play it safe. - No changes of catalog/pg_foo.h to catalog/pg_foo_d.h, to keep the patch from exploding in size. Note that this patch touches just *.c files, so nothing declared in header files changes in hidden ways. As a small example, in src/backend/access/transam/rmgr.c, some IWYU pragma annotations are added to handle a special case there. Discussion: https://www.postgresql.org/message-id/flat/af837490-6b2f-46df-ba05-37ea6a6653fc%40eisentraut.org
2024-02-23	Introduce a new smgr bulk loading facility.	Heikki Linnakangas
	The new facility makes it easier to optimize bulk loading, as the logic for buffering, WAL-logging, and syncing the relation only needs to be implemented once. It's also less error-prone: We have had a number of bugs in how a relation is fsync'd - or not - at the end of a bulk loading operation. By centralizing that logic to one place, we only need to write it correctly once. The new facility is faster for small relations: Instead of of calling smgrimmedsync(), we register the fsync to happen at next checkpoint, which avoids the fsync latency. That can make a big difference if you are e.g. restoring a schema-only dump with lots of relations. It is also slightly more efficient with large relations, as the WAL logging is performed multiple pages at a time. That avoids some WAL header overhead. The sorted GiST index build did that already, this moves the buffering to the new facility. The changes to pageinspect GiST test needs an explanation: Before this patch, the sorted GiST index build set the LSN on every page to the special GistBuildLSN value, not the LSN of the WAL record, even though they were WAL-logged. There was no particular need for it, it just happened naturally when we wrote out the pages before WAL-logging them. Now we WAL-log the pages first, like in B-tree build, so the pages are stamped with the record's real LSN. When the build is not WAL-logged, we still use GistBuildLSN. To make the test output predictable, use an unlogged index. Reviewed-by: Andres Freund Discussion: https://www.postgresql.org/message-id/30e8f366-58b3-b239-c521-422122dd5150%40iki.fi
2024-02-16	Use new overflow-safe integer comparison functions.	Nathan Bossart
	Commit 6b80394781 introduced integer comparison functions designed to be as efficient as possible while avoiding overflow. This commit makes use of these functions in many of the in-tree qsort() comparators to help ensure transitivity. Many of these comparator functions should also see a small performance boost. Author: Mats Kindahl Reviewed-by: Andres Freund, Fabrízio de Royes Mello Discussion: https://postgr.es/m/CA%2B14426g2Wa9QuUpmakwPxXFWG_1FaY0AsApkvcTBy-YfS6uaw%40mail.gmail.com
2024-01-03	Update copyright for 2024	Bruce Momjian
	Reported-by: Michael Paquier Discussion: https://postgr.es/m/ZZKTDPxBBMt3C0J9@paquier.xyz Backpatch-through: 12
2024-01-02	Fix typos in comments and in one isolation test.	Robert Haas
	Dagfinn Ilmari Mannsåker, reviewed by Shubham Khanna. Some subtractions by me. Discussion: http://postgr.es/m/87le9fmi01.fsf@wibble.ilmari.org
2023-12-27	Improvements and fixes for e0b1ee17dc	Alexander Korotkov
	e0b1ee17dc introduced optimization for matching B-tree scan keys required for the directional scan. However, it incorrectly assumed that all keys required for opposite direction scan are satisfied by _bt_first(). It has been illustrated that with multiple scan keys over the same column, a lesser one (according to the scan direction) could win leaving the other one unsatisfied. Instead of relying on _bt_first() this commit introduces code that memorizes whether there was at least one match on the page. If that's true we know that keys required for opposite-direction scan are satisfied as soon as corresponding values are not NULLs. Also, this commit simplifies the description for the optimization of keys required for the current direction scan. Now the flag used for this is named continuescanPrechecked and means exactly that *continuescan flag is known to be true for the last item on the page. Reported-by: Peter Geoghegan Discussion: https://postgr.es/m/CAH2-Wzn0LeLcb1PdBnK0xisz8NpHkxRrMr3NWJ%2BKOK-WZ%2BQtTQ%40mail.gmail.com Reviewed-by: Pavel Borisov
2023-12-27	Remove BTScanOpaqueData.firstPage	Alexander Korotkov
	It's not necessary to keep the firstPage flag as a field of BTScanOpaqueData. This commit makes it an argument of the _bt_readpage() function. We can easily distinguish first-time and repeated calls (within the scan) of this function. Reported-by: Peter Geoghegan Discussion: https://postgr.es/m/CAH2-Wzk4SOsw%2BtHuTFiz8U9Jqj-R77rYPkhWKODCBb1mdHACXA%40mail.gmail.com Reviewed-by: Pavel Borisov
2023-12-08	Fix nbtree backward scan race condition comments.	Peter Geoghegan
	Remove comments that supposed that holding a pin was a useful interlock for _bt_walk_left(). There are times when _bt_walk_left() doesn't hold either a lock or a pin on any page, so clearly this can't be true. _bt_walk_left() is even prepared to deal with concurrent deletion of both the original page and any pages to its left. Oversight in commit 2ed5b87f96.
2023-12-08	Optimize nbtree backward scan boundary cases.	Peter Geoghegan
	Teach _bt_binsrch (and related helper routines like _bt_search and _bt_compare) about the initial positioning requirements of backward scans. Routines like _bt_binsrch already know all about "nextkey" searches, so it seems natural to teach them about "goback"/backward searches, too. These concepts are closely related, and are much easier to understand when discussed together. Now that certain implementation details are hidden from _bt_first, it's straightforward to add a new optimization: backward scans using the < strategy now avoid extra leaf page accesses in certain "boundary cases". Consider the following example, which uses the tenk1 table (and its tenk1_hundred index) from the standard regression tests: SELECT * FROM tenk1 WHERE hundred < 12 ORDER BY hundred DESC LIMIT 1; Before this commit, nbtree would scan two leaf pages, even though it was only really necessary to scan one leaf page. We'll now descend straight to the leaf page containing a (12, -inf) high key instead. The scan will locate matching non-pivot tuples with "hundred" values starting from the value 11. The scan won't waste a page access on the right sibling leaf page, which cannot possibly contain any matching tuples. You can think of the optimization added by this commit as disabling an optimization (the _bt_compare "!pivotsearch" behavior that was added to Postgres 12 in commit dd299df8) for a small subset of cases where it was always counterproductive. Equivalently, you can think of the new optimization as extending the "pivotsearch" behavior that page deletion by VACUUM has long required (since the aforementioned Postgres 12 commit went in) to other, similar cases. Obviously, this isn't strictly necessary for these new cases (unlike VACUUM, _bt_first is prepared to move the scan to the left once on the leaf level), but the underlying principle is the same. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-Wz=XPzM8HzaLPq278Vms420mVSHfgs9wi5tjFKHcapZCEw@mail.gmail.com
2023-12-08	Allow parallel CREATE INDEX for BRIN indexes	Tomas Vondra
	Allow using multiple worker processes to build BRIN index, which until now was supported only for BTREE indexes. For large tables this often results in significant speedup when the build is CPU-bound. The work is split in a simple way - each worker builds BRIN summaries on a subset of the table, determined by the regular parallel scan used to read the data, and feeds them into a shared tuplesort which sorts them by blkno (start of the range). The leader then reads this sorted stream of ranges, merges duplicates (which may happen if the parallel scan does not align with BRIN pages_per_range), and adds the resulting ranges into the index. The number of duplicate results produced by workers (requiring merging in the leader process) should be fairly small, thanks to how parallel scans assign chunks to workers. The likelihood of duplicate results may increase for higher pages_per_range values, but then there are fewer page ranges in total. In any case, we expect the merging to be much cheaper than summarization, so this should be a win. Most of the parallelism infrastructure is a simplified copy of the code used by BTREE indexes, omitting the parts irrelevant for BRIN indexes (e.g. uniqueness checks). This also introduces a new index AM flag amcanbuildparallel, determining whether to attempt to start parallel workers for the index build. Original patch by me, with reviews and substantial reworks by Matthias van de Meent, certainly enough to make him a co-author. Author: Tomas Vondra, Matthias van de Meent Reviewed-by: Matthias van de Meent Discussion: https://postgr.es/m/c2ee7d69-ce17-43f2-d1a0-9811edbda6e6%40enterprisedb.com
2023-11-25	Reuse BrinDesc and BrinRevmap in brininsert	Tomas Vondra
	The brininsert code used to initialize (and destroy) BrinDesc and BrinRevmap for each tuple, which is not free. This patch initializes these structures only once, and reuses them for all inserts in the same command. The data is passed through indexInfo->ii_AmCache. This also introduces an optional AM callback "aminsertcleanup" that allows performing custom cleanup in case simply pfree-ing ii_AmCache is not sufficient (which is the case when the cache contains TupleDesc, Buffers, and so on). Author: Soumyadeep Chakraborty Reviewed-by: Alvaro Herrera, Matthias van de Meent, Tomas Vondra Discussion: https://postgr.es/m/CAE-ML%2B9r2%3DaO1wwji1sBN9gvPz2xRAtFUGfnffpd0ZqyuzjamA%40mail.gmail.com
2023-10-26	Add trailing commas to enum definitions	Peter Eisentraut
	Since C99, there can be a trailing comma after the last value in an enum definition. A lot of new code has been introducing this style on the fly. Some new patches are now taking an inconsistent approach to this. Some add the last comma on the fly if they add a new last value, some are trying to preserve the existing style in each place, some are even dropping the last comma if there was one. We could nudge this all in a consistent direction if we just add the trailing commas everywhere once. I omitted a few places where there was a fixed "last" value that will always stay last. I also skipped the header files of libpq and ecpg, in case people want to use those with older compilers. There were also a small number of cases where the enum type wasn't used anywhere (but the enum values were), which ended up confusing pgindent a bit, so I left those alone. Discussion: https://www.postgresql.org/message-id/flat/386f8c45-c8ac-4681-8add-e3b0852c1620%40eisentraut.org
2023-10-09	Remove duplicate words in docs and code comments.	Amit Kapila
	Additionally, add a missing "the" in a couple of places. Author: Vignesh C, Dagfinn Ilmari Mannsåker Discussion: http://postgr.es/m/CALDaNm28t+wWyPfuyqEaARS810Je=dRFkaPertaLAEJYY2cWYQ@mail.gmail.com
2023-10-07	Fix another typo in e0b1ee17dc	Alexander Korotkov
	Reported-by: Richard Guo Discussion: https://postgr.es/m/CAMbWs4_kHMJDak75y1kBTirv-drS1-knT-7Mpg5LprAjqRJDVA%40mail.gmail.com
2023-10-07	Fix typos in e0b1ee17dc	Alexander Korotkov
	Reported-by: Alexander Lakhin
2023-10-06	Skip checking of scan keys required for directional scan in B-tree	Alexander Korotkov
	Currently, B-tree code matches every scan key to every item on the page. Imagine the ordered B-tree scan for the query like this. SELECT * FROM tbl WHERE col > 'a' AND col < 'b' ORDER BY col; The (col > 'a') scan key will be always matched once we find the location to start the scan. The (col < 'b') scan key will match every item on the page as long as it matches the last item on the page. This patch implements prechecking of the scan keys required for directional scan on beginning of page scan. If precheck is successful we can skip this scan keys check for the items on the page. That could lead to significant acceleration especially if the comparison operator is expensive. Idea from patch by Konstantin Knizhnik. Discussion: https://postgr.es/m/079c3f8e-3371-abe2-e93c-fc8a0ae3f571%40garret.ru Reviewed-by: Peter Geoghegan, Pavel Borisov