summaryrefslogtreecommitdiff
path: root/src/backend/access/nbtree/nbtutils.c
AgeCommit message (Collapse)Author
2024-11-01Clarify nbtree array preprocessing comment.Peter Geoghegan
Oversight in commit 5bf748b8.
2024-10-30Clarify nbtree array exhaustion comments.Peter Geoghegan
Strictly speaking, we only need to make sure to leave the scan's array keys in their final positions (final for the current scan direction) to handle SAOP array exhaustion because btgettuple might only return a subset of the items for the final page (final for the current scan direction), before the scan changes direction. While it's typical for so->currPos to be invalidated shortly after the scan's arrays are first exhausted, and while so->currPos invalidation does obviate the need to leave the scan's arrays in any particular state, we can't rely on any of that actually happening when handling array exhaustion. Adjust comments to make all of that a lot clearer. Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp execution.
2024-10-30Fix bug in nbtree array primitive scan scheduling.Peter Geoghegan
A bug in nbtree's handling of primitive index scan scheduling could lead to wrong answers when a scrollable cursor was used with an index scan that had a SAOP index qual. Wrong answers were only possible when the scan direction changed after a primitive scan was scheduled, but before _bt_next was asked to fetch the next tuple in line (i.e. for things to break, _bt_next had to be denied the opportunity to step off the page in the same direction as the one used when the primscan was scheduled). Furthermore, the issue only occurred when the page in question happened to be the first page to be visited by the entire top-level scan; the issue hinged upon the cursor backing up to the absolute beginning of the key space that it returns tuples from (fetching in the opposite scan direction across a "primitive scan boundary" always worked correctly). To fix, make _bt_next unset the "needs primitive index scan" flag when it detects that the current scan direction is not the one that was used by _bt_readpage back when the primitive scan in question was scheduled. This fixes the cases that are known to be faulty, and also seems like a good idea on general robustness grounds. Affected scrollable cursor cases now avoid a spurious primitive index scan when they fetch backwards to the absolute start of the key space to be visited by their cursor. Fetching backwards now only returns those tuples at the start of the scan, as expected. It'll also be okay to once again fetch forwards from the start at that point, since the scan will be left in a state that's exactly consistent with the state it was in before any tuples were ever fetched, as expected. Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp execution. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-Wznv49bFsE2jkt4GuZ0tU2C91dEST=50egzjY2FeOcHL4Q@mail.gmail.com Backpatch: 17-, where commit 5bf748b8 first appears.
2024-10-18Optimize nbtree backwards scans.Peter Geoghegan
Make nbtree backwards scans optimistically access the next page to be read to the left by following a prevPage block number that's now stashed in currPos when the leaf page is first read. This approach matches the one taken during forward scans, which follow a symmetric nextPage block number from currPos. We stash both a prevPage and a nextPage, since the scan direction might change (when fetching from a scrollable cursor). Backwards scans will no longer need to lock the same page twice, except in rare cases where the scan detects a concurrent page split (or page deletion). Testing has shown this optimization to be particularly effective during parallel index-only backwards scans: ~12% reductions in query execution time are quite possible. We're much better off being optimistic; concurrent left sibling page splits are rare in general. It's possible that we'll need to lock more pages than the pessimistic approach would have, but only when there are _multiple_ concurrent splits of the left sibling page we now start at. If there's just a single concurrent left sibling page split, the new approach to scanning backwards will at least break even relative to the old one (we'll acquire the same number of leaf page locks as before). The optimization from this commit has long been contemplated by comments added by commit 2ed5b87f96, which changed the rules for locking/pinning during nbtree index scans. The approach that that commit introduced to leaf level link traversal when scanning forwards is now more or less applied all the time, regardless of the direction we're scanning in. Following uniform conventions around sibling link traversal is simpler. The only real remaining difference between our forward and backwards handling is that our backwards handling must still detect and recover from any concurrent left sibling splits (and concurrent page deletions), as documented in the nbtree README. That is structured as a single, isolated extra step that takes place in _bt_readnextpage. Also use this opportunity to further simplify the functions that deal with reading pages and traversing sibling links on the leaf level, and to document their preconditions and postconditions (with respect to things like buffer locks, buffer pins, and seizing the parallel scan). This enhancement completely supersedes the one recently added by commit 3f44959f. Author: Matthias van de Meent <boekewurm+postgres@gmail.com> Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAEze2WgpBGRgTTxTWVPXc9+PB6fc1a7t+VyGXHzfnrFXcQVxnA@mail.gmail.com Discussion: https://postgr.es/m/CAH2-WzkBTuFv7W2+84jJT8mWZLXVL0GHq2hMUTn6c9Vw=eYrCw@mail.gmail.com
2024-10-16Normalize nbtree truncated high key array behavior.Peter Geoghegan
Commit 5bf748b8 taught nbtree ScalarArrayOp index scans to decide when and how to start the next primitive index scan based on physical index characteristics. This included rules for deciding whether to start a new primitive index scan (or whether to move onto the right sibling leaf page instead) that specifically consider truncated lower-order columns (-inf columns) from leaf page high keys. These omitted columns were treated as satisfying the scan's required scan keys, though only for scan keys marked required in the current scan direction (forward). Scan keys that didn't get this behavior (those marked required in the backwards direction only) usually didn't give the scan reasonable cause to reposition itself to a later leaf page (via another descent of the index in _bt_first), but _bt_advance_array_keys would nevertheless always give up by forcing another call to _bt_first. _bt_advance_array_keys was unwilling to allow the scan to continue onto the next leaf page, to reconsider whether we really should start another primitive scan based on the details of the sibling page's tuples. This didn't match its behavior with similar cases involving keys required in the current scan direction (forward), which seems unprincipled. It led to an excessive number of primitive scans/index descents for queries with a higher-order = array scan key (with dense, contiguous values) mixed with a lower-order required > or >= scan key. Bring > and >= strategy scan keys in line with other required scan key types: treat truncated -inf scan keys as having satisfied scan keys required in either scan direction (forwards and backwards alike) during array advancement. That way affected scans can continue to the right sibling leaf page. Advancement must now schedule an explicit recheck of the right sibling page's high key in cases involving > or >= scan keys. The recheck gives the scan a way to back out and start another primitive index scan (we can't just rely on _bt_checkkeys with > or >= scan keys). This work can be considered a stand alone optimization on top of the work from commit 5bf748b8. But it was written in preparation for an upcoming patch that will add skip scan to nbtree. In practice scans that use "skip arrays" will tend to be much more sensitive to any implementation deficiencies in this area. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAH2-Wz=9A_UtM7HzUThSkQ+BcrQsQZuNhWOvQWK06PRkEp=SKQ@mail.gmail.com
2024-09-24Update obsolete nbtree array preprocessing comments.Peter Geoghegan
The array->scan_key references fixed up at the end of preprocessing start out as offsets into the arrayKeyData[] array (the array returned by _bt_preprocess_array_keys at the start of preprocessing that involves array scan keys). Offsets into the arrayKeyData[] array are no longer guaranteed to be valid offsets into our original scan->keyData[] input scan key array, but comments describing the array->scan_key references still talked about scan->keyData[]. Update those comments. Oversight in commit b5249741.
2024-09-21Refactor handling of nbtree array redundancies.Peter Geoghegan
Teach _bt_preprocess_array_keys to eliminate redundant array equality scan keys directly, rather than just marking them as redundant. Its _bt_preprocess_keys caller is no longer required to ignore input scan keys that were marked redundant in this way. Oversights like the one fixed by commit f22e17f7 are no longer possible. The new scheme also makes it easier for _bt_preprocess_keys to output a so.keyData[] scan key array with _more_ scan keys than it was passed in its scan.keyData[] input scan key array. An upcoming patch that adds skip scan optimizations to nbtree will take advantage of this. In passing, remove and rename certain _bt_preprocess_keys variables to make the difference between our input scan key array and our output scan key array clearer. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAH2-Wz=9A_UtM7HzUThSkQ+BcrQsQZuNhWOvQWK06PRkEp=SKQ@mail.gmail.com
2024-08-26Fix nbtree lookahead overflow bug.Peter Geoghegan
Add bounds checking to nbtree's lookahead/skip-within-a-page mechanism. Otherwise it's possible for cases with lots of before-array-keys tuples to overflow an int16 variable, causing the mechanism to generate an out of bounds page offset number. Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp execution. Reported-By: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/6c68ac42-bbb5-8b24-103e-af0e279c536f@gmail.com Backpatch: 17-, where nbtree SAOP execution was enhanced.
2024-06-26Fix nbtree array unsatisfied inequality check.Peter Geoghegan
_bt_advance_array_keys didn't take sufficient care at the point where it decides whether to start a new primitive index scan based on a call to _bt_check_compare against finaltup (a call with the scan direction flipped around). The final decision was conditioned on rules about how the scan key offset sktrig that initially triggered array advancement (passed to _bt_advance_array_keys from its _bt_checkkeys caller) compared to the offset set by its own _bt_check_compare finaltup call. This approach was faulty, in that it allowed _bt_advance_array_keys to incorrectly start a new primitive index scan, that landed on the same leaf page (on assert-enabled builds it led to an assertion failure). In general, scans with array keys are expected to never have to read the same leaf page more than once (barring cases involving cursors, and cases where the scan restores a marked position for the inner side of a merge join). This principle was established by commit 5bf748b8. To fix, make the final decision based on whether the scan key offset set by the _bt_check_compare finaltup call is an offset to an inequality strategy scan key. An unsatisfied required inequality strategy scan key indicates that all of the scan's required equality strategy scan keys must also be satisfied by finaltup (not just by caller's tuple), and that there is a decent chance that _bt_first will be able to reposition the scan to a position many leaf pages ahead of the current leaf page. Oversight in commit 5bf748b8. Discussion: https://postgr.es/m/CAH2-Wz=DyHbcg7o6zXqzyiin8WE8vzk4tvU8Lrnh-a=EAvO0TQ@mail.gmail.com
2024-04-22Remove unneeded nbtree array preprocessing assert.Peter Geoghegan
Certain cases involving the use of cursors had assertion failures within _bt_preprocess_keys's recently added no-op return path. The assertion in question made the faulty assumption that a second or third call to _bt_preprocess_keys (within the same btrescan) could only happen when another scheduled primitive index scan was just about to begin. It would be possible to address the problem by only allowing scans that have array keys to take the new no-op path, forcing affected cases to perform redundant preprocessing work. It seems simpler to just remove the assertion, and reframe the no-op path as a more general mechanism. Take this simpler approach. The important underlying principle is that we only need to perform preprocessing once per btrescan (at most). This is expected regardless of whether or not the scan happens to have array keys. Oversight in commit 1b134ca5, which enhanced nbtree ScalarArrayOp execution. Reported-By: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/ef0f7c8b-a6fa-362e-6fd6-054950f947ca@gmail.com
2024-04-21Remove overzealous array element type assertion.Peter Geoghegan
This led to spurious assertion failures in certain scenarios involving pseudo types. Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp execution. Reported-By: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/CAMbWs48f5rDOwxaT76Zd40m7n9iGZQcjEk7vG_5p3YWNh6oPfA@mail.gmail.com
2024-04-18Fix typos and duplicate wordsDaniel Gustafsson
This fixes various typos, duplicated words, and tiny bits of whitespace mainly in code comments but also in docs. Author: Daniel Gustafsson <daniel@yesql.se> Author: Heikki Linnakangas <hlinnaka@iki.fi> Author: Alexander Lakhin <exclusion@gmail.com> Author: David Rowley <dgrowleyml@gmail.com> Author: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/3F577953-A29E-4722-98AD-2DA9EFF2CBB8@yesql.se
2024-04-18Don't try to fix eliminated nbtree array scan keys.Peter Geoghegan
Preprocessing for nbtree index scans allowed array "input" scan keys already marked eliminated during array-specific preprocessing to be "fixed up" during preprocessing proper. This allowed eliminated scan keys on DESC index columns to spurious have their strategy commuted, causing assertion failures. To fix, teach _bt_fix_scankey_strategy to ignore these scan keys. This brings it in line with its only caller, _bt_preprocess_keys. Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp execution. Reported-By: Donghang Lin <donghanglin@gmail.com> Discussion: https://postgr.es/m/CAA=D8a2sHK6CAzZ=0CeafC-Y-MFXbYxnRSHvZTi=+JHu6kAa8Q@mail.gmail.com
2024-04-07Remove redundant nbtree preprocessing assertions.Peter Geoghegan
One of the assertions was the subject of a false positive complaint from Coverity, but none of the assertions added much, so get rid of them. Reported-By: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/3000247.1712537309@sss.pgh.pa.us
2024-04-07Avoid extra lookups with nbtree array inequalities.Peter Geoghegan
nbtree index scans with SAOP inequalities (but no SAOP equalities) performed extra ORDER proc lookups for any remaining equality strategy scan keys. This could waste cycles, and caused assertion failures. Keeping around a separate ORDER proc is only necessary for a scan's non-array/non-SAOP equality scan keys when the scan has at least one other SAOP equality strategy key (a SAOP inequality shouldn't count). To fix, replace _bt_preprocess_array_keys_final's assertion with a test that makes the function return early when the scan has no SAOP equality scan keys. Oversight in commit 1b134ca5, which enhanced nbtree ScalarArrayOp execution. Reported-By: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/0539d3d3-a402-0a49-ed5e-26429dffc4bd@gmail.com
2024-04-06Enhance nbtree ScalarArrayOp execution.Peter Geoghegan
Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals natively. This works by pushing down the full context (the array keys) to the nbtree index AM, enabling it to execute multiple primitive index scans that the planner treats as one continuous index scan/index path. This earlier enhancement enabled nbtree ScalarArrayOp index-only scans. It also allowed scans with ScalarArrayOp quals to return ordered results (with some notable restrictions, described further down). Take this general approach a lot further: teach nbtree SAOP index scans to decide how to execute ScalarArrayOp scans (when and where to start the next primitive index scan) based on physical index characteristics. This can be far more efficient. All SAOP scans will now reliably avoid duplicative leaf page accesses (just like any other nbtree index scan). SAOP scans whose array keys are naturally clustered together now require far fewer index descents, since we'll reliably avoid starting a new primitive scan just to get to a later offset from the same leaf page. The scan's arrays now advance using binary searches for the array element that best matches the next tuple's attribute value. Required scan key arrays (i.e. arrays from scan keys that can terminate the scan) ratchet forward in lockstep with the index scan. Non-required arrays (i.e. arrays from scan keys that can only exclude non-matching tuples) "advance" without the process ever rolling over to a higher-order array. Naturally, only required SAOP scan keys trigger skipping over leaf pages (non-required arrays cannot safely end or start primitive index scans). Consequently, even index scans of a composite index with a high-order inequality scan key (which we'll mark required) and a low-order SAOP scan key (which we won't mark required) now avoid repeating leaf page accesses -- that benefit isn't limited to simpler equality-only cases. In general, all nbtree index scans now output tuples as if they were one continuous index scan -- even scans that mix a high-order inequality with lower-order SAOP equalities reliably output tuples in index order. This allows us to remove a couple of special cases that were applied when building index paths with SAOP clauses during planning. Bugfix commit 807a40c5 taught the planner to avoid generating unsafe path keys: path keys on a multicolumn index path, with a SAOP clause on any attribute beyond the first/most significant attribute. These cases are now all safe, so we go back to generating path keys without regard for the presence of SAOP clauses (just like with any other clause type). Affected queries can now exploit scan output order in all the usual ways (e.g., certain "ORDER BY ... LIMIT n" queries can now terminate early). Also undo changes from follow-up bugfix commit a4523c5a, which taught the planner to produce alternative index paths, with path keys, but without low-order SAOP index quals (filter quals were used instead). We'll no longer generate these alternative paths, since they can no longer offer any meaningful advantages over standard index qual paths. Affected queries thereby avoid all of the disadvantages that come from using filter quals within index scan nodes. They can avoid extra heap page accesses from using filter quals to exclude non-matching tuples (index quals will never have that problem). They can also skip over irrelevant sections of the index in more cases (though only when nbtree determines that starting another primitive scan actually makes sense). There is a theoretical risk that removing restrictions on SAOP index paths from the planner will break compatibility with amcanorder-based index AMs maintained as extensions. Such an index AM could have the same limitations around ordered SAOP scans as nbtree had up until now. Adding a pro forma incompatibility item about the issue to the Postgres 17 release notes seems like a good idea. Author: Peter Geoghegan <pg@bowt.ie> Author: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-By: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
2024-03-04Remove unused #include's from backend .c filesPeter Eisentraut
as determined by include-what-you-use (IWYU) While IWYU also suggests to *add* a bunch of #include's (which is its main purpose), this patch does not do that. In some cases, a more specific #include replaces another less specific one. Some manual adjustments of the automatic result: - IWYU currently doesn't know about includes that provide global variable declarations (like -Wmissing-variable-declarations), so those includes are being kept manually. - All includes for port(ability) headers are being kept for now, to play it safe. - No changes of catalog/pg_foo.h to catalog/pg_foo_d.h, to keep the patch from exploding in size. Note that this patch touches just *.c files, so nothing declared in header files changes in hidden ways. As a small example, in src/backend/access/transam/rmgr.c, some IWYU pragma annotations are added to handle a special case there. Discussion: https://www.postgresql.org/message-id/flat/af837490-6b2f-46df-ba05-37ea6a6653fc%40eisentraut.org
2024-01-03Update copyright for 2024Bruce Momjian
Reported-by: Michael Paquier Discussion: https://postgr.es/m/ZZKTDPxBBMt3C0J9@paquier.xyz Backpatch-through: 12
2023-12-27Improvements and fixes for e0b1ee17dcAlexander Korotkov
e0b1ee17dc introduced optimization for matching B-tree scan keys required for the directional scan. However, it incorrectly assumed that all keys required for opposite direction scan are satisfied by _bt_first(). It has been illustrated that with multiple scan keys over the same column, a lesser one (according to the scan direction) could win leaving the other one unsatisfied. Instead of relying on _bt_first() this commit introduces code that memorizes whether there was at least one match on the page. If that's true we know that keys required for opposite-direction scan are satisfied as soon as corresponding values are not NULLs. Also, this commit simplifies the description for the optimization of keys required for the current direction scan. Now the flag used for this is named continuescanPrechecked and means exactly that *continuescan flag is known to be true for the last item on the page. Reported-by: Peter Geoghegan Discussion: https://postgr.es/m/CAH2-Wzn0LeLcb1PdBnK0xisz8NpHkxRrMr3NWJ%2BKOK-WZ%2BQtTQ%40mail.gmail.com Reviewed-by: Pavel Borisov
2023-12-08Optimize nbtree backward scan boundary cases.Peter Geoghegan
Teach _bt_binsrch (and related helper routines like _bt_search and _bt_compare) about the initial positioning requirements of backward scans. Routines like _bt_binsrch already know all about "nextkey" searches, so it seems natural to teach them about "goback"/backward searches, too. These concepts are closely related, and are much easier to understand when discussed together. Now that certain implementation details are hidden from _bt_first, it's straightforward to add a new optimization: backward scans using the < strategy now avoid extra leaf page accesses in certain "boundary cases". Consider the following example, which uses the tenk1 table (and its tenk1_hundred index) from the standard regression tests: SELECT * FROM tenk1 WHERE hundred < 12 ORDER BY hundred DESC LIMIT 1; Before this commit, nbtree would scan two leaf pages, even though it was only really necessary to scan one leaf page. We'll now descend straight to the leaf page containing a (12, -inf) high key instead. The scan will locate matching non-pivot tuples with "hundred" values starting from the value 11. The scan won't waste a page access on the right sibling leaf page, which cannot possibly contain any matching tuples. You can think of the optimization added by this commit as disabling an optimization (the _bt_compare "!pivotsearch" behavior that was added to Postgres 12 in commit dd299df8) for a small subset of cases where it was always counterproductive. Equivalently, you can think of the new optimization as extending the "pivotsearch" behavior that page deletion by VACUUM has long required (since the aforementioned Postgres 12 commit went in) to other, similar cases. Obviously, this isn't strictly necessary for these new cases (unlike VACUUM, _bt_first is prepared to move the scan to the left once on the leaf level), but the underlying principle is the same. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-Wz=XPzM8HzaLPq278Vms420mVSHfgs9wi5tjFKHcapZCEw@mail.gmail.com
2023-10-07Fix typos in e0b1ee17dcAlexander Korotkov
Reported-by: Alexander Lakhin
2023-10-06Skip checking of scan keys required for directional scan in B-treeAlexander Korotkov
Currently, B-tree code matches every scan key to every item on the page. Imagine the ordered B-tree scan for the query like this. SELECT * FROM tbl WHERE col > 'a' AND col < 'b' ORDER BY col; The (col > 'a') scan key will be always matched once we find the location to start the scan. The (col < 'b') scan key will match every item on the page as long as it matches the last item on the page. This patch implements prechecking of the scan keys required for directional scan on beginning of page scan. If precheck is successful we can skip this scan keys check for the items on the page. That could lead to significant acceleration especially if the comparison operator is expensive. Idea from patch by Konstantin Knizhnik. Discussion: https://postgr.es/m/079c3f8e-3371-abe2-e93c-fc8a0ae3f571%40garret.ru Reviewed-by: Peter Geoghegan, Pavel Borisov
2023-09-28Fix btmarkpos/btrestrpos array key wraparound bug.Peter Geoghegan
nbtree's mark/restore processing failed to correctly handle an edge case involving array key advancement and related search-type scan key state. Scans with ScalarArrayScalarArrayOpExpr quals requiring mark/restore processing (for a merge join) could incorrectly conclude that an affected array/scan key must not have advanced during the time between marking and restoring the scan's position. As a result of all this, array key handling within btrestrpos could skip a required call to _bt_preprocess_keys(). This confusion allowed later primitive index scans to overlook tuples matching the true current array keys. The scan's search-type scan keys would still have spurious values corresponding to the final array element(s) -- not values matching the first/now-current array element(s). To fix, remember that "array key wraparound" has taken place during the ongoing btrescan in a flag variable stored in the scan's state, and use that information at the point where btrestrpos decides if another call to _bt_preprocess_keys is required. Oversight in commit 70bc5833, which taught nbtree to handle array keys during mark/restore processing, but missed this subtlety. That commit was itself a bug fix for an issue in commit 9e8da0f7, which taught nbtree to handle ScalarArrayOpExpr quals natively. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-WzkgP3DDRJxw6DgjCxo-cu-DKrvjEv_ArkP2ctBJatDCYg@mail.gmail.com Backpatch: 11- (all supported branches).
2023-06-10nbtree: Allocate new pages in separate function.Peter Geoghegan
Split nbtree's _bt_getbuf function is two: code that read locks or write locks existing pages remains in _bt_getbuf, while code that deals with allocating new pages is moved to a new, dedicated function called _bt_allocbuf. This simplifies most _bt_getbuf callers, since it is no longer necessary for them to pass a heaprel argument. Many of the changes to nbtree from commit 61b313e4 can be reverted. This minimizes the divergence between HEAD/PostgreSQL 16 and earlier release branches. _bt_allocbuf replaces the previous nbtree idiom of passing P_NEW to _bt_getbuf. There are only 3 affected call sites, all of which continue to pass a heaprel for recovery conflict purposes. Note that nbtree's use of P_NEW was superficial; nbtree never actually relied on the P_NEW code paths in bufmgr.c, so this change is strictly mechanical. GiST already took the same approach; it has a dedicated function for allocating new pages called gistNewBuffer(). That factor allowed commit 61b313e4 to make much more targeted changes to GiST. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://postgr.es/m/CAH2-Wz=8Z9qY58bjm_7TAHgtW6RzZ5Ke62q5emdCEy9BAzwhmg@mail.gmail.com
2023-04-01Pass down table relation into more index relation functionsAndres Freund
This is done in preparation for logical decoding on standby, which needs to include whether visibility affecting WAL records are about a (user) catalog table. Which is only known for the table, not the indexes. It's also nice to be able to pass the heap relation to GlobalVisTestFor() in vacuumRedirectAndPlaceholder(). Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/21b700c3-eecf-2e05-a699-f8c78dd31ec7@gmail.com
2023-02-07Remove useless casts to (void *) in arguments of some system functionsPeter Eisentraut
The affected functions are: bsearch, memcmp, memcpy, memset, memmove, qsort, repalloc Reviewed-by: Corey Huinker <corey.huinker@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/fd9adf5d-b1aa-e82f-e4c7-263c30145807%40enterprisedb.com
2023-01-02Update copyright for 2023Bruce Momjian
Backpatch-through: 11
2022-12-15Static assertions cleanupPeter Eisentraut
Because we added StaticAssertStmt() first before StaticAssertDecl(), some uses as well as the instructions in c.h are now a bit backwards from the "native" way static assertions are meant to be used in C. This updates the guidance and moves some static assertions to better places. Specifically, since the addition of StaticAssertDecl(), we can put static assertions at the file level. This moves a number of static assertions out of function bodies, where they might have been stuck out of necessity, to perhaps better places at the file level or in header files. Also, when the static assertion appears in a position where a declaration is allowed, then using StaticAssertDecl() is more native than StaticAssertStmt(). Reviewed-by: John Naylor <john.naylor@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/941a04e7-dd6f-c0e4-8cdf-a33b3338cbda%40enterprisedb.com
2022-04-13Remove extraneous blank lines before block-closing bracesAlvaro Herrera
These are useless and distracting. We wouldn't have written the code with them to begin with, so there's no reason to keep them. Author: Justin Pryzby <pryzby@telsasoft.com> Discussion: https://postgr.es/m/20220411020336.GB26620@telsasoft.com Discussion: https://postgr.es/m/attachment/133167/0016-Extraneous-blank-lines.patch
2022-04-12Revert the addition of GetMaxBackends() and related stuff.Robert Haas
This reverts commits 0147fc7, 4567596, aa64f23, and 5ecd018. There is no longer agreement that introducing this function was the right way to address the problem. The consensus now seems to favor trying to make a correct value for MaxBackends available to mdules executing their _PG_init() functions. Nathan Bossart Discussion: http://postgr.es/m/20220323045229.i23skfscdbvrsuxa@jrouhaud
2022-04-01Add macros in hash and btree AMs to get the special area of their pagesMichael Paquier
This makes the code more consistent with SpGiST, GiST and GIN, that already use this style, and the idea is to make easier the introduction of more sanity checks for each of these AM-specific macros. BRIN uses a different set of macros to get a page's type and flags, so it has no need for something similar. Author: Matthias van de Meent Discussion: https://postgr.es/m/CAEze2WjE3+tGO9Fs9+iZMU+z6mMZKo54W1Zt98WKqbEUHbHOBg@mail.gmail.com
2022-02-08Remove MaxBackends variable in favor of GetMaxBackends() function.Robert Haas
Previously, it was really easy to write code that accessed MaxBackends before we'd actually initialized it, especially when coding up an extension. To make this less error-prune, introduce a new function GetMaxBackends() which should be used to obtain the correct value. This will ERROR if called too early. Demote the global variable to a file-level static, so that nobody can peak at it directly. Nathan Bossart. Idea by Andres Freund. Review by Greg Sabino Mullane, by Michael Paquier (who had doubts about the approach), and by me. Discussion: http://postgr.es/m/20210802224204.bckcikl45uezv5e4@alap3.anarazel.de
2022-02-03Add UNIQUE null treatment optionPeter Eisentraut
The SQL standard has been ambiguous about whether null values in unique constraints should be considered equal or not. Different implementations have different behaviors. In the SQL:202x draft, this has been formalized by making this implementation-defined and adding an option on unique constraint definitions UNIQUE [ NULLS [NOT] DISTINCT ] to choose a behavior explicitly. This patch adds this option to PostgreSQL. The default behavior remains UNIQUE NULLS DISTINCT. Making this happen in the btree code is pretty easy; most of the patch is just to carry the flag around to all the places that need it. The CREATE UNIQUE INDEX syntax extension is not from the standard, it's my own invention. I named all the internal flags, catalog columns, etc. in the negative ("nulls not distinct") so that the default PostgreSQL behavior is the default if the flag is false. Reviewed-by: Maxim Orlov <orlovmg@gmail.com> Reviewed-by: Pavel Borisov <pashkin.elfe@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/84e5ee1b-387e-9a54-c326-9082674bde78@enterprisedb.com
2022-01-07Update copyright for 2022Bruce Momjian
Backpatch-through: 10
2021-10-15Remove obsolete nbtree deduplication comments.Peter Geoghegan
Follow up to commit 2903f140.
2021-10-02Enable deduplication in system catalog indexes.Peter Geoghegan
The "equality implies image equality" opclass infrastructure disallowed deduplication in system catalog indexes and TOAST indexes before now. That seemed like the right approach back when the infrastructure was added by commit 612a1ab7, since ALTER INDEX cannot set deduplicate_items to 'off' (due to an old implementation restriction). But that decision now seems arbitrary at best. Remove special case handling implementing this policy. No catversion bump, since existing catalog indexes will still work. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-Wz=rYQHFaJ3WYBdK=xgwxKzaiGMSSrh-ZCREa-pS-7Zjew@mail.gmail.com
2021-03-11Add back vacuum_cleanup_index_scale_factor parameter.Peter Geoghegan
Commit 9f3665fb removed the vacuum_cleanup_index_scale_factor storage parameter. However, that creates dump/reload hazards when moving across major versions. Add back the vacuum_cleanup_index_scale_factor parameter (though not the GUC of the same name) purely to avoid problems when using tools like pg_upgrade. The parameter remains disabled and undocumented. No backpatch to Postgres 13, since vacuum_cleanup_index_scale_factor was only disabled by REL_13_STABLE's version of master branch commit 9f3665fb in the first place -- the parameter already looks like this on REL_13_STABLE. Discussion: https://postgr.es/m/YEm/a3Ko3nKnBuVq@paquier.xyz
2021-03-10Don't consider newly inserted tuples in nbtree VACUUM.Peter Geoghegan
Remove the entire idea of "stale stats" within nbtree VACUUM (stop caring about stats involving the number of inserted tuples). Also remove the vacuum_cleanup_index_scale_factor GUC/param on the master branch (though just disable them on postgres 13). The vacuum_cleanup_index_scale_factor/stats interface made the nbtree AM partially responsible for deciding when pg_class.reltuples stats needed to be updated. This seems contrary to the spirit of the index AM API, though -- it is not actually necessary for an index AM's bulk delete and cleanup callbacks to provide accurate stats when it happens to be inconvenient. The core code owns that. (Index AMs have the authority to perform or not perform certain kinds of deferred cleanup based on their own considerations, such as page deletion and recycling, but that has little to do with pg_class.reltuples/num_index_tuples.) This issue was fairly harmless until the introduction of the autovacuum_vacuum_insert_threshold feature by commit b07642db, which had an undesirable interaction with the vacuum_cleanup_index_scale_factor mechanism: it made insert-driven autovacuums perform full index scans, even though there is no real benefit to doing so. This has been tied to a regression with an append-only insert benchmark [1]. Also have remaining cases that perform a full scan of an index during a cleanup-only nbtree VACUUM indicate that the final tuple count is only an estimate. This prevents vacuumlazy.c from setting the index's pg_class.reltuples in those cases (it will now only update pg_class when vacuumlazy.c had TIDs for nbtree to bulk delete). This arguably fixes an oversight in deduplication-related bugfix commit 48e12913. [1] https://smalldatum.blogspot.com/2021/01/insert-benchmark-postgres-is-still.html Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAD21AoA4WHthN5uU6+WScZ7+J_RcEjmcuH94qcoUPuB42ShXzg@mail.gmail.com Backpatch: 13-, where autovacuum_vacuum_insert_threshold was added.
2021-01-02Update copyright for 2021Bruce Momjian
Backpatch-through: 9.5
2020-11-17Deprecate nbtree's BTP_HAS_GARBAGE flag.Peter Geoghegan
Streamline handling of the various strategies that we have to avoid a page split in nbtinsert.c. When it looks like a leaf page is about to overflow, we now perform deleting LP_DEAD items and deduplication in one central place. This greatly simplifies _bt_findinsertloc(). This has an independently useful consequence: nbtree no longer relies on the BTP_HAS_GARBAGE page level flag/hint for anything important. We still set and unset the flag in the same way as before, but it's no longer treated as a gating condition when considering if we should check for already-set LP_DEAD bits. This happens at the point where the page looks like it might have to be split anyway, so simply checking the LP_DEAD bits in passing is practically free. This avoids missing LP_DEAD bits just because the page-level hint is unset, which is probably reasonably common (e.g. it happens when VACUUM unsets the page-level flag without actually removing index tuples whose LP_DEAD-bit was set recently, after the VACUUM operation began but before it reached the leaf page in question). Note that this isn't a big behavioral change compared to PostgreSQL 13. We were already checking for set LP_DEAD bits regardless of whether the BTP_HAS_GARBAGE page level flag was set before we considered doing a deduplication pass. This commit only goes slightly further by doing the same check for all indexes, even indexes where deduplication won't be performed. We don't completely remove the BTP_HAS_GARBAGE flag. We still rely on it as a gating condition with pg_upgrade'd indexes from before B-tree version 4/PostgreSQL 12. That makes sense because we sometimes have to make a choice among pages full of duplicates when inserting a tuple with pre version 4 indexes. It probably still pays to avoid accessing the line pointer array of a page there, since it won't yet be clear whether we'll insert on to the page in question at all, let alone split it as a result. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Victor Yegorov <vyegorov@gmail.com> Discussion: https://postgr.es/m/CAH2-Wz%3DYpc1PDdk8OVJDChGJBjT06%3DA0Mbv9HyTLCsOknGcUFg%40mail.gmail.com
2020-07-21Add nbtree Valgrind buffer lock checks.Peter Geoghegan
Holding just a buffer pin (with no buffer lock) on an nbtree buffer/page provides very weak guarantees, especially compared to heapam, where it's often safe to read a page while only holding a buffer pin. This commit has Valgrind enforce the following rule: it is never okay to access an nbtree buffer without holding both a pin and a lock on the buffer. A draft version of this patch detected questionable code that was cleaned up by commits fa7ff642 and 7154aa16. The code in question used to access an nbtree buffer page's special/opaque area with no buffer lock (only a buffer pin). This practice (which isn't obviously unsafe) is hereby formally disallowed in nbtree. There doesn't seem to be any reason to allow it, and banning it keeps things simple for Valgrind. The new checks are implemented by adding custom nbtree client requests (located in LockBuffer() wrapper functions); these requests are "superimposed" on top of the generic bufmgr.c Valgrind client requests added by commit 1e0dfd16. No custom resource management cleanup code is needed to undo the effects of marking buffers as non-accessible under this scheme. Author: Peter Geoghegan Reviewed-By: Anastasia Lubennikova, Georgios Kokolatos Discussion: https://postgr.es/m/CAH2-WzkLgyN3zBvRZ1pkNJThC=xi_0gpWRUb_45eexLH1+k2_Q@mail.gmail.com
2020-05-16Final pgindent run with pg_bsd_indent version 2.1.Tom Lane
This is just to provide a clean basis for comparison of the results of the new version. I did fix a typo that crept into 242dfcbaf. Discussion: https://postgr.es/m/20200114221814.GA19630@alvherre.pgsql
2020-05-15Avoid killing btree items that are already deadAlvaro Herrera
_bt_killitems marks btree items dead when a scan leaves the page where they live, but it does so with only share lock (to improve concurrency). This was historicall okay, since killing a dead item has no consequences. However, with the advent of data checksums and wal_log_hints, this action incurs a WAL full-page-image record of the page. Multiple concurrent processes would write the same page several times, leading to WAL bloat. The probability of this happening can be reduced by only killing items if they're not already dead, so change the code to do that. The problem could eliminated completely by having _bt_killitems upgrade to exclusive lock upon seeing a killable item, but that would reduce concurrency so it's considered a cure worse than the disease. Backpatch all the way back to 9.5, since wal_log_hints was introduced in 9.4. Author: Masahiko Sawada <masahiko.sawada@2ndquadrant.com> Discussion: https://postgr.es/m/CA+fd4k6PeRj2CkzapWNrERkja5G0-6D-YQiKfbukJV+qZGFZ_Q@mail.gmail.com
2020-04-29Remove redundant _bt_killitems() buffer check.Peter Geoghegan
_bt_getbuf() cannot return an invalid buffer. Oversight in commit 2ed5b87f96d.
2020-04-13Harmonize nbtree page split point code.Peter Geoghegan
An nbtree split point can be thought of as a point between two adjoining tuples from an imaginary version of the page being split that includes the incoming/new item (in addition to the items that really are on the page). These adjoining tuples are called the lastleft and firstright tuples. The variables that represent split points contained a field called firstright, which is an offset number of the first data item from the original page that goes on the new right page. The corresponding tuple from origpage was usually the same thing as the actual firstright tuple, but not always: the firstright tuple is sometimes the new/incoming item instead. This situation seems unnecessarily confusing. Make things clearer by renaming the origpage offset returned by _bt_findsplitloc() to "firstrightoff". We now have a firstright tuple and a firstrightoff offset number which are comparable to the newitem/lastleft tuples and the newitemoff/lastleftoff offset numbers respectively. Also make sure that we are consistent about how we describe nbtree page split point state. Push the responsibility for dealing with pg_upgrade'd !heapkeyspace indexes down to lower level code, relieving _bt_split() from dealing with it directly. This means that we always have a palloc'd left page high key on the leaf level, no matter what. This enables simplifying some of the code (and code comments) within _bt_split(). Finally, restructure the page split code to make it clearer why suffix truncation (which only takes place during leaf page splits) is completely different to the first data item truncation that takes place during internal page splits. Tuples are marked as having fewer attributes stored in both cases, and the firstright tuple is truncated in both cases, so it's easy to imagine somebody missing the distinction.
2020-04-07Remove nbtree BTreeTupleSetAltHeapTID() function.Peter Geoghegan
Since heap TID is supposed to be just another key attribute to the implementation, it doesn't make much sense to have separate BTreeTupleSetNAtts() and BTreeTupleSetAltHeapTID() functions. Merge the two functions together. This slightly simplifies _bt_truncate().
2020-04-06Fix nbtree kill_prior_tuple posting list assert.Peter Geoghegan
An assertion added by commit 0d861bbb checked that _bt_killitems() only processes a BTScanPosItem whose heap TID is contained in a posting list tuple when its page offset number still matches what is on the page (i.e. when it matches the posting list tuple's current offset number). This was only correct in the common case where the page can't have changed since we first read it. It was not correct in cases where we don't drop the buffer pin (and don't need to verify the page hasn't changed using its LSN). The latter category includes scans involving unlogged tables, and scans that use a non-MVCC snapshot, per the logic originally introduced by commit 2ed5b87f. The assertion still seems helpful. Fix it by taking cases where the page may have been concurrently modified into account. Reported-By: Anastasia Lubennikova, Alexander Lakhin Discussion: https://postgr.es/m/c4e38e9a-0f9c-8e53-e639-adf343f94472@postgrespro.ru
2020-03-30Further simplify nbtree high key truncation.Peter Geoghegan
Commit 7c2dbc69 reorganized _bt_truncate() in a way that enables a further simplification that I (pgeoghegan) missed: Since we mark the tuple that is returned to the caller as a pivot tuple before the point where its heap TID is set as of 7c2dbc69, it is possible to use the high level BTreeTupleGetHeapTID() inline function to get an item pointer. Do it that way now. This approach is clearer and more maintainable.
2020-03-30Refactor nbtree high key truncation.Peter Geoghegan
Simplify _bt_truncate(), the routine that generates truncated leaf page high keys. Remove a micro-optimization that avoided a second palloc0() call (this was used when a heap TID was needed in the final pivot tuple, though only when the index happened to not be an INCLUDE index). Removing this dubious micro-optimization allows _bt_truncate() to use the index_truncate_tuple() indextuple.c utility routine in all cases. This was already the common case. This commit is a HEAD-only follow up to bugfix commit 4b42a899.
2020-03-30Consistently truncate non-key suffix columns.Peter Geoghegan
INCLUDE indexes failed to have their non-key attributes physically truncated away in certain rare cases. This led to physically larger pivot tuples that contained useless non-key attribute values. The impact on users should be negligible, but this is still clearly a regression (Postgres 11 supports INCLUDE indexes, and yet was not affected). The bug appeared in commit dd299df8, which introduced "true" suffix truncation of key attributes. Discussion: https://postgr.es/m/CAH2-Wz=E8pkV9ivRSFHtv812H5ckf8s1-yhx61_WrJbKccGcrQ@mail.gmail.com Backpatch: 12-, where "true" suffix truncation was introduced.