summaryrefslogtreecommitdiff
path: root/src/backend/access/nbtree/nbtutils.c
AgeCommit message (Collapse)Author
2023-04-01Pass down table relation into more index relation functionsAndres Freund
This is done in preparation for logical decoding on standby, which needs to include whether visibility affecting WAL records are about a (user) catalog table. Which is only known for the table, not the indexes. It's also nice to be able to pass the heap relation to GlobalVisTestFor() in vacuumRedirectAndPlaceholder(). Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/21b700c3-eecf-2e05-a699-f8c78dd31ec7@gmail.com
2023-02-07Remove useless casts to (void *) in arguments of some system functionsPeter Eisentraut
The affected functions are: bsearch, memcmp, memcpy, memset, memmove, qsort, repalloc Reviewed-by: Corey Huinker <corey.huinker@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/fd9adf5d-b1aa-e82f-e4c7-263c30145807%40enterprisedb.com
2023-01-02Update copyright for 2023Bruce Momjian
Backpatch-through: 11
2022-12-15Static assertions cleanupPeter Eisentraut
Because we added StaticAssertStmt() first before StaticAssertDecl(), some uses as well as the instructions in c.h are now a bit backwards from the "native" way static assertions are meant to be used in C. This updates the guidance and moves some static assertions to better places. Specifically, since the addition of StaticAssertDecl(), we can put static assertions at the file level. This moves a number of static assertions out of function bodies, where they might have been stuck out of necessity, to perhaps better places at the file level or in header files. Also, when the static assertion appears in a position where a declaration is allowed, then using StaticAssertDecl() is more native than StaticAssertStmt(). Reviewed-by: John Naylor <john.naylor@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/941a04e7-dd6f-c0e4-8cdf-a33b3338cbda%40enterprisedb.com
2022-04-13Remove extraneous blank lines before block-closing bracesAlvaro Herrera
These are useless and distracting. We wouldn't have written the code with them to begin with, so there's no reason to keep them. Author: Justin Pryzby <pryzby@telsasoft.com> Discussion: https://postgr.es/m/20220411020336.GB26620@telsasoft.com Discussion: https://postgr.es/m/attachment/133167/0016-Extraneous-blank-lines.patch
2022-04-12Revert the addition of GetMaxBackends() and related stuff.Robert Haas
This reverts commits 0147fc7, 4567596, aa64f23, and 5ecd018. There is no longer agreement that introducing this function was the right way to address the problem. The consensus now seems to favor trying to make a correct value for MaxBackends available to mdules executing their _PG_init() functions. Nathan Bossart Discussion: http://postgr.es/m/20220323045229.i23skfscdbvrsuxa@jrouhaud
2022-04-01Add macros in hash and btree AMs to get the special area of their pagesMichael Paquier
This makes the code more consistent with SpGiST, GiST and GIN, that already use this style, and the idea is to make easier the introduction of more sanity checks for each of these AM-specific macros. BRIN uses a different set of macros to get a page's type and flags, so it has no need for something similar. Author: Matthias van de Meent Discussion: https://postgr.es/m/CAEze2WjE3+tGO9Fs9+iZMU+z6mMZKo54W1Zt98WKqbEUHbHOBg@mail.gmail.com
2022-02-08Remove MaxBackends variable in favor of GetMaxBackends() function.Robert Haas
Previously, it was really easy to write code that accessed MaxBackends before we'd actually initialized it, especially when coding up an extension. To make this less error-prune, introduce a new function GetMaxBackends() which should be used to obtain the correct value. This will ERROR if called too early. Demote the global variable to a file-level static, so that nobody can peak at it directly. Nathan Bossart. Idea by Andres Freund. Review by Greg Sabino Mullane, by Michael Paquier (who had doubts about the approach), and by me. Discussion: http://postgr.es/m/20210802224204.bckcikl45uezv5e4@alap3.anarazel.de
2022-02-03Add UNIQUE null treatment optionPeter Eisentraut
The SQL standard has been ambiguous about whether null values in unique constraints should be considered equal or not. Different implementations have different behaviors. In the SQL:202x draft, this has been formalized by making this implementation-defined and adding an option on unique constraint definitions UNIQUE [ NULLS [NOT] DISTINCT ] to choose a behavior explicitly. This patch adds this option to PostgreSQL. The default behavior remains UNIQUE NULLS DISTINCT. Making this happen in the btree code is pretty easy; most of the patch is just to carry the flag around to all the places that need it. The CREATE UNIQUE INDEX syntax extension is not from the standard, it's my own invention. I named all the internal flags, catalog columns, etc. in the negative ("nulls not distinct") so that the default PostgreSQL behavior is the default if the flag is false. Reviewed-by: Maxim Orlov <orlovmg@gmail.com> Reviewed-by: Pavel Borisov <pashkin.elfe@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/84e5ee1b-387e-9a54-c326-9082674bde78@enterprisedb.com
2022-01-07Update copyright for 2022Bruce Momjian
Backpatch-through: 10
2021-10-15Remove obsolete nbtree deduplication comments.Peter Geoghegan
Follow up to commit 2903f140.
2021-10-02Enable deduplication in system catalog indexes.Peter Geoghegan
The "equality implies image equality" opclass infrastructure disallowed deduplication in system catalog indexes and TOAST indexes before now. That seemed like the right approach back when the infrastructure was added by commit 612a1ab7, since ALTER INDEX cannot set deduplicate_items to 'off' (due to an old implementation restriction). But that decision now seems arbitrary at best. Remove special case handling implementing this policy. No catversion bump, since existing catalog indexes will still work. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-Wz=rYQHFaJ3WYBdK=xgwxKzaiGMSSrh-ZCREa-pS-7Zjew@mail.gmail.com
2021-03-11Add back vacuum_cleanup_index_scale_factor parameter.Peter Geoghegan
Commit 9f3665fb removed the vacuum_cleanup_index_scale_factor storage parameter. However, that creates dump/reload hazards when moving across major versions. Add back the vacuum_cleanup_index_scale_factor parameter (though not the GUC of the same name) purely to avoid problems when using tools like pg_upgrade. The parameter remains disabled and undocumented. No backpatch to Postgres 13, since vacuum_cleanup_index_scale_factor was only disabled by REL_13_STABLE's version of master branch commit 9f3665fb in the first place -- the parameter already looks like this on REL_13_STABLE. Discussion: https://postgr.es/m/YEm/a3Ko3nKnBuVq@paquier.xyz
2021-03-10Don't consider newly inserted tuples in nbtree VACUUM.Peter Geoghegan
Remove the entire idea of "stale stats" within nbtree VACUUM (stop caring about stats involving the number of inserted tuples). Also remove the vacuum_cleanup_index_scale_factor GUC/param on the master branch (though just disable them on postgres 13). The vacuum_cleanup_index_scale_factor/stats interface made the nbtree AM partially responsible for deciding when pg_class.reltuples stats needed to be updated. This seems contrary to the spirit of the index AM API, though -- it is not actually necessary for an index AM's bulk delete and cleanup callbacks to provide accurate stats when it happens to be inconvenient. The core code owns that. (Index AMs have the authority to perform or not perform certain kinds of deferred cleanup based on their own considerations, such as page deletion and recycling, but that has little to do with pg_class.reltuples/num_index_tuples.) This issue was fairly harmless until the introduction of the autovacuum_vacuum_insert_threshold feature by commit b07642db, which had an undesirable interaction with the vacuum_cleanup_index_scale_factor mechanism: it made insert-driven autovacuums perform full index scans, even though there is no real benefit to doing so. This has been tied to a regression with an append-only insert benchmark [1]. Also have remaining cases that perform a full scan of an index during a cleanup-only nbtree VACUUM indicate that the final tuple count is only an estimate. This prevents vacuumlazy.c from setting the index's pg_class.reltuples in those cases (it will now only update pg_class when vacuumlazy.c had TIDs for nbtree to bulk delete). This arguably fixes an oversight in deduplication-related bugfix commit 48e12913. [1] https://smalldatum.blogspot.com/2021/01/insert-benchmark-postgres-is-still.html Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAD21AoA4WHthN5uU6+WScZ7+J_RcEjmcuH94qcoUPuB42ShXzg@mail.gmail.com Backpatch: 13-, where autovacuum_vacuum_insert_threshold was added.
2021-01-02Update copyright for 2021Bruce Momjian
Backpatch-through: 9.5
2020-11-17Deprecate nbtree's BTP_HAS_GARBAGE flag.Peter Geoghegan
Streamline handling of the various strategies that we have to avoid a page split in nbtinsert.c. When it looks like a leaf page is about to overflow, we now perform deleting LP_DEAD items and deduplication in one central place. This greatly simplifies _bt_findinsertloc(). This has an independently useful consequence: nbtree no longer relies on the BTP_HAS_GARBAGE page level flag/hint for anything important. We still set and unset the flag in the same way as before, but it's no longer treated as a gating condition when considering if we should check for already-set LP_DEAD bits. This happens at the point where the page looks like it might have to be split anyway, so simply checking the LP_DEAD bits in passing is practically free. This avoids missing LP_DEAD bits just because the page-level hint is unset, which is probably reasonably common (e.g. it happens when VACUUM unsets the page-level flag without actually removing index tuples whose LP_DEAD-bit was set recently, after the VACUUM operation began but before it reached the leaf page in question). Note that this isn't a big behavioral change compared to PostgreSQL 13. We were already checking for set LP_DEAD bits regardless of whether the BTP_HAS_GARBAGE page level flag was set before we considered doing a deduplication pass. This commit only goes slightly further by doing the same check for all indexes, even indexes where deduplication won't be performed. We don't completely remove the BTP_HAS_GARBAGE flag. We still rely on it as a gating condition with pg_upgrade'd indexes from before B-tree version 4/PostgreSQL 12. That makes sense because we sometimes have to make a choice among pages full of duplicates when inserting a tuple with pre version 4 indexes. It probably still pays to avoid accessing the line pointer array of a page there, since it won't yet be clear whether we'll insert on to the page in question at all, let alone split it as a result. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Victor Yegorov <vyegorov@gmail.com> Discussion: https://postgr.es/m/CAH2-Wz%3DYpc1PDdk8OVJDChGJBjT06%3DA0Mbv9HyTLCsOknGcUFg%40mail.gmail.com
2020-07-21Add nbtree Valgrind buffer lock checks.Peter Geoghegan
Holding just a buffer pin (with no buffer lock) on an nbtree buffer/page provides very weak guarantees, especially compared to heapam, where it's often safe to read a page while only holding a buffer pin. This commit has Valgrind enforce the following rule: it is never okay to access an nbtree buffer without holding both a pin and a lock on the buffer. A draft version of this patch detected questionable code that was cleaned up by commits fa7ff642 and 7154aa16. The code in question used to access an nbtree buffer page's special/opaque area with no buffer lock (only a buffer pin). This practice (which isn't obviously unsafe) is hereby formally disallowed in nbtree. There doesn't seem to be any reason to allow it, and banning it keeps things simple for Valgrind. The new checks are implemented by adding custom nbtree client requests (located in LockBuffer() wrapper functions); these requests are "superimposed" on top of the generic bufmgr.c Valgrind client requests added by commit 1e0dfd16. No custom resource management cleanup code is needed to undo the effects of marking buffers as non-accessible under this scheme. Author: Peter Geoghegan Reviewed-By: Anastasia Lubennikova, Georgios Kokolatos Discussion: https://postgr.es/m/CAH2-WzkLgyN3zBvRZ1pkNJThC=xi_0gpWRUb_45eexLH1+k2_Q@mail.gmail.com
2020-05-16Final pgindent run with pg_bsd_indent version 2.1.Tom Lane
This is just to provide a clean basis for comparison of the results of the new version. I did fix a typo that crept into 242dfcbaf. Discussion: https://postgr.es/m/20200114221814.GA19630@alvherre.pgsql
2020-05-15Avoid killing btree items that are already deadAlvaro Herrera
_bt_killitems marks btree items dead when a scan leaves the page where they live, but it does so with only share lock (to improve concurrency). This was historicall okay, since killing a dead item has no consequences. However, with the advent of data checksums and wal_log_hints, this action incurs a WAL full-page-image record of the page. Multiple concurrent processes would write the same page several times, leading to WAL bloat. The probability of this happening can be reduced by only killing items if they're not already dead, so change the code to do that. The problem could eliminated completely by having _bt_killitems upgrade to exclusive lock upon seeing a killable item, but that would reduce concurrency so it's considered a cure worse than the disease. Backpatch all the way back to 9.5, since wal_log_hints was introduced in 9.4. Author: Masahiko Sawada <masahiko.sawada@2ndquadrant.com> Discussion: https://postgr.es/m/CA+fd4k6PeRj2CkzapWNrERkja5G0-6D-YQiKfbukJV+qZGFZ_Q@mail.gmail.com
2020-04-29Remove redundant _bt_killitems() buffer check.Peter Geoghegan
_bt_getbuf() cannot return an invalid buffer. Oversight in commit 2ed5b87f96d.
2020-04-13Harmonize nbtree page split point code.Peter Geoghegan
An nbtree split point can be thought of as a point between two adjoining tuples from an imaginary version of the page being split that includes the incoming/new item (in addition to the items that really are on the page). These adjoining tuples are called the lastleft and firstright tuples. The variables that represent split points contained a field called firstright, which is an offset number of the first data item from the original page that goes on the new right page. The corresponding tuple from origpage was usually the same thing as the actual firstright tuple, but not always: the firstright tuple is sometimes the new/incoming item instead. This situation seems unnecessarily confusing. Make things clearer by renaming the origpage offset returned by _bt_findsplitloc() to "firstrightoff". We now have a firstright tuple and a firstrightoff offset number which are comparable to the newitem/lastleft tuples and the newitemoff/lastleftoff offset numbers respectively. Also make sure that we are consistent about how we describe nbtree page split point state. Push the responsibility for dealing with pg_upgrade'd !heapkeyspace indexes down to lower level code, relieving _bt_split() from dealing with it directly. This means that we always have a palloc'd left page high key on the leaf level, no matter what. This enables simplifying some of the code (and code comments) within _bt_split(). Finally, restructure the page split code to make it clearer why suffix truncation (which only takes place during leaf page splits) is completely different to the first data item truncation that takes place during internal page splits. Tuples are marked as having fewer attributes stored in both cases, and the firstright tuple is truncated in both cases, so it's easy to imagine somebody missing the distinction.
2020-04-07Remove nbtree BTreeTupleSetAltHeapTID() function.Peter Geoghegan
Since heap TID is supposed to be just another key attribute to the implementation, it doesn't make much sense to have separate BTreeTupleSetNAtts() and BTreeTupleSetAltHeapTID() functions. Merge the two functions together. This slightly simplifies _bt_truncate().
2020-04-06Fix nbtree kill_prior_tuple posting list assert.Peter Geoghegan
An assertion added by commit 0d861bbb checked that _bt_killitems() only processes a BTScanPosItem whose heap TID is contained in a posting list tuple when its page offset number still matches what is on the page (i.e. when it matches the posting list tuple's current offset number). This was only correct in the common case where the page can't have changed since we first read it. It was not correct in cases where we don't drop the buffer pin (and don't need to verify the page hasn't changed using its LSN). The latter category includes scans involving unlogged tables, and scans that use a non-MVCC snapshot, per the logic originally introduced by commit 2ed5b87f. The assertion still seems helpful. Fix it by taking cases where the page may have been concurrently modified into account. Reported-By: Anastasia Lubennikova, Alexander Lakhin Discussion: https://postgr.es/m/c4e38e9a-0f9c-8e53-e639-adf343f94472@postgrespro.ru
2020-03-30Further simplify nbtree high key truncation.Peter Geoghegan
Commit 7c2dbc69 reorganized _bt_truncate() in a way that enables a further simplification that I (pgeoghegan) missed: Since we mark the tuple that is returned to the caller as a pivot tuple before the point where its heap TID is set as of 7c2dbc69, it is possible to use the high level BTreeTupleGetHeapTID() inline function to get an item pointer. Do it that way now. This approach is clearer and more maintainable.
2020-03-30Refactor nbtree high key truncation.Peter Geoghegan
Simplify _bt_truncate(), the routine that generates truncated leaf page high keys. Remove a micro-optimization that avoided a second palloc0() call (this was used when a heap TID was needed in the final pivot tuple, though only when the index happened to not be an INCLUDE index). Removing this dubious micro-optimization allows _bt_truncate() to use the index_truncate_tuple() indextuple.c utility routine in all cases. This was already the common case. This commit is a HEAD-only follow up to bugfix commit 4b42a899.
2020-03-30Consistently truncate non-key suffix columns.Peter Geoghegan
INCLUDE indexes failed to have their non-key attributes physically truncated away in certain rare cases. This led to physically larger pivot tuples that contained useless non-key attribute values. The impact on users should be negligible, but this is still clearly a regression (Postgres 11 supports INCLUDE indexes, and yet was not affected). The bug appeared in commit dd299df8, which introduced "true" suffix truncation of key attributes. Discussion: https://postgr.es/m/CAH2-Wz=E8pkV9ivRSFHtv812H5ckf8s1-yhx61_WrJbKccGcrQ@mail.gmail.com Backpatch: 12-, where "true" suffix truncation was introduced.
2020-02-26Add deduplication to nbtree.Peter Geoghegan
Deduplication reduces the storage overhead of duplicates in indexes that use the standard nbtree index access method. The deduplication process is applied lazily, after the point where opportunistic deletion of LP_DEAD-marked index tuples occurs. Deduplication is only applied at the point where a leaf page split would otherwise be required. New posting list tuples are formed by merging together existing duplicate tuples. The physical representation of the items on an nbtree leaf page is made more space efficient by deduplication, but the logical contents of the page are not changed. Even unique indexes make use of deduplication as a way of controlling bloat from duplicates whose TIDs point to different versions of the same logical table row. The lazy approach taken by nbtree has significant advantages over a GIN style eager approach. Most individual inserts of index tuples have exactly the same overhead as before. The extra overhead of deduplication is amortized across insertions, just like the overhead of page splits. The key space of indexes works in the same way as it has since commit dd299df8 (the commit that made heap TID a tiebreaker column). Testing has shown that nbtree deduplication can generally make indexes with about 10 or 15 tuples for each distinct key value about 2.5X - 4X smaller, even with single column integer indexes (e.g., an index on a referencing column that accompanies a foreign key). The final size of single column nbtree indexes comes close to the final size of a similar contrib/btree_gin index, at least in cases where GIN's posting list compression isn't very effective. This can significantly improve transaction throughput, and significantly reduce the cost of vacuuming indexes. A new index storage parameter (deduplicate_items) controls the use of deduplication. The default setting is 'on', so all new B-Tree indexes automatically use deduplication where possible. This decision will be reviewed at the end of the Postgres 13 beta period. There is a regression of approximately 2% of transaction throughput with synthetic workloads that consist of append-only inserts into a table with several non-unique indexes, where all indexes have few or no repeated values. The underlying issue is that cycles are wasted on unsuccessful attempts at deduplicating items in non-unique indexes. There doesn't seem to be a way around it short of disabling deduplication entirely. Note that deduplication of items in unique indexes is fairly well targeted in general, which avoids the problem there (we can use a special heuristic to trigger deduplication passes in unique indexes, since we're specifically targeting "version bloat"). Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed. No bump in BTREE_VERSION, since the representation of posting list tuples works in a way that's backwards compatible with version 4 indexes (i.e. indexes built on PostgreSQL 12). However, users must still REINDEX a pg_upgrade'd index to use deduplication, regardless of the Postgres version they've upgraded from. This is the only way to set the new nbtree metapage flag indicating that deduplication is generally safe. Author: Anastasia Lubennikova, Peter Geoghegan Reviewed-By: Peter Geoghegan, Heikki Linnakangas Discussion: https://postgr.es/m/55E4051B.7020209@postgrespro.ru https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
2020-02-26Add equalimage B-Tree support functions.Peter Geoghegan
Invent the concept of a B-Tree equalimage ("equality implies image equality") support function, registered as support function 4. This indicates whether it is safe (or not safe) to apply optimizations that assume that any two datums considered equal by an operator class's order method must be interchangeable without any loss of semantic information. This is static information about an operator class and a collation. Register an equalimage routine for almost all of the existing B-Tree opclasses. We only need two trivial routines for all of the opclasses that are included with the core distribution. There is one routine for opclasses that index non-collatable types (which returns 'true' unconditionally), plus another routine for collatable types (which returns 'true' when the collation is a deterministic collation). This patch is infrastructure for an upcoming patch that adds B-Tree deduplication. Author: Peter Geoghegan, Anastasia Lubennikova Discussion: https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
2020-01-01Update copyrights for 2020Bruce Momjian
Backpatch-through: update all files in master, backpatch legal files through 9.4
2019-11-25Refactor reloption handling for index AMs in-coreMichael Paquier
This reworks the reloption parsing and build of a couple of index AMs by creating new structures for each index AM's options. This split was already done for BRIN, GIN and GiST (which actually has a fillfactor parameter), but not for hash, B-tree and SPGiST which relied on StdRdOptions due to an overlap with the default option set. This saves a couple of bytes for rd_options in each relcache entry with indexes making use of relation options, and brings more consistency between all index AMs. While on it, add a couple of AssertMacro() calls to make sure that utility macros to grab values of reloptions are used with the expected index AM. Author: Nikolay Shaplov Reviewed-by: Amit Langote, Michael Paquier, Álvaro Herrera, Dent John Discussion: https://postgr.es/m/4127670.gFlpRb6XCm@x200m
2019-11-12Make _bt_keep_natts_fast() use datum_image_eq().Peter Geoghegan
An upcoming patch that adds deduplication to the nbtree AM will rely on _bt_keep_natts_fast() understanding that differences in TOAST input state can never affect its answer. In particular, two opclass-equal datums (with opclasses deemed safe for deduplication) should never be treated as unequal by _bt_keep_natts_fast() due to TOAST input differences. This also seems like a good idea on general principle. nbtsplitloc.c will now occasionally make better decisions about where to split a leaf page. The behavior of _bt_keep_natts_fast() is now somewhat closer to the behavior of _bt_keep_natts(). Discussion: https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
2019-11-07Add reusable routine for making arrays unique.Thomas Munro
Introduce qunique() and qunique_arg(), which can be used after qsort() and qsort_arg() respectively to remove duplicate values. Use it where appropriate. Author: Thomas Munro Reviewed-by: Tom Lane (in an earlier version) Discussion: https://postgr.es/m/CAEepm%3D2vmFTNpAmwbGGD2WaryM6T3hSDVKQPfUwjdD_5XY6vAA%40mail.gmail.com
2019-09-12Remove redundant _bt_truncate() comment paragraph.Peter Geoghegan
2019-08-27Remove obsolete nbtree page deletion comment.Peter Geoghegan
Commit efada2b8e92, which made the nbtree page deletion algorithm more robust, removed the concept of a half-dead internal page. Remove a comment about half dead parent pages that was overlooked.
2019-08-13Fix inconsistencies and typos in the tree, take 10Michael Paquier
This addresses some issues with unnecessary code comments, fixes various typos in docs and comments, and removes some orphaned structures and definitions. Author: Alexander Lakhin Discussion: https://postgr.es/m/9aabc775-5494-b372-8bcb-4dfc0bd37c68@gmail.com
2019-05-26Fix typos.Amit Kapila
Reported-by: Alexander Lakhin Author: Alexander Lakhin Reviewed-by: Amit Kapila and Tom Lane Discussion: https://postgr.es/m/7208de98-add8-8537-91c0-f8b089e2928c@gmail.com
2019-05-22Phase 2 pgindent run for v12.Tom Lane
Switch to 2.1 version of pg_bsd_indent. This formats multiline function declarations "correctly", that is with additional lines of parameter declarations indented to match where the first line's left parenthesis is. Discussion: https://postgr.es/m/CAEepm=0P3FeTXRcU5B2W3jv3PgRVZ-kGUXLGfd42FFhUROO3ug@mail.gmail.com
2019-05-22Initial pgindent run for v12.Tom Lane
This is still using the 2.0 version of pg_bsd_indent. I thought it would be good to commit this separately, so as to document the differences between 2.0 and 2.1 behavior. Discussion: https://postgr.es/m/16296.1558103386@sss.pgh.pa.us
2019-04-23Prevent O(N^2) unique index insertion edge case.Peter Geoghegan
Commit dd299df8 made nbtree treat heap TID as a tiebreaker column, establishing the principle that there is only one correct location (page and page offset number) for every index tuple, no matter what. Insertions of tuples into non-unique indexes proceed as if heap TID (scan key's scantid) is just another user-attribute value, but insertions into unique indexes are more delicate. The TID value in scantid must initially be omitted to ensure that the unique index insertion visits every leaf page that duplicates could be on. The scantid is set once again after unique checking finishes successfully, which can force _bt_findinsertloc() to step right one or more times, to locate the leaf page that the new tuple must be inserted on. Stepping right within _bt_findinsertloc() was assumed to occur no more frequently than stepping right within _bt_check_unique(), but there was one important case where that assumption was incorrect: inserting a "duplicate" with NULL values. Since _bt_check_unique() didn't do any real work in this case, it wasn't appropriate for _bt_findinsertloc() to behave as if it was finishing off a conventional unique insertion, where any existing physical duplicate must be dead or recently dead. _bt_findinsertloc() might have to grovel through a substantial portion of all of the leaf pages in the index to insert a single tuple, even when there were no dead tuples. To fix, treat insertions of tuples with NULLs into a unique index as if they were insertions into a non-unique index: never unset scantid before calling _bt_search() to descend the tree, and bypass _bt_check_unique() entirely. _bt_check_unique() is no longer responsible for incoming tuples with NULL values. Discussion: https://postgr.es/m/CAH2-Wzm08nr+JPx4jMOa9CGqxWYDQ-_D4wtPBiKghXAUiUy-nQ@mail.gmail.com
2019-04-02Report progress of CREATE INDEX operationsAlvaro Herrera
This uses the progress reporting infrastructure added by c16dc1aca5e0, adding support for CREATE INDEX and CREATE INDEX CONCURRENTLY. There are two pieces to this: one is index-AM-agnostic, and the other is AM-specific. The latter is fairly elaborate for btrees, including reportage for parallel index builds and the separate phases that btree index creation uses; other index AMs, which are much simpler in their building procedures, have simplistic reporting only, but that seems sufficient, at least for non-concurrent builds. The index-AM-agnostic part is fairly complete, providing insight into the CONCURRENTLY wait phases as well as block-based progress during the index validation table scan. (The index validation index scan requires patching each AM, which has not been included here.) Reviewers: Rahila Syed, Pavan Deolasee, Tatsuro Yamada Discussion: https://postgr.es/m/20181220220022.mg63bhk26zdpvmcj@alvherre.pgsql
2019-03-31Fix nbtree high key "continuescan" row compare bug.Peter Geoghegan
Commit 29b64d1d mishandled skipping over truncated high key attributes during row comparisons. The row comparison key matching loop would loop forever when a truncated attribute was encountered for a row compare subkey. Fix by following the example of other code in the loop: advance the current subkey, or break out of the loop when the last subkey is reached. Add test coverage for the relevant _bt_check_rowcompare() code path. The new test case is somewhat tied to nbtree implementation details, which isn't ideal, but seems unavoidable.
2019-03-23Add nbtree high key "continuescan" optimization.Peter Geoghegan
Teach nbtree forward index scans to check the high key before moving to the right sibling page in the hope of finding that it isn't actually necessary to do so. The new check may indicate that the scan definitely cannot find matching tuples to the right, ending the scan immediately. We already opportunistically force a similar "continuescan orientated" key check of the final non-pivot tuple when it's clear that it cannot be returned to the scan due to being dead-to-all. The new high key check is complementary. The new approach for forward scans is more effective than checking the final non-pivot tuple, especially with composite indexes and non-unique indexes. The improvements to the logic for picking a split point added by commit fab25024 make it likely that relatively dissimilar high keys will appear on a page. A distinguishing key value that can only appear on non-pivot tuples on the right sibling page will often be present in leaf page high keys. Since forcing the final item to be key checked no longer makes any difference in the case of forward scans, the existing extra key check is now only used for backwards scans. Backward scans continue to opportunistically check the final non-pivot tuple, which is actually the first non-pivot tuple on the page (not the last). Note that even pg_upgrade'd v3 indexes make use of this optimization. Author: Peter Geoghegan, Heikki Linnakangas Reviewed-By: Heikki Linnakangas Discussion: https://postgr.es/m/CAH2-WzkOmUduME31QnuTFpimejuQoiZ-HOf0pOWeFZNhTMctvA@mail.gmail.com
2019-03-20Consider secondary factors during nbtree splits.Peter Geoghegan
Teach nbtree to give some consideration to how "distinguishing" candidate leaf page split points are. This should not noticeably affect the balance of free space within each half of the split, while still making suffix truncation truncate away significantly more attributes on average. The logic for choosing a leaf split point now uses a fallback mode in the case where the page is full of duplicates and it isn't possible to find even a minimally distinguishing split point. When the page is full of duplicates, the split should pack the left half very tightly, while leaving the right half mostly empty. Our assumption is that logical duplicates will almost always be inserted in ascending heap TID order with v4 indexes. This strategy leaves most of the free space on the half of the split that will likely be where future logical duplicates of the same value need to be placed. The number of cycles added is not very noticeable. This is important because deciding on a split point takes place while at least one exclusive buffer lock is held. We avoid using authoritative insertion scankey comparisons to save cycles, unlike suffix truncation proper. We use a faster binary comparison instead. Note that even pg_upgrade'd v3 indexes make use of these optimizations. Benchmarking has shown that even v3 indexes benefit, despite the fact that suffix truncation will only truncate non-key attributes in INCLUDE indexes. Grouping relatively similar tuples together is beneficial in and of itself, since it reduces the number of leaf pages that must be accessed by subsequent index scans. Author: Peter Geoghegan Reviewed-By: Heikki Linnakangas Discussion: https://postgr.es/m/CAH2-WzmmoLNQOj9mAD78iQHfWLJDszHEDrAzGTUMG3mVh5xWPw@mail.gmail.com
2019-03-20Make heap TID a tiebreaker nbtree index column.Peter Geoghegan
Make nbtree treat all index tuples as having a heap TID attribute. Index searches can distinguish duplicates by heap TID, since heap TID is always guaranteed to be unique. This general approach has numerous benefits for performance, and is prerequisite to teaching VACUUM to perform "retail index tuple deletion". Naively adding a new attribute to every pivot tuple has unacceptable overhead (it bloats internal pages), so suffix truncation of pivot tuples is added. This will usually truncate away the "extra" heap TID attribute from pivot tuples during a leaf page split, and may also truncate away additional user attributes. This can increase fan-out, especially in a multi-column index. Truncation can only occur at the attribute granularity, which isn't particularly effective, but works well enough for now. A future patch may add support for truncating "within" text attributes by generating truncated key values using new opclass infrastructure. Only new indexes (BTREE_VERSION 4 indexes) will have insertions that treat heap TID as a tiebreaker attribute, or will have pivot tuples undergo suffix truncation during a leaf page split (on-disk compatibility with versions 2 and 3 is preserved). Upgrades to version 4 cannot be performed on-the-fly, unlike upgrades from version 2 to version 3. contrib/amcheck continues to work with version 2 and 3 indexes, while also enforcing stricter invariants when verifying version 4 indexes. These stricter invariants are the same invariants described by "3.1.12 Sequencing" from the Lehman and Yao paper. A later patch will enhance the logic used by nbtree to pick a split point. This patch is likely to negatively impact performance without smarter choices around the precise point to split leaf pages at. Making these two mostly-distinct sets of enhancements into distinct commits seems like it might clarify their design, even though neither commit is particularly useful on its own. The maximum allowed size of new tuples is reduced by an amount equal to the space required to store an extra MAXALIGN()'d TID in a new high key during leaf page splits. The user-facing definition of the "1/3 of a page" restriction is already imprecise, and so does not need to be revised. However, there should be a compatibility note in the v12 release notes. Author: Peter Geoghegan Reviewed-By: Heikki Linnakangas, Alexander Korotkov Discussion: https://postgr.es/m/CAH2-WzkVb0Kom=R+88fDFb=JSxZMFvbHVC6Mn9LJ2n=X=kS-Uw@mail.gmail.com
2019-03-20Refactor nbtree insertion scankeys.Peter Geoghegan
Use dedicated struct to represent nbtree insertion scan keys. Having a dedicated struct makes the difference between search type scankeys and insertion scankeys a lot clearer, and simplifies the signature of several related functions. This is based on a suggestion by Andrey Lepikhov. Streamline how unique index insertions cache binary search progress. Cache the state of in-progress binary searches within _bt_check_unique() for later instead of having callers avoid repeating the binary search in an ad-hoc manner. This makes it easy to add a new optimization: _bt_check_unique() now falls out of its loop immediately in the common case where it's already clear that there couldn't possibly be a duplicate. The new _bt_check_unique() scheme makes it a lot easier to manage cached binary search effort afterwards, from within _bt_findinsertloc(). This is needed for the upcoming patch to make nbtree tuples unique by treating heap TID as a final tiebreaker column. Unique key binary searches need to restore lower and upper bounds. They cannot simply continue to use the >= lower bound as the offset to insert at, because the heap TID tiebreaker column must be used in comparisons for the restored binary search (unlike the original _bt_check_unique() binary search, where scankey's heap TID column must be omitted). Author: Peter Geoghegan, Heikki Linnakangas Reviewed-By: Heikki Linnakangas, Andrey Lepikhov Discussion: https://postgr.es/m/CAH2-WzmE6AhUdk9NdWBf4K3HjWXZBX3+umC7mH7+WDrKcRtsOw@mail.gmail.com
2019-01-02Update copyright for 2019Bruce Momjian
Backpatch-through: certain files through 9.4
2018-10-05Allow btree comparison functions to return INT_MIN.Tom Lane
Historically we forbade datatype-specific comparison functions from returning INT_MIN, so that it would be safe to invert the sort order just by negating the comparison result. However, this was never really safe for comparison functions that directly return the result of memcmp(), strcmp(), etc, as POSIX doesn't place any such restriction on those library functions. Buildfarm results show that at least on recent Linux on s390x, memcmp() actually does return INT_MIN sometimes, causing sort failures. The agreed-on answer is to remove this restriction and fix relevant call sites to not make such an assumption; code such as "res = -res" should be replaced by "INVERT_COMPARE_RESULT(res)". The same is needed in a few places that just directly negated the result of memcmp or strcmp. To help find places having this problem, I've also added a compile option to nbtcompare.c that causes some of the commonly used comparators to return INT_MIN/INT_MAX instead of their usual -1/+1. It'd likely be a good idea to have at least one buildfarm member running with "-DSTRESS_SORT_INT_MIN". That's far from a complete test of course, but it should help to prevent fresh introductions of such bugs. This is a longstanding portability hazard, so back-patch to all supported branches. Discussion: https://postgr.es/m/20180928185215.ffoq2xrq5d3pafna@alap3.anarazel.de
2018-06-30pgindent run prior to branchingAndrew Dunstan
2018-04-26Post-feature-freeze pgindent run.Tom Lane
Discussion: https://postgr.es/m/15719.1523984266@sss.pgh.pa.us
2018-04-19Adjust INCLUDE index truncation comments and code.Teodor Sigaev
Add several assertions that ensure that we're dealing with a pivot tuple without non-key attributes where that's expected. Also, remove the assertion within _bt_isequal(), restoring the v10 function signature. A similar check will be performed for the page highkey within _bt_moveright() in most cases. Also avoid dropping all objects within regression tests, to increase pg_dump test coverage for INCLUDE indexes. Rather than using infrastructure that's generally intended to be used with reference counted heap tuple descriptors during truncation, use the same function that was introduced to store flat TupleDescs in shared memory (we use a temp palloc'd buffer). This isn't strictly necessary, but seems more future-proof than the old approach. It also lets us avoid including rel.h within indextuple.c, which was arguably a modularity violation. Also, we now call index_deform_tuple() with the truncated TupleDesc, not the source TupleDesc, since that's more robust, and saves a few cycles. In passing, fix a memory leak by pfree'ing truncated pivot tuple memory during CREATE INDEX. Also pfree during a page split, just to be consistent. Refactor _bt_check_natts() to be more readable. Author: Peter Geoghegan with some editorization by me Reviewed by: Alexander Korotkov, Teodor Sigaev Discussion: https://www.postgresql.org/message-id/CAH2-Wz%3DkCWuXeMrBCopC-tFs3FbiVxQNjjgNKdG2sHxZ5k2y3w%40mail.gmail.com