diff options
author | Peter Geoghegan <pg@bowt.ie> | 2025-04-04 12:27:04 -0400 |
---|---|---|
committer | Peter Geoghegan <pg@bowt.ie> | 2025-04-04 12:27:04 -0400 |
commit | 92fe23d93aa3bbbc40fca669cabc4a4d7975e327 (patch) | |
tree | e79d024c849f0a0b89378ff8c16b6d6b2d0cc777 /src/backend/access/nbtree/nbtree.c | |
parent | 3ba2cdaa45416196fadc7163998cda7b4e07e7d7 (diff) |
Add nbtree skip scan optimization.
Teach nbtree multi-column index scans to opportunistically skip over
irrelevant sections of the index given a query with no "=" conditions on
one or more prefix index columns. When nbtree is passed input scan keys
derived from a predicate "WHERE b = 5", new nbtree preprocessing steps
output "WHERE a = ANY(<every possible 'a' value>) AND b = 5" scan keys.
That is, preprocessing generates a "skip array" (and an output scan key)
for the omitted prefix column "a", which makes it safe to mark the scan
key on "b" as required to continue the scan. The scan is therefore able
to repeatedly reposition itself by applying both the "a" and "b" keys.
A skip array has "elements" that are generated procedurally and on
demand, but otherwise works just like a regular ScalarArrayOp array.
Preprocessing can freely add a skip array before or after any input
ScalarArrayOp arrays. Index scans with a skip array decide when and
where to reposition the scan using the same approach as any other scan
with array keys. This design builds on the design for array advancement
and primitive scan scheduling added to Postgres 17 by commit 5bf748b8.
Testing has shown that skip scans of an index with a low cardinality
skipped prefix column can be multiple orders of magnitude faster than an
equivalent full index scan (or sequential scan). In general, the
cardinality of the scan's skipped column(s) limits the number of leaf
pages that can be skipped over.
The core B-Tree operator classes on most discrete types generate their
array elements with the help of their own custom skip support routine.
This infrastructure gives nbtree a way to generate the next required
array element by incrementing (or decrementing) the current array value.
It can reduce the number of index descents in cases where the next
possible indexable value frequently turns out to be the next value
stored in the index. Opclasses that lack a skip support routine fall
back on having nbtree "increment" (or "decrement") a skip array's
current element by setting the NEXT (or PRIOR) scan key flag, without
directly changing the scan key's sk_argument. These sentinel values
behave just like any other value from an array -- though they can never
locate equal index tuples (they can only locate the next group of index
tuples containing the next set of non-sentinel values that the scan's
arrays need to advance to).
A skip array's range is constrained by "contradictory" inequality keys.
For example, a skip array on "x" will only generate the values 1 and 2
given a qual such as "WHERE x BETWEEN 1 AND 2 AND y = 66". Such a skip
array qual usually has near-identical performance characteristics to a
comparable SAOP qual "WHERE x = ANY('{1, 2}') AND y = 66". However,
improved performance isn't guaranteed. Much depends on physical index
characteristics.
B-Tree preprocessing is optimistic about skipping working out: it
applies static, generic rules when determining where to generate skip
arrays, which assumes that the runtime overhead of maintaining skip
arrays will pay for itself -- or lead to only a modest performance loss.
As things stand, these assumptions are much too optimistic: skip array
maintenance will lead to unacceptable regressions with unsympathetic
queries (queries whose scan can't skip over many irrelevant leaf pages).
An upcoming commit will address the problems in this area by enhancing
_bt_readpage's approach to saving cycles on scan key evaluation, making
it work in a way that directly considers the needs of = array keys
(particularly = skip array keys).
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiro Ikeda <masahiro.ikeda@nttdata.com>
Reviewed-By: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Tomas Vondra <tomas@vondra.me>
Reviewed-By: Aleksander Alekseev <aleksander@timescale.com>
Reviewed-By: Alena Rybakina <a.rybakina@postgrespro.ru>
Discussion: https://postgr.es/m/CAH2-Wzmn1YsLzOGgjAQZdn1STSG_y8qP__vggTaPAYXJP+G4bw@mail.gmail.com
Diffstat (limited to 'src/backend/access/nbtree/nbtree.c')
-rw-r--r-- | src/backend/access/nbtree/nbtree.c | 196 |
1 files changed, 179 insertions, 17 deletions
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c index a99667eb2bd..77781cb9002 100644 --- a/src/backend/access/nbtree/nbtree.c +++ b/src/backend/access/nbtree/nbtree.c @@ -31,6 +31,7 @@ #include "storage/ipc.h" #include "storage/lmgr.h" #include "storage/read_stream.h" +#include "utils/datum.h" #include "utils/fmgrprotos.h" #include "utils/index_selfuncs.h" #include "utils/memutils.h" @@ -76,14 +77,26 @@ typedef struct BTParallelScanDescData /* * btps_arrElems is used when scans need to schedule another primitive - * index scan. Holds BTArrayKeyInfo.cur_elem offsets for scan keys. + * index scan with one or more SAOP arrays. Holds BTArrayKeyInfo.cur_elem + * offsets for each = scan key associated with a ScalarArrayOp array. */ int btps_arrElems[FLEXIBLE_ARRAY_MEMBER]; + + /* + * Additional space (at the end of the struct) is used when scans need to + * schedule another primitive index scan with one or more skip arrays. + * Holds a flattened datum representation for each = scan key associated + * with a skip array. + */ } BTParallelScanDescData; typedef struct BTParallelScanDescData *BTParallelScanDesc; +static void _bt_parallel_serialize_arrays(Relation rel, BTParallelScanDesc btscan, + BTScanOpaque so); +static void _bt_parallel_restore_arrays(Relation rel, BTParallelScanDesc btscan, + BTScanOpaque so); static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback, void *callback_state, BTCycleId cycleid); @@ -541,10 +554,167 @@ btrestrpos(IndexScanDesc scan) * btestimateparallelscan -- estimate storage for BTParallelScanDescData */ Size -btestimateparallelscan(int nkeys, int norderbys) +btestimateparallelscan(Relation rel, int nkeys, int norderbys) +{ + int16 nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel); + Size estnbtreeshared, + genericattrspace; + + /* + * Pessimistically assume that every input scan key will be output with + * its own SAOP array + */ + estnbtreeshared = offsetof(BTParallelScanDescData, btps_arrElems) + + sizeof(int) * nkeys; + + /* Single column indexes cannot possibly use a skip array */ + if (nkeyatts == 1) + return estnbtreeshared; + + /* + * Pessimistically assume that all attributes prior to the least + * significant attribute require a skip array (and an associated key) + */ + genericattrspace = datumEstimateSpace((Datum) 0, false, true, + sizeof(Datum)); + for (int attnum = 1; attnum < nkeyatts; attnum++) + { + CompactAttribute *attr; + + /* + * We make the conservative assumption that every index column will + * also require a skip array. + * + * Every skip array must have space to store its scan key's sk_flags. + */ + estnbtreeshared = add_size(estnbtreeshared, sizeof(int)); + + /* Consider space required to store a datum of opclass input type */ + attr = TupleDescCompactAttr(rel->rd_att, attnum - 1); + if (attr->attbyval) + { + /* This index attribute stores pass-by-value datums */ + Size estfixed = datumEstimateSpace((Datum) 0, false, + true, attr->attlen); + + estnbtreeshared = add_size(estnbtreeshared, estfixed); + continue; + } + + /* + * This index attribute stores pass-by-reference datums. + * + * Assume that serializing this array will use just as much space as a + * pass-by-value datum, in addition to space for the largest possible + * whole index tuple (this is not just a per-datum portion of the + * largest possible tuple because that'd be almost as large anyway). + * + * This is quite conservative, but it's not clear how we could do much + * better. The executor requires an up-front storage request size + * that reliably covers the scan's high watermark memory usage. We + * can't be sure of the real high watermark until the scan is over. + */ + estnbtreeshared = add_size(estnbtreeshared, genericattrspace); + estnbtreeshared = add_size(estnbtreeshared, BTMaxItemSize); + } + + return estnbtreeshared; +} + +/* + * _bt_parallel_serialize_arrays() -- Serialize parallel array state. + * + * Caller must have exclusively locked btscan->btps_lock when called. + */ +static void +_bt_parallel_serialize_arrays(Relation rel, BTParallelScanDesc btscan, + BTScanOpaque so) +{ + char *datumshared; + + /* Space for serialized datums begins immediately after btps_arrElems[] */ + datumshared = ((char *) &btscan->btps_arrElems[so->numArrayKeys]); + for (int i = 0; i < so->numArrayKeys; i++) + { + BTArrayKeyInfo *array = &so->arrayKeys[i]; + ScanKey skey = &so->keyData[array->scan_key]; + + if (array->num_elems != -1) + { + /* Save SAOP array's cur_elem (no need to copy key/datum) */ + Assert(!(skey->sk_flags & SK_BT_SKIP)); + btscan->btps_arrElems[i] = array->cur_elem; + continue; + } + + /* Save all mutable state associated with skip array's key */ + Assert(skey->sk_flags & SK_BT_SKIP); + memcpy(datumshared, &skey->sk_flags, sizeof(int)); + datumshared += sizeof(int); + + if (skey->sk_flags & (SK_BT_MINVAL | SK_BT_MAXVAL)) + { + /* No sk_argument datum to serialize */ + Assert(skey->sk_argument == 0); + continue; + } + + datumSerialize(skey->sk_argument, (skey->sk_flags & SK_ISNULL) != 0, + array->attbyval, array->attlen, &datumshared); + } +} + +/* + * _bt_parallel_restore_arrays() -- Restore serialized parallel array state. + * + * Caller must have exclusively locked btscan->btps_lock when called. + */ +static void +_bt_parallel_restore_arrays(Relation rel, BTParallelScanDesc btscan, + BTScanOpaque so) { - /* Pessimistically assume all input scankeys will be output with arrays */ - return offsetof(BTParallelScanDescData, btps_arrElems) + sizeof(int) * nkeys; + char *datumshared; + + /* Space for serialized datums begins immediately after btps_arrElems[] */ + datumshared = ((char *) &btscan->btps_arrElems[so->numArrayKeys]); + for (int i = 0; i < so->numArrayKeys; i++) + { + BTArrayKeyInfo *array = &so->arrayKeys[i]; + ScanKey skey = &so->keyData[array->scan_key]; + bool isnull; + + if (array->num_elems != -1) + { + /* Restore SAOP array using its saved cur_elem */ + Assert(!(skey->sk_flags & SK_BT_SKIP)); + array->cur_elem = btscan->btps_arrElems[i]; + skey->sk_argument = array->elem_values[array->cur_elem]; + continue; + } + + /* Restore skip array by restoring its key directly */ + if (!array->attbyval && skey->sk_argument) + pfree(DatumGetPointer(skey->sk_argument)); + skey->sk_argument = (Datum) 0; + memcpy(&skey->sk_flags, datumshared, sizeof(int)); + datumshared += sizeof(int); + + Assert(skey->sk_flags & SK_BT_SKIP); + + if (skey->sk_flags & (SK_BT_MINVAL | SK_BT_MAXVAL)) + { + /* No sk_argument datum to restore */ + continue; + } + + skey->sk_argument = datumRestore(&datumshared, &isnull); + if (isnull) + { + Assert(skey->sk_argument == 0); + Assert(skey->sk_flags & SK_SEARCHNULL); + Assert(skey->sk_flags & SK_ISNULL); + } + } } /* @@ -613,6 +783,7 @@ bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *next_scan_page, BlockNumber *last_curr_page, bool first) { + Relation rel = scan->indexRelation; BTScanOpaque so = (BTScanOpaque) scan->opaque; bool exit_loop = false, status = true, @@ -679,14 +850,9 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *next_scan_page, { /* Can start scheduled primitive scan right away, so do so */ btscan->btps_pageStatus = BTPARALLEL_ADVANCING; - for (int i = 0; i < so->numArrayKeys; i++) - { - BTArrayKeyInfo *array = &so->arrayKeys[i]; - ScanKey skey = &so->keyData[array->scan_key]; - array->cur_elem = btscan->btps_arrElems[i]; - skey->sk_argument = array->elem_values[array->cur_elem]; - } + /* Restore scan's array keys from serialized values */ + _bt_parallel_restore_arrays(rel, btscan, so); exit_loop = true; } else @@ -831,6 +997,7 @@ _bt_parallel_done(IndexScanDesc scan) void _bt_parallel_primscan_schedule(IndexScanDesc scan, BlockNumber curr_page) { + Relation rel = scan->indexRelation; BTScanOpaque so = (BTScanOpaque) scan->opaque; ParallelIndexScanDesc parallel_scan = scan->parallel_scan; BTParallelScanDesc btscan; @@ -849,12 +1016,7 @@ _bt_parallel_primscan_schedule(IndexScanDesc scan, BlockNumber curr_page) btscan->btps_pageStatus = BTPARALLEL_NEED_PRIMSCAN; /* Serialize scan's current array keys */ - for (int i = 0; i < so->numArrayKeys; i++) - { - BTArrayKeyInfo *array = &so->arrayKeys[i]; - - btscan->btps_arrElems[i] = array->cur_elem; - } + _bt_parallel_serialize_arrays(rel, btscan, so); } LWLockRelease(&btscan->btps_lock); } |