| Age | Commit message (Collapse) | Author |
|
We'll want to make use of it in more places, and we'd prefer to not
have to include all of primnodes.h everywhere.
Author: Mark Dilger <mark.dilger@enterprisedb.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
|
|
Follow up to commit 630f9a43cec. The previous name had become
confusing, because it doesn't actually translate a strategy number but
a CompareType into a strategy number. We might add the inverse at
some point, which would then probably be called something like
GistTranslateStratnum.
Reviewed-by: Mark Dilger <mark.dilger@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
|
|
The right mix of DDL and VACUUM could corrupt a catalog page header such
that PageIsVerified() durably fails, requiring a restore from backup.
This affects only catalogs that both have a syscache and have DDL code
that uses syscache tuples to construct updates. One of the test
permutations shows a variant not yet fixed.
This makes !TransactionIdIsValid(TM_FailureData.xmax) possible with
TM_Deleted. I think core and PGXN are indifferent to that.
Per bug #17821 from Alexander Lakhin. Back-patch to v13 (all supported
versions). The test case is v17+, since it uses INJECTION_POINT.
Discussion: https://postgr.es/m/17821-dd8c334263399284@postgresql.org
|
|
Assume twophase.c is the performance-sensitive caller, and preserve its
choice of unlikely() branch hint. Add some retrospective rationale for
that choice. Back-patch to v17, for the next commit to use it.
Reviewed (in earlier versions) by Michael Paquier.
Discussion: https://postgr.es/m/17821-dd8c334263399284@postgresql.org
Discussion: https://postgr.es/m/20250116010051.f3.nmisch@google.com
|
|
Discussion: https://www.postgresql.org/message-id/flat/5b558da8-99fb-0a99-83dd-f72f05388517@enterprisedb.com
|
|
Discussion: https://www.postgresql.org/message-id/flat/5b558da8-99fb-0a99-83dd-f72f05388517@enterprisedb.com
|
|
Discussion: https://www.postgresql.org/message-id/flat/692ee0da-49da-4d32-8dca-da224cc2800e@eisentraut.org
|
|
Move the several members of HeapScanDescData which are specific to
Bitmap Heap Scans into a new struct, BitmapHeapScanDescData, which
inherits from HeapScanDescData.
This reduces the size of the HeapScanDescData for other types of scans
and will allow us to add additional bitmap heap scan-specific members in
the future without fear of bloating the HeapScanDescData.
Reviewed-by: Tomas Vondra
Discussion: https://postgr.es/m/c736f6aa-8b35-4e20-9621-62c7c82e2168%40vondra.me
|
|
Commit 630f9a43cec introduced an enum forward declaration, which
doesn't work in C++. To fix, just include the header file to get the
type.
|
|
This changes commit 7406ab623fe in that the gist strategy number
mapping support function is changed to use the CompareType enum as
input, instead of the "well-known" RT*StrategyNumber strategy numbers.
This is a bit cleaner, since you are not dealing with two sets of
strategy numbers. Also, this will enable us to subsume this system
into a more general system of using CompareType to define operator
semantics across index methods.
Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
|
|
Add comments explaining the purpose of B-Tree support function 5 to
nbtree.h for consistency (all other support functions were already
described by nearby comments).
This fixes what was arguably an oversight in commit 911e702077, or in
follow-up doc commit 15cb2bd2 (which documented support function 5 in
btree.sgml, but neglected to add anything to nbtree.h).
|
|
Quite a bit of code within nbtutils.c is only called during nbtree
preprocessing. Move that code into a new .c file, nbtpreprocesskeys.c.
Also reorder some of the functions within the new file for clarity.
This commit has no functional impact. It is strictly mechanical.
Author: Peter Geoghegan <pg@bowt.ie>
Suggested-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/CAH2-WznwNn1BDOpWxHBUK1f3Rdw8pO9UCenWXnvT=n9GO8GnLA@mail.gmail.com
Discussion: https://postgr.es/m/86930045-5df5-494a-b4f1-815bc3fbcce0%40iki.fi
|
|
This adds support for the NOT ENFORCED/ENFORCED flag for constraints,
with support for check constraints.
The plan is to eventually support this for foreign key constraints,
where it is typically more useful.
Note that CHECK constraints do not currently support ALTER operations,
so changing the enforceability of an existing constraint isn't
possible without dropping and recreating it. This could be added
later.
Author: Amul Sul <amul.sul@enterprisedb.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: jian he <jian.universality@gmail.com>
Tested-by: Triveni N <triveni.n@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAJ_b962c5AcYW9KUt_R_ER5qs3fUGbe4az-SP-vuwPS-w-AGA@mail.gmail.com
|
|
6f3820f37 adjusted the assert-enabled validation of the CompactAttribute
to call a new external function to perform the validation. That commit
made it so the function was only available when building with
USE_ASSERT_CHECKING, and because TupleDescCompactAttr() is a static
inline function, the call to verify_compact_attribute() was compiled
into any extension which uses TupleDescCompactAttr(). This caused issues
for such extensions when loading the assert-enabled extension into
PostgreSQL versions without asserts enabled due to that function being
unavailable in core.
To fix this, make verify_compact_attribute() available unconditionally,
but make it do nothing unless building with USE_ASSERT_CHECKING.
Author: Andrew Kane <andrew@ankane.org>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAOdR5yHfMEMW00XGo=v1zCVUS6Huq2UehXdvKnwtXPTcZwXhmg@mail.gmail.com
|
|
The originally submitted code (using bit masking) was correct when the
number of slots was restricted to be a power of two -- but that
limitation was removed during development that led to commit
53c2a97a9266, which made the bank selection code incorrect. This led to
always using a smaller number of banks than available. Change said code
to use integer modulo instead, which works correctly with an arbitrary
number of banks.
It's likely that we could improve on this to avoid runtime use of
integer division. But with this change we're, at least, not wasting
memory on unused banks, and more banks mean less contention, which is
likely to have a much higher performance impact than a single
instruction's latency.
Author: Yura Sokolov <y.sokolov@postgrespro.ru>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Discussion: https://postgr.es/m/9444dc46-ca47-43ed-9058-89c456316306@postgrespro.ru
|
|
Backpatch-through: 13
|
|
5983a4cff added CompactAttribute as an abbreviated alternative to
FormData_pg_attribute to allow more cache-friendly processing in tasks
related to TupleDescs. That commit contained some assert-only code to
check that the CompactAttribute had been populated correctly, however,
the method used to do that checking caused the TupleDesc's
CompactAttribute to be zeroed before it was repopulated and compared to
the snapshot taken before the memset call. This caused issues as the type
cache caches TupleDescs in shared memory which can be used by multiple
backend processes at the same time. There was a window of time between
the zero and repopulation of the CompactAttribute where another process
would mistakenly think that the CompactAttribute is invalid due to the
memset.
To fix this, instead of taking a snapshot of the CompactAttribute and
calling populate_compact_attribute() and comparing the snapshot to the
freshly populated TupleDesc's CompactAttribute, refactor things so we
can just populate a temporary CompactAttribute on the stack. This way
we don't touch the TupleDesc's memory.
Reported-by: Alexander Lakhin, SQLsmith
Discussion: https://postgr.es/m/ca3a256a-5d12-42db-aabe-a75a030d9fb9@gmail.com
|
|
Here we convert CompactAttribute.attalign from a char, which is directly
derived from pg_attribute.attalign into a uint8, which stores the number
of bytes to align the column's value by in the tuple.
This allows tuple deformation and tuple size calculations to move away
from using the inefficient att_align_nominal() macro, which manually
checks each TYPALIGN_* char to translate that into the alignment bytes
for the given type. Effectively, this commit changes those to TYPEALIGN
calls, which are branchless and only perform some simple arithmetic with
some bit-twiddling.
The removed branches were often mispredicted by CPUs, especially so in
real-world tables which often contain a mishmash of different types
with different alignment requirements.
Author: David Rowley
Reviewed-by: Andres Freund, Victor Yegorov
Discussion: https://postgr.es/m/CAApHDvrBztXP3yx=NKNmo3xwFAFhEdyPnvrDg3=M0RhDs+4vYw@mail.gmail.com
|
|
The new compact_attrs array stores a few select fields from
FormData_pg_attribute in a more compact way, using only 16 bytes per
column instead of the 104 bytes that FormData_pg_attribute uses. Using
CompactAttribute allows performance-critical operations such as tuple
deformation to be performed without looking at the FormData_pg_attribute
element in TupleDesc which means fewer cacheline accesses.
For some workloads, tuple deformation can be the most CPU intensive part
of processing the query. Some testing with 16 columns on a table
where the first column is variable length showed around a 10% increase in
transactions per second for an OLAP type query performing aggregation on
the 16th column. However, in certain cases, the increases were much
higher, up to ~25% on one AMD Zen4 machine.
This also makes pg_attribute.attcacheoff redundant. A follow-on commit
will remove it, thus shrinking the FormData_pg_attribute struct by 4
bytes.
Author: David Rowley
Reviewed-by: Andres Freund, Victor Yegorov
Discussion: https://postgr.es/m/CAApHDvrBztXP3yx=NKNmo3xwFAFhEdyPnvrDg3=M0RhDs+4vYw@mail.gmail.com
|
|
With the repurposing of TBMIterator as an interface for both parallel
and serial iteration through TIDBitmaps in commit 7f9d4187e7bab10329cc,
bitmap table scans may now use it.
Modify bitmap table scan code to use the TBMIterator. This requires
moving around a bit of code, so a few variables are initialized
elsewhere.
Author: Melanie Plageman
Reviewed-by: Tomas Vondra
Discussion: https://postgr.es/m/c736f6aa-8b35-4e20-9621-62c7c82e2168%40vondra.me
|
|
Add and use TBMPrivateIterator, which replaces the current TBMIterator
for serial use cases, and repurpose TBMIterator to be a unified
interface for both the serial ("private") and parallel ("shared") TID
Bitmap iterator interfaces. This encapsulation simplifies call sites for
callers supporting both parallel and serial TID Bitmap access.
TBMIterator is not yet used in this commit.
Author: Melanie Plageman
Reviewed-by: Tomas Vondra, Heikki Linnakangas
Discussion: https://postgr.es/m/063e4eb4-32d9-439e-a0b1-75565a9835a8%40iki.fi
|
|
HeapScanDescData.rs_cindex and rs_ntuples can't be less than 0. All scan
types using the heap scan descriptor expect these values to be >= 0.
Make that expectation clear by making rs_cindex and rs_ntuples unsigned.
Also remove the test in heapam_scan_bitmap_next_tuple() that checks if
rs_cindex < 0. This was never true, but now that rs_cindex is unsigned,
it makes even less sense.
While we are at it, initialize both rs_cindex and rs_ntuples to 0 in
initscan().
Author: Melanie Plageman
Reviewed-by: Dilip Kumar
Discussion: https://postgr.es/m/CAAKRu_ZxF8cDCM_BFi_L-t%3DRjdCZYP1usd1Gd45mjHfZxm0nZw%40mail.gmail.com
|
|
It can be useful to know the state of a relation page's VM bits before
visibilitymap_set(). visibilitymap_set() has the old value on hand, so
returning it is simple. This commit does not use visibilitymap_set()'s
new return value.
Author: Melanie Plageman
Reviewed-by: Masahiko Sawada, Andres Freund, Nitin Jadhav, Bilal Yavuz
Discussion: https://postgr.es/m/flat/CAAKRu_ZQe26xdvAqo4weHLR%3DivQ8J4xrSfDDD8uXnh-O-6P6Lg%40mail.gmail.com#6d8d2b4219394f774889509bf3bdc13d,
https://postgr.es/m/ctdjzroezaxmiyah3gwbwm67defsrwj2b5fpfs4ku6msfpxeia%40mwjyqlhwr2wu
|
|
Remove the 'whenTaken' and 'lsn' fields from SnapshotData. After the
removal of the "snapshot too old" feature, they were never set to a
non-zero value.
This largely reverts commit 3e2f3c2e423, which added the
OldestActiveSnapshot tracking, and the init_toast_snapshot()
function. That was only required for setting the 'whenTaken' and 'lsn'
fields. SnapshotToast is now a constant again, like SnapshotSelf and
SnapshotAny. I kept a thin get_toast_snapshot() wrapper around
SnapshotToast though, to check that you have a registered or active
snapshot. That's still a useful sanity check.
Reviewed-by: Nathan Bossart, Andres Freund, Tom Lane
Discussion: https://www.postgresql.org/message-id/cd4b4f8c-e63a-41c0-95f6-6e6cd9b83f6d@iki.fi
|
|
Usually an entry in pg_amop or pg_amproc does not need a dependency on
its amoplefttype/amoprighttype/amproclefttype/amprocrighttype types,
because there is an indirect dependency via the argument types of its
referenced operator or procedure, or via the opclass it belongs to.
However, for some support procedures in some index AMs, the argument
types of the support procedure might not mention the column data type
at all. Also, the amop/amproc entry might be treated as "loose" in
the opfamily, in which case it lacks a dependency on any particular
opclass; or it might be a cross-type entry having a reference to a
datatype that is not its opclass' opcintype.
The upshot of all this is that there are cases where a datatype can
be dropped while leaving behind amop/amproc entries that mention it,
because there is no path in pg_depend showing that those entries
depend on that type. Such entries are harmless in normal activity,
because they won't get used, but they cause problems for maintenance
actions such as dropping the operator family. They also cause pg_dump
to produce bogus output. The previous commit put a band-aid on the
DROP OPERATOR FAMILY failure, but a real fix is needed.
To fix, add pg_depend entries showing that a pg_amop/pg_amproc entry
depends on its lefttype/righttype. To avoid bloating pg_depend too
much, skip this if the referenced operator or function has that type
as an input type. (I did not bother with considering the possible
indirect dependency via the opclass' opcintype; at least in the
reported case, that wouldn't help anyway.)
Probably, the reason this has escaped notice for so long is that
add-on datatypes and relevant opclasses/opfamilies are usually
packaged as extensions nowadays, so that there's no way to drop
a type without dropping the referencing opclasses/opfamilies too.
Still, in the absence of pg_depend entries there's nothing that
constrains DROP EXTENSION to drop the opfamily entries before the
datatype, so it seems possible for a DROP failure to occur anyway.
The specific case that was reported doesn't fail in v13, because
v13 prefers to attach the support procedure to the opclass not the
opfamily. But it's surely possible to construct other edge cases
that do fail in v13, so patch that too.
Per report from Yoran Heling. Back-patch to all supported branches.
Discussion: https://postgr.es/m/Z1MVCOh1hprjK5Sf@gmai021
|
|
This reverts commit d28dff3f6cd6a7562fb2c211ac0fb74a33ffd032.
Quite a large number of buildfarm members didn't like this commit and
it's not yet clear why. Reverting this before too many animals turn
red.
Discussion: https://postgr.es/m/CAApHDvr9i6T5=iAwQCxFDgMsthr_obVxgwBaEJkC8KUH6yM3Hw@mail.gmail.com
|
|
The new compact_attrs array stores a few select fields from
FormData_pg_attribute in a more compact way, using only 16 bytes per
column instead of the 104 bytes that FormData_pg_attribute uses. Using
CompactAttribute allows performance-critical operations such as tuple
deformation to be performed without looking at the FormData_pg_attribute
element in TupleDesc which means fewer cacheline accesses. With this
change, NAMEDATALEN could be increased with a much smaller negative impact
on performance.
For some workloads, tuple deformation can be the most CPU intensive part
of processing the query. Some testing with 16 columns on a table
where the first column is variable length showed around a 10% increase in
transactions per second for an OLAP type query performing aggregation on
the 16th column. However, in certain cases, the increases were much
higher, up to ~25% on one AMD Zen4 machine.
This also makes pg_attribute.attcacheoff redundant. A follow-on commit
will remove it, thus shrinking the FormData_pg_attribute struct by 4
bytes.
Author: David Rowley
Discussion: https://postgr.es/m/CAApHDvrBztXP3yx=NKNmo3xwFAFhEdyPnvrDg3=M0RhDs+4vYw@mail.gmail.com
Reviewed-by: Andres Freund, Victor Yegorov
|
|
This commit reverts 3c5db1d6b0, and subsequent improvements and fixes
including 8036d73ae3, 867d396ccd, 3ac3ec580c, 0868d7ae70, 85b98b8d5a,
2520226c95, 014f9f34d2, e658038772, e1555645d7, 5035172e4a, 6cfebfe88b,
73da6b8d1b, and e546989a26.
The reason for reverting is a set of remaining issues. Most notably, the
stored procedure appears to need more effort than the utility statement
to turn the backend into a "snapshot-less" state. This makes an approach
to use stored procedures questionable.
Catversion is bumped.
Discussion: https://postgr.es/m/Zyhj2anOPRKtb0xW%40paquier.xyz
|
|
Instead of talking about setting latches, which is a pretty low-level
mechanism, emphasize that they wake up other processes.
This is in preparation for replacing Latches with a new abstraction.
That's still work in progress, but this seems a little tidier anyway,
so let's get this refactoring out of the way already.
Discussion: https://www.postgresql.org/message-id/391abe21-413e-4d91-a650-b663af49500c%40iki.fi
|
|
This is in preparation for replacing Latches with a new abstraction.
That's still work in progress, but this seems a little tidier anyway,
so let's get this refactoring out of the way already.
Discussion: https://www.postgresql.org/message-id/391abe21-413e-4d91-a650-b663af49500c%40iki.fi
|
|
A bug in nbtree's handling of primitive index scan scheduling could lead
to wrong answers when a scrollable cursor was used with an index scan
that had a SAOP index qual. Wrong answers were only possible when the
scan direction changed after a primitive scan was scheduled, but before
_bt_next was asked to fetch the next tuple in line (i.e. for things to
break, _bt_next had to be denied the opportunity to step off the page in
the same direction as the one used when the primscan was scheduled).
Furthermore, the issue only occurred when the page in question happened
to be the first page to be visited by the entire top-level scan; the
issue hinged upon the cursor backing up to the absolute beginning of the
key space that it returns tuples from (fetching in the opposite scan
direction across a "primitive scan boundary" always worked correctly).
To fix, make _bt_next unset the "needs primitive index scan" flag when
it detects that the current scan direction is not the one that was used
by _bt_readpage back when the primitive scan in question was scheduled.
This fixes the cases that are known to be faulty, and also seems like a
good idea on general robustness grounds.
Affected scrollable cursor cases now avoid a spurious primitive index
scan when they fetch backwards to the absolute start of the key space to
be visited by their cursor. Fetching backwards now only returns those
tuples at the start of the scan, as expected. It'll also be okay to
once again fetch forwards from the start at that point, since the scan
will be left in a state that's exactly consistent with the state it was
in before any tuples were ever fetched, as expected.
Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp
execution.
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wznv49bFsE2jkt4GuZ0tU2C91dEST=50egzjY2FeOcHL4Q@mail.gmail.com
Backpatch: 17-, where commit 5bf748b8 first appears.
|
|
Commit a07e03fd8fa7daf4d1356f7cb501ffe784ea6257 changed inplace updates
to wait for heap_update() commands like GRANT TABLE and GRANT DATABASE.
By keeping the pin during that wait, a sequence of autovacuum workers
and an uncommitted GRANT starved one foreground LockBufferForCleanup()
for six minutes, on buildfarm member sarus. Prevent, at the cost of a
bit of complexity. Back-patch to v12, like the earlier commit. That
commit and heap_inplace_lock() have not yet appeared in any release.
Discussion: https://postgr.es/m/20241026184936.ae.nmisch@google.com
|
|
Move all responsibility for indicating a block is exhuasted into
table_scan_bitmap_next_tuple() and advance the main iterator in
heap-specific code. This flow control makes more sense and is a step
toward using the read stream API for bitmap heap scans.
Previously, table_scan_bitmap_next_block() returned false to indicate
table_scan_bitmap_next_tuple() should not be called for the tuples on
the page. This happened both when 1) there were no visible tuples on the
page and 2) when the block returned by the iterator was past the end of
the table. BitmapHeapNext() (generic bitmap table scan code) handled the
case when the bitmap was exhausted.
It makes more sense for table_scan_bitmap_next_tuple() to return false
when there are no visible tuples on the page and
table_scan_bitmap_next_block() to return false when the bitmap is
exhausted or there are no more blocks in the table.
As part of this new design, TBMIterateResults are no longer used as a
flow control mechanism in BitmapHeapNext(), so we removed
table_scan_bitmap_next_tuple's TBMIterateResult parameter.
Note that the prefetch iterator is still saved in the
BitmapHeapScanState node and advanced in generic bitmap table scan code.
This is because 1) it was not necessary to change the prefetch iterator
location to change the flow control in BitmapHeapNext() 2) modifying
prefetch iterator management requires several more steps better split
over multiple commits and 3) the prefetch iterator will be removed once
the read stream API is used.
Author: Melanie Plageman
Reviewed-by: Tomas Vondra, Andres Freund, Heikki Linnakangas, Mark Dilger
Discussion: https://postgr.es/m/063e4eb4-32d9-439e-a0b1-75565a9835a8%40iki.fi
|
|
Increment the lossy and exact page counters for EXPLAIN of bitmap heap
scans in heapam_scan_bitmap_next_block(). Note that other table AMs will
need to do this as well
Pushing the counters into heapam_scan_bitmap_next_block() is required to
be able to use the read stream API for bitmap heap scans. The bitmap
iterator must be advanced from inside the read stream callback, so
TBMIterateResults cannot be used as a flow control mechanism in
BitmapHeapNext().
Author: Melanie Plageman
Reviewed-by: Tomas Vondra, Heikki Linnakangas
Discussion: https://postgr.es/m/063e4eb4-32d9-439e-a0b1-75565a9835a8%40iki.fi
|
|
The inplace update survives ROLLBACK. The inval didn't, so another
backend's DDL could then update the row without incorporating the
inplace update. In the test this fixes, a mix of CREATE INDEX and ALTER
TABLE resulted in a table with an index, yet relhasindex=f. That is a
source of index corruption. Back-patch to v12 (all supported versions).
The back branch versions don't change WAL, because those branches just
added end-of-recovery SIResetAll(). All branches change the ABI of
extern function PrepareToInvalidateCacheTuple(). No PGXN extension
calls that, and there's no apparent use case in extensions.
Reviewed by Nitin Motiani and (in earlier versions) Andres Freund.
Discussion: https://postgr.es/m/20240523000548.58.nmisch@google.com
|
|
Currently, WaitForLSNReplay() immediately throws an error if waiting for LSN
replay is not successful. This commit teaches WaitForLSNReplay() to return
the result of waiting, while making pg_wal_replay_wait() responsible for
throwing an appropriate error.
This is preparation to adding 'no_error' argument to pg_wal_replay_wait() and
new function pg_wal_replay_wait_status(), which returns the last wait result
status.
Additionally, we stop distinguishing situations when we find our instance to
be not in a recovery state before entering the waiting loop and inside
the waiting loop. Standby promotion may happen at any moment, even between
issuing a procedure call statement and pg_wal_replay_wait() doing a first
check of recovery status. Thus, there is no pointing distinguishing these
situations.
Also, since we may exit the waiting loop and see our instance not in recovery
without throwing an error, we need to deleteLSNWaiter() in that case. We do
this unconditionally for the sake of simplicity, even if standby was already
promoted after reaching the target LSN, the startup process surely already
deleted us.
Reported-by: Michael Paquier
Discussion: https://postgr.es/m/ZtUF17gF0pNpwZDI%40paquier.xyz
Reviewed-by: Michael Paquier, Pavel Borisov
|
|
3c5db1d6b implemented the pg_wal_replay_wait() stored procedure. Due to
the patch development history, the implementation resided in
src/backend/commands/waitlsn.c (src/include/commands/waitlsn.h for headers).
014f9f34d moved pg_wal_replay_wait() itself to
src/backend/access/transam/xlogfuncs.c near to the WAL-manipulation functions.
But most of the implementation stayed in place.
The code in src/backend/commands/waitlsn.c has nothing to do with commands,
but is related to WAL. So, this commit moves this code into
src/backend/access/transam/xlogwait.c (src/include/access/xlogwait.h for
headers).
Reported-by: Peter Eisentraut
Discussion: https://postgr.es/m/18c0fa64-0475-415e-a1bd-665d922c5201%40eisentraut.org
Reviewed-by: Pavel Borisov
|
|
An oversight in commit f6bef362c.
Reviewed-by: David Rowley
Discussion: https://postgr.es/m/CAD21AoB8MJ5OHtpUw1UEGf7spioFmP3PNH44KNx6Yb3FiZSwKA%40mail.gmail.com
|
|
Make nbtree backwards scans optimistically access the next page to be
read to the left by following a prevPage block number that's now stashed
in currPos when the leaf page is first read. This approach matches the
one taken during forward scans, which follow a symmetric nextPage block
number from currPos. We stash both a prevPage and a nextPage, since the
scan direction might change (when fetching from a scrollable cursor).
Backwards scans will no longer need to lock the same page twice, except
in rare cases where the scan detects a concurrent page split (or page
deletion). Testing has shown this optimization to be particularly
effective during parallel index-only backwards scans: ~12% reductions in
query execution time are quite possible.
We're much better off being optimistic; concurrent left sibling page
splits are rare in general. It's possible that we'll need to lock more
pages than the pessimistic approach would have, but only when there are
_multiple_ concurrent splits of the left sibling page we now start at.
If there's just a single concurrent left sibling page split, the new
approach to scanning backwards will at least break even relative to the
old one (we'll acquire the same number of leaf page locks as before).
The optimization from this commit has long been contemplated by comments
added by commit 2ed5b87f96, which changed the rules for locking/pinning
during nbtree index scans. The approach that that commit introduced to
leaf level link traversal when scanning forwards is now more or less
applied all the time, regardless of the direction we're scanning in.
Following uniform conventions around sibling link traversal is simpler.
The only real remaining difference between our forward and backwards
handling is that our backwards handling must still detect and recover
from any concurrent left sibling splits (and concurrent page deletions),
as documented in the nbtree README. That is structured as a single,
isolated extra step that takes place in _bt_readnextpage.
Also use this opportunity to further simplify the functions that deal
with reading pages and traversing sibling links on the leaf level, and
to document their preconditions and postconditions (with respect to
things like buffer locks, buffer pins, and seizing the parallel scan).
This enhancement completely supersedes the one recently added by commit
3f44959f.
Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAEze2WgpBGRgTTxTWVPXc9+PB6fc1a7t+VyGXHzfnrFXcQVxnA@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-WzkBTuFv7W2+84jJT8mWZLXVL0GHq2hMUTn6c9Vw=eYrCw@mail.gmail.com
|
|
Commit 5bf748b8 taught nbtree ScalarArrayOp index scans to decide when
and how to start the next primitive index scan based on physical index
characteristics. This included rules for deciding whether to start a
new primitive index scan (or whether to move onto the right sibling leaf
page instead) that specifically consider truncated lower-order columns
(-inf columns) from leaf page high keys.
These omitted columns were treated as satisfying the scan's required
scan keys, though only for scan keys marked required in the current scan
direction (forward). Scan keys that didn't get this behavior (those
marked required in the backwards direction only) usually didn't give the
scan reasonable cause to reposition itself to a later leaf page (via
another descent of the index in _bt_first), but _bt_advance_array_keys
would nevertheless always give up by forcing another call to _bt_first.
_bt_advance_array_keys was unwilling to allow the scan to continue onto
the next leaf page, to reconsider whether we really should start another
primitive scan based on the details of the sibling page's tuples. This
didn't match its behavior with similar cases involving keys required in
the current scan direction (forward), which seems unprincipled. It led
to an excessive number of primitive scans/index descents for queries
with a higher-order = array scan key (with dense, contiguous values)
mixed with a lower-order required > or >= scan key.
Bring > and >= strategy scan keys in line with other required scan key
types: treat truncated -inf scan keys as having satisfied scan keys
required in either scan direction (forwards and backwards alike) during
array advancement. That way affected scans can continue to the right
sibling leaf page. Advancement must now schedule an explicit recheck of
the right sibling page's high key in cases involving > or >= scan keys.
The recheck gives the scan a way to back out and start another primitive
index scan (we can't just rely on _bt_checkkeys with > or >= scan keys).
This work can be considered a stand alone optimization on top of the
work from commit 5bf748b8. But it was written in preparation for an
upcoming patch that will add skip scan to nbtree. In practice scans
that use "skip arrays" will tend to be much more sensitive to any
implementation deficiencies in this area.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAH2-Wz=9A_UtM7HzUThSkQ+BcrQsQZuNhWOvQWK06PRkEp=SKQ@mail.gmail.com
|
|
Previously, the pg_stat_checkpointer view and the checkpoint completion
log message could show different numbers for buffers written
during checkpoints. The view only counted shared buffers,
while the log message included both shared and SLRU buffers,
causing inconsistencies.
This commit resolves the issue by updating both the view and the log message
to separately report shared and SLRU buffers written during checkpoints.
A new slru_written column is added to the pg_stat_checkpointer view
to track SLRU buffers, while the existing buffers_written column now
tracks only shared buffers. This change would help users distinguish
between the two types of buffers, in the pg_stat_checkpointer view and
the checkpoint complete log message, respectively.
Bump catalog version.
Author: Nitin Jadhav
Reviewed-by: Bharath Rupireddy, Michael Paquier, Kyotaro Horiguchi, Robert Haas
Reviewed-by: Andres Freund, vignesh C, Fujii Masao
Discussion: https://postgr.es/m/CAMm1aWb18EpT0whJrjG+-nyhNouXET6ZUw0pNYYAe+NezpvsAA@mail.gmail.com
|
|
Checkpoints can be skipped when the server is idle. The existing num_timed and
num_requested counters in pg_stat_checkpointer track both completed and
skipped checkpoints, but there was no way to count only the completed ones.
This commit introduces the num_done counter, which tracks only completed
checkpoints, making it easier to see how many were actually performed.
Bump catalog version.
Author: Anton A. Melnikov
Reviewed-by: Fujii Masao
Discussion: https://postgr.es/m/9ea77f40-818d-4841-9dee-158ac8f6e690@oss.nttdata.com
|
|
As previously-added tests demonstrated, heap_inplace_update() could
instead update an unrelated tuple of the same catalog. It could lose
the update. Losing relhasindex=t was a source of index corruption.
Inplace-updating commands like VACUUM will now wait for heap_update()
commands like GRANT TABLE and GRANT DATABASE. That isn't ideal, but a
long-running GRANT already hurts VACUUM progress more just by keeping an
XID running. The VACUUM will behave like a DELETE or UPDATE waiting for
the uncommitted change.
For implementation details, start at the systable_inplace_update_begin()
header comment and README.tuplock. Back-patch to v12 (all supported
versions). In back branches, retain a deprecated heap_inplace_update(),
for extensions.
Reported by Smolkin Grigory. Reviewed by Nitin Motiani, (in earlier
versions) Heikki Linnakangas, and (in earlier versions) Alexander
Lakhin.
Discussion: https://postgr.es/m/CAMp+ueZQz3yDk7qg42hk6-9gxniYbp-=bG2mgqecErqR5gGGOA@mail.gmail.com
|
|
table_beginscan_parallel and index_beginscan_parallel contain
Asserts checking that the relation a worker will use in
a parallel scan is the same one the leader intended. However,
they were checking for relation OID match, which was not strong
enough to detect the mismatch problem fixed in 126ec0bc7.
What would be strong enough is to compare relfilenodes instead.
Arguably, that's a saner definition anyway, since a scan surely
operates on a physical relation not a logical one. Hence,
store and compare RelFileLocators not relation OIDs. Also
ensure that index_beginscan_parallel checks the index identity
not just the table identity.
Discussion: https://postgr.es/m/2127254.1726789524@sss.pgh.pa.us
|
|
This opens the possibility to define keys for more types of statistics
kinds in PgStat_HashKey, the first case being 8-byte query IDs for
statistics like pg_stat_statements.
This increases the size of PgStat_HashKey from 12 to 16 bytes, while
PgStatShared_HashEntry, entry stored in the dshash for pgstats, keeps
the same size due to alignment.
xl_xact_stats_item, that tracks the stats items to drop in commit WAL
records, is increased from 12 to 16 bytes. Note that individual chunks
in commit WAL records should be multiples of sizeof(int), hence 8-byte
object IDs are stored as two uint32, based on a suggestion from Heikki
Linnakangas.
While on it, the field of PgStat_HashKey is renamed from "objoid" to
"objid", as for some stats kinds this field does not refer to OIDs but
just IDs, like for replication slot stats.
This commit bumps the following format variables:
- PGSTAT_FILE_FORMAT_ID, as PgStat_HashKey is written to the stats file
for non-serialized stats kinds in the dshash table.
- XLOG_PAGE_MAGIC for the changes in xl_xact_stats_item.
- Catalog version, for the SQL function pg_stat_have_stats().
Reviewed-by: Bertrand Drouvot
Discussion: https://postgr.es/m/ZsvTS9EW79Up8I62@paquier.xyz
|
|
Add WITHOUT OVERLAPS clause to PRIMARY KEY and UNIQUE constraints.
These are backed by GiST indexes instead of B-tree indexes, since they
are essentially exclusion constraints with = for the scalar parts of
the key and && for the temporal part.
(previously committed as 46a0cd4cefb, reverted by 46a0cd4cefb; the new
part is this:)
Because 'empty' && 'empty' is false, the temporal PK/UQ constraint
allowed duplicates, which is confusing to users and breaks internal
expectations. For instance, when GROUP BY checks functional
dependencies on the PK, it allows selecting other columns from the
table, but in the presence of duplicate keys you could get the value
from any of their rows. So we need to forbid empties.
This all means that at the moment we can only support ranges and
multiranges for temporal PK/UQs, unlike the original patch (above).
Documentation and tests for this are added. But this could
conceivably be extended by introducing some more general support for
the notion of "empty" for other types.
Author: Paul A. Jungwirth <pj@illuminatedcomputing.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: jian he <jian.universality@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CA+renyUApHgSZF9-nd-a0+OPGharLQLO=mDHcY4_qQ0+noCUVg@mail.gmail.com
|
|
This is support function 12 for the GiST AM and translates
"well-known" RT*StrategyNumber values into whatever strategy number is
used by the opclass (since no particular numbers are actually
required). We will use this to support temporal PRIMARY
KEY/UNIQUE/FOREIGN KEY/FOR PORTION OF functionality.
This commit adds two implementations, one for internal GiST opclasses
(just an identity function) and another for btree_gist opclasses. It
updates btree_gist from 1.7 to 1.8, adding the support function for
all its opclasses.
(previously committed as 6db4598fcb8, reverted by 8aee330af55; this is
essentially unchanged from those)
Author: Paul A. Jungwirth <pj@illuminatedcomputing.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: jian he <jian.universality@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CA+renyUApHgSZF9-nd-a0+OPGharLQLO=mDHcY4_qQ0+noCUVg@mail.gmail.com
|
|
This brings more clarity to heapam.c, by cleanly separating all the
logic related to WAL replay and the rest of Heap and Heap2, similarly
to other RMGRs like hash, btree, etc.
The header reorganization is also nice in heapam.c, cutting half of the
headers required.
Author: Li Yong
Reviewed-by: Sutou Kouhei, Michael Paquier
Discussion: https://postgr.es/m/EFE55E65-D7BD-4C6A-B630-91F43FD0771B@ebay.com
|
|
The only current implementation is for btree where it calls
_bt_getrootheight(). Other index types can now also use this to pass
information to their amcostestimate routine. Previously, btree was
hardcoded and other index types could not hook into the optimizer at
this point.
Author: Mark Dilger <mark.dilger@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
|
|
Add const qualifiers to XLogRegisterData() and XLogRegisterBufData().
Several unconstify() calls can be removed.
Reviewed-by: Aleksander Alekseev <aleksander@timescale.com>
Discussion: https://www.postgresql.org/message-id/dd889784-9ce7-436a-b4f1-52e4a5e577bd@eisentraut.org
|