summaryrefslogtreecommitdiff
path: root/src/backend/access/heap
AgeCommit message (Collapse)Author
47 hoursFix reuse-after-free hazard in dead_items_resetJohn Naylor
In similar vein to commit ccc8194e427, a reset instance of a shared memory TID store happened to occupy the same private memory as the old one for the entry point, since the chunk freed after the last round of index vacuuming was put on the context's freelist. The failure to update the vacrel->dead_items pointer was evident by nudging the system to allocate memory in a different area. This was not discovered at the time of the earlier commit since our regression tests didn't cover multiple index passes with parallel vacuum. Backpatch to v17, when TidStore came in. Author: Kevin Oommen Anish <kevin.o@zohocorp.com> Reviewed-by: Richard Guo <guofenglinux@gmail.com> Tested-by: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/199a07cbdfc.7a1c4aac25838.1675074408277594551%40zohocorp.com Backpatch-through: 17
11 daysCorrect prune WAL record opcode name in commentMelanie Plageman
f83d709760d8 incorrectly refers to a XLOG_HEAP2_PRUNE_FREEZE WAL record opcode. No such code exists. The relevant opcodes are XLOG_HEAP2_PRUNE_ON_ACCESS, XLOG_HEAP2_PRUNE_VACUUM_SCAN, and XLOG_HEAP2_PRUNE_VACUUM_CLEANUP. Correct it. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/yn4zp35kkdsjx6wf47zcfmxgexxt4h2og47pvnw2x5ifyrs3qc%407uw6jyyxuyf7
11 daysFix incorrect and inconsistent comments in tableam.h and heapam.c.Fujii Masao
This commit corrects several issues in function comments: * The parameter "rel" was incorrectly referred to as "relation" in the comments for table_tuple_delete(), table_tuple_update(), and table_tuple_lock(). * In table_tuple_delete(), "changingPart" was listed as an output parameter in the comments but is actually input. * In table_tuple_update(), "slot" was listed as an input parameter in the comments but is actually output. * The comment for "update_indexes" in table_tuple_update() was mis-indented. * The comments for heap_lock_tuple() incorrectly referenced a non-existent "tid" parameter. Author: Chao Li <lic@highgo.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Discussion: https://postgr.es/m/CAEoWx2nB6Ay8g=KEn7L3qbYX_4+sLk9XOMkV0XZqHR4cTY8ZvQ@mail.gmail.com
2025-09-08Add error codes when vacuum discovers VM corruptionMelanie Plageman
Commit fd6ec93bf890314a and other previous work established the principle that when an error is potentially reachable in case of on-disk corruption but is not expected to be reached otherwise, ERRCODE_DATA_CORRUPTED should be used. This allows log monitoring software to search for evidence of corruption by filtering on the error code. Enhance the existing log messages emitted when the heap page is found to be inconsistent with the VM by adding this error code. Suggested-by: Andrey Borodin <x4mmm@yandex-team.ru> Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Discussion: https://postgr.es/m/87DD95AA-274F-4F4F-BAD9-7738E5B1F905%40yandex-team.ru
2025-09-08Remove unneeded VM pin from VM replayMelanie Plageman
Previously, heap_xlog_visible() called visibilitymap_pin() even after getting a buffer from XLogReadBufferForRedoExtended() -- which returns a pinned buffer containing the specified block of the visibility map. This would just have resulted in visibilitymap_pin() returning early since the specified page was already present and pinned, but it was confusing extraneous code, so remove it. It doesn't seem worth backporting, though. It appears to be an oversight in 2c03216. While we are at it, remove two VM-related redundant asserts in the COPY FREEZE code path. visibilitymap_set() already asserts that PD_ALL_VISIBLE is set on the heap page and checks that the vmbuffer contains the bits corresponding to the specified heap block, so callers do not also need to check this. Author: Melanie Plageman <melanieplageman@gmail.com> Reported-by: Melanie Plageman <melanieplageman@gmail.com> Reported-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CALdSSPhu7WZd%2BEfQDha1nz%3DDC93OtY1%3DUFEdWwSZsASka_2eRQ%40mail.gmail.com
2025-09-05Add assert and log message to visibilitymap_setMelanie Plageman
Add an assert to visibilitymap_set() that the provided heap buffer is exclusively locked, which is expected. Also, enhance the debug logging message to specify which VM flags were set. Based on a related suggestion by Kirill Reshke on an in-progress patchset. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CALdSSPhAU56g1gGVT0%2BwG8RrSWE6qW8TOfNJS1HNAWX6wPgbFA%40mail.gmail.com
2025-08-29Remove unneeded casts of BufferGetPage() resultPeter Eisentraut
BufferGetPage() already returns type Page, so casting it to Page doesn't achieve anything. A sizable number of call sites does this casting; remove that. This was already done inconsistently in the code in the first import in 1996 (but didn't exist in the pre-1995 code), and it was then apparently just copied around. Author: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://www.postgresql.org/message-id/flat/CALdSSPgFhc5=vLqHdk-zCcnztC0zEY3EU_Q6a9vPEaw7FkE9Vw@mail.gmail.com
2025-08-28Avoid including commands/dbcommands.h in so many placesÁlvaro Herrera
This has been done historically because of get_database_name (which since commit cb98e6fb8fd4 belongs in lsyscache.c/h, so let's move it there) and get_database_oid (which is in the right place, but whose declaration should appear in pg_database.h rather than dbcommands.h). Clean this up. Also, xlogreader.h and stringinfo.h are no longer needed by dbcommands.h since commit f1fd515b393a, so remove them. Author: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/202508191031.5ipojyuaswzt@alvherre.pgsql
2025-08-22Use ereport() rather than elog()Heikki Linnakangas
Noah pointed this out before I committed 50f770c3d9, but I accidentally pushed the old version with elog() anyway. Oops. Reported-by: Noah Misch <noah@leadboat.com> Discussion: https://www.postgresql.org/message-id/20250820003756.31.nmisch@google.com
2025-08-22Revert GetTransactionSnapshot() to return historic snapshot during LRHeikki Linnakangas
Commit 1585ff7387 changed GetTransactionSnapshot() to throw an error if it's called during logical decoding, instead of returning the historic snapshot. I made that change for extra protection, because a historic snapshot can only be used to access catalog tables while GetTransactionSnapshot() is usually called when you're executing arbitrary queries. You might get very subtle visibility problems if you tried to use the historic snapshot for arbitrary queries. There's no built-in code in PostgreSQL that calls GetTransactionSnapshot() during logical decoding, but it turns out that the pglogical extension does just that, to evaluate row filter expressions. You would get weird results if the row filter runs arbitrary queries, but it is sane as long as you don't access any non-catalog tables. Even though there are no checks to enforce that in pglogical, a typical row filter expression does not access any tables and works fine. Accessing tables marked with the user_catalog_table = true option is also OK. To fix pglogical with row filters, and any other extensions that might do similar things, revert GetTransactionSnapshot() to return a historic snapshot during logical decoding. To try to still catch the unsafe usage of historic snapshots, add checks in heap_beginscan() and index_beginscan() to complain if you try to use a historic snapshot to scan a non-catalog table. We're very close to the version 18 release however, so add those new checks only in master. Backpatch-through: 18 Reported-by: Noah Misch <noah@leadboat.com> Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://www.postgresql.org/message-id/20250809222338.cc.nmisch@google.com
2025-07-24Fix return value of visibilitymap_get_status().Nathan Bossart
This function is declared as returning a uint8, but it returns a bool in one code path. To fix, return (uint8) 0 instead of false there. This should behave exactly the same as before, but it might prevent future compiler complaints. Oversight in commit a892234f83. Author: Julien Rouhaud <rjuju123@gmail.com> Discussion: https://postgr.es/m/aIHluT2isN58jqHV%40jrouhaud
2025-07-01Fix missing FSM vacuum opportunities on tables without indexes.Masahiko Sawada
Commit c120550edb86 optimized the vacuuming of relations without indexes (a.k.a. one-pass strategy) by directly marking dead item IDs as LP_UNUSED. However, the periodic FSM vacuum was still checking if dead item IDs had been marked as LP_DEAD when attempting to vacuum the FSM every VACUUM_FSM_EVERY_PAGES blocks. This condition was never met due to the optimization, resulting in missed FSM vacuum opportunities. This commit modifies the periodic FSM vacuum condition to use the number of tuples deleted during HOT pruning. This count includes items marked as either LP_UNUSED or LP_REDIRECT, both of which are expected to result in new free space to report. Back-patch to v17 where the vacuum optimization for tables with no indexes was introduced. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CAD21AoBL8m6B9GSzQfYxVaEgvD7-Kr3AJaS-hJPHC+avm-29zw@mail.gmail.com Backpatch-through: 17
2025-06-30Rationalize handling of VacuumParamsMichael Paquier
This commit refactors the vacuum routines that rely on VacuumParams, adding const markers where necessary to force a new policy in the code. This structure should not use a pointer as it may be used across multiple relations, and its contents should never be updated. vacuum_rel() stands as an exception as it touches the "index_cleanup" and "truncate" options. VacuumParams has been introduced in 0d831389749a, and 661643dedad9 has fixed a bug impacting VACUUM operating on multiple relations. The changes done in tableam.h break ABI compatibility, so this commit can only happen on HEAD. Author: Shihao Zhong <zhong950419@gmail.com> Co-authored-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Discussion: https://postgr.es/m/CAGRkXqTo+aK=GTy5pSc-9cy8H2F2TJvcrZ-zXEiNJj93np1UUw@mail.gmail.com
2025-06-28Message style improvementsPeter Eisentraut
2025-06-26Remove unused check in heap_xlog_insert()Melanie Plageman
8e03eb92e9a reverted the commit 39b66a91bd which allowed freezing in the heap_insert() code path but forgot to remove the corresponding check in heap_xlog_insert(). This code is extraneous but not harmful. However, cleaning it up makes it very clear that, as of now, we do not support any freezing of pages in the heap_insert() path. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/flat/CAAKRu_Zp4Pi-t51OFWm1YZ-cctDfBhHCMZ%3DEx6PKxv0o8y2GvA%40mail.gmail.com Backpatch-through: 14
2025-06-26Simplify vacuum VM update logging countersMelanie Plageman
We can simplify the VM counters added in dc6acfd910b8 to lazy_vacuum_heap_page() and lazy_scan_new_or_empty(). We won't invoke lazy_vacuum_heap_page() unless there are dead line pointers, so we know the page can't be all-visible. In lazy_scan_new_or_empty(), we only update the VM if the page-level hint PD_ALL_VISIBLE is clear, and the VM bit cannot be set if the page level bit is clear because a subsequent page update would fail to clear the visibility map bit. Simplify the logic for determining which log counters to increment based on this knowledge. Doing so is worthwhile because the old logic was confusing and misguided. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/flat/CAAKRu_a9w_n2mwY%3DG4LjfWTvRTJtjbfvnYAKi4WjO8QXHHrA0g%40mail.gmail.com
2025-06-03Rename log_lock_failure GUC to log_lock_failures for consistency.Fujii Masao
This commit renames the GUC log_lock_failure to log_lock_failures to align with the existing similar setting log_lock_waits, which uses the plural form. This improves naming consistency across related GUCs. Suggested-by: Peter Eisentraut <peter@eisentraut.org> Author: Fujii Masao <masao.fujii@gmail.com Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/7a8198b6-d5b8-4910-b41e-8d3efcbb015d@eisentraut.org
2025-06-02Correct heap vacuum boundary state setup orderingMelanie Plageman
052026c9b9 mistakenly reordered setup steps in heap_vacuum_rel(), incorrectly moving RelationGetNumberOfBlocks() before vacuum_get_cutoffs(). OldestXmin must be determined before RelationGetNumberOfBlocks() calculates the number of blocks in the relation that will be vacuumed. Otherwise tuples older than OldestXmin may be inserted into the end of the relation into blocks that are not vacuumed. If additional tuples newer than those inserted into unscanned blocks but older than OldestXmin are inserted into free space earlier in the relation, the result could be advancing pg_class.relfrozenxid to a newer value than an unfrozen XID in one of the unscanned heap pages. Assigning an incorrect relfrozenxid can lead to data loss, so it is imperative that it correctly reflect the oldest unfrozen xid. Reported-by: Peter Geoghegan <pg@bowt.ie> Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-WzntqvVEdbbpqG5JqSZGuLWmy4PBfUO-OswfivKchr2gvw%40mail.gmail.com
2025-05-30Ensure we have a snapshot when updating various system catalogs.Nathan Bossart
A few places that access system catalogs don't set up an active snapshot before potentially accessing their TOAST tables. To fix, push an active snapshot just before each section of code that might require accessing one of these TOAST tables, and pop it shortly afterwards. While at it, this commit adds some rather strict assertions in an attempt to prevent such issues in the future. Commit 16bf24e0e4 recently removed pg_replication_origin's TOAST table in order to fix the same problem for that catalog. On the back-branches, those bugs are left in place. We cannot easily remove a catalog's TOAST table on released major versions, and only replication origins with extremely long names are affected. Given the low severity of the issue, fixing older versions doesn't seem worth the trouble of significantly modifying the patch. Also, on v13 and v14, the aforementioned strict assertions have been omitted because commit 2776922201, which added HaveRegisteredOrActiveSnapshot(), was not back-patched. While we could probably back-patch it now, I've opted against it because it seems unlikely that new TOAST snapshot issues will be introduced in the oldest supported versions. Reported-by: Alexander Lakhin <exclusion@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/18127-fe54b6a667f29658%40postgresql.org Discussion: https://postgr.es/m/18309-c0bf914950c46692%40postgresql.org Discussion: https://postgr.es/m/ZvMSUPOqUU-VNADN%40nathan Backpatch-through: 13
2025-05-27Fix assertion when decrementing eager scanning success and failure counters.Masahiko Sawada
Previously, we asserted that the eager scan's success and failure counters were positive before decrementing them. However, this assumption was incorrect, as it's possible that some blocks have already been eagerly scanned by the time eager scanning is disabled. This commit replaces the assertions with guards to handle this scenario gracefully. With this change, we continue to allow read-ahead operations by the read stream that exceed the success and failure caps. While there is a possibility that overruns will trigger eager scans of additional pages, this does not pose a practical concern as the overruns will not be substantial and remain within an acceptable range. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CAD21AoConf6tkVCv-=JhQJj56kYsDwo4jG5+WqgT+ukSkYomSQ@mail.gmail.com
2025-05-10Add support for runtime arguments in injection pointsMichael Paquier
The macros INJECTION_POINT() and INJECTION_POINT_CACHED() are extended with an optional argument that can be passed down to the callback attached when an injection point is run, giving to callbacks the possibility to manipulate a stack state given by the caller. The existing callbacks in modules injection_points and test_aio have their declarations adjusted based on that. da7226993fd4 (core AIO infrastructure) and 93bc3d75d8e1 (test_aio) and been relying on a set of workarounds where a static variable called pgaio_inj_cur_handle is used as runtime argument in the injection point callbacks used by the AIO tests, in combination with a TRY/CATCH block to reset the argument value. The infrastructure introduced in this commit will be reused for the AIO tests, simplifying them. Reviewed-by: Greg Burd <greg@burd.me> Discussion: https://postgr.es/m/Z_y9TtnXubvYAApS@paquier.xyz
2025-04-20Comment on need to MarkBufferDirty() if omitting DELAY_CHKPT_START.Noah Misch
Blocking checkpoint phase 2 requires MarkBufferDirty() and BUFFER_LOCK_EXCLUSIVE; neither suffices by itself. transam/README documents this, citing SyncOneBuffer(). Update the DELAY_CHKPT_START documentation to say this. Expand the heap_inplace_update_and_unlock() comment that cites XLogSaveBufferForHint() as precedent, since heap_inplace_update_and_unlock() could have opted not to use DELAY_CHKPT_START. Commit 8e7e672cdaa6bfec85d4d5dd9be84159df23bb41 added DELAY_CHKPT_START to heap_inplace_update_and_unlock(). Since commit bc6bad88572501aecaa2ac5d4bc900ac0fd457d5 reverted it in non-master branches, no back-patch. Discussion: https://postgr.es/m/20250406180054.26.nmisch@google.com
2025-04-03Use AIO batchmode for bitmap heap scansMelanie Plageman
Previously bitmap heap scan was not AIO batchmode safe because of the visibility map reads potentially done for the "skip fetch" optimization (which skipped fetching tuples from the heap if the pages were all visible and none of the columns were used in the query). The skip fetch optimization implementation was found to have bugs and was removed in 459e7bf8e2f8, so we can safely enable batchmode for bitmap heap scans.
2025-04-02Remove HeapBitmapScan's skip_fetch optimizationAndres Freund
The optimization does not take the removal of TIDs by a concurrent vacuum into account. The concurrent vacuum can remove dead TIDs and make pages ALL_VISIBLE while those dead TIDs are referenced in the bitmap. This can lead to a skip_fetch scan returning too many tuples. It likely would be possible to implement this optimization safely, but we don't have the necessary infrastructure in place. Nor is it clear that it's worth building that infrastructure, given how limited the skip_fetch optimization is. In the backbranches we just disable the optimization by always passing need_tuples=true to table_beginscan_bm(). We can't perform API/ABI changes in the backbranches and we want to make the change as minimal as possible. Author: Matthias van de Meent <boekewurm+postgres@gmail.com> Reported-By: Konstantin Knizhnik <knizhnik@garret.ru> Discussion: https://postgr.es/m/CAEze2Wg3gXXZTr6_rwC+s4-o2ZVFB5F985uUSgJTsECx6AmGcQ@mail.gmail.com Backpatch-through: 13
2025-04-01heapam: Only set tuple's block once per page in pagemodeHeikki Linnakangas
Due to splitting the block id into two 16 bit integers, BlockIdSet() is more expensive than one might think. Doing it once per returned tuple shows up as a small but reliably reproducible cost. It's simple enough to set the block number just once per block in pagemode, so do so. Author: Andres Freund <andres@anarazel.de> Discussion: https://www.postgresql.org/message-id/lxzj26ga6ippdeunz6kuncectr5gfuugmm2ry22qu6hcx6oid6@lzx3sjsqhmt6
2025-03-30read_stream: Introduce and use optional batchmode supportAndres Freund
Submitting IO in larger batches can be more efficient than doing so one-by-one, particularly for many small reads. It does, however, require the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not: a) block without first calling pgaio_submit_staged(), unless a to-be-waited-on lock cannot be part of a deadlock, e.g. because it is never held while waiting for IO. b) directly or indirectly start another batch pgaio_enter_batchmode() As this requires care and is nontrivial in some cases, batching is only used with explicit opt-in. This patch adds an explicit flag (READ_STREAM_USE_BATCHING) to read_stream and uses it where appropriate. There are two cases where batching would likely be beneficial, but where we aren't using it yet: 1) bitmap heap scans, because the callback reads the VM This should soon be solved, because we are planning to remove the use of the VM, due to that not being sound. 2) The first phase of heap vacuum This could be made to support batchmode, but would require some care. Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
2025-03-29Use PRI?64 instead of "ll?" in format strings (continued).Peter Eisentraut
Continuation of work started in commit 15a79c73, after initial trial. Author: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/b936d2fb-590d-49c3-a615-92c3a88c6c19%40eisentraut.org
2025-03-24Fix bitmapheapscan incorrect recheck of NULL tuplesMelanie Plageman
The bitmap heap scan skip fetch optimization skips fetching the heap block when a page is set all-visible in the visibility map and no columns from the table are needed to satisfy the query. 2b73a8cd33b and c3953226a07 changed the control flow of bitmap heap scan to use the read stream API. The read stream API returns buffers containing blocks to the user. To make this work with the skip fetch optimization, we keep a count of the empty tuples we need to emit for all the blocks skipped and only emit the empty tuples after processing the next block fetched from the heap or at the end of the scan. It's incorrect to recheck NULL tuples, so we must set `recheck` to false before yielding control back to BitmapHeapNext(). This was done before emitting any remaining empty tuples at the end of the scan but not for empty tuples emitted during the scan. This meant that if a page fetched from the heap did require recheck and set `recheck` to true and then we emitted empty tuples for subsequent blocks, we would get wrong results. Fix this by always setting `recheck` to false before emitting empty tuples. Reported-by: Alexander Lakhin <exclusion@gmail.com> Tested-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/496f7acd-881c-4df3-9bd3-8f8534dfec26%40gmail.com
2025-03-18Fix assertion failure in parallel vacuum with minimal maintenance_work_mem ↵Masahiko Sawada
setting. bbf668d66fbf lowered the minimum value of maintenance_work_mem to 64kB. However, in parallel vacuum cases, since the initial underlying DSA size is 256kB, it attempts to perform a cycle of index vacuuming and table vacuuming with an empty TID store, resulting in an assertion failure. This commit ensures that at least one page is processed before index vacuuming and table vacuuming begins. Backpatch to 17, where the minimum maintenance_work_mem value was lowered. Reviewed-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAD21AoCEAmbkkXSKbj4dB+5pJDRL4ZHxrCiLBgES_g_g8mVi1Q@mail.gmail.com Backpatch-through: 17
2025-03-18Fix typo.Amit Kapila
Author: vignesh C <vignesh21@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://postgr.es/m/CALDaNm1KqJ0VFfDJRPbfYi9Shz6LHFEE-Ckn+eqsePfKhebv9w@mail.gmail.com
2025-03-15Remove table AM callback scan_bitmap_next_blockMelanie Plageman
After pushing the bitmap iterator into table-AM specific code (as part of making bitmap heap scan use the read stream API in 2b73a8cd33b7), scan_bitmap_next_block() no longer returns the current block number. Since scan_bitmap_next_block() isn't returning any relevant information to bitmap table scan code, it makes more sense to get rid of it. Now, bitmap table scan code only calls table_scan_bitmap_next_tuple(), and the heap AM implementation of scan_bitmap_next_block() is a local helper in heapam_handler.c. Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/flat/CAAKRu_ZwCwWFeL_H3ia26bP2e7HiKLWt0ZmGXPVwPO6uXq0vaA%40mail.gmail.com
2025-03-15BitmapHeapScan uses the read stream APIMelanie Plageman
Make Bitmap Heap Scan use the read stream API instead of invoking ReadBuffer() for each block indicated by the bitmap. The read stream API handles prefetching, so remove all of the explicit prefetching from bitmap heap scan code. Now, heap table AM implements a read stream callback which uses the bitmap iterator to return the next required block to the read stream code. Tomas Vondra conducted extensive regression testing of this feature. Andres Freund, Thomas Munro, and I analyzed regressions and Thomas Munro patched the read stream API. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Tomas Vondra <tomas@vondra.me> Tested-by: Tomas Vondra <tomas@vondra.me> Tested-by: Andres Freund <andres@anarazel.de> Tested-by: Thomas Munro <thomas.munro@gmail.com> Tested-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/flat/CAAKRu_ZwCwWFeL_H3ia26bP2e7HiKLWt0ZmGXPVwPO6uXq0vaA%40mail.gmail.com
2025-03-15Separate TBM[Shared|Private]Iterator and TBMIterateResultMelanie Plageman
Remove the TBMIterateResult member from the TBMPrivateIterator and TBMSharedIterator and make tbm_[shared|private_]iterate() take a TBMIterateResult as a parameter. This allows tidbitmap API users to manage multiple TBMIterateResults per scan. This is required for bitmap heap scan to use the read stream API, with which there may be multiple I/Os in flight at once, each one with a TBMIterateResult. Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/d4bb26c9-fe07-439e-ac53-c0e244387e01%40vondra.me
2025-03-14Add GUC option to log lock acquisition failures.Fujii Masao
This commit introduces a new GUC, log_lock_failure, which controls whether a detailed log message is produced when a lock acquisition fails. Currently, it only supports logging lock failures caused by SELECT ... NOWAIT. The log message includes information about all processes holding or waiting for the lock that couldn't be acquired, helping users analyze and diagnose the causes of lock failures. Currently, this option does not log failures from SELECT ... SKIP LOCKED, as that could generate excessive log messages if many locks are skipped, causing unnecessary noise. This mechanism can be extended in the future to support for logging lock failures from other commands, such as LOCK TABLE ... NOWAIT. Author: Yuki Seino <seinoyu@oss.nttdata.com> Co-authored-by: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl> Discussion: https://postgr.es/m/411280a186cc26ef7034e0f2dfe54131@oss.nttdata.com
2025-03-11Assert that a snapshot is active or registered before it's usedHeikki Linnakangas
The comment in GetTransactionSnapshot() said that you "should call RegisterSnapshot or PushActiveSnapshot on the returned snap if it is to be used very long". That felt too unclear to me. Make the comment more strongly worded. To enforce that rule and to catch potential bugs where a snapshot might get invalidated while it's still in use, add an assertion to HeapTupleSatisfiesMVCC() to check that the snapshot is registered or pushed to active stack. No new bugs were found by this, but it seems like good future-proofing. It's not a great place for the check; HeapTupleSatisfiesMVCC() is in fact safe to call with an unregistered snapshot, and the assertion won't catch other unsafe uses. But it goes a long way in practice. Fix a few cases that were playing fast and loose with that and just assumed that the snapshot cannot be invalidated during a scan. Those assumptions were not wrong, but they're not performance critical, so let's drop the excuses and just register the snapshot. These were false positives found by the new assertion. Discussion: https://www.postgresql.org/message-id/7c56f180-b9e1-481e-8c1d-efa63de3ecbb@iki.fi
2025-03-11Show index search count in EXPLAIN ANALYZE, take 2.Peter Geoghegan
Expose the count of index searches/index descents in EXPLAIN ANALYZE's output for index scan/index-only scan/bitmap index scan nodes. This information is particularly useful with scans that use ScalarArrayOp quals, where the number of index searches can be unpredictable due to implementation details that interact with physical index characteristics (at least with nbtree SAOP scans, since Postgres 17 commit 5bf748b8). The information shown also provides useful context when EXPLAIN ANALYZE runs a plan with an index scan node that successfully applied the skip scan optimization (set to be added to nbtree by an upcoming patch). The instrumentation works by teaching all index AMs to increment a new nsearches counter whenever a new index search begins. The counter is incremented at exactly the same point that index AMs already increment the pg_stat_*_indexes.idx_scan counter (we're counting the same event, but at the scan level rather than the relation level). Parallel queries have workers copy their local counter struct into shared memory when an index scan node ends -- even when it isn't a parallel aware scan node. An earlier version of this patch that only worked with parallel aware scans became commit 5ead85fb (though that was quickly reverted by commit d00107cd following "debug_parallel_query=regress" buildfarm failures). Our approach doesn't match the approach used when tracking other index scan related costs (e.g., "Rows Removed by Filter:"). It is comparable to the approach used in similar cases involving costs that are only readily accessible inside an access method, not from the executor proper (e.g., "Heap Blocks:" output for a Bitmap Heap Scan, which was recently enhanced to show per-worker costs by commit 5a1e6df3, using essentially the same scheme as the one used here). It is necessary for index AMs to have direct responsibility for maintaining the new counter, since the counter might need to be incremented multiple times per amgettuple call (or per amgetbitmap call). But it is also necessary for the executor proper to manage the shared memory now used to transfer each worker's counter struct to the leader. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Tomas Vondra <tomas@vondra.me> Reviewed-By: Masahiro Ikeda <ikedamsh@oss.nttdata.com> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-WzkRqvaqR2CTNqTZP0z6FuL4-3ED6eQB0yx38XBNj1v-4Q@mail.gmail.com Discussion: https://postgr.es/m/CAH2-Wz=PKR6rB7qbx+Vnd7eqeB5VTcrW=iJvAsTsKbdG+kW_UA@mail.gmail.com
2025-03-03Add relallfrozen to pg_classMelanie Plageman
Add relallfrozen, an estimate of the number of pages marked all-frozen in the visibility map. pg_class already has relallvisible, an estimate of the number of pages in the relation marked all-visible in the visibility map. This is used primarily for planning. relallfrozen, together with relallvisible, is useful for estimating the outstanding number of all-visible but not all-frozen pages in the relation for the purposes of scheduling manual VACUUMs and tuning vacuum freeze parameters. A future commit will use relallfrozen to trigger more frequent vacuums on insert-focused workloads with significant volume of frozen data. Bump catalog version Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Robert Treat <rob@xzilla.net> Reviewed-by: Corey Huinker <corey.huinker@gmail.com> Reviewed-by: Greg Sabino Mullane <htamfids@gmail.com> Discussion: https://postgr.es/m/flat/CAAKRu_aj-P7YyBz_cPNwztz6ohP%2BvWis%3Diz3YcomkB3NpYA--w%40mail.gmail.com
2025-02-24Delay extraction of TIDBitmap per page offsetsMelanie Plageman
Pages from the bitmap created by the TIDBitmap API can be exact or lossy. The TIDBitmap API extracts the tuple offsets from exact pages into an array for the convenience of the caller. This was done in tbm_private|shared_iterate() right after advancing the iterator. However, as long as tbm_private|shared_iterate() set a reference to the PagetableEntry in the TBMIterateResult, the offset extraction can be done later. Waiting to extract the tuple offsets has a few benefits. For the shared iterator case, it allows us to extract the offsets after dropping the shared iterator state lock, reducing time spent holding a contended lock. Separating the iteration step and extracting the offsets later also allows us to avoid extracting the offsets for prefetched blocks. Those offsets were never used, so the overhead of extracting and storing them was wasted. The real motivation for this change, however, is that future commits will make bitmap heap scan use the read stream API. This requires a TBMIterateResult per issued block. By removing the array of tuple offsets from the TBMIterateResult and only extracting the offsets when they are used, we reduce the memory required for per buffer data substantially. Suggested-by: Thomas Munro <thomas.munro@gmail.com> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGLHbKP3jwJ6_%2BhnGi37Pw3BD5j2amjV3oSk7j-KyCnY7Q%40mail.gmail.com
2025-02-24Add lossy indicator to TBMIterateResultMelanie Plageman
TBMIterateResult->ntuples is -1 when the page in the bitmap is lossy. Add an explicit lossy indicator so that we can move ntuples out of the TBMIterateResult in a future commit. Discussion: https://postgr.es/m/CA%2BhUKGLHbKP3jwJ6_%2BhnGi37Pw3BD5j2amjV3oSk7j-KyCnY7Q%40mail.gmail.com
2025-02-18Reduce scope of heap vacuum per_buffer_dataMelanie Plageman
Move lazy_scan_heap()'s per_buffer_data variable into a tighter scope. In lazy_scan_heap()'s phase I heap vacuuming, the read stream API returns a pointer to the next block number to vacuum. As long as read_stream_next_buffer() returns a valid buffer, per_buffer_data should always be valid. Move per_buffer_data into a tighter scope and make sure it is reset to NULL on each iteration so that we get a core dump instead of bogus data from a previous block if something goes wrong in the read stream API. Suggested-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/626104.1739729538%40sss.pgh.pa.us
2025-02-17Add information about WAL buffers full to VACUUM/ANALYZE (VERBOSE)Michael Paquier
This commit adds the information about the number of times WAL buffers have been full to the logs generated by VACUUM/ANALYZE (VERBOSE) and in the logs generated by autovacuum, complementing the existing information stored by WalUsage. This is the last part of the backend code where the value of wal_buffers_full can be reported, similarly to all the other fields of WalUsage. 320545bfcfee and ce5bcc4a9f26 have done the same for EXPLAIN and pgss. Author: Bertrand Drouvot Reviewed-by: Ilia Evdokimov Discussion: https://postgr.es/m/Z6SOha5YFFgvpwQY@ip-10-97-1-34.eu-west-3.compute.internal
2025-02-14Add delay time to VACUUM/ANALYZE (VERBOSE) and autovacuum logs.Nathan Bossart
Commit bb8dff9995 added this information to the pg_stat_progress_vacuum and pg_stat_progress_analyze system views. This commit adds the same information to the output of VACUUM and ANALYZE with the VERBOSE option and to the autovacuum logs. Suggested-by: Masahiro Ikeda <ikedamsh@oss.nttdata.com> Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/ZmaXmWDL829fzAVX%40ip-10-97-1-34.eu-west-3.compute.internal
2025-02-14Use streaming read I/O in VACUUM's third phaseMelanie Plageman
Make vacuum's third phase (its second pass over the heap), which reaps dead items collected in the first phase and marks them as reusable, use the read stream API. This commit adds a new read stream callback, vacuum_reap_lp_read_stream_next(), that looks ahead in the TidStore and returns the next block number to read for vacuum. Author: Melanie Plageman <melanieplageman@gmail.com> Co-authored-by: Thomas Munro <thomas.munro@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGKN3oy0bN_3yv8hd78a4%2BM1tJC9z7mD8%2Bf%2ByA%2BGeoFUwQ%40mail.gmail.com
2025-02-14Use streaming read I/O in VACUUM's first phaseMelanie Plageman
Make vacuum's first phase, which prunes and freezes tuples and records dead TIDs, use the read stream API by by converting heap_vac_scan_next_block() to a read stream callback. Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/CAAKRu_aLwANZpxHc0tC-6OT0OQT4TftDGkKAO5yigMUOv_Tcsw%40mail.gmail.com
2025-02-14Convert heap_vac_scan_next_block() boolean parameters to flagsMelanie Plageman
The read stream API only allows one piece of extra per block state to be passed back to the API user (per_buffer_data). lazy_scan_heap() needs two pieces of per-buffer data: whether or not the block was all-visible in the visibility map and whether or not it was eagerly scanned. Convert these two pieces of information to flags so that they can be populated by heap_vac_scan_next_block() and returned to lazy_scan_heap(). A future commit will turn heap_vac_scan_next_block() into the read stream callback for heap phase I vacuuming. Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/CAAKRu_bmx33jTqATP5GKNFYwAg02a9dDtk4U_ciEjgBHZSVkOQ%40mail.gmail.com
2025-02-13Remove unnecessary (char *) casts [xlog]Peter Eisentraut
Remove (char *) casts no longer needed after XLogRegisterData() and XLogRegisterBufData() argument type change. Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Discussion: https://www.postgresql.org/message-id/flat/fd1fcedb-3492-4fc8-9e3e-74b97f2db6c7%40eisentraut.org
2025-02-12Remove unnecessary (char *) casts [mem]Peter Eisentraut
Remove (char *) casts around memory functions such as memcmp(), memcpy(), or memset() where the cast is useless. Since these functions don't take char * arguments anyway, these casts are at best complicated casts to (void *), about which see commit 7f798aca1d5. Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Discussion: https://www.postgresql.org/message-id/flat/fd1fcedb-3492-4fc8-9e3e-74b97f2db6c7%40eisentraut.org
2025-02-11Add is_analyze parameter to vacuum_delay_point().Nathan Bossart
This function is used in both vacuum and analyze code paths, and a follow-up commit will require distinguishing between the two. This commit forces callers to specify whether they are in a vacuum or analyze path, but it does not use that information for anything yet. Author: Nathan Bossart <nathandbossart@gmail.com> Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/ZmaXmWDL829fzAVX%40ip-10-97-1-34.eu-west-3.compute.internal
2025-02-11Eagerly scan all-visible pages to amortize aggressive vacuumMelanie Plageman
Aggressive vacuums must scan every unfrozen tuple in order to advance the relfrozenxid/relminmxid. Because data is often vacuumed before it is old enough to require freezing, relations may build up a large backlog of pages that are set all-visible but not all-frozen in the visibility map. When an aggressive vacuum is triggered, all of these pages must be scanned. These pages have often been evicted from shared buffers and even from the kernel buffer cache. Thus, aggressive vacuums often incur large amounts of extra I/O at the expense of foreground workloads. To amortize the cost of aggressive vacuums, eagerly scan some all-visible but not all-frozen pages during normal vacuums. All-visible pages that are eagerly scanned and set all-frozen in the visibility map are counted as successful eager freezes and those not frozen are counted as failed eager freezes. If too many eager scans fail in a row, eager scanning is temporarily suspended until a later portion of the relation. The number of failures tolerated is configurable globally and per table. To effectively amortize aggressive vacuums, we cap the number of successes as well. Capping eager freeze successes also limits the amount of potentially wasted work if these pages are modified again before the next aggressive vacuum. Once we reach the maximum number of blocks successfully eager frozen, eager scanning is disabled for the remainder of the vacuum of the relation. Original design idea from Robert Haas, with enhancements from Andres Freund, Tomas Vondra, and me Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Robert Treat <rob@xzilla.net> Reviewed-by: Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/flat/CAAKRu_ZF_KCzZuOrPrOqjGVe8iRVWEAJSpzMgRQs%3D5-v84cXUg%40mail.gmail.com
2025-02-07Virtual generated columnsPeter Eisentraut
This adds a new variant of generated columns that are computed on read (like a view, unlike the existing stored generated columns, which are computed on write, like a materialized view). The syntax for the column definition is ... GENERATED ALWAYS AS (...) VIRTUAL and VIRTUAL is also optional. VIRTUAL is the default rather than STORED to match various other SQL products. (The SQL standard makes no specification about this, but it also doesn't know about VIRTUAL or STORED.) (Also, virtual views are the default, rather than materialized views.) Virtual generated columns are stored in tuples as null values. (A very early version of this patch had the ambition to not store them at all. But so much stuff breaks or gets confused if you have tuples where a column in the middle is completely missing. This is a compromise, and it still saves space over being forced to use stored generated columns. If we ever find a way to improve this, a bit of pg_upgrade cleverness could allow for upgrades to a newer scheme.) The capabilities and restrictions of virtual generated columns are mostly the same as for stored generated columns. In some cases, this patch keeps virtual generated columns more restricted than they might technically need to be, to keep the two kinds consistent. Some of that could maybe be relaxed later after separate careful considerations. Some functionality that is currently not supported, but could possibly be added as incremental features, some easier than others: - index on or using a virtual column - hence also no unique constraints on virtual columns - extended statistics on virtual columns - foreign-key constraints on virtual columns - not-null constraints on virtual columns (check constraints are supported) - ALTER TABLE / DROP EXPRESSION - virtual column cannot have domain type - virtual columns are not supported in logical replication The tests in generated_virtual.sql have been copied over from generated_stored.sql with the keyword replaced. This way we can make sure the behavior is mostly aligned, and the differences can be visible. Some tests for currently not supported features are currently commented out. Reviewed-by: Jian He <jian.universality@gmail.com> Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Tested-by: Shlok Kyal <shlok.kyal.oss@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/a368248e-69e4-40be-9c07-6c3b5880b0a6@eisentraut.org