summaryrefslogtreecommitdiff
path: root/src/backend
AgeCommit message (Collapse)Author
20 hoursEnsure sanity of hash-join costing when there are no MCV statistics.HEADorigin/masterorigin/HEADmasterTom Lane
estimate_hash_bucket_stats is defined to return zero to *mcv_freq if it cannot obtain a value for the frequency of the most common value. Its sole caller final_cost_hashjoin ignored this provision and would blindly believe the zero value, resulting in computing zero for the largest bucket size. In consequence, the safety check that intended to prevent the largest bucket from exceeding get_hash_memory_limit() was ineffective, allowing very silly plans to be chosen if statistics were missing. After fixing final_cost_hashjoin to disregard zero results for mcv_freq, a second problem appeared: some cases that should use hash joins failed to. This is because estimate_hash_bucket_stats was unaware of the fact that ANALYZE won't store MCV statistics if it doesn't find any multiply-occurring values. Thus the lack of an MCV stats entry doesn't necessarily mean that we know nothing; we may well know that the column is unique. The former coding returned zero for *mcv_freq in this case, which was pretty close to correct, but now final_cost_hashjoin doesn't believe it and disables the hash join. So check to see if there is a HISTOGRAM stats entry; if so, ANALYZE has in fact run for this column and must have found it to be unique. In that case report the MCV frequency as 1 / rows, instead of claiming ignorance. Reporting a more accurate *mcv_freq in this case can also affect the bucket-size skew adjustment further down in estimate_hash_bucket_stats, causing hash-join cost estimates to change slightly. This affects some plan choices in the core regression tests. The first diff in join.out corresponds to a case where we have no stats and should not risk a hash join, but the remaining changes are caused by producing a better bucket-size estimate for unique join columns. Those are all harmless changes so far as I can tell. The existing behavior was introduced in commit 4867d7f62 in v11. It appears from the commit log that disabling the bucket-size safety check in the absence of statistics was intentional; but we've now seen a case where the ensuing behavior is bad enough to make that seem like a poor decision. In any case the lack of other problems with that safety check after several years helps to justify enforcing it more strictly. However, we won't risk back-patching this, in case any applications are depending on the existing behavior. Bug: #19363 Reported-by: Jinhui Lai <jinhui.lai@qq.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/2380165.1766871097@sss.pgh.pa.us Discussion: https://postgr.es/m/19363-8dd32fc7600a1153@postgresql.org
35 hoursIgnore PlaceHolderVars when looking up statisticsRichard Guo
When looking up statistical data about an expression, we failed to look through PlaceHolderVar nodes, treating them as opaque. This could prevent us from matching an expression to base columns, index expressions, or extended statistics, as examine_variable() relies on strict structural matching. As a result, queries involving PlaceHolderVar nodes often fell back to default selectivity estimates, potentially leading to poor plan choices. This patch updates examine_variable() to strip PlaceHolderVars before analysis. This is safe during estimation because PlaceHolderVars are transparent for the purpose of statistics lookup: they do not alter the value distribution of the underlying expression. To minimize performance overhead on this hot path, a lightweight walker first checks for the presence of PlaceHolderVars. The more expensive mutator is invoked only when necessary. There is one ensuing plan change in the regression tests, which is expected and demonstrates the fix: the rowcount estimate becomes much more accurate with this patch. Back-patch to v18. Although this issue exists before that, changes in this version made it common enough to notice. Given the lack of field reports for older versions, I am not back-patching further. Reported-by: Haowu Ge <gehaowu@bitmoe.com> Author: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/62af586c-c270-44f3-9c5e-02c81d537e3d.gehaowu@bitmoe.com Backpatch-through: 18
35 hoursStrip PlaceHolderVars from index operandsRichard Guo
When pulling up a subquery, we may need to wrap its targetlist items in PlaceHolderVars to enforce separate identity or as a result of outer joins. However, this causes any upper-level WHERE clauses referencing these outputs to contain PlaceHolderVars, which prevents indxpath.c from recognizing that they could be matched to index columns or index expressions, potentially affecting the planner's ability to use indexes. To fix, explicitly strip PlaceHolderVars from index operands. A PlaceHolderVar appearing in a relation-scan-level expression is effectively a no-op. Nevertheless, to play it safe, we strip only PlaceHolderVars that are not marked nullable. The stripping is performed recursively to handle cases where PlaceHolderVars are nested or interleaved with other node types. To minimize performance impact, we first use a lightweight walker to check for the presence of strippable PlaceHolderVars. The expensive mutator is invoked only if a candidate is found, avoiding unnecessary memory allocation and tree copying in the common case where no PlaceHolderVars are present. Back-patch to v18. Although this issue exists before that, changes in this version made it common enough to notice. Given the lack of field reports for older versions, I am not back-patching further. Reported-by: Haowu Ge <gehaowu@bitmoe.com> Author: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/62af586c-c270-44f3-9c5e-02c81d537e3d.gehaowu@bitmoe.com Backpatch-through: 18
2 daysChange some Datum to void * for opaque pass-through pointerPeter Eisentraut
Here, Datum was used to pass around an opaque pointer between a group of functions. But one might as well use void * for that; the use of Datum doesn't achieve anything here and is just distracting. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/flat/1c5d23cb-288b-4154-b1cd-191fe2301707%40eisentraut.org
3 daysSplit some long Makefile listsMichael Paquier
This change makes more readable code diffs when adding new items or removing old items, while ensuring that lines do not get excessively long. Some SUBDIRS, PROGRAMS and REGRESS lists are split. Note that there are a few more REGRESS lists that could be split, particularly in contrib/. Author: Jelte Fennema-Nio <postgres@jeltef.nl> Co-Authored-By: Jacob Champion <jacob.champion@enterprisedb.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Japin Li <japinli@hotmail.com> Reviewed-by: Man Zeng <zengman@halodbtech.com> Discussion: https://postgr.es/m/DF6HDGB559U5.3MPRFCWPONEAE@jeltef.nl
3 daysFix pg_stat_get_backend_activity() to use multi-byte truncated resultMichael Paquier
pg_stat_get_backend_activity() calls pgstat_clip_activity() to ensure that the reported query string is correctly truncated when it finishes with an incomplete multi-byte sequence. However, the result returned by the function was not what pgstat_clip_activity() generated, but the non-truncated, original, contents from PgBackendStatus.st_activity_raw. Oversight in 54b6cd589ac2, so backpatch all the way down. Author: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CAEoWx2mDzwc48q2EK9tSXS6iJMJ35wvxNQnHX+rXjy5VgLvJQw@mail.gmail.com Backpatch-through: 14
5 daysUpgrade BufFile to use int64 for byte positionsMichael Paquier
This change has the advantage of removing some weird type casts, caused by offset calculations based on pgoff_t but saved as int (on older branches we use off_t, which could be 4 or 8 bytes depending on the environment). These are safe currently because capped by MAX_PHYSICAL_FILESIZE, but we would run into problems when to make MAX_PHYSICAL_FILESIZE larger or allow callers of these routines to use a larger physical max size on demand. While on it, this improves BufFileDumpBuffer() so as we do not use an offset for "availbytes". It is not a file offset per-set, but a number of available bytes. This change should lead to no functional changes. Author: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/aUStrqoOCDRFAq1M@paquier.xyz
5 daysFix typo in stat_utils.cMichael Paquier
Introduced by 213a1b895270. Reported-by: Tender Wang <tndrwang@gmail.com> Discussion: https://postgr.es/m/CAHewXNku-jz-FPKeJVk25fZ1pV2buYh5vpeqGDOB=bFQhKxXhw@mail.gmail.com
5 daysMove attribute statistics functions to stat_utils.cMichael Paquier
Many of the operations done for attribute stats in attribute_stats.c share the same logic as extended stats, as done by a patch under discussion to add support for extended stats import and export. All the pieces necessary for extended statistics are moved to stats_utils.c, which is the file where common facilities are shared for stats files. The following renames are done: * get_attr_stat_type() -> statatt_get_type() * init_empty_stats_tuple() -> statatt_init_empty_tuple() * set_stats_slot() -> statatt_set_slot() * get_elem_stat_type() -> statatt_get_elem_type() While on it, this commit adds more documentation for all these functions, describing more their internals and the dependencies that have been implied for attribute statistics. The same concepts apply to extended statistics, at some degree. Author: Corey Huinker <corey.huinker@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Yu Wang <wangyu_runtime@163.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com
5 daysFix planner error with SRFs and grouping setsRichard Guo
If there are any SRFs in a PathTarget, we must separate it into SRF-computing and SRF-free targets. This is because the executor can only handle SRFs that appear at the top level of the targetlist of a ProjectSet plan node. If we find a subexpression that matches an expression already computed in the previous plan level, we should treat it like a Var and should not split it again. setrefs.c will later replace the expression with a Var referencing the subplan output. However, when processing the grouping target for grouping sets, the planner can fail to recognize that an expression is already computed in the scan/join phase. The root cause is a mismatch in the nullingrels bits. Expressions in the grouping target carry the grouping nulling bit in their nullingrels to indicate that they can be nulled by the grouping step. However, the corresponding expressions in the scan/join target do not have these bits. As a result, the exact match check in list_member() fails, leading the planner to incorrectly believe that the expression needs to be re-evaluated from its arguments, which are often not available in the subplan. This can lead to planner errors such as "variable not found in subplan target list". To fix, ignore the grouping nulling bit when checking whether an expression from the grouping target is available in the pre-grouping input target. This aligns with the matching logic in setrefs.c. Backpatch to v18, where this issue was introduced. Bug: #19353 Reported-by: Marian MULLER REBEYROL <marian.muller@serli.com> Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Tender Wang <tndrwang@gmail.com> Discussion: https://postgr.es/m/19353-aaa179bba986a19b@postgresql.org Backpatch-through: 18
6 daysFix CREATE SUBSCRIPTION failure when the publisher runs on pre-PG19.Fujii Masao
CREATE SUBSCRIPTION with copy_data=true and origin='none' previously failed when the publisher was running a version earlier than PostgreSQL 19, even though this combination should be supported. The failure occurred because the command issued a query calling pg_get_publication_sequences function on the publisher. That function does not exist before PG19 and the query is only needed for logical replication sequence synchronization, which is supported starting in PG19. This commit fixes this issue by skipping that query when the publisher runs a version earlier than PG19. Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Shlok Kyal <shlok.kyal.oss@gmail.com> Discussion: https://postgr.es/m/CAHGQGwEx4twHtJdiPWTyAXJhcBPLaH467SH2ajGSe-41m65giA@mail.gmail.com
6 daysFix version check for retain_dead_tuples subscription option.Fujii Masao
The retain_dead_tuples subscription option is supported only when the publisher runs PostgreSQL 19 or later. However, it could previously be enabled even when the publisher was running an earlier version. This was caused by check_pub_dead_tuple_retention() comparing the publisher server version against 19000 instead of 190000. Fix this typo so that the version check correctly enforces the PG19+ requirement. Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Shlok Kyal <shlok.kyal.oss@gmail.com> Discussion: https://postgr.es/m/CAHGQGwEx4twHtJdiPWTyAXJhcBPLaH467SH2ajGSe-41m65giA@mail.gmail.com
6 daysTeach expr_is_nonnullable() to handle more expression typesRichard Guo
Currently, the function expr_is_nonnullable() checks only Const and Var expressions to determine if an expression is non-nullable. This patch extends the detection logic to handle more expression types. This can enable several downstream optimizations, such as reducing NullTest quals to constant truth values (e.g., "COALESCE(var, 1) IS NULL" becomes FALSE) and converting "COUNT(expr)" to the more efficient "COUNT(*)" when the expression is proven non-nullable. This breaks a test case in test_predtest.sql, since we now simplify "ARRAY[] IS NULL" to constant FALSE, preventing it from weakly refuting a strict ScalarArrayOpExpr ("x = any(ARRAY[])"). To ensure the refutation logic is still exercised as intended, wrap the array argument in opaque_array(). Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Tender Wang <tndrwang@gmail.com> Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Discussion: https://postgr.es/m/CAMbWs49UhPBjm+NRpxerjaeuFKyUZJ_AjM3NBcSYK2JgZ6VTEQ@mail.gmail.com
6 daysOptimize ROW(...) IS [NOT] NULL using non-nullable fieldsRichard Guo
We break ROW(...) IS [NOT] NULL into separate tests on its component fields. During this breakdown, we can improve efficiency by utilizing expr_is_nonnullable() to detect fields that are provably non-nullable. If a component field is proven non-nullable, it affects the outcome based on the test type. For an IS NULL test, a single non-nullable field refutes the whole NullTest, reducing it to constant FALSE. For an IS NOT NULL test, the check for that specific field is guaranteed to succeed, so we can discard it from the list of component tests. This extends the existing optimization logic, which previously only handled Const fields, to support any expression that can be proven non-nullable. In passing, update the existing constant folding of NullTests to use expr_is_nonnullable() instead of var_is_nonnullable(), enabling it to benefit from future improvements to that function. Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Tender Wang <tndrwang@gmail.com> Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Discussion: https://postgr.es/m/CAMbWs49UhPBjm+NRpxerjaeuFKyUZJ_AjM3NBcSYK2JgZ6VTEQ@mail.gmail.com
6 daysSimplify COALESCE expressions using non-nullable argumentsRichard Guo
The COALESCE function returns the first of its arguments that is not null. When an argument is proven non-null, if it is the first non-null-constant argument, the entire COALESCE expression can be replaced by that argument. If it is a subsequent argument, all following arguments can be dropped, since they will never be reached. Currently, we perform this simplification only for Const arguments. This patch extends the simplification to support any expression that can be proven non-nullable. This can help avoid the overhead of evaluating unreachable arguments. It can also lead to better plans when the first argument is proven non-nullable and replaces the expression, as the planner no longer has to treat the expression as non-strict, and can also leverage index scans on the resulting expression. There is an ensuing plan change in generated_virtual.out, and we have to modify the test to ensure that it continues to test what it is intended to. Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Tender Wang <tndrwang@gmail.com> Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Discussion: https://postgr.es/m/CAMbWs49UhPBjm+NRpxerjaeuFKyUZJ_AjM3NBcSYK2JgZ6VTEQ@mail.gmail.com
6 daysImprove comment in pgstatfuncs.cMichael Paquier
Author: Zizhen Qiao <zizhen_qiao@163.com> Discussion: https://postgr.es/m/5ee635f9.49f7.19b4ed9e803.Coremail.zizhen_qiao@163.com
6 daysDon't advance origin during apply failure.Amit Kapila
The logical replication parallel apply worker could incorrectly advance the origin progress during an error or failed apply. This behavior risks transaction loss because such transactions will not be resent by the server. Commit 3f28b2fcac addressed a similar issue for both the apply worker and the table sync worker by registering a before_shmem_exit callback to reset origin information. This prevents the worker from advancing the origin during transaction abortion on shutdown. This patch applies the same fix to the parallel apply worker, ensuring consistent behavior across all worker types. As with 3f28b2fcac, we are backpatching through version 16, since parallel apply mode was introduced there and the issue only occurs when changes are applied before the transaction end record (COMMIT or ABORT) is received. Author: Hou Zhijie <houzj.fnst@fujitsu.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Backpatch-through: 16 Discussion: https://postgr.es/m/TY4PR01MB169078771FB31B395AB496A6B94B4A@TY4PR01MB16907.jpnprd01.prod.outlook.com Discussion: https://postgr.es/m/TYAPR01MB5692FAC23BE40C69DA8ED4AFF5B92@TYAPR01MB5692.jpnprd01.prod.outlook.com
7 daysToggle logical decoding dynamically based on logical slot presence.Masahiko Sawada
Previously logical decoding required wal_level to be set to 'logical' at server start. This meant that users had to incur the overhead of logical-level WAL logging even when no logical replication slots were in use. This commit adds functionality to automatically control logical decoding availability based on logical replication slot presence. The newly introduced module logicalctl.c allows logical decoding to be dynamically activated when needed when wal_level is set to 'replica'. When the first logical replication slot is created, the system automatically increases the effective WAL level to maintain logical-level WAL records. Conversely, after the last logical slot is dropped or invalidated, it decreases back to 'replica' WAL level. While activation occurs synchronously right after creating the first logical slot, deactivation happens asynchronously through the checkpointer process. This design avoids a race condition at the end of recovery; a concurrent deactivation could happen while the startup process enables logical decoding at the end of recovery, but WAL writes are still not permitted until recovery fully completes. The checkpointer will handle it after recovery is done. Asynchronous deactivation also avoids excessive toggling of the logical decoding status in workloads that repeatedly create and drop a single logical slot. On the other hand, this lazy approach can delay changes to effective_wal_level and the disabling logical decoding, especially when the checkpointer is busy with other tasks. We chose this lazy approach in all deactivation paths to keep the implementation simple, even though laziness is strictly required only for end-of-recovery cases. Future work might address this limitation either by using a dedicated worker instead of the checkpointer, or by implementing synchronous waiting during slot drops if workloads are significantly affected by the lazy deactivation of logical decoding. The effective WAL level, determined internally by XLogLogicalInfo, is allowed to change within a transaction until an XID is assigned. Once an XID is assigned, the value becomes fixed for the remainder of the transaction. This behavior ensures that the logging mode remains consistent within a writing transaction, similar to the behavior of GUC parameters. A new read-only GUC parameter effective_wal_level is introduced to monitor the actual WAL level in effect. This parameter reflects the current operational WAL level, which may differ from the configured wal_level setting. Bump PG_CONTROL_VERSION as it adds a new field to CheckPoint struct. Reviewed-by: Shveta Malik <shveta.malik@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Shlok Kyal <shlok.kyal.oss@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://postgr.es/m/CAD21AoCVLeLYq09pQPaWs+Jwdni5FuJ8v2jgq-u9_uFbcp6UbA@mail.gmail.com
7 daysFix bug in following update chain when locking a heap tupleHeikki Linnakangas
After waiting for a concurrent updater to finish, heap_lock_tuple() followed the update chain to lock all tuple versions. However, when stepping from the initial tuple to the next one, it failed to check that the next tuple's XMIN matches the initial tuple's XMAX. That's an important check whenever following an update chain, and the recursive part that follows the chain did it, but the initial step missed it. Without the check, if the updating transaction aborts, the updated tuple is vacuumed away and replaced by an unrelated tuple, the unrelated tuple might get incorrectly locked. Author: Jasper Smit <jasper.smit@servicenow.com> Discussion: https://www.postgresql.org/message-id/CAOG+RQ74x0q=kgBBQ=mezuvOeZBfSxM1qu_o0V28bwDz3dHxLw@mail.gmail.com Backpatch-through: 14
7 daysFix orphaned origin in shared memory after DROP SUBSCRIPTIONMichael Paquier
Since ce0fdbfe9722, a replication slot and an origin are created by each tablesync worker, whose information is stored in both a catalog and shared memory (once the origin is set up in the latter case). The transaction where the origin is created is the same as the one that runs the initial COPY, with the catalog state of the origin becoming visible for other sessions only once the COPY transaction has committed. The catalog state is coupled with a state in shared memory, initialized at the same time as the origin created in the catalogs. Note that the transaction doing the initial data sync can take a long time, time that depends on the amount of data to transfer from a publication node to its subscriber node. Now, when a DROP SUBSCRIPTION is executed, all its workers are stopped with the origins removed. The removal of each origin relies on a catalog lookup. A worker still running the initial COPY would fail its transaction, with the catalog state of the origin rolled back while the shared memory state remains around. The session running the DROP SUBSCRIPTION should be in charge of cleaning up the catalog and the shared memory state, but as there is no data in the catalogs the shared memory state is not removed. This issue would leave orphaned origin data in shared memory, leading to a confusing state as it would still show up in pg_replication_origin_status. Note that this shared memory data is sticky, being flushed on disk in replorigin_checkpoint at checkpoint. This prevents other origins from reusing a slot position in the shared memory data. To address this problem, the commit moves the creation of the origin at the end of the transaction that precedes the one executing the initial COPY, making the origin immediately visible in the catalogs for other sessions, giving DROP SUBSCRIPTION a way to know about it. A different solution would have been to clean up the shared memory state using an abort callback within the tablesync worker. The solution of this commit is more consistent with the apply worker that creates an origin in a short transaction. A test is added in the subscription test 004_sync.pl, which was able to display the problem. The test fails when this commit is reverted. Reported-by: Tenglong Gu <brucegu@amazon.com> Reported-by: Daisuke Higuchi <higudai@amazon.com> Analyzed-by: Michael Paquier <michael@paquier.xyz> Author: Hou Zhijie <houzj.fnst@fujitsu.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/aUTekQTg4OYnw-Co@paquier.xyz Backpatch-through: 14
8 daysSwitch buffile.c/h to use pgoff_t instead of off_tMichael Paquier
off_t was previously used for offsets, which is 4 bytes on Windows, hence limiting the backend code to a hard limit for files longer than 2GB. This leads to some simplification in these files, removing some casts based on long, also 4 bytes on Windows. This commit removes one comment introduced in db3c4c3a2d98, not relevant anymore as pgoff_t is a safe 8-byte alternative on Windows. This change is surprisingly not invasive, as the callers of BufFileTell(), BufFileSeek() and BufFileTruncateFileSet() (worker.c, tuplestore.c, etc.) track offsets in local structures that just to switch from off_t to pgoff_t for the most part. The file is still relying on a maximum file size of MAX_PHYSICAL_FILESIZE (1GB). This change allows the code to make this maximum potentially larger in the future, or larger on a per-demand basis. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/aUStrqoOCDRFAq1M@paquier.xyz
8 daysFix another typo in gininsert.cMichael Paquier
Reported-by: Tender Wang <tndrwang@gmail.com> Discussion: https://postgr.es/m/CAHewXNkRJ9DMFZMQKWQ32U+OTBR78KeGh2=9Wy5jEeWDxMVFcQ@mail.gmail.com
9 daysRemove obsolete name_ops index-only scan comments.Peter Geoghegan
nbtree index-only scans of an index that uses btree/name_ops as one of its index column's input opclasses are no longer at any risk of reading past the end of currTuples. We're no longer reliant on such scans being able to at least read from the start of markTuples storage (which uses space from the same allocation as currTuples) to avoid a segfault: StoreIndexTuple (from nodeIndexonlyscan.c) won't actually read past the end of a cstring datum from a name_ops index. In other words, we already have the "special-case treatment for name_ops" that the removed comment supposed we could avoid by relying on markTuples in this way. Oversight in commit a63224be49, which added special case handling of name_ops cstrings to StoreIndexTuple, but missed these comments.
11 daysheapam: Move logic to handle HEAP_MOVED into a helper functionAndres Freund
Before we dealt with this in 6 near identical and one very similar copy. The helper function errors out when encountering a HEAP_MOVED_IN/HEAP_MOVED_OUT tuple with xvac considered current or in-progress. It'd be preferrable to do that change separately, but otherwise it'd not be possible to deduplicate the handling in HeapTupleSatisfiesVacuum(). Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://postgr.es/m/lxzj26ga6ippdeunz6kuncectr5gfuugmm2ry22qu6hcx6oid6@lzx3sjsqhmt6 Discussion: https://postgr.es/m/6rgb2nvhyvnszz4ul3wfzlf5rheb2kkwrglthnna7qhe24onwr@vw27225tkyar
11 daysbufmgr: Optimize & harmonize LockBufHdr(), LWLockWaitListLock()Andres Freund
The main optimization is for LockBufHdr() to delay initializing SpinDelayStatus, similar to what LWLockWaitListLock already did. The initialization is sufficiently expensive & buffer header lock acquisitions are sufficiently frequent, to make it worthwhile to instead have a fastpath (via a likely() branch) that does not initialize the SpinDelayStatus. While LWLockWaitListLock() already the aforementioned optimization, it did not use likely(), and inspection of the assembly shows that this indeed leads to worse code generation (also observed in a microbenchmark). Fix that by adding the likely(). While the LockBufHdr() improvement is a small gain on its own, it mainly is aimed at preventing a regression after a future commit, which requires additional locking to set hint bits. While touching both, also make the comments more similar to each other. Reviewed-by: Heikki Linnakangas <heikki.linnakangas@iki.fi> Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
11 daysUse proper type for RestoreTransactionSnapshot's PGPROC argHeikki Linnakangas
Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/08cbaeb5-aaaf-47b6-9ed8-4f7455b0bc4b@iki.fi
11 daysFix typos in gininsert.cMichael Paquier
Introduced by 8492feb98f6d. Author: Xingbin She <xingbin.she@qq.com> Discussion: https://postgr.es/m/tencent_C254AE962588605F132DB4A6F87205D6A30A@qq.com
11 daysAdd guard to prevent recursive memory context logging.Fujii Masao
Previously, if memory context logging was triggered repeatedly and rapidly while a previous request was still being processed, it could result in recursive calls to ProcessLogMemoryContextInterrupt(). This could lead to infinite recursion and potentially crash the process. This commit adds a guard to prevent such recursion. If ProcessLogMemoryContextInterrupt() is already in progress and logging memory contexts, subsequent calls will exit immediately, avoiding unintended recursive calls. While this scenario is unlikely in practice, it's not impossible. This change adds a safety check to prevent such failures. Back-patch to v14, where memory context logging was introduced. Reported-by: Robert Haas <robertmhaas@gmail.com> Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Atsushi Torikoshi <torikoshia@oss.nttdata.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Artem Gavrilov <artem.gavrilov@percona.com> Discussion: https://postgr.es/m/CA+TgmoZMrv32tbNRrFTvF9iWLnTGqbhYSLVcrHGuwZvCtph0NA@mail.gmail.com Backpatch-through: 14
12 daysUse table/index_close() more consistentlyMichael Paquier
All the code paths updated here have been using relation_close() to close a relation that has already been opened with table_open() or index_open(), where a relkind check is enforced. table_close() and index_open() do the same thing as relation_close(), so there was no harm, but being inconsistent could lead to issues if the internals of these close() functions begin to introduce some logic specific to each relkind in the future. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/aUKamYGiDKO6byp5@ip-10-97-1-34.eu-west-3.compute.internal
12 daysDo not emit WAL for unlogged BRIN indexesHeikki Linnakangas
Operations on unlogged relations should not be WAL-logged. The brin_initialize_empty_new_buffer() function didn't get the memo. The function is only called when a concurrent update to a brin page uses up space that we're just about to insert to, which makes it pretty hard to hit. If you do manage to hit it, a full-page WAL record is erroneously emitted for the unlogged index. If you then crash, crash recovery will fail on that record with an error like this: FATAL: could not create file "base/5/32819": File exists Author: Kirill Reshke <reshkekirill@gmail.com> Discussion: https://www.postgresql.org/message-id/CALdSSPhpZXVFnWjwEBNcySx_vXtXHwB2g99gE6rK0uRJm-3GgQ@mail.gmail.com Backpatch-through: 14
13 daysChange pgstat_report_vacuum() to use RelationMichael Paquier
This change makes pgstat_report_vacuum() more consistent with pgstat_report_analyze(), that also uses a Relation. This enforces a policy that callers of this routine should open and lock the relation whose statistics are updated before calling this routine. We will unlikely have a lot of callers of this routine in the tree, but it seems like a good idea to imply this requirement in the long run. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Suggested-by: Andres Freund <andres@anarazel.de> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/aUEA6UZZkDCQFgSA@ip-10-97-1-34.eu-west-3.compute.internal
14 daysRemove useless code in InjectionPointAttach()Michael Paquier
strlcpy() ensures that a target string is zero-terminated, so there is no need to enforce it a second time in this code. This simplification could have been done in 0eb23285a257. Author: Feilong Meng <feelingmeng@foxmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/tencent_771178777C5BC17FCB7F7A1771CD1FFD5708@qq.com
14 daysAvoid global LC_CTYPE dependency in pg_locale_icu.c.Jeff Davis
ICU still depends on libc for compatibility with certain historical behavior for single-byte encodings. Make the dependency explicit by holding a locale_t object when required. We should consider a better solution in the future, such as decoding the text to UTF-32 and using u_tolower(). That would be a behavior change and require additional infrastructure though; so for now, just avoid the global LC_CTYPE dependency. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/450ceb6260cad30d7afdf155d991a9caafee7c0d.camel@j-davis.com
14 daysdowncase_identifier(): use method table from locale provider.Jeff Davis
Previously, libc's tolower() was always used for lowercasing identifiers, regardless of the database locale (though only characters beyond 127 in single-byte encodings were affected). Refactor to allow each provider to supply its own implementation of identifier downcasing. For historical compatibility, when using a single-byte encoding, ICU still relies on tolower(). One minor behavior change is that, before the database default locale is initialized, it uses ASCII semantics to downcase the identifiers. Previously, it would use the postmaster's LC_CTYPE setting from the environment. While that could have some effect during GUC processing, for example, it would have been fragile to rely on the environment setting anyway. (Also, it only matters when the encoding is single-byte.) Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/450ceb6260cad30d7afdf155d991a9caafee7c0d.camel@j-davis.com
14 daysSwitch memory contexts in ReinitializeParallelDSM.Robert Haas
We already do this in CreateParallelContext, InitializeParallelDSM, and LaunchParallelWorkers. I suspect the reason why the matching logic was omitted from ReinitializeParallelDSM is that I failed to realize that any memory allocation was happening here -- but shm_mq_attach does allocate, which could result in a shm_mq_handle being allocated in a shorter-lived context than the ParallelContext which points to it. That could result in a crash if the shorter-lived context is freed before the parallel context is destroyed. As far as I am currently aware, there is no way to reach a crash using only code that is present in core PostgreSQL, but extensions could potentially trip over this. Fixing this in the back-branches appears low-risk, so back-patch to all supported versions. Author: Jakub Wartak <jakub.wartak@enterprisedb.com> Co-authored-by: Jeevan Chalke <jeevan.chalke@enterprisedb.com> Backpatch-through: 14 Discussion: http://postgr.es/m/CAKZiRmwfVripa3FGo06=5D1EddpsLu9JY2iJOTgbsxUQ339ogQ@mail.gmail.com
14 daysAdd explanatory comment to prune_freeze_setup()Melanie Plageman
heap_page_prune_and_freeze() fills in PruneState->deadoffsets, the array of OffsetNumbers of dead tuples. It is returned to the caller in the PruneFreezeResult. To avoid having two copies of the array, the PruneState saves only a pointer to the array. This was a bit unusual and confusing, so add a clarifying comment. Author: Melanie Plageman <melanieplageman@gmail.com> Suggested-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CAEoWx2=jiD1nqch4JQN+odAxZSD7mRvdoHUGJYN2r6tQG_66yQ@mail.gmail.com
14 daysFix const qualification in prune_freeze_setup()Melanie Plageman
The const qualification of the presult argument to prune_freeze_setup() is later cast away, so it was not correct. Remove it and add a comment explaining that presult should not be modified. Author: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/fb97d0ae-a0bc-411d-8a87-f84e7e146488%40eisentraut.org
2025-12-16Separate out bytea sort support from varlena.cJohn Naylor
In the wake of commit b45242fd3, bytea_sortsupport() still called out to varstr_sortsupport(). Treating bytea as a kind of text/varchar required varstr_sortsupport() to allow for the possibility of NUL bytes, but only for C collation. This was confusing. For better separation of concerns, create an independent sortsupport implementation in bytea.c. The heuristics for bytea_abbrev_abort() remain the same as for varstr_abbrev_abort(). It's possible that the bytea case warrants different treatment, but that is left for future investigation. In passing, adjust some strange looking comparisons in varstr_abbrev_abort(). Author: Aleksander Alekseev <aleksander@tigerdata.com> Reviewed-by: John Naylor <johncnaylorls@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CAJ7c6TP1bAbEhUJa6+rgceN6QJWMSsxhg1=mqfSN=Nb-n6DAKg@mail.gmail.com
2025-12-16Add TAP test to check recovery when redo LSN is missingMichael Paquier
This commit provides test coverage for dc7c77f825d7, where the redo record and the checkpoint record finish on different WAL segments with the start of recovery able to detect that the redo record is missing. This test uses a wait injection point done in the critical section of a checkpoint, method that requires not one but actually two wait injection points to avoid any memory allocations within the critical section of the checkpoint: - Checkpoint run with a background psql. - One first wait point is run by the checkpointer before the critical section, allocating the shared memory required by the DSM registry for the wait machinery in the library injection_points. - First point is woken up. - Second wait point is loaded before the critical section, allocating the memory to build the path to the library loaded, then run in the critical section once the checkpoint redo record has been logged. - WAL segment is switched while waiting on the second point. - Checkpoint completes. - Stop cluster with immediate mode. - The segment that includes the redo record is removed. - Start, recovery fails as the redo record cannot be found. The error message introduced in dc7c77f825d7 is now reduced to a FATAL, meaning that the information is still provided while being able to use a test for it. Nitin has provided a basic version of the test, that I have enhanced to make it portable with two points. Without dc7c77f825d7, the cluster crashes in this test, not on a PANIC but due to the pointer dereference at the beginning of recovery, failure mentioned in the other commit. Author: Nitin Jadhav <nitinjadhavpostgres@gmail.com> Co-authored-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CAMm1aWaaJi2w49c0RiaDBfhdCL6ztbr9m=daGqiOuVdizYWYaA@mail.gmail.com
2025-12-16Fail recovery when missing redo checkpoint record without backup_labelMichael Paquier
This commit adds an extra check at the beginning of recovery to ensure that the redo record of a checkpoint exists before attempting WAL replay, logging a PANIC if the redo record referenced by the checkpoint record could not be found. This is the same level of failure as when a checkpoint record is missing. This check is added when a cluster is started without a backup_label, after retrieving its checkpoint record. The redo LSN used for the check is retrieved from the checkpoint record successfully read. In the case where a backup_label exists, the startup process already fails if the redo record cannot be found after reading a checkpoint record at the beginning of recovery. Previously, the presence of the redo record was not checked. If the redo and checkpoint records were located on different WAL segments, it would be possible to miss a entire range of WAL records that should have been replayed but were just ignored. The consequences of missing the redo record depend on the version dealt with, these becoming worse the older the version used: - On HEAD, v18 and v17, recovery fails with a pointer dereference at the beginning of the redo loop, as the redo record is expected but cannot be found. These versions are good students, because we detect a failure before doing anything, even if the failure is misleading in the shape of a segmentation fault, giving no information that the redo record is missing. - In v16 and v15, problems show at the end of recovery within FinishWalRecovery(), the startup process using a buggy LSN to decide from where to start writing WAL. The cluster gets corrupted, still it is noisy about it. - v14 and older versions are worse: a cluster gets corrupted but it is entirely silent about the matter. The redo record missing causes the startup process to skip entirely recovery, because a missing record is the same as not redo being required at all. This leads to data loss, as everything is missed between the redo record and the checkpoint record. Note that I have tested that down to 9.4, reproducing the issue with a version of the author's reproducer slightly modified. The code is wrong since at least 9.2, but I did not look at the exact point of origin. This problem has been found by debugging a cluster where the WAL segment including the redo segment was missing due to an operator error, leading to a crash, based on an investigation in v15. Requesting archive recovery with the creation of a recovery.signal or a standby.signal even without a backup_label would mitigate the issue: if the record cannot be found in pg_wal/, the missing segment can be retrieved with a restore_command when checking that the redo record exists. This was already the case without this commit, where recovery would re-fetch the WAL segment that includes the redo record. The check introduced by this commit makes the segment to be retrieved earlier to make sure that the redo record can be found. On HEAD, the code will be slightly changed in a follow-up commit to not rely on a PANIC, to include a test able to emulate the original problem. This is a minimal backpatchable fix, kept separated for clarity. Reported-by: Andres Freund <andres@anarazel.de> Analyzed-by: Andres Freund <andres@anarazel.de> Author: Nitin Jadhav <nitinjadhavpostgres@gmail.com> Discussion: https://postgr.es/m/20231023232145.cmqe73stvivsmlhs@awork3.anarazel.de Discussion: https://postgr.es/m/CAMm1aWaaJi2w49c0RiaDBfhdCL6ztbr9m=daGqiOuVdizYWYaA@mail.gmail.com Backpatch-through: 14
2025-12-15Allow passing a pointer to GetNamedDSMSegment()'s init callback.Nathan Bossart
This commit adds a new "void *arg" parameter to GetNamedDSMSegment() that is passed to the initialization callback function. This is useful for reusing an initialization callback function for multiple DSM segments. Author: Zsolt Parragi <zsolt.parragi@percona.com> Reviewed-by: Sami Imseih <samimseih@gmail.com> Discussion: https://postgr.es/m/CAN4CZFMjh8TrT9ZhWgjVTzBDkYZi2a84BnZ8bM%2BfLPuq7Cirzg%40mail.gmail.com
2025-12-15Revisit cosmetics of "For inplace update, send nontransactional invalidations."Noah Misch
This removes a never-used CacheInvalidateHeapTupleInplace() parameter. It adds README content about inplace update visibility in logical decoding. It rewrites other comments. Back-patch to v18, where commit 243e9b40f1b2dd09d6e5bf91ebf6e822a2cd3704 first appeared. Since this removes a CacheInvalidateHeapTupleInplace() parameter, expect a v18 ".abi-compliance-history" edit to follow. PGXN contains no calls to that function. Reported-by: Paul A Jungwirth <pj@illuminatedcomputing.com> Reported-by: Ilyasov Ian <ianilyasov@outlook.com> Reviewed-by: Paul A Jungwirth <pj@illuminatedcomputing.com> Reviewed-by: Surya Poondla <s_poondla@apple.com> Discussion: https://postgr.es/m/CA+renyU+LGLvCqS0=fHit-N1J-2=2_mPK97AQxvcfKm+F-DxJA@mail.gmail.com Backpatch-through: 18
2025-12-15Correct comments of "Fix data loss at inplace update after heap_update()".Noah Misch
This corrects commit a07e03fd8fa7daf4d1356f7cb501ffe784ea6257. Reported-by: Paul A Jungwirth <pj@illuminatedcomputing.com> Reported-by: Surya Poondla <s_poondla@apple.com> Reviewed-by: Paul A Jungwirth <pj@illuminatedcomputing.com> Discussion: https://postgr.es/m/CA+renyWCW+_2QvXERBQ+mna6ANwAVXXmHKCA-WzL04bZRsjoBA@mail.gmail.com
2025-12-15Remove unused single-byte char_is_cased() API.Jeff Davis
https://postgr.es/m/450ceb6260cad30d7afdf155d991a9caafee7c0d.camel@j-davis.com
2025-12-15Use multibyte-aware extraction of pattern prefixes.Jeff Davis
Previously, like_fixed_prefix() used char-at-a-time logic, which forced it to be too conservative for case-insensitive matching. Introduce like_fixed_prefix_ci(), and use that for case-insensitive pattern prefixes. It uses multibyte and locale-aware logic, along with the new pg_iswcased() API introduced in 630706ced0. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/450ceb6260cad30d7afdf155d991a9caafee7c0d.camel@j-davis.com
2025-12-15Add offnum range checks to suppress compile warnings with UBSAN.Tom Lane
Late-model gcc with -fsanitize=undefined enabled issues warnings about uses of PageGetItemId() when it can't prove that the offsetNumber is > 0. The call sites where this happens are checking that the offnum is <= PageGetMaxOffsetNumber(page), so it seems reasonable to add an explicit check that offnum >= 1 too. While at it, rearrange the code to be less contorted and avoid duplicate checks on PageGetMaxOffsetNumber. Maybe the compiler would optimize away the duplicate logic or maybe not, but the existing coding has little to recommend it anyway. There are multiple instances of this identical coding pattern in heapam.c and heapam_xlog.c. Current gcc only complains about two of them, but I fixed them all in the name of consistency. Potentially this could be back-patched in the name of silencing warnings; but I think enabling UBSAN is mainly something people would do on HEAD, so for now it seems not worth the trouble. Discussion: https://postgr.es/m/1699806.1765746897@sss.pgh.pa.us
2025-12-15Improve sanity checks on multixid members lengthHeikki Linnakangas
In the server, check explicitly for multixids with zero members. We used to have an assertion for it, but commit d4b7bde418 replaced it with more extensive runtime checks, but it missed the original case of zero members. In the upgrade code, a negative length never makes sense, so better check for it explicitly. Commit d4b7bde418 added a similar sanity check to the corresponding server code on master, and in backbranches, the 'length' is passed to palloc which would fail with "invalid memory alloc request size" error. Clarify the comments on what kind of invalid entries are tolerated by the upgrade code and which ones are reported as fatal errors. Coverity complained about 'length' in the upgrade code being tainted. That's bogus because we trust the data on disk at least to some extent, but hopefully this will silence the complaint. If not, I'll dismiss it manually. Discussion: https://www.postgresql.org/message-id/7b505284-c6e9-4c80-a7ee-816493170abc@iki.fi
2025-12-15Fix typo in tablecmds.cDavid Rowley
Author: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CAEoWx2%3DAib%2BcatZn6wHKmz0BWe8-q10NAhpxu8wUDT19SddKNA%40mail.gmail.com
2025-12-15Add retry logic to pg_sync_replication_slots().Amit Kapila
Previously, pg_sync_replication_slots() would finish without synchronizing slots that didn't meet requirements, rather than failing outright. This could leave some failover slots unsynchronized if required catalog rows or WAL segments were missing or at risk of removal, while the standby continued removing needed data. To address this, the function now waits for the primary slot to advance to a position where all required data is available on the standby before completing synchronization. It retries cyclically until all failover slots that existed on the primary at the start of the call are synchronized. Slots created after the function begins are not included. If the standby is promoted during this wait, the function exits gracefully and the temporary slots will be removed. Author: Ajin Cherian <itsajin@gmail.com> Author: Hou Zhijie <houzj.fnst@fujitsu.com> Reviewed-by: Shveta Malik <shveta.malik@gmail.com> Reviewed-by: Japin Li <japinli@hotmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Ashutosh Sharma <ashu.coek88@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Yilin Zhang <jiezhilove@126.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/CAFPTHDZAA%2BgWDntpa5ucqKKba41%3DtXmoXqN3q4rpjO9cdxgQrw%40mail.gmail.com
2025-12-15Allow cumulative statistics to read/write auxiliary data from/to diskMichael Paquier
Cumulative stats kinds gain the capability to write additional per-entry data when flushing the stats at shutdown, and read this data when loading back the stats at startup. This can be fit for example in the case of variable-length data (like normalized query strings), so as it becomes possible to link the shared memory stats entries to data that is stored in a different area, like a DSA segment. Three new optional callbacks are added to PgStat_KindInfo, available to variable-numbered stats kinds: * to_serialized_data: writes auxiliary data for an entry. * from_serialized_data: reads auxiliary data for an entry. * finish: performs actions after read/write/discard operations. This is invoked after processing all the entries of a kind, allowing extensions to close file handles and clean up resources. Stats kinds have the option to store this data in the existing pgstats file, but can as well store it in one or more additional files whose names can be built upon the entry keys. The new serialized callbacks are called once an entry key is read or written from the main stats file. A file descriptor to the main pgstats file is available in the arguments of the callbacks. Author: Sami Imseih <samimseih@gmail.com> Co-authored-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CAA5RZ0s9SDOu+Z6veoJCHWk+kDeTktAtC-KY9fQ9Z6BJdDUirQ@mail.gmail.com