| Age | Commit message (Collapse) | Author |
|
estimate_hash_bucket_stats is defined to return zero to *mcv_freq if
it cannot obtain a value for the frequency of the most common value.
Its sole caller final_cost_hashjoin ignored this provision and would
blindly believe the zero value, resulting in computing zero for the
largest bucket size. In consequence, the safety check that intended
to prevent the largest bucket from exceeding get_hash_memory_limit()
was ineffective, allowing very silly plans to be chosen if statistics
were missing.
After fixing final_cost_hashjoin to disregard zero results for
mcv_freq, a second problem appeared: some cases that should use hash
joins failed to. This is because estimate_hash_bucket_stats was
unaware of the fact that ANALYZE won't store MCV statistics if it
doesn't find any multiply-occurring values. Thus the lack of an MCV
stats entry doesn't necessarily mean that we know nothing; we may
well know that the column is unique. The former coding returned zero
for *mcv_freq in this case, which was pretty close to correct, but now
final_cost_hashjoin doesn't believe it and disables the hash join.
So check to see if there is a HISTOGRAM stats entry; if so, ANALYZE
has in fact run for this column and must have found it to be unique.
In that case report the MCV frequency as 1 / rows, instead of claiming
ignorance.
Reporting a more accurate *mcv_freq in this case can also affect the
bucket-size skew adjustment further down in estimate_hash_bucket_stats,
causing hash-join cost estimates to change slightly. This affects
some plan choices in the core regression tests. The first diff in
join.out corresponds to a case where we have no stats and should not
risk a hash join, but the remaining changes are caused by producing
a better bucket-size estimate for unique join columns. Those are all
harmless changes so far as I can tell.
The existing behavior was introduced in commit 4867d7f62 in v11.
It appears from the commit log that disabling the bucket-size safety
check in the absence of statistics was intentional; but we've now seen
a case where the ensuing behavior is bad enough to make that seem like
a poor decision. In any case the lack of other problems with that
safety check after several years helps to justify enforcing it more
strictly. However, we won't risk back-patching this, in case any
applications are depending on the existing behavior.
Bug: #19363
Reported-by: Jinhui Lai <jinhui.lai@qq.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/2380165.1766871097@sss.pgh.pa.us
Discussion: https://postgr.es/m/19363-8dd32fc7600a1153@postgresql.org
|
|
When looking up statistical data about an expression, we failed to
look through PlaceHolderVar nodes, treating them as opaque. This
could prevent us from matching an expression to base columns, index
expressions, or extended statistics, as examine_variable() relies on
strict structural matching.
As a result, queries involving PlaceHolderVar nodes often fell back to
default selectivity estimates, potentially leading to poor plan
choices.
This patch updates examine_variable() to strip PlaceHolderVars before
analysis. This is safe during estimation because PlaceHolderVars are
transparent for the purpose of statistics lookup: they do not alter
the value distribution of the underlying expression.
To minimize performance overhead on this hot path, a lightweight
walker first checks for the presence of PlaceHolderVars. The more
expensive mutator is invoked only when necessary.
There is one ensuing plan change in the regression tests, which is
expected and demonstrates the fix: the rowcount estimate becomes much
more accurate with this patch.
Back-patch to v18. Although this issue exists before that, changes in
this version made it common enough to notice. Given the lack of field
reports for older versions, I am not back-patching further.
Reported-by: Haowu Ge <gehaowu@bitmoe.com>
Author: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/62af586c-c270-44f3-9c5e-02c81d537e3d.gehaowu@bitmoe.com
Backpatch-through: 18
|
|
When pulling up a subquery, we may need to wrap its targetlist items
in PlaceHolderVars to enforce separate identity or as a result of
outer joins. However, this causes any upper-level WHERE clauses
referencing these outputs to contain PlaceHolderVars, which prevents
indxpath.c from recognizing that they could be matched to index
columns or index expressions, potentially affecting the planner's
ability to use indexes.
To fix, explicitly strip PlaceHolderVars from index operands. A
PlaceHolderVar appearing in a relation-scan-level expression is
effectively a no-op. Nevertheless, to play it safe, we strip only
PlaceHolderVars that are not marked nullable.
The stripping is performed recursively to handle cases where
PlaceHolderVars are nested or interleaved with other node types. To
minimize performance impact, we first use a lightweight walker to
check for the presence of strippable PlaceHolderVars. The expensive
mutator is invoked only if a candidate is found, avoiding unnecessary
memory allocation and tree copying in the common case where no
PlaceHolderVars are present.
Back-patch to v18. Although this issue exists before that, changes in
this version made it common enough to notice. Given the lack of field
reports for older versions, I am not back-patching further.
Reported-by: Haowu Ge <gehaowu@bitmoe.com>
Author: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/62af586c-c270-44f3-9c5e-02c81d537e3d.gehaowu@bitmoe.com
Backpatch-through: 18
|
|
Here, Datum was used to pass around an opaque pointer between a group
of functions. But one might as well use void * for that; the use of
Datum doesn't achieve anything here and is just distracting.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/1c5d23cb-288b-4154-b1cd-191fe2301707%40eisentraut.org
|
|
This change makes more readable code diffs when adding new items or
removing old items, while ensuring that lines do not get excessively
long. Some SUBDIRS, PROGRAMS and REGRESS lists are split.
Note that there are a few more REGRESS lists that could be split,
particularly in contrib/.
Author: Jelte Fennema-Nio <postgres@jeltef.nl>
Co-Authored-By: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Japin Li <japinli@hotmail.com>
Reviewed-by: Man Zeng <zengman@halodbtech.com>
Discussion: https://postgr.es/m/DF6HDGB559U5.3MPRFCWPONEAE@jeltef.nl
|
|
pg_stat_get_backend_activity() calls pgstat_clip_activity() to ensure
that the reported query string is correctly truncated when it finishes
with an incomplete multi-byte sequence. However, the result returned by
the function was not what pgstat_clip_activity() generated, but the
non-truncated, original, contents from PgBackendStatus.st_activity_raw.
Oversight in 54b6cd589ac2, so backpatch all the way down.
Author: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAEoWx2mDzwc48q2EK9tSXS6iJMJ35wvxNQnHX+rXjy5VgLvJQw@mail.gmail.com
Backpatch-through: 14
|
|
This change has the advantage of removing some weird type casts, caused
by offset calculations based on pgoff_t but saved as int (on older
branches we use off_t, which could be 4 or 8 bytes depending on the
environment). These are safe currently because capped by
MAX_PHYSICAL_FILESIZE, but we would run into problems when to make
MAX_PHYSICAL_FILESIZE larger or allow callers of these routines to use a
larger physical max size on demand.
While on it, this improves BufFileDumpBuffer() so as we do not use an
offset for "availbytes". It is not a file offset per-set, but a number
of available bytes.
This change should lead to no functional changes.
Author: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/aUStrqoOCDRFAq1M@paquier.xyz
|
|
Introduced by 213a1b895270.
Reported-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/CAHewXNku-jz-FPKeJVk25fZ1pV2buYh5vpeqGDOB=bFQhKxXhw@mail.gmail.com
|
|
Many of the operations done for attribute stats in attribute_stats.c
share the same logic as extended stats, as done by a patch under
discussion to add support for extended stats import and export. All the
pieces necessary for extended statistics are moved to stats_utils.c,
which is the file where common facilities are shared for stats files.
The following renames are done:
* get_attr_stat_type() -> statatt_get_type()
* init_empty_stats_tuple() -> statatt_init_empty_tuple()
* set_stats_slot() -> statatt_set_slot()
* get_elem_stat_type() -> statatt_get_elem_type()
While on it, this commit adds more documentation for all these
functions, describing more their internals and the dependencies that
have been implied for attribute statistics. The same concepts apply to
extended statistics, at some degree.
Author: Corey Huinker <corey.huinker@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Yu Wang <wangyu_runtime@163.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com
|
|
If there are any SRFs in a PathTarget, we must separate it into
SRF-computing and SRF-free targets. This is because the executor can
only handle SRFs that appear at the top level of the targetlist of a
ProjectSet plan node.
If we find a subexpression that matches an expression already computed
in the previous plan level, we should treat it like a Var and should
not split it again. setrefs.c will later replace the expression with
a Var referencing the subplan output.
However, when processing the grouping target for grouping sets, the
planner can fail to recognize that an expression is already computed
in the scan/join phase. The root cause is a mismatch in the
nullingrels bits. Expressions in the grouping target carry the
grouping nulling bit in their nullingrels to indicate that they can be
nulled by the grouping step. However, the corresponding expressions
in the scan/join target do not have these bits.
As a result, the exact match check in list_member() fails, leading the
planner to incorrectly believe that the expression needs to be
re-evaluated from its arguments, which are often not available in the
subplan. This can lead to planner errors such as "variable not found
in subplan target list".
To fix, ignore the grouping nulling bit when checking whether an
expression from the grouping target is available in the pre-grouping
input target. This aligns with the matching logic in setrefs.c.
Backpatch to v18, where this issue was introduced.
Bug: #19353
Reported-by: Marian MULLER REBEYROL <marian.muller@serli.com>
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/19353-aaa179bba986a19b@postgresql.org
Backpatch-through: 18
|
|
CREATE SUBSCRIPTION with copy_data=true and origin='none' previously
failed when the publisher was running a version earlier than PostgreSQL 19,
even though this combination should be supported.
The failure occurred because the command issued a query calling
pg_get_publication_sequences function on the publisher. That function
does not exist before PG19 and the query is only needed for logical
replication sequence synchronization, which is supported starting in PG19.
This commit fixes this issue by skipping that query when the
publisher runs a version earlier than PG19.
Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Shlok Kyal <shlok.kyal.oss@gmail.com>
Discussion: https://postgr.es/m/CAHGQGwEx4twHtJdiPWTyAXJhcBPLaH467SH2ajGSe-41m65giA@mail.gmail.com
|
|
The retain_dead_tuples subscription option is supported only when
the publisher runs PostgreSQL 19 or later. However, it could previously
be enabled even when the publisher was running an earlier version.
This was caused by check_pub_dead_tuple_retention() comparing
the publisher server version against 19000 instead of 190000.
Fix this typo so that the version check correctly enforces the PG19+
requirement.
Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Shlok Kyal <shlok.kyal.oss@gmail.com>
Discussion: https://postgr.es/m/CAHGQGwEx4twHtJdiPWTyAXJhcBPLaH467SH2ajGSe-41m65giA@mail.gmail.com
|
|
Currently, the function expr_is_nonnullable() checks only Const and
Var expressions to determine if an expression is non-nullable. This
patch extends the detection logic to handle more expression types.
This can enable several downstream optimizations, such as reducing
NullTest quals to constant truth values (e.g., "COALESCE(var, 1) IS
NULL" becomes FALSE) and converting "COUNT(expr)" to the more
efficient "COUNT(*)" when the expression is proven non-nullable.
This breaks a test case in test_predtest.sql, since we now simplify
"ARRAY[] IS NULL" to constant FALSE, preventing it from weakly
refuting a strict ScalarArrayOpExpr ("x = any(ARRAY[])"). To ensure
the refutation logic is still exercised as intended, wrap the array
argument in opaque_array().
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Discussion: https://postgr.es/m/CAMbWs49UhPBjm+NRpxerjaeuFKyUZJ_AjM3NBcSYK2JgZ6VTEQ@mail.gmail.com
|
|
We break ROW(...) IS [NOT] NULL into separate tests on its component
fields. During this breakdown, we can improve efficiency by utilizing
expr_is_nonnullable() to detect fields that are provably non-nullable.
If a component field is proven non-nullable, it affects the outcome
based on the test type. For an IS NULL test, a single non-nullable
field refutes the whole NullTest, reducing it to constant FALSE. For
an IS NOT NULL test, the check for that specific field is guaranteed
to succeed, so we can discard it from the list of component tests.
This extends the existing optimization logic, which previously only
handled Const fields, to support any expression that can be proven
non-nullable.
In passing, update the existing constant folding of NullTests to use
expr_is_nonnullable() instead of var_is_nonnullable(), enabling it to
benefit from future improvements to that function.
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Discussion: https://postgr.es/m/CAMbWs49UhPBjm+NRpxerjaeuFKyUZJ_AjM3NBcSYK2JgZ6VTEQ@mail.gmail.com
|
|
The COALESCE function returns the first of its arguments that is not
null. When an argument is proven non-null, if it is the first
non-null-constant argument, the entire COALESCE expression can be
replaced by that argument. If it is a subsequent argument, all
following arguments can be dropped, since they will never be reached.
Currently, we perform this simplification only for Const arguments.
This patch extends the simplification to support any expression that
can be proven non-nullable.
This can help avoid the overhead of evaluating unreachable arguments.
It can also lead to better plans when the first argument is proven
non-nullable and replaces the expression, as the planner no longer has
to treat the expression as non-strict, and can also leverage index
scans on the resulting expression.
There is an ensuing plan change in generated_virtual.out, and we have
to modify the test to ensure that it continues to test what it is
intended to.
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Discussion: https://postgr.es/m/CAMbWs49UhPBjm+NRpxerjaeuFKyUZJ_AjM3NBcSYK2JgZ6VTEQ@mail.gmail.com
|
|
Author: Zizhen Qiao <zizhen_qiao@163.com>
Discussion: https://postgr.es/m/5ee635f9.49f7.19b4ed9e803.Coremail.zizhen_qiao@163.com
|
|
The logical replication parallel apply worker could incorrectly advance
the origin progress during an error or failed apply. This behavior risks
transaction loss because such transactions will not be resent by the
server.
Commit 3f28b2fcac addressed a similar issue for both the apply worker and
the table sync worker by registering a before_shmem_exit callback to reset
origin information. This prevents the worker from advancing the origin
during transaction abortion on shutdown. This patch applies the same fix
to the parallel apply worker, ensuring consistent behavior across all
worker types.
As with 3f28b2fcac, we are backpatching through version 16, since parallel
apply mode was introduced there and the issue only occurs when changes are
applied before the transaction end record (COMMIT or ABORT) is received.
Author: Hou Zhijie <houzj.fnst@fujitsu.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Backpatch-through: 16
Discussion: https://postgr.es/m/TY4PR01MB169078771FB31B395AB496A6B94B4A@TY4PR01MB16907.jpnprd01.prod.outlook.com
Discussion: https://postgr.es/m/TYAPR01MB5692FAC23BE40C69DA8ED4AFF5B92@TYAPR01MB5692.jpnprd01.prod.outlook.com
|
|
Previously logical decoding required wal_level to be set to 'logical'
at server start. This meant that users had to incur the overhead of
logical-level WAL logging even when no logical replication slots were
in use.
This commit adds functionality to automatically control logical
decoding availability based on logical replication slot presence. The
newly introduced module logicalctl.c allows logical decoding to be
dynamically activated when needed when wal_level is set to
'replica'.
When the first logical replication slot is created, the system
automatically increases the effective WAL level to maintain
logical-level WAL records. Conversely, after the last logical slot is
dropped or invalidated, it decreases back to 'replica' WAL level.
While activation occurs synchronously right after creating the first
logical slot, deactivation happens asynchronously through the
checkpointer process. This design avoids a race condition at the end
of recovery; a concurrent deactivation could happen while the startup
process enables logical decoding at the end of recovery, but WAL
writes are still not permitted until recovery fully completes. The
checkpointer will handle it after recovery is done. Asynchronous
deactivation also avoids excessive toggling of the logical decoding
status in workloads that repeatedly create and drop a single logical
slot. On the other hand, this lazy approach can delay changes to
effective_wal_level and the disabling logical decoding, especially
when the checkpointer is busy with other tasks. We chose this lazy
approach in all deactivation paths to keep the implementation simple,
even though laziness is strictly required only for end-of-recovery
cases. Future work might address this limitation either by using a
dedicated worker instead of the checkpointer, or by implementing
synchronous waiting during slot drops if workloads are significantly
affected by the lazy deactivation of logical decoding.
The effective WAL level, determined internally by XLogLogicalInfo, is
allowed to change within a transaction until an XID is assigned. Once
an XID is assigned, the value becomes fixed for the remainder of the
transaction. This behavior ensures that the logging mode remains
consistent within a writing transaction, similar to the behavior of
GUC parameters.
A new read-only GUC parameter effective_wal_level is introduced to
monitor the actual WAL level in effect. This parameter reflects the
current operational WAL level, which may differ from the configured
wal_level setting.
Bump PG_CONTROL_VERSION as it adds a new field to CheckPoint struct.
Reviewed-by: Shveta Malik <shveta.malik@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://postgr.es/m/CAD21AoCVLeLYq09pQPaWs+Jwdni5FuJ8v2jgq-u9_uFbcp6UbA@mail.gmail.com
|
|
After waiting for a concurrent updater to finish, heap_lock_tuple()
followed the update chain to lock all tuple versions. However, when
stepping from the initial tuple to the next one, it failed to check
that the next tuple's XMIN matches the initial tuple's XMAX. That's an
important check whenever following an update chain, and the recursive
part that follows the chain did it, but the initial step missed it.
Without the check, if the updating transaction aborts, the updated
tuple is vacuumed away and replaced by an unrelated tuple, the
unrelated tuple might get incorrectly locked.
Author: Jasper Smit <jasper.smit@servicenow.com>
Discussion: https://www.postgresql.org/message-id/CAOG+RQ74x0q=kgBBQ=mezuvOeZBfSxM1qu_o0V28bwDz3dHxLw@mail.gmail.com
Backpatch-through: 14
|
|
Since ce0fdbfe9722, a replication slot and an origin are created by each
tablesync worker, whose information is stored in both a catalog and
shared memory (once the origin is set up in the latter case). The
transaction where the origin is created is the same as the one that runs
the initial COPY, with the catalog state of the origin becoming visible
for other sessions only once the COPY transaction has committed. The
catalog state is coupled with a state in shared memory, initialized at
the same time as the origin created in the catalogs. Note that the
transaction doing the initial data sync can take a long time, time that
depends on the amount of data to transfer from a publication node to its
subscriber node.
Now, when a DROP SUBSCRIPTION is executed, all its workers are stopped
with the origins removed. The removal of each origin relies on a
catalog lookup. A worker still running the initial COPY would fail its
transaction, with the catalog state of the origin rolled back while the
shared memory state remains around. The session running the DROP
SUBSCRIPTION should be in charge of cleaning up the catalog and the
shared memory state, but as there is no data in the catalogs the shared
memory state is not removed. This issue would leave orphaned origin
data in shared memory, leading to a confusing state as it would still
show up in pg_replication_origin_status. Note that this shared memory
data is sticky, being flushed on disk in replorigin_checkpoint at
checkpoint. This prevents other origins from reusing a slot position
in the shared memory data.
To address this problem, the commit moves the creation of the origin at
the end of the transaction that precedes the one executing the initial
COPY, making the origin immediately visible in the catalogs for other
sessions, giving DROP SUBSCRIPTION a way to know about it. A different
solution would have been to clean up the shared memory state using an
abort callback within the tablesync worker. The solution of this commit
is more consistent with the apply worker that creates an origin in a
short transaction.
A test is added in the subscription test 004_sync.pl, which was able to
display the problem. The test fails when this commit is reverted.
Reported-by: Tenglong Gu <brucegu@amazon.com>
Reported-by: Daisuke Higuchi <higudai@amazon.com>
Analyzed-by: Michael Paquier <michael@paquier.xyz>
Author: Hou Zhijie <houzj.fnst@fujitsu.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/aUTekQTg4OYnw-Co@paquier.xyz
Backpatch-through: 14
|
|
off_t was previously used for offsets, which is 4 bytes on Windows,
hence limiting the backend code to a hard limit for files longer than
2GB. This leads to some simplification in these files, removing some
casts based on long, also 4 bytes on Windows.
This commit removes one comment introduced in db3c4c3a2d98, not relevant
anymore as pgoff_t is a safe 8-byte alternative on Windows.
This change is surprisingly not invasive, as the callers of
BufFileTell(), BufFileSeek() and BufFileTruncateFileSet() (worker.c,
tuplestore.c, etc.) track offsets in local structures that just to
switch from off_t to pgoff_t for the most part.
The file is still relying on a maximum file size of
MAX_PHYSICAL_FILESIZE (1GB). This change allows the code to make this
maximum potentially larger in the future, or larger on a per-demand
basis.
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/aUStrqoOCDRFAq1M@paquier.xyz
|
|
Reported-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/CAHewXNkRJ9DMFZMQKWQ32U+OTBR78KeGh2=9Wy5jEeWDxMVFcQ@mail.gmail.com
|
|
nbtree index-only scans of an index that uses btree/name_ops as one of
its index column's input opclasses are no longer at any risk of reading
past the end of currTuples. We're no longer reliant on such scans being
able to at least read from the start of markTuples storage (which uses
space from the same allocation as currTuples) to avoid a segfault:
StoreIndexTuple (from nodeIndexonlyscan.c) won't actually read past the
end of a cstring datum from a name_ops index. In other words, we
already have the "special-case treatment for name_ops" that the removed
comment supposed we could avoid by relying on markTuples in this way.
Oversight in commit a63224be49, which added special case handling of
name_ops cstrings to StoreIndexTuple, but missed these comments.
|
|
Before we dealt with this in 6 near identical and one very similar copy.
The helper function errors out when encountering a
HEAP_MOVED_IN/HEAP_MOVED_OUT tuple with xvac considered current or
in-progress. It'd be preferrable to do that change separately, but otherwise
it'd not be possible to deduplicate the handling in
HeapTupleSatisfiesVacuum().
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/lxzj26ga6ippdeunz6kuncectr5gfuugmm2ry22qu6hcx6oid6@lzx3sjsqhmt6
Discussion: https://postgr.es/m/6rgb2nvhyvnszz4ul3wfzlf5rheb2kkwrglthnna7qhe24onwr@vw27225tkyar
|
|
The main optimization is for LockBufHdr() to delay initializing
SpinDelayStatus, similar to what LWLockWaitListLock already did. The
initialization is sufficiently expensive & buffer header lock acquisitions are
sufficiently frequent, to make it worthwhile to instead have a fastpath (via a
likely() branch) that does not initialize the SpinDelayStatus.
While LWLockWaitListLock() already the aforementioned optimization, it did not
use likely(), and inspection of the assembly shows that this indeed leads to
worse code generation (also observed in a microbenchmark). Fix that by adding
the likely().
While the LockBufHdr() improvement is a small gain on its own, it mainly is
aimed at preventing a regression after a future commit, which requires
additional locking to set hint bits.
While touching both, also make the comments more similar to each other.
Reviewed-by: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
|
|
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/08cbaeb5-aaaf-47b6-9ed8-4f7455b0bc4b@iki.fi
|
|
Introduced by 8492feb98f6d.
Author: Xingbin She <xingbin.she@qq.com>
Discussion: https://postgr.es/m/tencent_C254AE962588605F132DB4A6F87205D6A30A@qq.com
|
|
Previously, if memory context logging was triggered repeatedly and
rapidly while a previous request was still being processed, it could
result in recursive calls to ProcessLogMemoryContextInterrupt().
This could lead to infinite recursion and potentially crash the process.
This commit adds a guard to prevent such recursion.
If ProcessLogMemoryContextInterrupt() is already in progress and
logging memory contexts, subsequent calls will exit immediately,
avoiding unintended recursive calls.
While this scenario is unlikely in practice, it's not impossible.
This change adds a safety check to prevent such failures.
Back-patch to v14, where memory context logging was introduced.
Reported-by: Robert Haas <robertmhaas@gmail.com>
Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Artem Gavrilov <artem.gavrilov@percona.com>
Discussion: https://postgr.es/m/CA+TgmoZMrv32tbNRrFTvF9iWLnTGqbhYSLVcrHGuwZvCtph0NA@mail.gmail.com
Backpatch-through: 14
|
|
All the code paths updated here have been using relation_close() to
close a relation that has already been opened with table_open() or
index_open(), where a relkind check is enforced.
table_close() and index_open() do the same thing as relation_close(), so
there was no harm, but being inconsistent could lead to issues if the
internals of these close() functions begin to introduce some logic
specific to each relkind in the future.
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/aUKamYGiDKO6byp5@ip-10-97-1-34.eu-west-3.compute.internal
|
|
Operations on unlogged relations should not be WAL-logged. The
brin_initialize_empty_new_buffer() function didn't get the memo.
The function is only called when a concurrent update to a brin page
uses up space that we're just about to insert to, which makes it
pretty hard to hit. If you do manage to hit it, a full-page WAL record
is erroneously emitted for the unlogged index. If you then crash,
crash recovery will fail on that record with an error like this:
FATAL: could not create file "base/5/32819": File exists
Author: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://www.postgresql.org/message-id/CALdSSPhpZXVFnWjwEBNcySx_vXtXHwB2g99gE6rK0uRJm-3GgQ@mail.gmail.com
Backpatch-through: 14
|
|
This change makes pgstat_report_vacuum() more consistent with
pgstat_report_analyze(), that also uses a Relation. This enforces a
policy that callers of this routine should open and lock the relation
whose statistics are updated before calling this routine. We will
unlikely have a lot of callers of this routine in the tree, but it seems
like a good idea to imply this requirement in the long run.
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Suggested-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/aUEA6UZZkDCQFgSA@ip-10-97-1-34.eu-west-3.compute.internal
|
|
strlcpy() ensures that a target string is zero-terminated, so there is
no need to enforce it a second time in this code. This simplification
could have been done in 0eb23285a257.
Author: Feilong Meng <feelingmeng@foxmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/tencent_771178777C5BC17FCB7F7A1771CD1FFD5708@qq.com
|
|
ICU still depends on libc for compatibility with certain historical
behavior for single-byte encodings. Make the dependency explicit by
holding a locale_t object when required.
We should consider a better solution in the future, such as decoding
the text to UTF-32 and using u_tolower(). That would be a behavior
change and require additional infrastructure though; so for now, just
avoid the global LC_CTYPE dependency.
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/450ceb6260cad30d7afdf155d991a9caafee7c0d.camel@j-davis.com
|
|
Previously, libc's tolower() was always used for lowercasing
identifiers, regardless of the database locale (though only characters
beyond 127 in single-byte encodings were affected). Refactor to allow
each provider to supply its own implementation of identifier
downcasing.
For historical compatibility, when using a single-byte encoding, ICU
still relies on tolower().
One minor behavior change is that, before the database default locale
is initialized, it uses ASCII semantics to downcase the
identifiers. Previously, it would use the postmaster's LC_CTYPE
setting from the environment. While that could have some effect during
GUC processing, for example, it would have been fragile to rely on the
environment setting anyway. (Also, it only matters when the encoding
is single-byte.)
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/450ceb6260cad30d7afdf155d991a9caafee7c0d.camel@j-davis.com
|
|
We already do this in CreateParallelContext, InitializeParallelDSM, and
LaunchParallelWorkers. I suspect the reason why the matching logic was
omitted from ReinitializeParallelDSM is that I failed to realize that
any memory allocation was happening here -- but shm_mq_attach does
allocate, which could result in a shm_mq_handle being allocated in a
shorter-lived context than the ParallelContext which points to it.
That could result in a crash if the shorter-lived context is freed
before the parallel context is destroyed. As far as I am currently
aware, there is no way to reach a crash using only code that is
present in core PostgreSQL, but extensions could potentially trip
over this. Fixing this in the back-branches appears low-risk, so
back-patch to all supported versions.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Jeevan Chalke <jeevan.chalke@enterprisedb.com>
Backpatch-through: 14
Discussion: http://postgr.es/m/CAKZiRmwfVripa3FGo06=5D1EddpsLu9JY2iJOTgbsxUQ339ogQ@mail.gmail.com
|
|
heap_page_prune_and_freeze() fills in PruneState->deadoffsets, the array
of OffsetNumbers of dead tuples. It is returned to the caller in the
PruneFreezeResult. To avoid having two copies of the array, the
PruneState saves only a pointer to the array. This was a bit unusual and
confusing, so add a clarifying comment.
Author: Melanie Plageman <melanieplageman@gmail.com>
Suggested-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAEoWx2=jiD1nqch4JQN+odAxZSD7mRvdoHUGJYN2r6tQG_66yQ@mail.gmail.com
|
|
The const qualification of the presult argument to prune_freeze_setup()
is later cast away, so it was not correct. Remove it and add a comment
explaining that presult should not be modified.
Author: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/fb97d0ae-a0bc-411d-8a87-f84e7e146488%40eisentraut.org
|
|
In the wake of commit b45242fd3, bytea_sortsupport() still called out
to varstr_sortsupport(). Treating bytea as a kind of text/varchar
required varstr_sortsupport() to allow for the possibility of
NUL bytes, but only for C collation. This was confusing. For
better separation of concerns, create an independent sortsupport
implementation in bytea.c.
The heuristics for bytea_abbrev_abort() remain the same as for
varstr_abbrev_abort(). It's possible that the bytea case warrants
different treatment, but that is left for future investigation.
In passing, adjust some strange looking comparisons in
varstr_abbrev_abort().
Author: Aleksander Alekseev <aleksander@tigerdata.com>
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAJ7c6TP1bAbEhUJa6+rgceN6QJWMSsxhg1=mqfSN=Nb-n6DAKg@mail.gmail.com
|
|
This commit provides test coverage for dc7c77f825d7, where the redo
record and the checkpoint record finish on different WAL segments with
the start of recovery able to detect that the redo record is missing.
This test uses a wait injection point done in the critical section of a
checkpoint, method that requires not one but actually two wait injection
points to avoid any memory allocations within the critical section of
the checkpoint:
- Checkpoint run with a background psql.
- One first wait point is run by the checkpointer before the critical
section, allocating the shared memory required by the DSM registry for
the wait machinery in the library injection_points.
- First point is woken up.
- Second wait point is loaded before the critical section, allocating
the memory to build the path to the library loaded, then run in the
critical section once the checkpoint redo record has been logged.
- WAL segment is switched while waiting on the second point.
- Checkpoint completes.
- Stop cluster with immediate mode.
- The segment that includes the redo record is removed.
- Start, recovery fails as the redo record cannot be found.
The error message introduced in dc7c77f825d7 is now reduced to a FATAL,
meaning that the information is still provided while being able to use a
test for it. Nitin has provided a basic version of the test, that I
have enhanced to make it portable with two points. Without
dc7c77f825d7, the cluster crashes in this test, not on a PANIC but due
to the pointer dereference at the beginning of recovery, failure
mentioned in the other commit.
Author: Nitin Jadhav <nitinjadhavpostgres@gmail.com>
Co-authored-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAMm1aWaaJi2w49c0RiaDBfhdCL6ztbr9m=daGqiOuVdizYWYaA@mail.gmail.com
|
|
This commit adds an extra check at the beginning of recovery to ensure
that the redo record of a checkpoint exists before attempting WAL
replay, logging a PANIC if the redo record referenced by the checkpoint
record could not be found. This is the same level of failure as when a
checkpoint record is missing. This check is added when a cluster is
started without a backup_label, after retrieving its checkpoint record.
The redo LSN used for the check is retrieved from the checkpoint record
successfully read.
In the case where a backup_label exists, the startup process already
fails if the redo record cannot be found after reading a checkpoint
record at the beginning of recovery.
Previously, the presence of the redo record was not checked. If the
redo and checkpoint records were located on different WAL segments, it
would be possible to miss a entire range of WAL records that should have
been replayed but were just ignored. The consequences of missing the
redo record depend on the version dealt with, these becoming worse the
older the version used:
- On HEAD, v18 and v17, recovery fails with a pointer dereference at the
beginning of the redo loop, as the redo record is expected but cannot be
found. These versions are good students, because we detect a failure
before doing anything, even if the failure is misleading in the shape of
a segmentation fault, giving no information that the redo record is
missing.
- In v16 and v15, problems show at the end of recovery within
FinishWalRecovery(), the startup process using a buggy LSN to decide
from where to start writing WAL. The cluster gets corrupted, still it
is noisy about it.
- v14 and older versions are worse: a cluster gets corrupted but it is
entirely silent about the matter. The redo record missing causes the
startup process to skip entirely recovery, because a missing record is
the same as not redo being required at all. This leads to data loss, as
everything is missed between the redo record and the checkpoint record.
Note that I have tested that down to 9.4, reproducing the issue with a
version of the author's reproducer slightly modified. The code is wrong
since at least 9.2, but I did not look at the exact point of origin.
This problem has been found by debugging a cluster where the WAL segment
including the redo segment was missing due to an operator error, leading
to a crash, based on an investigation in v15.
Requesting archive recovery with the creation of a recovery.signal or
a standby.signal even without a backup_label would mitigate the issue:
if the record cannot be found in pg_wal/, the missing segment can be
retrieved with a restore_command when checking that the redo record
exists. This was already the case without this commit, where recovery
would re-fetch the WAL segment that includes the redo record. The check
introduced by this commit makes the segment to be retrieved earlier to
make sure that the redo record can be found.
On HEAD, the code will be slightly changed in a follow-up commit to not
rely on a PANIC, to include a test able to emulate the original problem.
This is a minimal backpatchable fix, kept separated for clarity.
Reported-by: Andres Freund <andres@anarazel.de>
Analyzed-by: Andres Freund <andres@anarazel.de>
Author: Nitin Jadhav <nitinjadhavpostgres@gmail.com>
Discussion: https://postgr.es/m/20231023232145.cmqe73stvivsmlhs@awork3.anarazel.de
Discussion: https://postgr.es/m/CAMm1aWaaJi2w49c0RiaDBfhdCL6ztbr9m=daGqiOuVdizYWYaA@mail.gmail.com
Backpatch-through: 14
|
|
This commit adds a new "void *arg" parameter to
GetNamedDSMSegment() that is passed to the initialization callback
function. This is useful for reusing an initialization callback
function for multiple DSM segments.
Author: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/CAN4CZFMjh8TrT9ZhWgjVTzBDkYZi2a84BnZ8bM%2BfLPuq7Cirzg%40mail.gmail.com
|
|
This removes a never-used CacheInvalidateHeapTupleInplace() parameter.
It adds README content about inplace update visibility in logical
decoding. It rewrites other comments.
Back-patch to v18, where commit 243e9b40f1b2dd09d6e5bf91ebf6e822a2cd3704
first appeared. Since this removes a CacheInvalidateHeapTupleInplace()
parameter, expect a v18 ".abi-compliance-history" edit to follow. PGXN
contains no calls to that function.
Reported-by: Paul A Jungwirth <pj@illuminatedcomputing.com>
Reported-by: Ilyasov Ian <ianilyasov@outlook.com>
Reviewed-by: Paul A Jungwirth <pj@illuminatedcomputing.com>
Reviewed-by: Surya Poondla <s_poondla@apple.com>
Discussion: https://postgr.es/m/CA+renyU+LGLvCqS0=fHit-N1J-2=2_mPK97AQxvcfKm+F-DxJA@mail.gmail.com
Backpatch-through: 18
|
|
This corrects commit a07e03fd8fa7daf4d1356f7cb501ffe784ea6257.
Reported-by: Paul A Jungwirth <pj@illuminatedcomputing.com>
Reported-by: Surya Poondla <s_poondla@apple.com>
Reviewed-by: Paul A Jungwirth <pj@illuminatedcomputing.com>
Discussion: https://postgr.es/m/CA+renyWCW+_2QvXERBQ+mna6ANwAVXXmHKCA-WzL04bZRsjoBA@mail.gmail.com
|
|
https://postgr.es/m/450ceb6260cad30d7afdf155d991a9caafee7c0d.camel@j-davis.com
|
|
Previously, like_fixed_prefix() used char-at-a-time logic, which
forced it to be too conservative for case-insensitive matching.
Introduce like_fixed_prefix_ci(), and use that for case-insensitive
pattern prefixes. It uses multibyte and locale-aware logic, along with
the new pg_iswcased() API introduced in 630706ced0.
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/450ceb6260cad30d7afdf155d991a9caafee7c0d.camel@j-davis.com
|
|
Late-model gcc with -fsanitize=undefined enabled issues warnings
about uses of PageGetItemId() when it can't prove that the
offsetNumber is > 0. The call sites where this happens are
checking that the offnum is <= PageGetMaxOffsetNumber(page), so
it seems reasonable to add an explicit check that offnum >= 1 too.
While at it, rearrange the code to be less contorted and avoid
duplicate checks on PageGetMaxOffsetNumber. Maybe the compiler
would optimize away the duplicate logic or maybe not, but the
existing coding has little to recommend it anyway.
There are multiple instances of this identical coding pattern in
heapam.c and heapam_xlog.c. Current gcc only complains about two
of them, but I fixed them all in the name of consistency.
Potentially this could be back-patched in the name of silencing
warnings; but I think enabling UBSAN is mainly something people
would do on HEAD, so for now it seems not worth the trouble.
Discussion: https://postgr.es/m/1699806.1765746897@sss.pgh.pa.us
|
|
In the server, check explicitly for multixids with zero members. We
used to have an assertion for it, but commit d4b7bde418 replaced it
with more extensive runtime checks, but it missed the original case of
zero members.
In the upgrade code, a negative length never makes sense, so better
check for it explicitly. Commit d4b7bde418 added a similar sanity
check to the corresponding server code on master, and in backbranches,
the 'length' is passed to palloc which would fail with "invalid memory
alloc request size" error. Clarify the comments on what kind of
invalid entries are tolerated by the upgrade code and which ones are
reported as fatal errors.
Coverity complained about 'length' in the upgrade code being
tainted. That's bogus because we trust the data on disk at least to
some extent, but hopefully this will silence the complaint. If not,
I'll dismiss it manually.
Discussion: https://www.postgresql.org/message-id/7b505284-c6e9-4c80-a7ee-816493170abc@iki.fi
|
|
Author: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAEoWx2%3DAib%2BcatZn6wHKmz0BWe8-q10NAhpxu8wUDT19SddKNA%40mail.gmail.com
|
|
Previously, pg_sync_replication_slots() would finish without synchronizing
slots that didn't meet requirements, rather than failing outright. This
could leave some failover slots unsynchronized if required catalog rows or
WAL segments were missing or at risk of removal, while the standby
continued removing needed data.
To address this, the function now waits for the primary slot to advance to
a position where all required data is available on the standby before
completing synchronization. It retries cyclically until all failover slots
that existed on the primary at the start of the call are synchronized.
Slots created after the function begins are not included. If the standby
is promoted during this wait, the function exits gracefully and the
temporary slots will be removed.
Author: Ajin Cherian <itsajin@gmail.com>
Author: Hou Zhijie <houzj.fnst@fujitsu.com>
Reviewed-by: Shveta Malik <shveta.malik@gmail.com>
Reviewed-by: Japin Li <japinli@hotmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Ashutosh Sharma <ashu.coek88@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Yilin Zhang <jiezhilove@126.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CAFPTHDZAA%2BgWDntpa5ucqKKba41%3DtXmoXqN3q4rpjO9cdxgQrw%40mail.gmail.com
|
|
Cumulative stats kinds gain the capability to write additional per-entry
data when flushing the stats at shutdown, and read this data when
loading back the stats at startup. This can be fit for example in the
case of variable-length data (like normalized query strings), so as it
becomes possible to link the shared memory stats entries to data that is
stored in a different area, like a DSA segment.
Three new optional callbacks are added to PgStat_KindInfo, available to
variable-numbered stats kinds:
* to_serialized_data: writes auxiliary data for an entry.
* from_serialized_data: reads auxiliary data for an entry.
* finish: performs actions after read/write/discard operations. This is
invoked after processing all the entries of a kind, allowing extensions
to close file handles and clean up resources.
Stats kinds have the option to store this data in the existing pgstats
file, but can as well store it in one or more additional files whose
names can be built upon the entry keys. The new serialized callbacks
are called once an entry key is read or written from the main stats
file. A file descriptor to the main pgstats file is available in the
arguments of the callbacks.
Author: Sami Imseih <samimseih@gmail.com>
Co-authored-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAA5RZ0s9SDOu+Z6veoJCHWk+kDeTktAtC-KY9fQ9Z6BJdDUirQ@mail.gmail.com
|