Age | Commit message (Collapse) | Author |
|
If the requested recovery timeline is not reachable, the logged
checkpoint and timeline should to be the values read from the
backup_label when it is defined. The message generated used the values
from the control file in this case, which is fine when recovering from
the control file without a backup_label, but not if there is a
backup_label.
Issue introduced in ee994272ca50. v15 has introduced xlogrecovery.c and
more simplifications in this area (4a92a1c3d1c3, a27048cbcb58), making
this change a bit simpler to think about, so backpatch only down to this
version.
Author: David Steele <david@pgbackrest.org>
Reviewed-by: Andrey M. Borodin <x4mmm@yandex-team.ru>
Reviewed-by: Benoit Lobréau <benoit.lobreau@dalibo.com>
Discussion: https://postgr.es/m/c3d617d4-1696-4aa7-8a4d-5a7d19cc5618@pgbackrest.org
Backpatch-through: 15
|
|
Commit a99cc6c6b4 introduced the PqMsg_Progress macro but missed
updating HandleParallelMessage() accordingly.
Backpatch-through: 17
|
|
Assume twophase.c is the performance-sensitive caller, and preserve its
choice of unlikely() branch hint. Add some retrospective rationale for
that choice. Back-patch to v17, for the next commit to use it.
Reviewed (in earlier versions) by Michael Paquier.
Discussion: https://postgr.es/m/17821-dd8c334263399284@postgresql.org
Discussion: https://postgr.es/m/20250116010051.f3.nmisch@google.com
|
|
XLogPageRead() checks immediately for an invalid WAL record header on a
standby, to be able to handle the case of continuation records that need
to be read across two different sources. As written, the check was too
generic, applying to any target LSN. Based on an analysis by Kyotaro
Horiguchi, what really matters is to make sure that the page header is
checked when attempting to read a LSN at the boundary of a segment, to
handle the case of a continuation record that spawns across multiple
pages when dealing with multiple segments, as WAL receivers are spawned
they request WAL from the beginning of a segment. This fix has been
proposed by Kyotaro Horiguchi.
This could cause standbys to loop infinitely when dealing with a
continuation record during a timeline jump, in the case where the
contents of the record in the follow-up page are invalid.
Some regression tests are added to check such scenarios, able to
reproduce the original problem. In the test, the contents of a
continuation record are overwritten with junk zeros on its follow-up
page, and replayed on standbys. This is inspired by 039_end_of_wal.pl,
and is enough to show how standbys should react on promotion by not
being stuck. Without the fix, the test would fail with a timeout. The
test to reproduce the problem has been written by Alexander Kukushkin.
The original check has been introduced in 066871980183, for a similar
problem.
Author: Kyotaro Horiguchi, Alexander Kukushkin
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/CAFh8B=mozC+e1wGJq0H=0O65goZju+6ab5AU7DEWCSUA2OtwDg@mail.gmail.com
Backpatch-through: 13
|
|
This commit reverts 8f67f994e8ea (down to v13) and c3de0f9eed38 (down to
v17), as these are proving to not be completely correct regarding two
aspects:
- In v17 and newer branches, c3de0f9eed38's check for epoch handling is
incorrect, and does not correctly handle frozen epochs. A logic closer
to widen_snapshot_xid() should be used. The 2PC code should try to
integrate deeper with FullTransactionIds, 5a1dfde8334b being not enough.
- In v13 and newer branches, 8f67f994e8ea is a workaround for the real
issue, which is that we should not attempt CLOG lookups without reaching
consistency. This exists since 728bd991c3c4, and this is reachable with
ProcessTwoPhaseBuffer() called by restoreTwoPhaseData() at the beginning
of recovery.
Per discussion with Noah Misch.
Discussion: https://postgr.es/m/20250116010051.f3.nmisch@google.com
Backpatch-through: 13
|
|
The originally submitted code (using bit masking) was correct when the
number of slots was restricted to be a power of two -- but that
limitation was removed during development that led to commit
53c2a97a9266, which made the bank selection code incorrect. This led to
always using a smaller number of banks than available. Change said code
to use integer modulo instead, which works correctly with an arbitrary
number of banks.
It's likely that we could improve on this to avoid runtime use of
integer division. But with this change we're, at least, not wasting
memory on unused banks, and more banks mean less contention, which is
likely to have a much higher performance impact than a single
instruction's latency.
Author: Yura Sokolov <y.sokolov@postgrespro.ru>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Discussion: https://postgr.es/m/9444dc46-ca47-43ed-9058-89c456316306@postgrespro.ru
|
|
At the beginning of recovery, an orphaned two-phase file in an epoch
different than the one defined in the checkpoint record could not be
removed based on the assumptions that AdjustToFullTransactionId() relies
on, assuming that all files would be either from the current epoch or
from the previous epoch.
If the checkpoint epoch was 0 while the 2PC file was orphaned and in the
future, AdjustToFullTransactionId() would underflow the epoch used to
build the 2PC file path. In non-assert builds, this would create a
WARNING message referring to a 2PC file with an epoch of "FFFFFFFF" (or
UINT32_MAX), as an effect of the underflow calculation, leaving the
orphaned file around.
Some tests are added with dummy 2PC files in the past and the future,
checking that these are properly removed.
Issue introduced by 5a1dfde8334b, that has switched two-phase state
files to use FullTransactionIds.
Reported-by: Vitaly Davydov
Author: Michael Paquier
Reviewed-by: Vitaly Davydov
Discussion: https://postgr.es/m/13b5b6-676c3080-4d-531db900@47931709
Backpatch-through: 17
|
|
Before 728bd991c3c4, that has improved the support for 2PC files during
recovery, the initial logic scanning files in pg_twophase was done so as
files in the future of the transaction ID horizon were checked first,
followed by a check if a transaction ID is aborted or committed which
could involve a pg_xact lookup. After this commit, these checks have
been done in reverse order.
Files detected as in the future do not have a state that can be checked
in pg_xact, hence this caused recovery to fail abruptly should an
orphaned 2PC file in the future of the transaction ID horizon exist in
pg_twophase at the beginning of recovery.
A test is added to check for this scenario, using an empty 2PC with a
transaction ID large enough to be in the future when running the test.
This test is added in 16 and older versions for now. 17 and newer
versions are impacted by a second bug caused by the addition of the
epoch in the 2PC file names. An equivalent test will be added in these
branches in a follow-up commit, once the second set of issues reported
are fixed.
Author: Vitaly Davydov, Michael Paquier
Discussion: https://postgr.es/m/11e597-676ab680-8d-374f23c0@145466129
Backpatch-through: 13
|
|
Cause parallel workers to not check datallowconn, rolcanlogin, and
ACL_CONNECT privileges. The leader already checked these things
(except for rolcanlogin which might have been checked for a different
role). Re-checking can accomplish little except to induce unexpected
failures in applications that might not even be aware that their query
has been parallelized. We already had the principle that parallel
workers rely on their leader to pass a valid set of authorization
information, so this change just extends that a bit further.
Also, modify the ReservedConnections, datconnlimit and rolconnlimit
logic so that these limits are only enforced against regular backends,
and only regular backends are counted while checking if the limits
were already reached. Previously, background processes that had an
assigned database or role were subject to these limits (with rather
random exclusions for autovac workers and walsenders), and the set of
existing processes that counted against each limit was quite haphazard
as well. The point of these limits, AFAICS, is to ensure the
availability of PGPROC slots for regular backends. Since all other
types of processes have their own separate pools of PGPROC slots, it
makes no sense either to enforce these limits against them or to count
them while enforcing the limit.
While edge-case failures of these sorts have been possible for a
long time, the problem got a good deal worse with commit 5a2fed911
(CVE-2024-10978), which caused parallel workers to make some of these
checks using the leader's current role where before we had used its
AuthenticatedUserId, thus allowing parallel queries to fail after
SET ROLE. The previous behavior was fairly accidental and I have
no desire to return to it.
This patch includes reverting 73c9f91a1, which was an emergency hack
to suppress these same checks in some cases. It wasn't complete,
as shown by a recent bug report from Laurenz Albe. We can also revert
fd4d93d26 and 492217301, which hacked around the same problems in one
regression test.
In passing, remove the special case for autovac workers in
CheckMyDatabase; it seems cleaner to have AutoVacWorkerMain pass
the INIT_PG_OVERRIDE_ALLOW_CONNS flag, now that that does what's
needed.
Like 5a2fed911, back-patch to supported branches (which sadly no
longer includes v12).
Discussion: https://postgr.es/m/1808397.1735156190@sss.pgh.pa.us
|
|
This reverts commit d77f91214fb7 on all stable branches, due to concerns
regarding the compatility side effects this could create in a minor
release. The change still exists on HEAD.
Discussion: https://postgr.es/m/CA+TgmoZqRgeFTg4+Yf_CMRRXiHuNz1u6ZC4FvVk+rxw0RmOPnw@mail.gmail.com
Backpatch-through: 13
|
|
When using a pipeline, a transaction starts from the first command and
is committed with a Sync message or when the pipeline ends.
Functions like IsInTransactionBlock() or PreventInTransactionBlock()
were already able to understand a pipeline as being in a transaction
block, but it was not the case of CheckTransactionBlock(). This
function is called for example to generate a WARNING for SET LOCAL,
complaining that it is used outside of a transaction block.
The current state of the code caused multiple problems, like:
- SET LOCAL executed at any stage of a pipeline issued a WARNING, even
if the command was at least second in line where the pipeline is in a
transaction state.
- LOCK TABLE failed when invoked at any step of a pipeline, even if it
should be able to work within a transaction block.
The pipeline protocol assumes that the first command of a pipeline is
not part of a transaction block, and that any follow-up commands is
considered as within a transaction block.
This commit changes the backend so as an implicit transaction block is
started each time the first Execute message of a pipeline has finished
processing, with this implicit transaction block ended once a sync is
processed. The checks based on XACT_FLAGS_PIPELINING in the routines
checking if we are in a transaction block are not necessary: it is
enough to rely on the existing ones.
Some tests are added to pgbench, that can be backpatched down to v17
when \syncpipeline is involved and down to v14 where \startpipeline and
\endpipeline are available. This is unfortunately limited regarding the
error patterns that can be checked, but it provides coverage for various
pipeline combinations to check if these succeed or fail. These tests
are able to capture the case of SET LOCAL's WARNING. The author has
proposed a different feature to improve the coverage by adding similar
meta-commands to psql where error messages could be checked, something
more useful for the cases where commands cannot be used in transaction
blocks, like REINDEX CONCURRENTLY or VACUUM. This is considered as
future work for v18~.
Author: Anthonin Bonnefoy
Reviewed-by: Jelte Fennema-Nio, Michael Paquier
Discussion: https://postgr.es/m/CAO6_XqrWO8uNBQrSu5r6jh+vTGi5Oiyk4y8yXDORdE2jbzw8xw@mail.gmail.com
Backpatch-through: 13
|
|
The SQL spec mandates that SET SESSION AUTHORIZATION implies
SET ROLE NONE. We tried to implement that within the lowest-level
functions that manipulate these settings, but that was a bad idea.
In particular, guc.c assumes that it doesn't matter in what order
it applies GUC variable updates, but that was not the case for these
two variables. This problem, compounded by some hackish attempts to
work around it, led to some security-grade issues:
* Rolling back a transaction that had done SET SESSION AUTHORIZATION
would revert to SET ROLE NONE, even if that had not been the previous
state, so that the effective user ID might now be different from what
it had been.
* The same for SET SESSION AUTHORIZATION in a function SET clause.
* If a parallel worker inspected current_setting('role'), it saw
"none" even when it should see something else.
Also, although the parallel worker startup code intended to cope
with the current role's pg_authid row having disappeared, its
implementation of that was incomplete so it would still fail.
Fix by fully separating the miscinit.c functions that assign
session_authorization from those that assign role. To implement the
spec's requirement, teach set_config_option itself to perform "SET
ROLE NONE" when it sets session_authorization. (This is undoubtedly
ugly, but the alternatives seem worse. In particular, there's no way
to do it within assign_session_authorization without incompatible
changes in the API for GUC assign hooks.) Also, improve
ParallelWorkerMain to directly set all the relevant user-ID variables
instead of relying on some of them to get set indirectly. That
allows us to survive not finding the pg_authid row during worker
startup.
In v16 and earlier, this includes back-patching 9987a7bf3 which
fixed a violation of GUC coding rules: SetSessionAuthorization
is not an appropriate place to be throwing errors from.
Security: CVE-2024-10978
|
|
Commit ac04aa84a put the shutoff for this into the planner, which is
not ideal because it doesn't prevent us from re-using a previously
made parallel plan. Revert the planner change and instead put the
shutoff into InitializeParallelDSM, modeling it on the existing code
there for recovering from failure to allocate a DSM segment.
However, that code path is mostly untested, and testing a bit harder
showed there's at least one bug: ExecHashJoinReInitializeDSM is not
prepared for us to have skipped doing parallel DSM setup. I also
thought the Assert in ReinitializeParallelWorkers is pretty
ill-advised, and replaced it with a silent Min() operation.
The existing test case added by ac04aa84a serves fine to test this
version of the fix, so no change needed there.
Patch by me, but thanks to Noah Misch for the core idea that we
could shut off worker creation when !INTERRUPTS_CAN_BE_PROCESSED.
Back-patch to v12, as ac04aa84a was.
Discussion: https://postgr.es/m/CAC-SaSzHUKT=vZJ8MPxYdC_URPfax+yoA1hKTcF4ROz_Q6z0_Q@mail.gmail.com
|
|
This reverts commit 95c5acb3fc261067ab65ddc0b2dca8e162f09442 (v17) and
counterparts in each other non-master branch. If released, that commit
would have caused a worst-in-years minor release regression, via
undetected LWLock self-deadlock. This commit and its self-deadlock fix
warrant more bake time in the master branch.
Reported by Alexander Lakhin.
Discussion: https://postgr.es/m/10ec0bc3-5933-1189-6bb8-5dec4114558e@gmail.com
|
|
The inplace update survives ROLLBACK. The inval didn't, so another
backend's DDL could then update the row without incorporating the
inplace update. In the test this fixes, a mix of CREATE INDEX and ALTER
TABLE resulted in a table with an index, yet relhasindex=f. That is a
source of index corruption. Back-patch to v12 (all supported versions).
The back branch versions don't change WAL, because those branches just
added end-of-recovery SIResetAll(). All branches change the ABI of
extern function PrepareToInvalidateCacheTuple(). No PGXN extension
calls that, and there's no apparent use case in extensions.
Reviewed by Nitin Motiani and (in earlier versions) Andres Freund.
Discussion: https://postgr.es/m/20240523000548.58.nmisch@google.com
|
|
An inplace update's invalidation messages are part of its transaction's
commit record. However, the update survives even if its transaction
aborts or we stop recovery before replaying its transaction commit.
After recovery, a backend that started in recovery could update the row
without incorporating the inplace update. That could result in a table
with an index, yet relhasindex=f. That is a source of index corruption.
This bulk invalidation avoids the functional consequences. A future
change can fix the !RecoveryInProgress() scenario without changing the
WAL format. Back-patch to v17 - v12 (all supported versions). v18 will
instead add invalidations to WAL.
Discussion: https://postgr.es/m/20240618152349.7f.nmisch@google.com
|
|
COMMIT PREPARED removes on-disk 2PC files near its end, but the state
checked if a file is on-disk or not gets read from shared memory while
not holding the two-phase state lock.
Because of that, there was a small window where a second backend doing a
PREPARE TRANSACTION could reuse the GlobalTransaction put back into the
2PC free list by the COMMIT PREPARED, overwriting the "ondisk" flag read
afterwards by the COMMIT PREPARED to decide if its on-disk two-phase
state file should be removed, preventing the file deletion.
This commit fixes this issue so as the "ondisk" flag in the
GlobalTransaction is read while holding the two-phase state lock, not
from shared memory after its entry has been added to the free list.
Orphaned two-phase state files flushed to disk after a checkpoint are
discarded at the beginning of recovery. However, a truncation of
pg_xact/ would make the startup process issue a FATAL when it cannot
read the SLRU page holding the state of the transaction whose 2PC file
was orphaned, which is a necessary step to decide if the 2PC file should
be removed or not. Removing manually the file would be necessary in
this case.
Issue introduced by effe7d9552dd, so backpatch all the way down.
Mea culpa.
Author: wuchengwen
Discussion: https://postgr.es/m/tencent_A7F059B5136A359625C7B2E4A386B3C3F007@qq.com
Backpatch-through: 12
|
|
|
|
This is a continuation of c9e24573905b, containing changes included into
the proposed patch that have been missed in the actual commit. I have
managed to miss these diffs while doing a rebase of the original patch.
Thanks to Noah Misch, Peter Eisentraut and Alexander Korotkov for the
pokes.
Discussion: https://postgr.es/m/92fe572d-638e-4162-aef6-1c42a2936f25@eisentraut.org
Discussion: https://postgr.es/m/20240810175055.cd.nmisch@google.com
Backpatch-through: 17
|
|
One of the two slot scans in SlruSelectLRUPage was not walking only the
slots in the specific bank where the buffer could be; change it to do
that.
Oversight in 53c2a97a9266.
Author: Sergey Sargsyan <sergey.sargsyan.2001@gmail.com>
Discussion: https://postgr.es/m/18582-5f301dd30ba91a38@postgresql.org
|
|
This reverts commit 5887dd4894db5ac1c6411615160555ac6e57e49b.
Some buildfarm animals are failing with "cannot change
"client_encoding" during a parallel operation". It looks like
assign_client_encoding is unhappy at being asked to roll back a
client_encoding setting after a parallel worker encounters a
failure. There must be more to it though: why didn't I see this
during local testing? In any case, it's clear that moving the
RestoreGUCState() call is not as side-effect-free as I thought.
Given that the bug f5f30c22e intended to fix has gone unreported
for years, it's not something that's urgent to fix; I'm not
willing to risk messing with it further with only days to our
next release wrap.
|
|
Parallel workers failed after a sequence like
BEGIN;
CREATE USER foo;
SET SESSION AUTHORIZATION foo;
because check_session_authorization could not see the uncommitted
pg_authid row for "foo". This is because we ran RestoreGUCState()
in a separate transaction using an ordinary just-created snapshot.
The same disease afflicts any other GUC that requires catalog lookups
and isn't forgiving about the lookups failing.
To fix, postpone RestoreGUCState() into the worker's main transaction
after we've set up a snapshot duplicating the leader's. This affects
check_transaction_isolation and check_transaction_deferrable, which
think they should only run during transaction start. Make them
act like check_transaction_read_only, which already knows it should
silently accept the value when InitializingParallelWorker.
Per bug #18545 from Andrey Rachitskiy. Back-patch to all
supported branches, because this has been wrong for awhile.
Discussion: https://postgr.es/m/18545-feba138862f19aaa@postgresql.org
|
|
This is a continuation of 3937cadfd438, taking care of more areas I have
managed to miss previously.
Reported-by: Noah Misch
Reviewed-by: Noah Misch
Discussion: https://postgr.es/m/20240724130059.1f.nmisch@google.com
Backpatch-through: 17
|
|
When a standby is promoted, CleanupAfterArchiveRecovery() may decide
to rename the final WAL file from the old timeline by adding ".partial"
to the name. If WAL summarization is enabled and this file is renamed
before its partial contents are summarized, WAL summarization breaks:
the summarizer gets stuck at that point in the WAL stream and just
errors out.
To fix that, first make the startup process wait for WAL summarization
to catch up before renaming the file. Generally, this should be quick,
and if it's not, the user can shut off summarize_wal and try again.
To make this fix work, also teach the WAL summarizer that after a
promotion has occurred, no more WAL can appear on the previous
timeline: previously, the WAL summarizer wouldn't switch to the new
timeline until we actually started writing WAL there, but that meant
that when the startup process was waiting for the WAL summarizer, it
was waiting for an action that the summarizer wasn't yet prepared to
take.
In the process of fixing these bugs, I realized that the logic to wait
for WAL summarization to catch up was spread out in a way that made
it difficult to reuse properly, so this code refactors things to make
it easier.
Finally, add a test case that would have caught this bug and the
previously-fixed bug that WAL summarization sometimes needs to back up
when the timeline changes.
Discussion: https://postgr.es/m/CA+TgmoZGEsZodXC4f=XZNkAeyuDmWTSkpkjCEOcF19Am0mt_OA@mail.gmail.com
|
|
clog.c, async.c and predicate.c included some SLRU page numbers still
handled as 4-byte integers, while int64 should be used for this purpose.
These holes have been introduced in 4ed8f0913bfd, that has introduced
the use of 8-byte integers for SLRU page numbers, still forgot about the
code paths updated by this commit.
Reported-by: Noah Misch
Author: Aleksander Alekseev, Michael Paquier
Discussion: https://postgr.es/m/20240626002747.dc.nmisch@google.com
Backpatch-through: 17
|
|
slru.h described incorrectly how SLRU segment names are formatted
depending on the segment number and if long or short segment names are
used. This commit closes the gap with a better description, fitting
with the reality.
Reported-by: Noah Misch
Author: Aleksander Alekseev
Discussion: https://postgr.es/m/20240626002747.dc.nmisch@google.com
Backpatch-through: 17
|
|
As per Coverity and Tom Lane, commit 402b586d0 (back-patched to v17
as 2b5819e2b) forgot to initialize this new structure member in this
code path.
|
|
To do this, we must include the wal_level in the first WAL record
covered by each summary file; so add wal_level to struct Checkpoint
and the payload of XLOG_CHECKPOINT_REDO and XLOG_END_OF_RECOVERY.
This, in turn, requires bumping XLOG_PAGE_MAGIC and, since the
Checkpoint is also stored in the control file, also
PG_CONTROL_VERSION. It's not great to do that so late in the release
cycle, but the alternative seems to ship v17 without robust
protections against this scenario, which could result in corrupted
incremental backups.
A side effect of this patch is that, when a server with
wal_level=replica is started with summarize_wal=on for the first time,
summarization will no longer begin with the oldest WAL that still
exists in pg_wal, but rather from the first checkpoint after that.
This change should be harmless, because a WAL summary for a partial
checkpoint cycle can never make an incremental backup possible when
it would otherwise not have been.
Report by Fujii Masao. Patch by me. Review and/or testing by Jakub
Wartak and Fujii Masao.
Discussion: http://postgr.es/m/6e30082e-041b-4e31-9633-95a66de76f5d@oss.nttdata.com
|
|
Before this commit, when the WAL summarizer started up or recovered
from an error, it would resume summarization from wherever it left
off. That was OK normally, but wrong if summarize_wal=off had been
turned off temporary, allowing some WAL to be removed, and then turned
back on again. In such cases, the WAL summarizer would simply hang
forever. This commit changes the reinitialization sequence for WAL
summarizer to rederive the starting position in the way we were
already doing at initial startup, fixing the problem.
Per report from Israel Barth Rubio. Reviewed by Tom Lane.
Discussion: http://postgr.es/m/CA+TgmoYN6x=YS+FoFOS6=nr6=qkXZFWhdiL7k0oatGwug2hcuA@mail.gmail.com
|
|
We did not recover the subtransaction IDs of prepared transactions
when starting a hot standby from a shutdown checkpoint. As a result,
such subtransactions were considered as aborted, rather than
in-progress. That would lead to hint bits being set incorrectly, and
the subtransactions suddenly becoming visible to old snapshots when
the prepared transaction was committed.
To fix, update pg_subtrans with prepared transactions's subxids when
starting hot standby from a shutdown checkpoint. The snapshots taken
from that state need to be marked as "suboverflowed", so that we also
check the pg_subtrans.
Backport to all supported versions.
Discussion: https://www.postgresql.org/message-id/6b852e98-2d49-4ca1-9e95-db419a2696e0@iki.fi
|
|
1. TruncateMultiXact() performs the SLRU truncations in a critical
section. Deleting the SLRU segments calls ForwardSyncRequest(), which
will try to compact the request queue if it's full
(CompactCheckpointerRequestQueue()). That in turn allocates memory,
which is not allowed in a critical section. Backtrace:
TRAP: failed Assert("CritSectionCount == 0 || (context)->allowInCritSection"), File: "../src/backend/utils/mmgr/mcxt.c", Line: 1353, PID: 920981
postgres: autovacuum worker template0(ExceptionalCondition+0x6e)[0x560a501e866e]
postgres: autovacuum worker template0(+0x5dce3d)[0x560a50217e3d]
postgres: autovacuum worker template0(ForwardSyncRequest+0x8e)[0x560a4ffec95e]
postgres: autovacuum worker template0(RegisterSyncRequest+0x2b)[0x560a50091eeb]
postgres: autovacuum worker template0(+0x187b0a)[0x560a4fdc2b0a]
postgres: autovacuum worker template0(SlruDeleteSegment+0x101)[0x560a4fdc2ab1]
postgres: autovacuum worker template0(TruncateMultiXact+0x2fb)[0x560a4fdbde1b]
postgres: autovacuum worker template0(vac_update_datfrozenxid+0x4b3)[0x560a4febd2f3]
postgres: autovacuum worker template0(+0x3adf66)[0x560a4ffe8f66]
postgres: autovacuum worker template0(AutoVacWorkerMain+0x3ed)[0x560a4ffe7c2d]
postgres: autovacuum worker template0(+0x3b1ead)[0x560a4ffecead]
postgres: autovacuum worker template0(+0x3b620e)[0x560a4fff120e]
postgres: autovacuum worker template0(+0x3b3fbb)[0x560a4ffeefbb]
postgres: autovacuum worker template0(+0x2f724e)[0x560a4ff3224e]
/lib/x86_64-linux-gnu/libc.so.6(+0x27c8a)[0x7f62cc642c8a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f62cc642d45]
postgres: autovacuum worker template0(_start+0x21)[0x560a4fd16f31]
To fix, bail out in CompactCheckpointerRequestQueue() without doing
anything, if it's called in a critical section. That covers the above
call path, as well as any other similar cases where
RegisterSyncRequest might be called in a critical section.
2. After fixing that, another problem became apparent: Autovacuum
process doing that truncation can deadlock with the checkpointer
process. TruncateMultiXact() sets "MyProc->delayChkptFlags |=
DELAY_CHKPT_START". If the sync request queue is full and cannot be
compacted, the process will repeatedly sleep and retry, until there is
room in the queue. However, if the checkpointer is trying to start a
checkpoint at the same time, and is waiting for the DELAY_CHKPT_START
processes to finish, the queue will never shrink.
More concretely, the autovacuum process is stuck here:
#0 0x00007fc934926dc3 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x000056220b24348b in WaitEventSetWaitBlock (set=0x56220c2e4b50, occurred_events=0x7ffe7856d040, nevents=1, cur_timeout=<optimized out>) at ../src/backend/storage/ipc/latch.c:1570
#2 WaitEventSetWait (set=0x56220c2e4b50, timeout=timeout@entry=10, occurred_events=<optimized out>, occurred_events@entry=0x7ffe7856d040, nevents=nevents@entry=1,
wait_event_info=wait_event_info@entry=150994949) at ../src/backend/storage/ipc/latch.c:1516
#3 0x000056220b243224 in WaitLatch (latch=<optimized out>, latch@entry=0x0, wakeEvents=wakeEvents@entry=40, timeout=timeout@entry=10, wait_event_info=wait_event_info@entry=150994949)
at ../src/backend/storage/ipc/latch.c:538
#4 0x000056220b26cf46 in RegisterSyncRequest (ftag=ftag@entry=0x7ffe7856d0a0, type=type@entry=SYNC_FORGET_REQUEST, retryOnError=true) at ../src/backend/storage/sync/sync.c:614
#5 0x000056220af9db0a in SlruInternalDeleteSegment (ctl=ctl@entry=0x56220b7beb60 <MultiXactMemberCtlData>, segno=segno@entry=11350) at ../src/backend/access/transam/slru.c:1495
#6 0x000056220af9dab1 in SlruDeleteSegment (ctl=ctl@entry=0x56220b7beb60 <MultiXactMemberCtlData>, segno=segno@entry=11350) at ../src/backend/access/transam/slru.c:1566
#7 0x000056220af98e1b in PerformMembersTruncation (oldestOffset=<optimized out>, newOldestOffset=<optimized out>) at ../src/backend/access/transam/multixact.c:3006
#8 TruncateMultiXact (newOldestMulti=newOldestMulti@entry=3221225472, newOldestMultiDB=newOldestMultiDB@entry=4) at ../src/backend/access/transam/multixact.c:3201
#9 0x000056220b098303 in vac_truncate_clog (frozenXID=749, minMulti=<optimized out>, lastSaneFrozenXid=749, lastSaneMinMulti=3221225472) at ../src/backend/commands/vacuum.c:1917
#10 vac_update_datfrozenxid () at ../src/backend/commands/vacuum.c:1760
#11 0x000056220b1c3f76 in do_autovacuum () at ../src/backend/postmaster/autovacuum.c:2550
#12 0x000056220b1c2c3d in AutoVacWorkerMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at ../src/backend/postmaster/autovacuum.c:1569
and the checkpointer is stuck here:
#0 0x00007fc9348ebf93 in clock_nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007fc9348fe353 in nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x000056220b40ecb4 in pg_usleep (microsec=microsec@entry=10000) at ../src/port/pgsleep.c:50
#3 0x000056220afb43c3 in CreateCheckPoint (flags=flags@entry=108) at ../src/backend/access/transam/xlog.c:7098
#4 0x000056220b1c6e86 in CheckpointerMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at ../src/backend/postmaster/checkpointer.c:464
To fix, add AbsorbSyncRequests() to the loops where the checkpointer
waits for DELAY_CHKPT_START or DELAY_CHKPT_COMPLETE operations to
finish.
Backpatch to v14. Before that, SLRU deletion didn't call
RegisterSyncRequest, which avoided this failure. I'm not sure if there
are other similar scenarios on older versions, but we haven't had
any such reports.
Discussion: https://www.postgresql.org/message-id/ccc66933-31c1-4f6a-bf4b-45fef0d4f22e@iki.fi
|
|
The macros were confused about the argument data types. All the
arguments were called 'xid', and some of the macros included casts to
TransactionId, even though the arguments were actually either
MultiXactIds or MultiXactOffsets. It compiles to the same thing,
because TransactionId, MultiXactId and MultiXactOffset are all
typedefs of uint32, but it was highly misleading.
Author: Maxim Orlov <orlovmg@gmail.com>
Discussion: https://www.postgresql.org/message-id/CACG%3DezbLUG-OD1osAW3OchOMxZtdxHh2itYR9Zhh-a13wEBEQw%40mail.gmail.com
Discussion: https://www.postgresql.org/message-id/ff143b24-a093-40da-9833-d36b83726bdf%40iki.fi
|
|
The purpose of the function is to reduce the effective
autovacuum_multixact_freeze_max_age if the multixact members SLRU is
approaching wraparound, to make multixid freezing more aggressive.
The returned value should therefore never be greater than plain
autovacuum_multixact_freeze_max_age.
Reviewed-by: Robert Haas
Discussion: https://www.postgresql.org/message-id/85fb354c-f89f-4d47-b3a2-3cbd461c90a3@iki.fi
Backpatch-through: 12, all supported versions
|
|
ReorderBufferImmediateInvalidation() executes invalidation messages in
an aborted transaction. However, RelationFlushRelation sometimes
required a valid resource owner, to temporarily increment the refcount
of the relache entry. Commit b8bff07daa worked around that in the main
subtransaction abort function, AbortSubTransaction(), but missed this
similar case in ReorderBufferImmediateInvalidation().
To fix, introduce a separate function to invalidate a relcache
entry. It does the same thing as RelationClearRelation(rebuild==true)
does when outside a transaction, but can be called without
incrementing the refcount.
Add regression test. Before this fix, it failed with:
ERROR: ResourceOwnerEnlarge called after release started
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://www.postgresql.org/message-id/e56be7d9-14b1-664d-0bfc-00ce9772721c@gmail.com
|
|
After further review, we want to move in the direction of always
quoting GUC names in error messages, rather than the previous (PG16)
wildly mixed practice or the intermittent (mid-PG17) idea of doing
this depending on how possibly confusing the GUC name is.
This commit applies appropriate quotes to (almost?) all mentions of
GUC names in error messages. It partially supersedes a243569bf65 and
8d9978a7176, which had moved things a bit in the opposite direction
but which then were abandoned in a partial state.
Author: Peter Smith <smithpb2250@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAHut%2BPv-kSN8SkxSdoHano_wPubqcg5789ejhCDZAcLFceBR-w%40mail.gmail.com
|
|
Author: Alexander Lakhin
Discussion: https://postgr.es/m/ae9f2fcb-4b24-5bb0-4240-efbbbd944ca1@gmail.com
|
|
fefd9a3fed turned tail recursion of CommitTransactionCommand() and
AbortCurrentTransaction() into iteration. However, it splits the handling of
cases between different functions.
This commit puts the handling of all the cases into
AbortCurrentTransactionInternal() and CommitTransactionCommandInternal().
Now CommitTransactionCommand() and AbortCurrentTransaction() are just doing
the repeated calls of internal functions.
Reported-by: Andres Freund
Discussion: https://postgr.es/m/20240415224834.w6piwtefskoh32mv%40awork3.anarazel.de
Author: Andres Freund
|
|
This commit reverts 06c418e163, e37662f221, bf1e650806, 25f42429e2,
ee79928441, and 74eaf66f98 per review by Heikki Linnakangas.
Discussion: https://postgr.es/m/b155606b-e744-4218-bda5-29379779da1a%40iki.fi
|
|
In one multixact.c edge case, we need a mechanism to wait for one
multixact offset to be written before being allowed to read the next
one. We used to handle this case by sleeping for one millisecond and
retrying, but such sleeps have been reported as problematic in
production cases. We can avoid the problem by using a condition
variable: readers sleep on it and then every creator of multixacts
broadcasts into the CV when creation is sufficiently far along.
Author: Kyotaro Horiguchi <horikyotajntt@gmail.com>
Reviewed-by: Andrey Borodin <amborodin@acm.org>
Discussion: https://postgr.es/m/47A598F4-B4E7-4029-8FEC-A06A6C3CB4B5@yandex-team.ru
Discussion: https://postgr.es/m/20200515.090333.24867479329066911.horikyota.ntt
|
|
This tracks the position of WAL that's been fully copied into WAL
buffers by all processes emitting WAL. (For some reason we call that
"WAL insertion"). This is updated using atomic monotonic advance during
WaitXLogInsertionsToFinish, which is not when the insertions actually
occur, but it's the only place where we know where have all the
insertions have completed.
This value is useful in WALReadFromBuffers, which can verify that
callers don't try to read past what has been inserted. (However, more
infrastructure is needed in order to actually use WAL after the flush
point, since it could be lost.)
The value is also useful in WaitXLogInsertionsToFinish() itself, since
we can now exit quickly when all WAL has been already inserted, without
even having to take any locks.
|
|
Even though waiting for replay LSN happens without explicit transaction,
AbortTransaction() is responsible for the cleanup of the shared memory if
the error is thrown in a stored procedure. So, we need to do WaitLSNCleanup()
there to clean up after some unexpected error happened while waiting for
replay LSN.
Discussion: https://postgr.es/m/202404051815.eri4u5q6oj26%40alvherre.pgsql
Author: Alvaro Herrera
|
|
This removes the need to hold both the info_lck spinlock and
WALWriteLock to update them. We use stock atomic write instead, with
WALWriteLock held. Readers can use atomic read, without any locking.
This allows for some code to be reordered: some places were a bit
contorted to avoid repeated spinlock acquisition, but that's no longer a
concern, so we can turn them to more natural coding. Some further
changes are possible (maybe to performance wins), but in this commit I
did rather minimal ones only, to avoid increasing the blast radius.
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Jeff Davis <pgsql@j-davis.com>
Reviewed-by: Andres Freund <andres@anarazel.de> (earlier versions)
Discussion: https://postgr.es/m/20200831182156.GA3983@alvherre.pgsql
|
|
After this change we have XLogCtl->logWriteResult and ->logFlushResult.
There's no functional change, other than the fact that the assignment
from shared memory to local is no longer done via struct assignment, but
instead using a macro that copies each member separately.
The current representation is inconvenient going forward; notably, we
would like to add a new member "Copy" (to keep track of the last
position copied into WAL buffers), so the symmetry between the values in
shared memory vs. those in local would be lost.
This also gives us freedom to later change the concurrency model for the
values in shared memory: we can make them use atomics instead of relying
on the info_lck spinlock.
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/202404031119.cd2kugjk2vho@alvherre.pgsql
|
|
06c418e163 introduced pg_wal_replay_wait() procedure allowing to wait for
the particular LSN to be replayed on standby. The waiters were stored in
the flat array. Even though scanning small arrays is fast, that might be a
problem at scale (a lot of waiting processes).
This commit replaces the flat shared memory array with the pairing heap,
which holds the waiter with the least LSN at the top. This gives us O(log N)
complexity for both inserting and removing waiters.
Reported-by: Alvaro Herrera
Discussion: https://postgr.es/m/202404030658.hhj3vfxeyhft%40alvherre.pgsql
|
|
This adds errcodes to a set of PANIC and FATAL errors in xlog.c
and relcache.c, which previously had no errcode at all set, in
order to make fleetwide analysis of errorlogs easier. There are
many more ereport/elogs left which could benefit from having an
errcode but this at least makes a dent in the issue.
Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/CAN55FZ1k8LgLEqncPGmz_fWnrobV6bjABOTH4tOWta6xNcPQig@mail.gmail.com
|
|
pg_wal_replay_wait() is to be used on standby and specifies waiting for
the specific WAL location to be replayed before starting the transaction.
This option is useful when the user makes some data changes on primary and
needs a guarantee to see these changes on standby.
The queue of waiters is stored in the shared memory array sorted by LSN.
During replay of WAL waiters whose LSNs are already replayed are deleted from
the shared memory array and woken up by setting of their latches.
pg_wal_replay_wait() needs to wait without any snapshot held. Otherwise,
the snapshot could prevent the replay of WAL records implying a kind of
self-deadlock. This is why it is only possible to implement
pg_wal_replay_wait() as a procedure working in a non-atomic context,
not a function.
Catversion is bumped.
Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira
|
|
Allow use of BeginInternalSubTransaction() in parallel mode, so long
as the subtransaction doesn't attempt to acquire an XID or increment
the command counter. Given those restrictions, the other parallel
processes don't need to know about the subtransaction at all, so
this should be safe. The benefit is that it allows subtransactions
intended for error recovery, such as pl/pgsql exception blocks,
to be used in PARALLEL SAFE functions.
Another reason for doing this is that the API of
BeginInternalSubTransaction() doesn't allow reporting failure.
pl/python for one, and perhaps other PLs, copes very poorly with an
error longjmp out of BeginInternalSubTransaction(). The headline
feature of this patch removes the only easily-triggerable failure
case within that function. There remain some resource-exhaustion
and similar cases, which we now deal with by promoting them to FATAL
errors, so that callers need not try to clean up. (It is likely
that such errors would leave us with corrupted transaction state
inside xact.c, making recovery difficult if not impossible anyway.)
Although this work started because of a report of a pl/python crash,
we're not going to do anything about that in the back branches.
Back-patching this particular fix is obviously not very wise.
While we could contemplate some narrower band-aid, pl/python is
already an untrusted language, so it seems okay to classify this
as a "so don't do that" case.
Patch by me, per report from Hao Zhang. Thanks to Robert Haas for
review.
Discussion: https://postgr.es/m/CALY6Dr-2yLVeVPhNMhuBnRgOZo1UjoTETgtKBx1B2gUi8yy+3g@mail.gmail.com
|
|
Similar to commit 7e735035f20.
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAMbWs4-WhpCFMbXCjtJ%2BFzmjfPrp7Hw1pk4p%2BZpU95Kh3ofZ1A%40mail.gmail.com
|
|
Two FATALs and one PANIC gain details about the LSNs they fail at:
- When restoring from a backup_label, the FATAL log generated when not
finding the checkpoint record now reports its LSN.
- When restoring from a backup_label, the FATAL log generated when not
finding the redo record referenced by a checkpoint record now shows both
the redo and checkpoint record LSNs.
- When not restoring from a backup_label, the PANIC error generated when
not finding the checkpoint record now reports its LSN.
This information is useful when debugging corruption issues, and these
LSNs may not show up in the logs depending on the level of logging
configured in the backend.
Author: David Steele
Discussion: https://postgr.es/m/0e90da89-77ca-4ccf-872c-9626d755e288@pgmasters.net
|
|
The function recurses, but didn't perform stack-depth checks. It's
just a debugging aid, so instead of the usual check_stack_depth()
call, stop the printing if we'd risk stack overflow.
Here's an example of how to test this:
(n=1000000; printf "BEGIN;"; for ((i=1;i<=$n;i++)); do printf "SAVEPOINT s$i;"; done; printf "SET log_min_messages = 'DEBUG5'; SAVEPOINT sp;") | psql >/dev/null
In the passing, swap building the list of child XIDs and recursing to
parent. That saves memory while recursing, reducing the risk of out of
memory errors with lots of subtransactions. The saving is not very
significant in practice, but this order seems more logical anyway.
Report by Egor Chindyaskin and Alexander Lakhin.
Discussion: https://www.postgresql.org/message-id/1672760457.940462079%40f306.i.mail.ru
Author: Heikki Linnakangas
Reviewed-by: Robert Haas, Andres Freund, Alexander Korotkov
|