Age | Commit message (Collapse) | Author |
|
When the startup process restores a WAL file from the archive, it deletes
any old file with the same name and renames the new file in its place. On
Windows, however, when a file is deleted, it still lingers as long as a
process holds a file handle open on it. With cascading replication, a
walsender process can hold the old file open, so the rename() in the startup
process would fail. To fix that, rename the old file to a temporary name, to
make the original file name available for reuse, before deleting the old
file.
|
|
Give the correct name of the GUC parameter being complained of.
Also, emit a more suitable SQLSTATE (INVALID_PARAMETER_VALUE,
not the default INTERNAL_ERROR).
Gurjeet Singh, errcode adjustment by me
|
|
Reported by Peter Eisentraut.
|
|
The cascading replication code assumed that the current RecoveryTargetTLI
never changes, but that's not true with recovery_target_timeline='latest'.
The obvious upshot of that is that RecoveryTargetTLI in shared memory needs
to be protected by a lock. A less obvious consequence is that when a
cascading standby is connected, and the standby switches to a new target
timeline after scanning the archive, it will continue to stream WAL to the
cascading standby, but from a wrong file, ie. the file of the previous
timeline. For example, if the standby is currently streaming from the middle
of file 000000010000000000000005, and the timeline changes, the standby
will continue to stream from that file. However, the WAL on the new
timeline is in file 000000020000000000000005, so the standby sends garbage
from 000000010000000000000005 to the cascading standby, instead of the
correct WAL from file 000000020000000000000005.
This also fixes a related bug where a partial WAL segment is restored from
the archive and streamed to a cascading standby. The code assumed that when
a WAL segment is copied from the archive, it can immediately be fully
streamed to a cascading standby. However, if the segment is only partially
filled, ie. has the right size, but only N first bytes contain valid WAL,
that's not safe. That can happen if a partial WAL segment is manually copied
to the archive, or if a partial WAL segment is archived because a server is
started up on a new timeline within that segment. The cascading standby will
get confused if the WAL it received is not valid, and will get stuck until
it's restarted. This patch fixes that problem by not allowing WAL restored
from the archive to be streamed to a cascading standby until it's been
replayed, and thus validated.
|
|
This gets rid of a dangerous-looking use of the not-volatile XLogCtl
pointer in a couple of spinlock-protected sections, where the normal
coding rule is that you should only access shared memory through a
pointer-to-volatile. I think the risk is only hypothetical not actual,
since for there to be a bug the compiler would have to move the spinlock
acquire or release across the memcpy() call, which one sincerely hopes
it will not. Still, it looks cleaner this way.
Per comment from Daniel Farina and subsequent discussion.
|
|
This reduces unnecessary exposure of other headers through htup.h, which
is very widely included by many files.
I have chosen to move the function prototypes to the new file as well,
because that means htup.h no longer needs to include tupdesc.h. In
itself this doesn't have much effect in indirect inclusion of tupdesc.h
throughout the tree, because it's also required by execnodes.h; but it's
something to explore in the future, and it seemed best to do the htup.h
change now while I'm busy with it.
|
|
The heapam XLog functions are used by other modules, not all of which
are interested in the rest of the heapam API. With this, we let them
get just the XLog stuff in which they are interested and not pollute
them with unrelated includes.
Also, since heapam.h no longer requires xlog.h, many files that do
include heapam.h no longer get xlog.h automatically, including a few
headers. This is useful because heapam.h is getting pulled in by
execnodes.h, which is in turn included by a lot of files.
|
|
This lets files that are mere users of ResourceOwner not automatically
include the headers for stuff that is managed by the resowner mechanism.
|
|
This improves on commit 51fed14d73ed3acd2282b531fb1396877e44e86a by
eliminating the assumption that we can form <some pointer value> +
<some offset> without overflow. The entire point of those tests is that
we don't trust the offset value, so coding them in a way that could wrap
around if the buffer happens to be near the top of memory doesn't seem
sound. Instead, track the remaining space as a size_t variable and
compare offsets against that.
Also, improve comment about why we need the extra early check on
xl_tot_len.
|
|
If a WAL record header was split across pages, but xl_tot_len was 0, we
would get confused and conclude that we had already read the whole record,
and proceed to CRC check it. That can lead to a crash in RecordIsValid(),
which isn't careful to not read beyond end-of-record, as defined by
xl_tot_len.
Add an explicit sanity check for xl_tot_len <= SizeOfXlogRecord. Also,
make RecordIsValid() more robust by checking in each step that it doesn't
try to access memory beyond end of record, even if a length field in the
record's or a backup block's header is bogus.
Per report and analysis by Tom Lane.
|
|
level, per report from Tom.
|
|
Cascading replication copied the incoming file into pg_xlog but
didn't set path correctly, so the first attempt to open file failed
causing it to loop around and look for file in pg_xlog. So the
earlier coding worked, but accidentally rather than by design.
Spotted by Fujii Masao, fix by Fujii Masao and Simon Riggs
|
|
This was broken in commit ed0b409d22346b1b027a4c2099ca66984d94b6dd,
which revised the GlobalTransactionData struct to not include the
associated PGPROC as its first member, but overlooked one place where
a cast was used in reliance on that equivalence.
The most effective way of fixing this seems to be to create a new function
that looks up the GlobalTransactionData struct given the XID, and make
both TwoPhaseGetDummyBackendId and TwoPhaseGetDummyProc rely on that.
Per report from Robert Ross.
|
|
Dave Kerr
|
|
mdinit() was misusing IsBootstrapProcessingMode() to decide whether to
create an fsync pending-operations table in the current process. This led
to creating a table not only in the startup and checkpointer processes as
intended, but also in the bgwriter process, not to mention other auxiliary
processes such as walwriter and walreceiver. Creation of the table in the
bgwriter is fatal, because it absorbs fsync requests that should have gone
to the checkpointer; instead they just sit in bgwriter local memory and are
never acted on. So writes performed by the bgwriter were not being fsync'd
which could result in data loss after an OS crash. I think there is no
live bug with respect to walwriter and walreceiver because those never
perform any writes of shared buffers; but the potential is there for
future breakage in those processes too.
To fix, make AuxiliaryProcessMain() export the current process's
AuxProcType as a global variable, and then make mdinit() test directly for
the types of aux process that should have a pendingOpsTable. Having done
that, we might as well also get rid of the random bool flags such as
am_walreceiver that some of the aux processes had grown. (Note that we
could not have fixed the bug by examining those variables in mdinit(),
because it's called from BaseInit() which is run by AuxiliaryProcessMain()
before entering any of the process-type-specific code.)
Back-patch to 9.2, where the problem was introduced by the split-up of
bgwriter and checkpointer processes. The bogus pendingOpsTable exists
in walwriter and walreceiver processes in earlier branches, but absent
any evidence that it causes actual problems there, I'll leave the older
branches alone.
|
|
Commit f11e8be3e812cdbbc139c1b4e49141378b118dee broke this; it was right
in Peter's original patch, but I messed it up before committing.
|
|
This avoids discriminating against wal_sync_method = open_sync or
open_datasync.
Fujii Masao, reviewed by Andres Freund
|
|
|
|
Per gripe from Tom Lane.
|
|
Instead of letting every backend participating in a group commit wait
independently, have the first one that becomes ready to flush WAL wait
for the configured delay, and let all the others wait just long enough
for that first process to complete its flush. This greatly increases
the chances of being able to configure a commit_delay setting that
actually improves performance.
As a side consequence of this change, commit_delay now affects all WAL
flushes, rather than just commits. There was some discussion on
pgsql-hackers about whether to rename the GUC to, say, wal_flush_delay,
but in the absence of consensus I am leaving it alone for now.
Peter Geoghegan, with some changes, mostly to the documentation, by me.
|
|
Per testing by Andres Freund, this improves replication performance
and reduces replication latency and latency jitter. I was a bit
concerned about moving more work into XLogInsert, but testing seems
to show that it's not a problem in practice.
Along the way, improve comments for WaitLatchOrSocket.
Andres Freund. Review and stylistic cleanup by me.
|
|
If the record header is garbled, we're now quite likely to notice it before
we try to make a bogus memory allocation and run out of memory. That can
still happen, if the xlog record is split across pages (we cannot verify
the record header until reading the next page in that scenario), but this
reduces the chances. An out-of-memory is treated as a corrupt record
anyway, so this isn't a correctness issue, just a case of giving a better
error message.
Per Amit Kapila's suggestion.
|
|
This bug was introduced by commit 20d98ab6e4110087d1816cd105a40fcc8ce0a307,
so backpatch this to 9.0-9.2 like that one.
This fixes bug #6710, reported by Tarvi Pillessaar
|
|
Spotted by Amit Kapila
|
|
Fujii Masao
|
|
Reported by Amit Kapila.
|
|
Remove proc.h from sinvaladt.h and twophase.h; also replace xlog.h in
proc.h with xlogdefs.h.
|
|
I didn't notice this on my laptop as I don't HAVE_FSYNC_WRITETHROUGH.
|
|
This simplifies code that needs to do arithmetic on XLogRecPtrs.
To avoid changing on-disk format of data pages, the LSN on data pages is
still stored in the old format. That should keep pg_upgrade happy. However,
we have XLogRecPtrs embedded in the control file, and in the structs that
are sent over the replication protocol, so this changes breaks compatibility
of pg_basebackup and server. I didn't do anything about this in this patch,
per discussion on -hackers, the right thing to do would to be to change the
replication protocol to be architecture-independent, so that you could use
a newer version of pg_receivexlog, for example, against an older server
version.
|
|
This saves a few bytes of WAL space, but the real motivation is to make it
predictable how much WAL space a record requires, as it no longer depends
on whether we need to waste the last few bytes at end of WAL page because
the header doesn't fit.
The total length field of WAL record, xl_tot_len, is moved to the beginning
of the WAL record header, so that it is still always found on the first page
where a WAL record begins.
Bump WAL version number again as this is an incompatible change.
|
|
The continuation record only contained one field, xl_rem_len, so it makes
things simpler to just include it in the WAL page header. This wastes four
bytes on pages that don't begin with a continuation from previos page, plus
four bytes on every page, because of padding.
The motivation of this is to make it easier to calculate how much space a
WAL record needs. Before this patch, it depended on how many page boundaries
the record crosses. The motivation of that, in turn, is to separate the
allocation of space in the WAL from the copying of the record data to the
allocated space. Keeping the calculation of space required simple helps to
keep the critical section of allocating the space from WAL short. But that's
not included in this patch yet.
Bump WAL version number again, as this is an incompatible change.
|
|
The comments claimed that wasting the last segment made it easier to do
calculations with XLogRecPtrs, because you don't have problems representing
last-byte-position-plus-1 that way. In my experience, however, it only made
things more complicated, because the there was two ways to represent the
boundary at the beginning of a logical log file: logid = n+1 and xrecoff = 0,
or as xlogid = n and xrecoff = 4GB - XLOG_SEG_SIZE. Some functions were
picky about which representation was used.
Also, use a 64-bit segment number instead of the log/seg combination, to
point to a certain WAL segment. We assume that all platforms have a working
64-bit integer type nowadays.
This is an incompatible change in WAL format, so bumping WAL version number.
|
|
Gurjeet Singh, with corrections by me.
|
|
Darold Gilles, reviewed by Gabriele Bartolini and others, rebased by
Marco Nenciarini. Stylistic cleanup and OID fixes by me.
|
|
There's no need to hold onto the locks until the files are needed,
and by doing it this way, we reduce the impact on other backends who
may be awaiting locks we hold.
Noah Misch
|
|
This reverts commit 18fb9d8d21a28caddb72c7ffbdd7b96d52ff9724. Per
discussion, it does not seem like a good idea to allow committed changes to
go un-checkpointed indefinitely, as could happen in a low-traffic server;
that makes us entirely reliant on the WAL stream with no redundancy that
might aid data recovery in case of disk failure.
This re-introduces the original problem of hot-standby setups generating a
small continuing stream of WAL traffic even when idle, but there are other
ways to address that without compromising crash recovery, so we'll revisit
that issue in a future release cycle.
|
|
commit-fest.
|
|
This provides a speedup of about 4X when NBuffers is large enough.
There is also a useful reduction in sinval traffic, since we
only do CacheInvalidateSmgr() once not once per fork.
Simon Riggs, reviewed and somewhat revised by Tom Lane
|
|
WALSender now woken up after each background flush by WALwriter, avoiding
multi-second replication delay for an all-async commit workload.
Replication delay reduced from 7s with default settings to 200ms and often
much less, allowing significantly reduced data loss at failover.
Andres Freund and Simon Riggs
|
|
|
|
AbortOutOfAnyTransaction failed to do anything if the state it saw on
entry corresponded to failing partway through StartTransaction. I fixed
AbortCurrentTransaction to cope with that case way back in commit
60b2444cc3ba037630c9b940c3c9ef01b954b87b, but evidently overlooked that
AbortOutOfAnyTransaction should do likewise.
Back-patch to all supported branches. It's not clear that this omission
has any more-than-cosmetic consequences, but it's also not clear that it
doesn't, so back-patching seems the least risky choice.
|
|
When the "hot" members of PGPROC were split off to separate PGXACT structs,
many PGPROC fields referred to in comments were moved to PGXACT, but the
comments were neglected in the commit. Mostly this is just a search/replace
of PGPROC with PGXACT, but the way the dummy PGPROC entries are created for
prepared transactions changed more, making some of the comments totally
bogus.
Noah Misch
|
|
|
|
|
|
Users of asynchronous-commit mode expect there to be a guaranteed maximum
delay before an async commit's WAL records get flushed to disk. The
original version of the walwriter hibernation patch broke that. Add an
extra shared-memory flag to allow async commits to kick the walwriter out
of hibernation mode, without adding any noticeable overhead in cases where
no action is needed.
|
|
This patch modifies the walwriter process so that, when it has not found
anything useful to do for many consecutive wakeup cycles, it extends its
sleep time to reduce the server's idle power consumption. It reverts to
normal as soon as it's done any successful flushes. It's still true that
during any async commit, backends check for completed, unflushed pages of
WAL and signal the walwriter if there are any; so that in practice the
walwriter can get awakened and returned to normal operation sooner than the
sleep time might suggest.
Also, improve the checkpointer so that it uses a latch and a computed delay
time to not wake up at all except when it has something to do, replacing a
previous hardcoded 0.5 sec wakeup cycle. This also is primarily useful for
reducing the server's power consumption when idle.
In passing, get rid of the dedicated latch for signaling the walwriter in
favor of using its procLatch, since that comports better with possible
generic signal handlers using that latch. Also, fix a pre-existing bug
with failure to save/restore errno in walwriter's signal handlers.
Peter Geoghegan, somewhat simplified by Tom
|
|
This patch adjusts the core statistics views to match the decision already
taken for pg_stat_statements, that values representing elapsed time should
be represented as float8 and measured in milliseconds. By using float8,
we are no longer tied to a specific maximum precision of timing data.
(Internally, it's still microseconds, but we could now change that without
needing changes at the SQL level.)
The columns affected are
pg_stat_bgwriter.checkpoint_write_time
pg_stat_bgwriter.checkpoint_sync_time
pg_stat_database.blk_read_time
pg_stat_database.blk_write_time
pg_stat_user_functions.total_time
pg_stat_user_functions.self_time
pg_stat_xact_user_functions.total_time
pg_stat_xact_user_functions.self_time
The first four of these are new in 9.2, so there is no compatibility issue
from changing them. The others require a release note comment that they
are now double precision (and can show a fractional part) rather than
bigint as before; also their underlying statistics functions now match
the column definitions, instead of returning bigint microseconds.
|
|
Noted by Peter Geoghegan.
|
|
Josh Kupershmidt
|
|
Josh Kupershmidt
|