summaryrefslogtreecommitdiff
path: root/src/backend/access/transam/xlog.c
AgeCommit message (Collapse)Author
2014-01-31Introduce replication slots.Robert Haas
Replication slots are a crash-safe data structure which can be created on either a master or a standby to prevent premature removal of write-ahead log segments needed by a standby, as well as (with hot_standby_feedback=on) pruning of tuples whose removal would cause replication conflicts. Slots have some advantages over existing techniques, as explained in the documentation. In a few places, we refer to the type of replication slots introduced by this patch as "physical" slots, because forthcoming patches for logical decoding will also have slots, but with somewhat different properties. Andres Freund and Robert Haas
2014-01-25Add recovery_target='immediate' option.Heikki Linnakangas
This allows ending recovery as a consistent state has been reached. Without this, there was no easy way to e.g restore an online backup, without replaying any extra WAL after the backup ended. MauMau and me.
2014-01-23Allow use of "z" flag in our printf calls, and use it where appropriate.Tom Lane
Since C99, it's been standard for printf and friends to accept a "z" size modifier, meaning "whatever size size_t has". Up to now we've generally dealt with printing size_t values by explicitly casting them to unsigned long and using the "l" modifier; but this is really the wrong thing on platforms where pointers are wider than longs (such as Win64). So let's start using "z" instead. To ensure we can do that on all platforms, teach src/port/snprintf.c to understand "z", and add a configure test to force use of that implementation when the platform's version doesn't handle "z". Having done that, modify a bunch of places that were using the unsigned-long hack to use "z" instead. This patch doesn't pretend to have gotten everyplace that could benefit, but it catches many of them. I made an effort in particular to ensure that all uses of the same error message text were updated together, so as not to increase the number of translatable strings. It's possible that this change will result in format-string warnings from pre-C99 compilers. We might have to reconsider if there are any popular compilers that will warn about this; but let's start by seeing what the buildfarm thinks. Andres Freund, with a little additional work by me
2014-01-14Fix multiple bugs in index page locking during hot-standby WAL replay.Tom Lane
In ordinary operation, VACUUM must be careful to take a cleanup lock on each leaf page of a btree index; this ensures that no indexscans could still be "in flight" to heap tuples due to be deleted. (Because of possible index-tuple motion due to concurrent page splits, it's not enough to lock only the pages we're deleting index tuples from.) In Hot Standby, the WAL replay process must likewise lock every leaf page. There were several bugs in the code for that: * The replay scan might come across unused, all-zero pages in the index. While btree_xlog_vacuum itself did the right thing (ie, nothing) with such pages, xlogutils.c supposed that such pages must be corrupt and would throw an error. This accounts for various reports of replication failures with "PANIC: WAL contains references to invalid pages". To fix, add a ReadBufferMode value that instructs XLogReadBufferExtended not to complain when we're doing this. * btree_xlog_vacuum performed the extra locking if standbyState == STANDBY_SNAPSHOT_READY, but that's not the correct test: we won't open up for hot standby queries until the database has reached consistency, and we don't want to do the extra locking till then either, for fear of reading corrupted pages (which bufmgr.c would complain about). Fix by exporting a new function from xlog.c that will report whether we're actually in hot standby replay mode. * To ensure full coverage of the index in the replay scan, btvacuumscan would emit a dummy WAL record for the last page of the index, if no vacuuming work had been done on that page. However, if the last page of the index is all-zero, that would result in corruption of said page, since the functions called on it weren't prepared to handle that case. There's no need to lock any such pages, so change the logic to target the last normal leaf page instead. The first two of these bugs were diagnosed by Andres Freund, the other one by me. Fixes based on ideas from Heikki Linnakangas and myself. This has been wrong since Hot Standby was introduced, so back-patch to 9.0.
2014-01-09Refactor checking whether we've reached the recovery target.Heikki Linnakangas
Makes the replay loop slightly more readable, by separating the concerns of whether to stop and whether to delay, and how to extract the timestamp from a record. This has the user-visible change that the timestamp of the last applied record is now updated after actually applying it. Before, it was updated just before applying it. That meant that pg_last_xact_replay_timestamp() could return the timestamp of a commit record that is in process of being replayed, but not yet applied. Normally the difference is small, but if min_recovery_apply_delay is set, there could be a significant delay between reading a record and applying it. Another behavioral change is that if you recover to a restore point, we stop after the restore point record, not before it. It makes no difference as far as running queries on the server is concerned, as applying a restore point record changes nothing, but if examine the timeline history you will see that the new timeline branched off just after the restore point record, not before it. One practical consequence is that if you do PITR to the new timeline, and set recovery target to the same named restore point again, it will find and stop recovery at the same restore point. Conceptually, I think it makes more sense to consider the restore point as part of the new timeline's history than not. In principle, setting the last-replayed timestamp before actually applying the record was a bug all along, but it doesn't seem worth the risk to backpatch, since min_recovery_apply_delay was only added in 9.4.
2014-01-08Fix pause_at_recovery_target + recovery_target_inclusive combination.Heikki Linnakangas
If pause_at_recovery_target is set, recovery pauses *before* applying the target record, even if recovery_target_inclusive is set. If you then continue with pg_xlog_replay_resume(), it will apply the target record before ending recovery. In other words, if you log in while it's paused and verify that the database looks OK, ending recovery changes its state again, possibly destroying data that you were tring to salvage with PITR. Backpatch to 9.1, this has been broken since pause_at_recovery_target was added.
2014-01-08If multiple recovery_targets are specified, use the latest one.Heikki Linnakangas
The docs say that only one of recovery_target_xid, recovery_target_time, or recovery_target_name can be specified. But the code actually did something different, so that a name overrode time, and xid overrode both time and name. Now the target specified last takes effect, whether it's an xid, time or name. With this patch, we still accept multiple recovery_target settings, even though docs say that only one can be specified. It's a general property of the recovery.conf file parser that you if you specify the same option twice, the last one takes effect, like with postgresql.conf.
2014-01-08Fix bug in determining when recovery has reached consistency.Heikki Linnakangas
When starting WAL replay from an online checkpoint, the last replayed WAL record variable was initialized using the checkpoint record's location, even though the records between the REDO location and the checkpoint record had not been replayed yet. That was noted as "slightly confusing" but harmless in the comment, but in some cases, it fooled CheckRecoveryConsistency to incorrectly conclude that we had already reached a consistent state immediately at the beginning of WAL replay. That caused the system to accept read-only connections in hot standby mode too early, and also PANICs with message "WAL contains references to invalid pages". Fix by initializing the variables to the REDO location instead. In 9.2 and above, change CheckRecoveryConsistency() to use lastReplayedEndRecPtr variable when checking if backup end location has been reached. It was inconsistently using EndRecPtr for that check, but lastReplayedEndRecPtr when checking min recovery point. It made no difference before this patch, because in all the places where CheckRecoveryConsistency was called the two variables were the same, but it was always an accident waiting to happen, and would have been wrong after this patch anyway. Report and analysis by Tomonari Katsumata, bug #8686. Backpatch to 9.0, where hot standby was introduced.
2014-01-07Update copyright for 2014Bruce Momjian
Update all files in head, and files COPYRIGHT and legal.sgml in all back branches.
2014-01-07Move permissions check from do_pg_start_backup to pg_start_backupMagnus Hagander
And the same for do_pg_stop_backup. The code in do_pg_* is not allowed to access the catalogs. For manual base backups, the permissions check can be handled in the calling function, and for streaming base backups only users with the required permissions can get past the authentication step in the first place. Reported by Antonin Houska, diagnosed by Andres Freund
2014-01-01Rename walLogHints to wal_log_hints for easier grepping.Robert Haas
Michael Paquier
2013-12-21Rename wal_log_hintbits to wal_log_hints, per discussion on pgsql-hackers.Fujii Masao
Sawada Masahiko
2013-12-13Fix more instances of "the the" in comments.Heikki Linnakangas
Plus one instance of "to to" in the docs.
2013-12-13Add GUC to enable WAL-logging of hint bits, even with checksums disabled.Heikki Linnakangas
WAL records of hint bit updates is useful to tools that want to examine which pages have been modified. In particular, this is required to make the pg_rewind tool safe (without checksums). This can also be used to test how much extra WAL-logging would occur if you enabled checksums, without actually enabling them (which you can't currently do without re-initdb'ing). Sawada Masahiko, docs by Samrat Revagade. Reviewed by Dilip Kumar, with further changes by me.
2013-12-12Allow time delayed standbys and recoverySimon Riggs
Set min_recovery_apply_delay to force a delay in recovery apply for commit and restore point WAL records. Other records are replayed immediately. Delay is measured between WAL record time and local standby time. Robert Haas, Fabrízio de Royes Mello and Simon Riggs Detailed review by Mitsumasa Kondo
2013-12-11Remove bogus executable permissions on xlog.c.Tom Lane
Apparently fat-fingered in 1a3d104475ce01326fc00601ed66ac4d658e37e5. Noted by Peter Geoghegan.
2013-12-10Add new wal_level, logical, sufficient for logical decoding.Robert Haas
When wal_level=logical, we'll log columns from the old tuple as configured by the REPLICA IDENTITY facility added in commit 07cacba983ef79be4a84fcd0e0ca3b5fcb85dd65. This makes it possible a properly-configured logical replication solution to correctly follow table updates even if they change the chosen key columns, or, with REPLICA IDENTITY FULL, even if the table has no key at all. Note that updates which do not modify the replica identity column won't log anything extra, making the choice of a good key (i.e. one that will rarely be changed) important to performance when wal_level=logical is configured. Each insert, update, or delete to a catalog table will also log the CMIN and/or CMAX values of stamped by the current transaction. This is necessary because logical decoding will require access to historical snapshots of the catalog in order to decode some data types, and the CMIN/CMAX values that we may need in order to judge row visibility may have been overwritten by the time we need them. Andres Freund, reviewed in various versions by myself, Heikki Linnakangas, KONDO Mitsumasa, and many others.
2013-11-29Truncate pg_multixact/'s contents during crash recoveryAlvaro Herrera
Commit 9dc842f08 of 8.2 era prevented MultiXact truncation during crash recovery, because there was no guarantee that enough state had been setup, and because it wasn't deemed to be a good idea to remove data during crash recovery anyway. Since then, due to Hot-Standby, streaming replication and PITR, the amount of time a cluster can spend doing crash recovery has increased significantly, to the point that a cluster may even never come out of it. This has made not truncating the content of pg_multixact/ not defensible anymore. To fix, take care to setup enough state for multixact truncation before crash recovery starts (easy since checkpoints contain the required information), and move the current end-of-recovery actions to a new TrimMultiXact() function, analogous to TrimCLOG(). At some later point, this should probably done similarly to the way clog.c is doing it, which is to just WAL log truncations, but we can't do that for the back branches. Back-patch to 9.0. 8.4 also has the problem, but since there's no hot standby there, it's much less pressing. In 9.2 and earlier, this patch is simpler than in newer branches, because multixact access during recovery isn't required. Add appropriate checks to make sure that's not happening. Andres Freund
2013-11-22Avoid acquiring spinlock when checking if recovery has finished, for speed.Heikki Linnakangas
RecoveryIsInProgress() can be called very frequently. During normal operation, it just checks a backend-local variable and returns quickly, but during hot standby, it checks a spinlock-protected shared variable. Those spinlock acquisitions can become a point of contention on a busy hot standby system. Replace the spinlock acquisition with a memory barrier. Per discussion with Andres Freund, Ants Aasma and Merlin Moncure.
2013-10-31Use appendStringInfoString instead of appendStringInfo where possible.Robert Haas
This shaves a few cycles, and generally seems like good programming practice. David Rowley
2013-10-08TYPEALIGN doesn't work on int64 on 32-bit platforms.Heikki Linnakangas
The TYPEALIGN macro, and the related ones like MAXALIGN, don't work with values larger than intptr_t, because TYPEALIGN casts the argument to intptr_t to do the arithmetic. That's not a problem when dealing with pointers or lengths or offsets related to pointers, but the XLogInsert scaling patch added a call to MAXALIGN with an XLogRecPtr argument. To fix, add wider variants of the macros, called TYPEALIGN64 and MAXALIGN64, which are just like the existing variants but work with uint64 instead of intptr_t. Report and patch by David Rowley, analysis by Andres Freund.
2013-09-16Add a GUC to report whether data page checksums are enabled.Heikki Linnakangas
Bernd Helmle
2013-09-04Revert WAL posix_fallocate() patches.Jeff Davis
This reverts commit 269e780822abb2e44189afaccd6b0ee7aefa7ddd and commit 5b571bb8c8d2bea610e01ae1ee7bc05adcfff528. Unfortunately, the initial patch had insufficient performance testing, and resulted in a regression. Per report by Thom Brown.
2013-09-04Keep heavily-contended fields in XLogCtlInsert on different cache lines.Heikki Linnakangas
Performance testing shows that if the insertpos_lck spinlock and the fields that it protects are on the same cache line with other variables that are frequently accessed, the false sharing can hurt performance a lot. Keep them apart by adding some padding.
2013-08-19Rename the "fast_promote" file to just "promote".Heikki Linnakangas
This keeps the usual trigger file name unchanged from 9.2, avoiding nasty issues if you use a pre-9.3 pg_ctl binary with a 9.3 server or vice versa. The fallback behavior of creating a full checkpoint before starting up is now triggered by a file called "fallback_promote". That can be useful for debugging purposes, but we don't expect any users to have to resort to that and we might want to remove that in the future, which is why the fallback mechanism is undocumented.
2013-08-09Message punctuation and pluralization fixesPeter Eisentraut
2013-07-28Message style improvementsPeter Eisentraut
2013-07-17Fix variable names mentioned in comment to match the code.Heikki Linnakangas
Also, in another comment, explain why holding an insertion slot is a critical section. Per review by Amit Kapila.
2013-07-17Fix assert failure at end of recovery, broken by XLogInsert scaling patch.Heikki Linnakangas
Initialization of the first XLOG buffer at end-of-recovery was broken for the case that the last read WAL record ended at a page boundary. Instead of trying to copy the last full xlog page to the buffer cache in that case, just set shared state so that the next page is initialized when the first WAL record after startup is inserted. (that's what we did in earlier version, too) To make the shared state required for that case less surprising, replace the XLogCtl->curridx variable, which was the index of the latest initialized buffer, with an XLogRecPtr of how far the buffers have been initialized. That also allows us to get rid of the XLogRecEndPtrToBufIdx macro. While we're at it, make a similar change for XLogCtl->Write.curridx, getting rid of that variable and calculating the next buffer to write from XLogCtl->LogwrtResult instead.
2013-07-08Fix Windows build.Heikki Linnakangas
Was broken by my xloginsert scaling patch. XLogCtl global variable needs to be initialized in each process, as it's not inherited by fork() on Windows.
2013-07-08Improve scalability of WAL insertions.Heikki Linnakangas
This patch replaces WALInsertLock with a number of WAL insertion slots, allowing multiple backends to insert WAL records to the WAL buffers concurrently. This is particularly useful for parallel loading large amounts of data on a system with many CPUs. This has one user-visible change: switching to a new WAL segment with pg_switch_xlog() now fills the remaining unused portion of the segment with zeros. This potentially adds some overhead, but it has been a very common practice by DBA's to clear the "tail" of the segment with an external pg_clearxlogtail utility anyway, to make the WAL files compress better. With this patch, it's no longer necessary to do that. This patch adds a new GUC, xloginsert_slots, to tune the number of WAL insertion slots. Performance testing suggests that the default, 8, works pretty well for all kinds of worklods, but I left the GUC in place to allow others with different hardware to test that easily. We might want to remove that before release. Reviewed by Andres Freund.
2013-07-06Handle posix_fallocate() errors.Jeff Davis
On some platforms, posix_fallocate() is available but may still return EINVAL if the underlying filesystem does not support it. So, in case of an error, fall through to the alternate implementation that just writes zeros. Per buildfarm failure and analysis by Tom Lane.
2013-07-05Use posix_fallocate() for new WAL files, where available.Jeff Davis
This function is more efficient than actually writing out zeroes to the new file, per microbenchmarks by Jon Nelson. Also, it may reduce the likelihood of WAL file fragmentation. Jon Nelson, with review by Andres Freund, Greg Smith and me.
2013-07-04Add new GUC, max_worker_processes, limiting number of bgworkers.Robert Haas
In 9.3, there's no particular limit on the number of bgworkers; instead, we just count up the number that are actually registered, and use that to set MaxBackends. However, that approach causes problems for Hot Standby, which needs both MaxBackends and the size of the lock table to be the same on the standby as on the master, yet it may not be desirable to run the same bgworkers in both places. 9.3 handles that by failing to notice the problem, which will probably work fine in nearly all cases anyway, but is not theoretically sound. A further problem with simply counting the number of registered workers is that new workers can't be registered without a postmaster restart. This is inconvenient for administrators, since bouncing the postmaster causes an interruption of service. Moreover, there are a number of applications for background processes where, by necessity, the background process must be started on the fly (e.g. parallel query). While this patch doesn't actually make it possible to register new background workers after startup time, it's a necessary prerequisite. Patch by me. Review by Michael Paquier.
2013-07-01Retry short writes when flushing WAL.Heikki Linnakangas
We don't normally bother retrying when the number of bytes written by write() is short of what was requested. It is generally assumed that a write() to disk doesn't return short, unless you run out of disk space. While writing the WAL, however, it seems prudent to try a bit harder, because a failure leads to PANIC. The write() is also much larger than most write()s in the backend (up to wal_buffers), so there's more room for surprises. Also retry on EINTR. All signals used in the backend are flagged SA_RESTART nowadays, so it shouldn't happen, but better to be defensive.
2013-06-23Ensure no xid gaps during Hot Standby startupSimon Riggs
In some cases with higher numbers of subtransactions it was possible for us to incorrectly initialize subtrans leading to complaints of missing pages. Bug report by Sergey Konoplev Analysis and fix by Andres Freund
2013-06-17Add buffer_std flag to MarkBufferDirtyHint().Jeff Davis
MarkBufferDirtyHint() writes WAL, and should know if it's got a standard buffer or not. Currently, the only callers where buffer_std is false are related to the FSM. In passing, rename XLOG_HINT to XLOG_FPI, which is more descriptive. Back-patch to 9.3.
2013-06-13Remove special-case treatment of LOG severity level in standalone mode.Tom Lane
elog.c has historically treated LOG messages as low-priority during bootstrap and standalone operation. This has led to confusion and even masked a bug, because the normal expectation of code authors is that elog(LOG) will put something into the postmaster log, and that wasn't happening during initdb. So get rid of the special-case rule and make the priority order the same as it is in normal operation. To keep from cluttering initdb's output and the behavior of a standalone backend, tweak the severity level of three messages routinely issued by xlog.c during startup and shutdown so that they won't appear in these cases. Per my proposal back in December.
2013-06-12Observe array length in HaveVirtualXIDsDelayingChkpt().Noah Misch
Since commit f21bb9cfb5646e1793dcc9c0ea697bab99afa523, this function ignores the caller-provided length and loops until it finds a terminator, which GetVirtualXIDsDelayingChkpt() never adds. Restore the previous loop control logic. In passing, revert the addition of an unused variable by the same commit, presumably a debugging relic.
2013-06-06Fix typo in comment.Heikki Linnakangas
2013-06-03Code review of recycling WAL segments in a restartpoint.Heikki Linnakangas
Seems cleaner to get the currently-replayed TLI in the same call to GetXLogReplayRecPtr that we get the WAL position. Make it more clear in the comment what the code does when recovery has already ended (RecoveryInProgress() will set ThisTimeLineID in that case). Finally, make resetting ThisTimeLineID afterwards more explicit.
2013-06-01Post-pgindent cleanupStephen Frost
Make slightly better decisions about indentation than what pgindent is capable of. Mostly breaking out long function calls into one line per argument, with a few other minor adjustments. No functional changes- all whitespace. pgindent ran cleanly (didn't change anything) after. Passes all regressions.
2013-05-29pgindent run for release 9.3Bruce Momjian
This is the first run of the Perl-based pgindent script. Also update pgindent instructions.
2013-05-21After fast promotion use CHECKPOINT_FORCESimon Riggs
Not necessary for correctness, just to make log_checkpoints output look less singular. Requested by Fujii Masao
2013-05-21Maintain ThisTimeLineID correctly in checkpointerSimon Riggs
checkpointer needs to reset ThisTimeLineID after a restartpoint to allow installing/recycling new WAL files. If recovery has already ended this would leave ThisTimeLineID set incorrectly and so we must reset it otherwise later checkpoints do not have the correct timeline. Bug report by Heikki Linnakangas. Further investigation by Heikki and myself.
2013-05-19Init crash recovery using the latest available TLISimon Riggs
This simplifies the handling of crashes after fast promotion and various minor cases that can exist in short timing windows around that case. Broad fix to bug reported by Michael Paquier on -hackers, approach prompted by Heikki Linnakangas
2013-05-19Emit msg correctly for timeline-crossing crashSimon Riggs
2013-05-19Remove single space on end of a line in xlog.cSimon Riggs
Michael Paquier
2013-05-08Fix walsender failure at promotion.Heikki Linnakangas
If a standby server has a cascading standby server connected to it, it's possible that WAL has already been sent up to the next WAL page boundary, splitting a WAL record in the middle, when the first standby server is promoted. Don't throw an assertion failure or error in walsender if that happens. Also, fix a variant of the same bug in pg_receivexlog: if it had already received WAL on previous timeline up to a segment boundary, when the upstream standby server is promoted so that the timeline switch record falls on the previous segment, pg_receivexlog would miss the segment containing the timeline switch. To fix that, have walsender send the position of the timeline switch at end-of-streaming, in addition to the next timeline's ID. It was previously assumed that the switch happened exactly where the streaming stopped. Note: this is an incompatible change in the streaming protocol. You might get an error if you try to stream over timeline switches, if the client is running 9.3beta1 and the server is more recent. It should be fine after a reconnect, however. Reported by Fujii Masao.
2013-04-30Record data_checksum_version in control file.Simon Riggs
The value is not used anywhere in code, but will allow future changes to the checksum version should that become necessary in the future.