summaryrefslogtreecommitdiff
path: root/src/backend/access/transam/xlog.c
AgeCommit message (Collapse)Author
2007-10-12When telling the bgwriter that we need a checkpoint because too much xlogTom Lane
has been consumed, recheck against the latest value of RedoRecPtr before really sending the signal. This avoids useless checkpoint activity if XLogWrite is executed when we have a very stale local copy of RedoRecPtr. The potential for useless checkpoint is very much worse in 8.3 because of the walwriter process (which never does XLogInsert), so while this behavior was intentional, it needs to be changed. Per report from Itagaki Takahiro.
2007-09-30Adjust recovery PS display as agreed with Simon: 'waiting for XXX'Tom Lane
while the restore_command does its thing, then 'recovering XXX' while processing the segment file. These operations are heavyweight enough that an extra PS display set shouldn't bother anyone.
2007-09-29Make recovery show the current input WAL segment name in the startupTom Lane
process' PS display. After a suggestion by Simon (not exactly his patch though).
2007-09-29Make archive recovery always start a new timeline, rather than only when aTom Lane
recovery stop time was used. This avoids a corner-case risk of trying to overwrite an existing archived copy of the last WAL segment, and seems simpler and cleaner all around than the original definition. Per example from Jon Colverson and subsequent analysis by Simon.
2007-09-26Minor improvements in backup and recovery:Tom Lane
- create a separate archive_mode GUC, on which archive_command is dependent - %r option in recovery.conf sends last restartpoint to recovery command - %r used in pg_standby, updated README - minor other code cleanup in pg_standby - doc on Warm Standby now mentions pg_standby and %r - log_restartpoints recovery option emits LOG message at each restartpoint - end of recovery now displays last transaction end time, as requested by Warren Little; also shown at each restartpoint - restart archiver if needed to carry away WAL files at shutdown Simon Riggs
2007-09-08Replace the former method of determining snapshot xmax --- to wit, callingTom Lane
ReadNewTransactionId from GetSnapshotData --- with a "latestCompletedXid" variable that is updated during transaction commit or abort. Since latestCompletedXid is written only in places that had to lock ProcArrayLock exclusively anyway, and is read only in places that had to lock ProcArrayLock shared anyway, it adds no new locking requirements to the system despite being cluster-wide. Moreover, removing ReadNewTransactionId from snapshot acquisition eliminates the need to take both XidGenLock and ProcArrayLock at the same time. Since XidGenLock is sometimes held across I/O this can be a significant win. Some preliminary benchmarking suggested that this patch has no effect on average throughput but can significantly improve the worst-case transaction times seen in pgbench. Concept by Florian Pflug, implementation by Tom Lane.
2007-09-05Implement lazy XID allocation: transactions that do not modify any databaseTom Lane
rows will normally never obtain an XID at all. We already did things this way for subtransactions, but this patch extends the concept to top-level transactions. In applications where there are lots of short read-only transactions, this should improve performance noticeably; not so much from removal of the actual XID-assignments, as from reduction of overhead that's driven by the rate of XID consumption. We add a concept of a "virtual transaction ID" so that active transactions can be uniquely identified even if they don't have a regular XID. This is a much lighter-weight concept: uniqueness of VXIDs is only guaranteed over the short term, and no on-disk record is made about them. Florian Pflug, with some editorialization by Tom.
2007-08-28Add a debug logging message when a resource manager rejects an attemptedTom Lane
restart point. Per suggestion from Simon Riggs.
2007-08-13Fix two bugs induced in VACUUM FULL by async-commit patch.Tom Lane
First, we cannot assume that XLogAsyncCommitFlush guarantees hint bits will be settable, because clog.c's inexact LSN bookkeeping results in windows where a previously flushed transaction is considered unhintable because it shares an LSN slot with a later unflushed transaction. But repair_frag requires XMIN_COMMITTED to be correct so that it can distinguish tuples moved by the current vacuum. Since not being able to set the bit is an uncommon corner case, the most practical way of dealing with it seems to be to abandon shrinking (ie, don't invoke repair_frag) when we find a non-dead tuple whose XMIN_COMMITTED bit couldn't be set. Second, it is possible for the same reason that a RECENTLY_DEAD tuple does not get its XMAX_COMMITTED bit set during scan_heap. But by the time repair_frag examines the tuple it might be possible to set the bit. We therefore must take buffer content lock when calling HeapTupleSatisfiesVacuum a second time, else we can get an Assert failure in SetBufferCommitInfoNeedsSave. This latter bug is latent in existing releases, but I think it cannot actually occur without async commit, since the first HeapTupleSatisfiesVacuum call should always have set the bit. So I'm not going to back-patch it. In passing, reduce the existing "cannot shrink relation" messages from NOTICE to LOG level. The new message must be no higher than LOG if we don't want unpredictable regression test failures, and consistency seems like a good idea. Also arrange that only one such message is reported per VACUUM FULL; in typical scenarios you could get spammed with many such messages, which seems a bit useless.
2007-08-04Switch over to using the src/timezone functions for formatting timestampsTom Lane
displayed in the postmaster log. This avoids Windows-specific problems with localized time zone names that are in the wrong encoding, and generally seems like a good idea to forestall other potential platform-dependent issues. To preserve the existing behavior that all backends will log in the same time zone, create a new GUC variable log_timezone that can only be changed on a system-wide basis, and reference log-related calculations to that zone instead of the TimeZone variable. This fixes the issue reported by Hiroshi Saito that timestamps printed by xlog.c startup could be improperly localized on Windows. We still need a simpler patch for that problem in the back branches, however.
2007-08-01Support an optional asynchronous commit mode, in which we don't flush WALTom Lane
before reporting a transaction committed. Data consistency is still guaranteed (unlike setting fsync = off), but a crash may lose the effects of the last few transactions. Patch by Simon, some editorialization by Tom.
2007-07-24Create a new dedicated Postgres process, "wal writer", which exists to writeTom Lane
and fsync WAL at convenient intervals. For the moment it just tries to offload this work from backends, but soon it will be responsible for guaranteeing a maximum delay before asynchronously-committed transactions will be flushed to disk. This is a portion of Simon Riggs' async-commit patch, committed to CVS separately because a background WAL writer seems like it might be a good idea independently of the async-commit feature. I rebased walwriter.c on bgwriter.c because it seemed like a more appropriate way of handling signals; while the startup/shutdown logic in postmaster.c is more like autovac because we want walwriter to quit before we start the shutdown checkpoint.
2007-06-30Improve logging of checkpoints. Patch by Greg Smith, worked overTom Lane
by Heikki and a little bit by me.
2007-06-28Implement "distributed" checkpoints in which the checkpoint I/O is spreadTom Lane
over a fairly long period of time, rather than being spat out in a burst. This happens only for background checkpoints carried out by the bgwriter; other cases, such as a shutdown checkpoint, are still done at full speed. Remove the "all buffers" scan in the bgwriter, and associated stats infrastructure, since this seems no longer very useful when the checkpoint itself is properly throttled. Original patch by Itagaki Takahiro, reworked by Heikki Linnakangas, and some minor API editorialization by me.
2007-05-31Make some messages more consistentPeter Eisentraut
2007-05-31Downgrade some low-level startup messages to DEBUG1.Peter Eisentraut
2007-05-30Make large sequential scans and VACUUMs work in a limited-size "ring" ofTom Lane
buffers, rather than blowing out the whole shared-buffer arena. Aside from avoiding cache spoliation, this fixes the problem that VACUUM formerly tended to cause a WAL flush for every page it modified, because we had it hacked to use only a single buffer. Those flushes will now occur only once per ring-ful. The exact ring size, and the threshold for seqscans to switch into the ring usage pattern, remain under debate; but the infrastructure seems done. The key bit of infrastructure is a new optional BufferAccessStrategy object that can be passed to ReadBuffer operations; this replaces the former StrategyHintVacuum API. This patch also changes the buffer usage-count methodology a bit: we now advance usage_count when first pinning a buffer, rather than when last unpinning it. To preserve the behavior that a buffer's lifetime starts to decrease when it's released, the clock sweep code is modified to not decrement usage_count of pinned buffers. Work not done in this commit: teach GiST and GIN indexes to use the vacuum BufferAccessStrategy for vacuum-driven fetches. Original patch by Simon, reworked by Heikki and again by Tom.
2007-05-20To support external compression of archived WAL data, add a flag bit toTom Lane
WAL records that shows whether it is safe to remove full-page images (ie, whether or not an on-line backup was in progress when the WAL entry was made). Also make provision for an XLOG_NOOP record type that can be used to fill in the extra space when decompressing the data for restore. This is the portion of Koichi Suzuki's "full page writes" patch that has to go into the core database. The remainder of that work is two external compression and decompression programs, which for the time being will undergo separate development on pgfoundry. Per discussion. Also, twiddle the handling of BTREE_SPLIT records to ensure it'll be possible to compress them (the previous coding caused essential info to be omitted). The other commonly-used record types seem OK already, with the possible exception of GIN and GIST WAL records, which I don't understand well enough to opine on.
2007-04-30Change the timestamps recorded in transaction commit/abort xlog recordsTom Lane
from time_t to TimestampTz representation. This provides full gettimeofday() resolution of the timestamps, which might be useful when attempting to do point-in-time recovery --- previously it was not possible to specify the stop point with sub-second resolution. But mostly this is to get rid of TimestampTz-to-time_t conversion overhead during commit. Per my proposal of a day or two back.
2007-04-03Remove the CheckpointStartLock in favor of having backends show whether theyTom Lane
are in their commit critical sections via flags in the ProcArray. Checkpoint can watch the ProcArray to determine when it's safe to proceed. This is a considerably better solution to the original problem of race conditions between checkpoint and transaction commit: it speeds up commit, since there's one less lock to fool with, and it prevents the problem of checkpoint being delayed indefinitely when there's a constant flow of commits. Heikki, with some kibitzing from Tom.
2007-04-03Decouple the values of TOAST_TUPLE_THRESHOLD and TOAST_MAX_CHUNK_SIZE.Tom Lane
Add the latter to the values checked in pg_control, since it can't be changed without invalidating toast table content. This commit in itself shouldn't change any behavior, but it lays some necessary groundwork for experimentation with these toast-control numbers. Note: while TOAST_TUPLE_THRESHOLD can now be changed without initdb, some thought still needs to be given to needs_toast_table() in toasting.c before unleashing random changes.
2007-03-03Remove undo information from pg_controldata --- never used.Bruce Momjian
Florian G. Pflug
2007-02-14Move fsync method macro defines into /include/access/xlogdefs.h so theyBruce Momjian
can be used by src/tools/fsync/test_fsync.c.
2007-02-08Normalize fgets() calls to use sizeof() for calculating the buffer sizePeter Eisentraut
where possible, and fix some sites that apparently thought that fgets() will overwrite the buffer by one byte. Also add some strlcpy() to eliminate some weird memory handling.
2007-02-07Remove the xlog-centric "database system is ready" message and replace it withTom Lane
"database system is ready to accept connections", which is issued by the postmaster when it really is ready to accept connections. Per proposal from Markus Schiltknecht and subsequent discussion.
2007-02-01Wording cleanup for error messages. Also change can't -> cannot.Bruce Momjian
Standard English uses "may", "can", and "might" in different ways: may - permission, "You may borrow my rake." can - ability, "I can lift that log." might - possibility, "It might rain today." Unfortunately, in conversational English, their use is often mixed, as in, "You may use this variable to do X", when in fact, "can" is a better choice. Similarly, "It may crash" is better stated, "It might crash".
2007-01-05Update CVS HEAD for 2007 copyright. Back branches are typically notBruce Momjian
back-stamped for this.
2006-12-08Remove the logId/logSeg fields from pg_control, because they are not neededTom Lane
in normal operation, and we can avoid rewriting pg_control at every log segment switch if we don't insist that these values be valid. Reducing the number of pg_control updates is a good idea for both performance and reliability. It does make pg_resetxlog's life a bit harder, but that seems a good tradeoff; and anyway the change to pg_resetxlog amounts to automating something people formerly needed to do by hand, namely look at the existing pg_xlog files to make sure the new WAL start point was past them. In passing, change the wording of xlog.c's "database system was interrupted" messages: describe the pg_control timestamp as "last known up at" rather than implying it is the exact time of service interruption. With this change the timestamp will generally be the time of the last checkpoint, which could be many minutes before the failure; and we've already seen indications that people tend to misinterpret the old wording. initdb forced due to change in pg_control layout. Simon Riggs and Tom Lane
2006-11-30Minor adjustments to make failures in startup/shutdown behave more cleanly.Tom Lane
StartupXLOG and ShutdownXLOG no longer need to be critical sections, because in all contexts where they are invoked, elog(ERROR) would be translated to elog(FATAL) anyway. (One change in bgwriter.c is needed to make this true: set ExitOnAnyError before trying to exit. This is a good fix anyway since the existing code would have gone into an infinite loop on elog(ERROR) during shutdown.) That avoids a misleading report of PANIC during semi-orderly failures. Modify the postmaster to include the startup process in the set of processes that get SIGTERM when a fast shutdown is requested, and also fix it to not try to restart the bgwriter if the bgwriter fails while trying to write the shutdown checkpoint. Net result is that "pg_ctl stop -m fast" does something reasonable for a system in warm standby mode, and so should Unix system shutdown (ie, universal SIGTERM). Per gripe from Stephen Harris and some corner-case testing of my own.
2006-11-21On systems that have setsid(2) (which should be just about everything exceptTom Lane
Windows), arrange for each postmaster child process to be its own process group leader, and deliver signals SIGINT, SIGTERM, SIGQUIT to the whole process group not only the direct child process. This provides saner behavior for archive and recovery scripts; in particular, it's possible to shut down a warm-standby recovery server using "pg_ctl stop -m immediate", since delivery of SIGQUIT to the startup subprocess will result in killing the waiting recovery_command. Also, this makes Query Cancel and statement_timeout apply to scripts being run from backends via system(). (There is no support in the core backend for that, but it's widely done using untrusted PLs.) Per gripe from Stephen Harris and subsequent discussion.
2006-11-16String fixPeter Eisentraut
2006-11-10Clean up some misleading references to %p being a full path, per Simon.Tom Lane
2006-11-08Change Windows rename and unlink substitutes so that they time out afterTom Lane
30 seconds instead of retrying forever. Also modify xlog.c so that if it fails to rename an old xlog segment up to a future slot, it will unlink the segment instead. Per discussion of bug #2712, in which it became apparent that Windows can handle unlinking a file that's being held open, but not renaming it.
2006-11-05Fix recently-understood problems with handling of XID freezing, particularlyTom Lane
in PITR scenarios. We now WAL-log the replacement of old XIDs with FrozenTransactionId, so that such replacement is guaranteed to propagate to PITR slave databases. Also, rather than relying on hint-bit updates to be preserved, pg_clog is not truncated until all instances of an XID are known to have been replaced by FrozenTransactionId. Add new GUC variables and pg_autovacuum columns to allow management of the freezing policy, so that users can trade off the size of pg_clog against the amount of freezing work done. Revise the already-existing code that forces autovacuum of tables approaching the wraparound point to make it more bulletproof; also, revise the autovacuum logic so that anti-wraparound vacuuming is done per-table rather than per-database. initdb forced because of changes in pg_class, pg_database, and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-10-18Add some code to CREATE DATABASE to check for pre-existing subdirectoriesTom Lane
that conflict with the OID that we want to use for the new database. This avoids the risk of trying to remove files that maybe we shouldn't remove. Per gripe from Jon Lapham and subsequent discussion of 27-Sep.
2006-10-06Message style improvementsPeter Eisentraut
2006-10-04pgindent run for 8.2.Bruce Momjian
2006-08-21Make the server track an 'XID epoch', that is, maintain higher-order bitsTom Lane
of the transaction ID counter. Nothing is done with the epoch except to store it in checkpoint records, but this provides a foundation with which add-on code can pretend that XIDs never wrap around. This is a severely trimmed and rewritten version of the xxid patch submitted by Marko Kreen. Per discussion, the epoch counter seems the only part of xxid that really needs to be in the core server.
2006-08-17Implement archive_timeout feature to force xlog file switches to occur no moreTom Lane
than N seconds apart. This allows a simple, if not very high performance, means of guaranteeing that a PITR archive is no more than N seconds behind real time. Also make pg_current_xlog_location return the WAL Write pointer, add pg_current_xlog_insert_location to return the Insert pointer, and fix pg_xlogfile_name_offset to return its results as a two-element record instead of a smashed-together string, as per recent discussion. Simon Riggs
2006-08-07Make recovery from WAL be restartable, by executing a checkpoint-likeTom Lane
operation every so often. This improves the usefulness of PITR log shipping for hot standby: formerly, if the standby server crashed, it was necessary to restart it from the last base backup and replay all the WAL since then. Now it will only need to reread about the same amount of WAL as the master server would. The behavior might also come in handy during a long PITR replay sequence. Simon Riggs, with some editorialization by Tom Lane.
2006-08-06Add support for forcing a switch to a new xlog file; cause such a switchTom Lane
to happen automatically during pg_stop_backup(). Add some functions for interrogating the current xlog insertion point and for easily extracting WAL filenames from the hex WAL locations displayed by pg_stop_backup and friends. Simon Riggs with some editorialization by Tom Lane.
2006-07-30Modify snapshot definition so that lazy vacuums are ignored by otherAlvaro Herrera
vacuums. This allows a OLTP-like system with big tables to continue regular vacuuming on small-but-frequently-updated tables while the big tables are being vacuumed. Original patch from Hannu Krossing, rewritten by Tom Lane and updated by me.
2006-07-14Remove 576 references of include files that were not needed.Bruce Momjian
2006-07-13Allow include files to compile own their own.Bruce Momjian
Strip unused include files out unused include files, and add needed includes to C files. The next step is to remove unused include files in C files.
2006-06-27Put #ifdef NOT_USED around posix_fadvise call. We may want to resurrectTom Lane
this someday, but right now it seems that posix_fadvise is immature to the point of being broken on many platforms ... and we don't have any benchmark evidence proving it's worth spending time on.
2006-06-22pg_stop_backup was calling XLogArchiveNotify() twice for the newly createdTom Lane
backup history file. Bug introduced by the 8.1 change to make pg_stop_backup delete older history files. Per report from Masao Fujii.
2006-06-18Don't try to call posix_fadvise() unless <fcntl.h> supplies a declarationTom Lane
for it. Hopefully will fix core dump evidenced by some buildfarm members since fadvise patch went in. The actual definition of the function is not ABI-compatible with compiler's default assumption in the absence of any declaration, so it's clearly unsafe to try to call it without seeing a declaration.
2006-06-16Test for POSIX_FADV_DONTNEED to use posix_fadvise().Bruce Momjian
2006-06-15Use posix_fadvise() to avoid kernel caching of WAL contents on WAL fileBruce Momjian
close. ITAGAKI Takahiro
2006-04-20Ensure that we validate the page header of the first page of a WAL fileTom Lane
whenever we start to read within that file. The first page carries extra identification information that really ought to be checked, but as the code stood, this was only checked when we switched sequentially into a new WAL file, or if by chance the starting checkpoint record was within the first page. This patch ensures that we will detect bogus 'long header' information before we start replaying the WAL sequence.