user/sven/postgresql.git

Age	Commit message (Collapse)	Author
2012-01-13	Correctly initialise shared recoveryLastRecPtr in recovery.	Simon Riggs
	Previously we used ReadRecPtr rather than EndRecPtr, which was not a serious error but caused pg_stat_replication to report incorrect replay_location until at least one WAL record is replayed. Fujii Masao
2012-01-11	Remove useless 'needlock' argument from GetXLogInsertRecPtr. It was always	Heikki Linnakangas
	passed as 'true'.
2012-01-11	Refactor XLogInsert a bit. The rdata entries for backup blocks are now	Heikki Linnakangas
	constructed before acquiring WALInsertLock, which slightly reduces the time the lock is held. Although I could not measure any benefit in benchmarks, the code is more readable this way.
2012-01-06	Make the number of CLOG buffers adaptive, based on shared_buffers.	Robert Haas
	Previously, this was hardcoded: we always had 8. Performance testing shows that isn't enough, especially on big SMP systems, so we allow it to scale up as high as 32 when there's adequate memory. On the flip side, when shared_buffers is very small, drop the number of CLOG buffers down to as little as 4, so that we can start the postmaster even when very little shared memory is available. Per extensive discussion with Simon Riggs, Tom Lane, and others on pgsql-hackers.
2012-01-01	Update copyright notices for year 2012.	Bruce Momjian

2011-12-31	Send new protocol keepalive messages to standby servers.	Simon Riggs
	Allows streaming replication users to calculate transfer latency and apply delay via internal functions. No external functions yet.
2011-12-20	Avoid crashing when we have problems unlinking files post-commit.	Tom Lane
	smgrdounlink takes care to not throw an ERROR if it fails to unlink something, but that caution was rendered useless by commit 3396000684b41e7e9467d1abc67152b39e697035, which put an smgrexists call in front of it; smgrexists does throw error if anything looks funny, such as getting a permissions error from trying to open the file. If that happens post-commit, you get a PANIC, and what's worse the same logic appears in the WAL replay code, so the database even fails to restart. Restore the intended behavior by removing the smgrexists call --- it isn't accomplishing anything that we can't do better by adjusting mdunlink's ideas of whether it ought to warn about ENOENT or not. Per report from Joseph Shraibman of unrecoverable crash after trying to drop a table whose FSM fork had somehow gotten chmod'd to 000 permissions. Backpatch to 8.4, where the bogus coding was introduced.
2011-12-17	Fix some long-obsolete references to XLogOpenRelation.	Tom Lane
	These were missed in commit a213f1ee6c5a1bbe1f074ca201975e76ad2ed50c, which removed that function.
2011-12-17	Add SP-GiST (space-partitioned GiST) index access method.	Tom Lane
	SP-GiST is comparable to GiST in flexibility, but supports non-balanced partitioned search structures rather than balanced trees. As described at PGCon 2011, this new indexing structure can beat GiST in both index build time and query speed for search problems that it is well matched to. There are a number of areas that could still use improvement, but at this point the code seems committable. Teodor Sigaev and Oleg Bartunov, with considerable revisions by Tom Lane
2011-12-12	Move BKP_REMOVABLE bit from individual WAL records to WAL page headers.	Tom Lane
	Removing this bit from xl_info allows us to restore the old limit of four (not three) separate pages touched by a WAL record, which is needed for the upcoming SP-GiST feature, and will likely be useful elsewhere in future. When we implemented XLR_BKP_REMOVABLE in 2007, we had to do it like that because no special WAL-visible action was taken when starting a backup. However, now we force a segment switch when starting a backup, so a compressing WAL archiver (such as pglesslog) that uses the state shown in the current page header will not be fooled as to removability of backup blocks. The only downside is that the archiver will not return to compressing mode for up to one WAL page after the backup is over, which is a small price to pay for getting back the extra xl_info bit. In any case the archiver could look for XLOG_BACKUP_END records if it thought it was worth the trouble to do so. Bump XLOG_PAGE_MAGIC since this is effectively a change in WAL format.
2011-12-09	Don't set reachedMinRecoveryPoint during crash recovery. In crash recovery,	Heikki Linnakangas
	we don't reach consistency before replaying all of the WAL. Rename the variable to reachedConsistency, to make its intention clearer. In master, that was an active bug because of the recent patch to immediately PANIC if a reference to a missing page is found in WAL after reaching consistency, as Tom Lane's test case demonstrated. In 9.1 and 9.0, the only consequence was a misleading "consistent recovery state reached at %X/%X" message in the log at the beginning of crash recovery (the database is not consistent at that point yet). In 8.4, the log message was not printed in crash recovery, even though there was a similar reachedMinRecoveryPoint local variable that was also set early. So, backpatch to 9.1 and 9.0.
2011-12-02	During recovery, if we reach consistent state and still have entries in the	Heikki Linnakangas
	invalid-page hash table, PANIC immediately. Immediate PANIC is much better than waiting for end-of-recovery, which is what we did before, because the end-of-recovery might not come until months later if this is a standby server. Also refrain from creating a restartpoint if there are invalid-page entries in the hash table. Restarting recovery from such a restartpoint would not see the invalid references, and wouldn't be able to cross-check them when consistency is reached. That wouldn't matter when things are going smoothly, but the more sanity checks you have the better. Fujii Masao
2011-11-25	Move "hot" members of PGPROC into a separate PGXACT array.	Robert Haas
	This speeds up snapshot-taking and reduces ProcArrayLock contention. Also, the PGPROC (and PGXACT) structures used by two-phase commit are now allocated as part of the main array, rather than in a separate array, and we keep ProcArray sorted in pointer order. These changes are intended to minimize the number of cache lines that must be pulled in to take a snapshot, and testing shows a substantial increase in performance on both read and write workloads at high concurrencies. Pavan Deolasee, Heikki Linnakangas, Robert Haas
2011-11-13	Wakeup WALWriter as needed for asynchronous commit performance.	Simon Riggs
	Previously we waited for wal_writer_delay before flushing WAL. Now we also wake WALWriter as soon as a WAL buffer page has filled. Significant effect observed on performance of asynchronous commits by Robert Haas, attributed to the ability to set hint bits on tuples earlier and so reducing contention caused by clog lookups.
2011-11-04	Move user functions related to WAL into xlogfuncs.c	Simon Riggs

2011-11-02	Update more comments about checkpoints being done by bgwriter	Simon Riggs

2011-11-02	Reduce checkpoints and WAL traffic on low activity database server	Simon Riggs
	Previously, we skipped a checkpoint if no WAL had been written since last checkpoint, though this does not appear in user documentation. As of now, we skip a checkpoint until we have written at least one enough WAL to switch the next WAL file. This greatly reduces the level of activity and number of WAL messages generated by a very low activity server. This is safe because the purpose of a checkpoint is to act as a starting place for a recovery, in case of crash. This patch maintains minimal WAL volume for replay in case of crash, thus maintaining very low crash recovery time.
2011-11-02	Refactor xlog.c to create src/backend/postmaster/startup.c	Simon Riggs
	Startup process now has its own dedicated file, just like all other special/background processes. Reduces role and size of xlog.c
2011-11-02	Derive oldestActiveXid at correct time for Hot Standby.	Simon Riggs
	There was a timing window between when oldestActiveXid was derived and when it should have been derived that only shows itself under heavy load. Move code around to ensure correct timing of derivation. No change to StartupSUBTRANS() code, which is where this failed. Bug report by Chris Redekop
2011-11-02	Fix timing of Startup CLOG and MultiXact during Hot Standby	Simon Riggs
	Patch by me, bug report by Chris Redekop, analysis by Florian Pflug
2011-11-01	Comment changes to show bgwriter no longer performs checkpoints.	Simon Riggs

2011-10-22	Support synchronization of snapshots through an export/import procedure.	Tom Lane
	A transaction can export a snapshot with pg_export_snapshot(), and then others can import it with SET TRANSACTION SNAPSHOT. The data does not leave the server so there are not security issues. A snapshot can only be imported while the exporting transaction is still running, and there are some other restrictions. I'm not totally convinced that we've covered all the bases for SSI (true serializable) mode, but it works fine for lesser isolation modes. Joachim Wieland, reviewed by Marko Tiikkaja, and rather heavily modified by Tom Lane
2011-10-18	Suppress -Wunused-result warnings about write() and fwrite().	Tom Lane
	This is merely an exercise in satisfying pedants, not a bug fix, because in every case we were checking for failure later with ferror(), or else there was nothing useful to be done about a failure anyway. Document the latter cases.
2011-10-04	Fix uninitialized-variable bug.	Tom Lane

2011-10-04	Use callbacks in SlruScanDirectory for the actual action	Alvaro Herrera
	Previously, the code assumed that the only possible action to take was to delete files behind a certain cutoff point. The async notify code was already a crock: it used a different "pagePrecedes" function for truncation than for regular operation. By allowing it to pass a callback to SlruScanDirectory it can do cleanly exactly what it needs to do. The clog.c code also had its own use for SlruScanDirectory, which is made a bit simpler with this.
2011-10-02	Restructure error handling in reading of postgresql.conf.	Tom Lane
	This patch has two distinct purposes: to report multiple problems in postgresql.conf rather than always bailing out after the first one, and to change the policy for whether changes are applied when there are unrelated errors in postgresql.conf. Formerly the policy was to apply no changes if any errors could be detected, but that had a significant consistency problem, because in some cases specific values might be seen as valid by some processes but invalid by others. This meant that the latter processes would fail to adopt changes in other parameters even though the former processes had done so. The new policy is that during SIGHUP, the file is rejected as a whole if there are any errors in the "name = value" syntax, or if any lines attempt to set nonexistent built-in parameters, or if any lines attempt to set custom parameters whose prefix is not listed in (the new value of) custom_variable_classes. These tests should always give the same results in all processes, and provide what seems a reasonably robust defense against loading values from badly corrupted config files. If these tests pass, all processes will apply all settings that they individually see as good, ignoring (but logging) any they don't. In addition, the postmaster does not abandon reading a configuration file after the first syntax error, but continues to read the file and report syntax errors (up to a maximum of 100 syntax errors per file). The postmaster will still refuse to start up if the configuration file contains any errors at startup time, but these changes allow multiple errors to be detected and reported before quitting. Alexey Klyukin, reviewed by Andy Colson and av (Alexander ?) with some additional hacking by Tom Lane
2011-09-26	Allow snapshot references to still work during transaction abort.	Tom Lane
	In REPEATABLE READ (nee SERIALIZABLE) mode, an attempt to do GetTransactionSnapshot() between AbortTransaction and CleanupTransaction failed, because GetTransactionSnapshot would recompute the transaction snapshot (which is already wrong, given the isolation mode) and then re-register it in the TopTransactionResourceOwner, leading to an Assert because the TopTransactionResourceOwner should be empty of resources after AbortTransaction. This is the root cause of bug #6218 from Yamamoto Takashi. While changing plancache.c to avoid requesting a snapshot when handling a ROLLBACK masks the problem, I think this is really a snapmgr.c bug: it's lower-level than the resource manager mechanism and should not be shutting itself down before we unwind resource manager resources. However, just postponing the release of the transaction snapshot until cleanup time didn't work because of the circular dependency with TopTransactionResourceOwner. Fix by managing the internal reference to that snapshot manually instead of depending on TopTransactionResourceOwner. This saves a few cycles as well as making the module layering more straightforward. predicate.c's dependencies on TopTransactionResourceOwner go away too. I think this is a longstanding bug, but there's no evidence that it's more than a latent bug, so it doesn't seem worth any risk of back-patching.
2011-09-09	Move Timestamp/Interval typedefs and basic macros into datatype/timestamp.h.	Tom Lane
	As per my recent proposal, this refactors things so that these typedefs and macros are available in a header that can be included in frontend-ish code. I also changed various headers that were undesirably including utils/timestamp.h to include datatype/timestamp.h instead. Unsurprisingly, this showed that half the system was getting utils/timestamp.h by way of xlog.h. No actual code changes here, just header refactoring.
2011-09-07	Partially revoke attempt to improve performance with many savepoints.	Simon Riggs
	Maintain difference between subtransaction release and commit introduced by earlier patch.
2011-09-05	Adjust translator comment format to xgettext expectations	Alvaro Herrera

2011-09-05	Mark some untranslatable messages with errmsg_internal	Alvaro Herrera

2011-09-04	Clean up the #include mess a little.	Tom Lane
	walsender.h should depend on xlog.h, not vice versa. (Actually, the inclusion was circular until a couple hours ago, which was even sillier; but Bruce broke it in the expedient rather than logically correct direction.) Because of that poor decision, plus blind application of pgrminclude, we had a situation where half the system was depending on xlog.h to include such unrelated stuff as array.h and guc.h. Clean up the header inclusion, and manually revert a lot of what pgrminclude had done so things build again. This episode reinforces my feeling that pgrminclude should not be run without adult supervision. Inclusion changes in header files in particular need to be reviewed with great care. More generally, it'd be good if we had a clearer notion of module layering to dictate which headers can sanely include which others ... but that's a big task for another day.
2011-09-03	Whitespace adjustment for consistency in the file	Peter Eisentraut

2011-09-01	Remove unnecessary #include references, per pgrminclude script.	Bruce Momjian

2011-08-29	Remove some tabs from README file.	Robert Haas
	Some of the ASCII art expected 8-space tab stops, and some of it expected 4-space tab stops. Per report from YAMAMOTO Takashi.
2011-08-26	Add missing includes after pgrminclude run.	Bruce Momjian

2011-08-17	Fix comment about which version had BACKUP METHOD line in backup_lable, again.	Heikki Linnakangas
	It was invalidated again by Fujii's patch to 9.1.
2011-08-16	Fix race condition in relcache init file invalidation.	Tom Lane
	The previous code tried to synchronize by unlinking the init file twice, but that doesn't actually work: it leaves a window wherein a third process could read the already-stale init file but miss the SI messages that would tell it the data is stale. The result would be bizarre failures in catalog accesses, typically "could not read block 0 in file ..." later during startup. Instead, hold RelCacheInitLock across both the unlink and the sending of the SI messages. This is more straightforward, and might even be a bit faster since only one unlink call is needed. This has been wrong since it was put in (in 2002!), so back-patch to all supported releases.
2011-08-16	Fix bogus comment that claimed that the new BACKUP METHOD line in	Heikki Linnakangas
	backup_label was new in 9.0. Spotted by Fujii Masao.
2011-08-10	Change the autovacuum launcher to use WaitLatch instead of a poll loop.	Tom Lane
	In pursuit of this (and with the expectation that WaitLatch will be needed in more places), convert the latch field that was already added to PGPROC for sync rep into a generic latch that is activated for all PGPROC-owning processes, and change many of the standard backend signal handlers to set that latch when a signal happens. This will allow WaitLatch callers to be wakened properly by these signals. In passing, fix a whole bunch of signal handlers that had been hacked to do things that might change errno, without adding the necessary save/restore logic for errno. Also make some minor fixes in unix_latch.c, and clean up bizarre and unsafe scheme for disowning the process's latch. Much of this has to be back-patched into 9.1. Peter Geoghegan, with additional work by Tom
2011-08-10	If backup-end record is not seen, and we reach end of recovery from a	Heikki Linnakangas
	streamed backup, throw an error and refuse to start up. The restore has not finished correctly in that case and the data directory is possibly corrupt. We already errored out in case of archive recovery, but could not during crash recovery because we couldn't distinguish between the case that pg_start_backup() was called and the database then crashed (must not error, data is OK), and the case that we're restoring from a backup and not all the needed WAL was replayed (data can be corrupt). To distinguish those cases, add a line to backup_label to indicate whether the backup was taken with pg_start/stop_backup(), or by streaming (ie. pg_basebackup). This requires re-initdb, because of a new field added to the control file.
2011-08-09	Measure WaitLatch's timeout parameter in milliseconds, not microseconds.	Tom Lane
	The original definition had the problem that timeouts exceeding about 2100 seconds couldn't be specified on 32-bit machines. Milliseconds seem like sufficient resolution, and finer grain than that would be fantasy anyway on many platforms. Back-patch to 9.1 so that this aspect of the latch API won't change between 9.1 and later releases. Peter Geoghegan
2011-07-19	Remove O(N^2) performance issue with multiple SAVEPOINTs.	Simon Riggs
	Subtransaction locks now released en masse at main commit, rather than repeatedly re-scanning for locks as we ascend the nested transaction tree. Split transaction state TBLOCK_SUBEND into two states, TBLOCK_SUBCOMMIT and TBLOCK_SUBRELEASE to allow the commit path to be optimised using the existing code in ResourceOwnerRelease() which appears to have been intended for this usage, judging from comments therein.
2011-07-19	Cascading replication feature for streaming log-based replication.	Simon Riggs
	Standby servers can now have WALSender processes, which can work with either WALReceiver or archive_commands to pass data. Fully updated docs, including new conceptual terms of sending server, upstream and downstream servers. WALSenders terminated when promote to master. Fujii Masao, review, rework and doc rewrite by Simon Riggs
2011-07-08	Introduce a pipe between postmaster and each backend, which can be used to	Heikki Linnakangas
	detect postmaster death. Postmaster keeps the write-end of the pipe open, so when it dies, children get EOF in the read-end. That can conveniently be waited for in select(), which allows eliminating some of the polling loops that check for postmaster death. This patch doesn't yet change all the loops to use the new mechanism, expect a follow-on patch to do that. This changes the interface to WaitLatch, so that it takes as argument a bitmask of events that it waits for. Possible events are latch set, timeout, postmaster death, and socket becoming readable or writeable. The pipe method behaves slightly differently from the kill() method previously used in PostmasterIsAlive() in the case that postmaster has died, but its parent has not yet read its exit code with waitpid(). The pipe returns EOF as soon as the process dies, but kill() continues to return true until waitpid() has been called (IOW while the process is a zombie). Because of that, change PostmasterIsAlive() to use the pipe too, otherwise WaitLatch() would return immediately with WL_POSTMASTER_DEATH, while PostmasterIsAlive() would claim it's still alive. That could easily lead to busy-waiting while postmaster is in zombie state. Peter Geoghegan with further changes by me, reviewed by Fujii Masao and Florian Pflug.
2011-06-29	Unify spelling of "canceled", "canceling", "cancellation"	Peter Eisentraut
	We had previously (af26857a2775e7ceb0916155e931008c2116632f) established the U.S. spellings as standard.
2011-06-28	Introduce compact WAL record for the common case of commit (non-DDL).	Simon Riggs
	XLOG_XACT_COMMIT_COMPACT leaves out invalidation messages and relfilenodes, saving considerable space for the vast majority of transaction commits. XLOG_XACT_COMMIT keeps same definition as XLOG_PAGE_MAGIC 0xD067 and earlier. Leonardo Francalanci and Simon Riggs
2011-06-21	Make the visibility map crash-safe.	Robert Haas
	This involves two main changes from the previous behavior. First, when we set a bit in the visibility map, emit a new WAL record of type XLOG_HEAP2_VISIBLE. Replay sets the page-level PD_ALL_VISIBLE bit and the visibility map bit. Second, when inserting, updating, or deleting a tuple, we can no longer get away with clearing the visibility map bit after releasing the lock on the corresponding heap page, because an intervening crash might leave the visibility map bit set and the page-level bit clear. Making this work requires a bit of interface refactoring. In passing, a few minor but related cleanups: change the test in visibilitymap_set and visibilitymap_clear to throw an error if the wrong page (or no page) is pinned, rather than silently doing nothing; this case should never occur. Also, remove duplicate definitions of InvalidXLogRecPtr. Patch by me, review by Noah Misch.
2011-06-16	pgindent run of recent SSI changes. Also, remove an unnecessary #include.	Heikki Linnakangas
	Kevin Grittner
2011-06-14	Oops, forgot to change the order of entries in 2PC callback arrays when I	Heikki Linnakangas
	renumbered the resource managers. This should fix the buildfarm..