user/sven/postgresql.git

Age	Commit message (Collapse)	Author
2013-06-23	Ensure no xid gaps during Hot Standby startup	Simon Riggs
	In some cases with higher numbers of subtransactions it was possible for us to incorrectly initialize subtrans leading to complaints of missing pages. Bug report by Sergey Konoplev Analysis and fix by Andres Freund
2013-06-06	Fix typo in comment.	Heikki Linnakangas

2013-05-13	Fix handling of OID wraparound while in standalone mode.	Tom Lane
	If OID wraparound should occur while in standalone mode (unlikely but possible), we want to advance the counter to FirstNormalObjectId not FirstBootstrapObjectId. Otherwise, user objects might be created with OIDs in the system-reserved range. That isn't immediately harmful but it poses a risk of conflicts during future pg_upgrade operations. Noted by Andres Freund. Back-patch to all supported branches, since all of them are supported sources for pg_upgrade operations.
2013-05-02	Fix thinko in comment.	Heikki Linnakangas
	WAL segment means a 16 MB physical WAL file; this comment meant a logical 4 GB log file. Amit Langote. Apply to backbranches only, as the comment is gone in master.
2013-03-27	Fix buffer pin leak in heap update redo routine.	Heikki Linnakangas
	In a heap update, if the old and new tuple were on different pages, and the new page no longer existed (because it was subsequently truncated away by vacuum), heap_xlog_update forgot to release the pin on the old buffer. This bug was introduced by the "Fix multiple problems in WAL replay" patch, commit 3bbf668de9f1bc172371681e80a4e769b6d014c8 (on master branch). With full_page_writes=off, this triggered an "incorrect local pin count" error later in replay, if the old page was vacuumed. This fixes bug #7969, reported by Yunong Xiao. Backpatch to 9.0, like the commit that introduced this bug.
2013-02-10	Further cleanup of gistsplit.c.	Tom Lane
	After further reflection I was unconvinced that the existing coding is guaranteed to return valid union datums in every code path for multi-column indexes. Fix that by forcing a gistunionsubkey() call at the end of the recursion. Having done that, we can remove some clearly-redundant calls elsewhere. This should be a little faster for multi-column indexes (since the previous coding would uselessly do such a call for each column while unwinding the recursion), as well as much harder to break. Also, simplify the handling of cases where one side or the other of a primary split contains only don't-care tuples. The previous coding used a very ugly hack in removeDontCares() that essentially forced one random tuple to be treated as non-don't-care, providing a random initial choice of seed datum for the secondary split. It seems unlikely that that method will give better-than-random splits. Instead, treat such a split as degenerate and just let the next column determine the split, the same way that we handle fully degenerate cases where the two sides produce identical union datums.
2013-02-10	Remove useless picksplit-doesn't-support-secondary-split log spam.	Tom Lane
	This LOG message was put in over five years ago with the evident expectation that we'd make all GiST opclasses support secondary split directly. However, no such thing ever happened, and indeed the number of opclasses supporting it decreased to zero in 9.2. The reason is that improving on the default implementation isn't that easy --- the opclass-specific code that did exist, before 9.2, doesn't appear to have been any improvement over the default. Hence, remove the message altogether. There's certainly no point in nagging users about this in released branches, but I doubt that we'll ever implement complete opclass-specific support anyway.
2013-02-10	Document and clean up gistsplit.c.	Tom Lane
	Improve comments, rename some variables and functions, slightly simplify a couple of APIs, in an attempt to make this code readable by people other than its original author. Even though this is essentially just cosmetic, back-patch to all active branches, because otherwise it's going to make back-patching future fixes in this file very painful.
2013-02-08	Fix gist_box_same and gist_point_consistent to handle fuzziness correctly.	Tom Lane
	While there's considerable doubt that we want fuzzy behavior in the geometric operators at all (let alone as currently implemented), nobody is stepping forward to redesign that stuff. In the meantime it behooves us to make sure that index searches agree with the behavior of the underlying operators. This patch fixes two problems in this area. First, gist_box_same was using fuzzy equality, but it really needs to use exact equality to prevent not-quite-identical upper index keys from being treated as identical, which for example would prevent an existing upper key from being extended by an amount less than epsilon. This would result in inconsistent indexes. (The next release notes will need to recommend that users reindex GiST indexes on boxes, polygons, circles, and points, since all four opclasses use gist_box_same.) Second, gist_point_consistent used exact comparisons for upper-page comparisons in ~= searches, when it needs to use fuzzy comparisons to ensure it finds all matches; and it used fuzzy comparisons for point <@ box searches, when it needs to use exact comparisons because that's what the <@ operator (rather inconsistently) does. The added regression test cases illustrate all three misbehaviors. Back-patch to all active branches. (8.4 did not have GiST point_ops, but it still seems prudent to apply the gist_box_same patch to it.) Alexander Korotkov, reviewed by Noah Misch
2013-02-07	Repair bugs in GiST page splitting code for multi-column indexes.	Tom Lane
	When considering a non-last column in a multi-column GiST index, gistsplit.c tries to improve on the split chosen by the opclass-specific pickSplit function by considering penalties for the next column. However, there were two bugs in this code: it failed to recompute the union keys for the leftmost index columns, even though these might well change after reassigning tuples; and it included the old union keys in the recomputation for the columns it did recompute, so that those keys couldn't get smaller even if they should. The first problem could result in an invalid index in which searches wouldn't find index entries that are in fact present; the second would make the index less efficient to search. Both of these errors were caused by misuse of gistMakeUnionItVec, whose API was designed in a way that just begged such errors to be made. There is no situation in which it's safe or useful to compute the union keys for a subset of the index columns, and there is no caller that wants any previous union keys to be included in the computation; so the undocumented choice to treat the union keys as in/out rather than pure output parameters is a waste of code as well as being dangerous. Hence, rather than just making a minimal patch, I've changed the API of gistMakeUnionItVec to remove the "startkey" parameter (it now always processes all index columns) and treat the attr/isnull arrays as purely output parameters. In passing, also get rid of a couple of unnecessary and dangerous uses of static variables in gistutil.c. It's remarkable that the one in gistMakeUnionKey hasn't given us portability troubles before now, because in addition to posing a re-entrancy hazard, it was unsafely assuming that a static char[] array would have at least Datum alignment. Per investigation of a trouble report from Tomas Vondra. (There are also some bugs in contrib/btree_gist to be fixed, but that seems like material for a separate patch.) Back-patch to all supported branches.
2013-01-24	Fix rare missing cancellations in Hot Standby.	Simon Riggs
	The machinery around XLOG_HEAP2_CLEANUP_INFO failed to correctly pass through the necessary information on latestRemovedXid, avoiding cancellations in some infrequent concurrent update/cleanup scenarios. Backpatchable fix to 9.0 Detailed bug report and fix by Noah Misch, backpatchable version by me.
2012-12-11	Consistency check should compare last record replayed, not last record read.	Heikki Linnakangas
	EndRecPtr is the last record that we've read, but not necessarily yet replayed. CheckRecoveryConsistency should compare minRecoveryPoint with the last replayed record instead. This caused recovery to think it's reached consistency too early. Now that we do the check in CheckRecoveryConsistency correctly, we have to move the call of that function to after redoing a record. The current place, after reading a record but before replaying it, is wrong. In particular, if there are no more records after the one ending at minRecoveryPoint, we don't enter hot standby until one extra record is generated and read by the standby, and CheckRecoveryConsistency is called. These two bugs conspired to make the code appear to work correctly, except for the small window between reading the last record that reaches minRecoveryPoint, and replaying it. In the passing, rename recoveryLastRecPtr, which is the last record replayed, to lastReplayedEndRecPtr. This makes it slightly less confusing with replayEndRecPtr, which is the last record read that we're about to replay. Original report from Kyotaro HORIGUCHI, further diagnosis by Fujii Masao. Backpatch to 9.0, where Hot Standby subtly changed the test from "minRecoveryPoint < EndRecPtr" to "minRecoveryPoint <= EndRecPtr". The former works because where the test is performed, we have always read one more record than we've replayed.
2012-12-10	Update minimum recovery point on truncation.	Heikki Linnakangas
	If a file is truncated, we must update minRecoveryPoint. Once a file is truncated, there's no going back; it would not be safe to stop recovery at a point earlier than that anymore. Per report from Kyotaro HORIGUCHI. Backpatch to 8.4. Before that, minRecoveryPoint was not updated during recovery at all.
2012-12-02	Don't advance checkPoint.nextXid near the end of a checkpoint sequence.	Tom Lane
	This reverts commit c11130690d6dca64267201a169cfb38c1adec5ef in favor of actually fixing the problem: namely, that we should never have been modifying the checkpoint record's nextXid at this point to begin with. The nextXid should match the state as of the checkpoint's logical WAL position (ie the redo point), not the state as of its physical position. It's especially bogus to advance it in some wal_levels and not others. In any case there is no need for the checkpoint record to carry the same nextXid shown in the XLOG_RUNNING_XACTS record just emitted by LogStandbySnapshot, as any replay operation will already have adopted that value as current. This fixes bug #7710 from Tarvi Pillessaar, and probably also explains bug #6291 from Daniel Farina, in that if a checkpoint were in progress at the instant of XID wraparound, the epoch bump would be lost as reported. (And, of course, these days there's at least a 50-50 chance of a checkpoint being in progress at any given instant.) Diagnosed by me and independently by Andres Freund. Back-patch to all branches supporting hot standby.
2012-12-02	XidEpoch++ if wraparound during checkpoint.	Simon Riggs
	If wal_level = hot_standby we update the checkpoint nextxid, though in the case where a wraparound occurred half-way through a checkpoint we would neglect updating the epoch also. Updating the nextxid is arguably the wrong thing to do, but changing that may introduce subtle bugs into hot standby startup, while updating the value doesn't cause any known bugs yet. Minimal fix now to HEAD and backbranches, wider fix later in HEAD. Bug reported in #6291 by Daniel Farina and slightly differently in Cause analysis and recommended fixes from Tom Lane and Andres Freund. Applied patch is minimal version of Andres Freund's work.
2012-11-29	Fix assorted bugs in CREATE INDEX CONCURRENTLY.	Tom Lane
	This patch changes CREATE INDEX CONCURRENTLY so that the pg_index flag changes it makes without exclusive lock on the index are made via heap_inplace_update() rather than a normal transactional update. The latter is not very safe because moving the pg_index tuple could result in concurrent SnapshotNow scans finding it twice or not at all, thus possibly resulting in index corruption. In addition, fix various places in the code that ought to check to make sure that the indexes they are manipulating are valid and/or ready as appropriate. These represent bugs that have existed since 8.2, since a failed CREATE INDEX CONCURRENTLY could leave a corrupt or invalid index behind, and we ought not try to do anything that might fail with such an index. Also fix RelationReloadIndexInfo to ensure it copies all the pg_index columns that are allowed to change after initial creation. Previously we could have been left with stale values of some fields in an index relcache entry. It's not clear whether this actually had any user-visible consequences, but it's at least a bug waiting to happen. This is a subset of a patch already applied in 9.2 and HEAD. Back-patch into all earlier supported branches. Tom Lane and Andres Freund
2012-11-22	Avoid bogus "out-of-sequence timeline ID" errors in standby-mode.	Heikki Linnakangas
	When startup process opens a WAL segment after replaying part of it, it validates the first page on the WAL segment, even though the page it's really interested in later in the file. As part of the validation, it checks that the TLI on the page header is >= the TLI it saw on the last page it read. If the segment contains a timeline switch, and we have already replayed it, and then re-open the WAL segment (because of streaming replication got disconnected and reconnected, for example), the TLI check will fail when the first page is validated. Fix that by relaxing the TLI check when re-opening a WAL segment. Backpatch to 9.0. Earlier versions had the same code, but before standby mode was introduced in 9.0, recovery never tried to re-read a segment after partially replaying it. Reported by Amit Kapila, while testing a new feature.
2012-11-12	Fix multiple problems in WAL replay.	Tom Lane
	Most of the replay functions for WAL record types that modify more than one page failed to ensure that those pages were locked correctly to ensure that concurrent queries could not see inconsistent page states. This is a hangover from coding decisions made long before Hot Standby was added, when it was hardly necessary to acquire buffer locks during WAL replay at all, let alone hold them for carefully-chosen periods. The key problem was that RestoreBkpBlocks was written to hold lock on each page restored from a full-page image for only as long as it took to update that page. This was guaranteed to break any WAL replay function in which there was any update-ordering constraint between pages, because even if the nominal order of the pages is the right one, any mixture of full-page and non-full-page updates in the same record would result in out-of-order updates. Moreover, it wouldn't work for situations where there's a requirement to maintain lock on one page while updating another. Failure to honor an update ordering constraint in this way is thought to be the cause of bug #7648 from Daniel Farina: what seems to have happened there is that a btree page being split was rewritten from a full-page image before the new right sibling page was written, and because lock on the original page was not maintained it was possible for hot standby queries to try to traverse the page's right-link to the not-yet-existing sibling page. To fix, get rid of RestoreBkpBlocks as such, and instead create a new function RestoreBackupBlock that restores just one full-page image at a time. This function can be invoked by WAL replay functions at the points where they would otherwise perform non-full-page updates; in this way, the physical order of page updates remains the same no matter which pages are replaced by full-page images. We can then further adjust the logic in individual replay functions if it is necessary to hold buffer locks for overlapping periods. A side benefit is that we can simplify the handling of concurrency conflict resolution by moving that code into the record-type-specfic functions; there's no more need to contort the code layout to keep conflict resolution in front of the RestoreBkpBlocks call. In connection with that, standardize on zero-based numbering rather than one-based numbering for referencing the full-page images. In HEAD, I removed the macros XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4. They are still there in the header files in previous branches, but are no longer used by the code. In addition, fix some other bugs identified in the course of making these changes: spgRedoAddNode could fail to update the parent downlink at all, if the parent tuple is in the same page as either the old or new split tuple and we're not doing a full-page image: it would get fooled by the LSN having been advanced already. This would result in permanent index corruption, not just transient failure of concurrent queries. Also, ginHeapTupleFastInsert's "merge lists" case failed to mark the old tail page as a candidate for a full-page image; in the worst case this could result in torn-page corruption. heap_xlog_freeze() was inconsistent about using a cleanup lock or plain exclusive lock: it did the former in the normal path but the latter for a full-page image. A plain exclusive lock seems sufficient, so change to that. Also, remove gistRedoPageDeleteRecord(), which has been dead code since VACUUM FULL was rewritten. Back-patch to 9.0, where hot standby was introduced. Note however that 9.0 had a significantly different WAL-logging scheme for GIST index updates, and it doesn't appear possible to make that scheme safe for concurrent hot standby queries, because it can leave inconsistent states in the index even between WAL records. Given the lack of complaints from the field, we won't work too hard on fixing that branch.
2012-09-05	Fix inappropriate error messages for Hot Standby misconfiguration errors.	Tom Lane
	Give the correct name of the GUC parameter being complained of. Also, emit a more suitable SQLSTATE (INVALID_PARAMETER_VALUE, not the default INTERNAL_ERROR). Gurjeet Singh, errcode adjustment by me
2012-08-30	Back-patch recent fixes for gistchoose and gistRelocateBuildBuffersOnSplit.	Tom Lane
	This back-ports commits c8ba697a4bdb934f0c51424c654e8db6133ea255 and e5db11c5582b469c04a11f217a0f32c827da5dd7, which fix one definite and one speculative bug in gistchoose, and make the code a lot more intelligible as well. In 9.2 only, this also affects the largely-copied-and-pasted logic in gistRelocateBuildBuffersOnSplit. The impact of the bugs was that the functions might make poor decisions as to which index tree branch to push a new entry down into, resulting in GiST index bloat and poor performance. The fixes rectify these decisions for future insertions, but a REINDEX would be needed to clean up any existing index bloat. Alexander Korotkov, Robert Haas, Tom Lane
2012-08-07	fsync backup_label after pg_start_backup()	Simon Riggs
	Dave Kerr, backpatched by Simon Riggs
2012-06-29	Initialize shared memory copy of ckptXidEpoch correctly when not in recovery.	Heikki Linnakangas
	This bug was introduced by commit 20d98ab6e4110087d1816cd105a40fcc8ce0a307, so backpatch this to 9.0-9.2 like that one. This fixes bug #6710, reported by Tarvi Pillessaar
2012-06-07	Revert "Wake WALSender to reduce data loss at failover for async commit."	Tom Lane
	This reverts commit 090e8a984cf1a8a3ef7f6db6dc919f843902d80c. Since WalSndWakeup does not exist in 9.0, it's clear that this patch wasn't even compiled in this branch. Perhaps some variant of it is appropriate in 9.0, but for the moment I'm just going to un-break the buildfarm.
2012-06-07	Wake WALSender to reduce data loss at failover for async commit.	Simon Riggs
	WALSender now woken up after each background flush by WALwriter, avoiding multi-second replication delay for an all-async commit workload. Replication delay reduced from 7s with default settings to 200ms, allowing significantly reduced data loss at failover. Andres Freund and Simon Riggs
2012-06-01	Avoid early reuse of btree pages, causing incorrect query results.	Simon Riggs
	When we allowed read-only transactions to skip assigning XIDs we introduced the possibility that a fully deleted btree page could be reused. This broke the index link sequence which could then lead to indexscans silently returning fewer rows than would have been correct. The actual incidence of silent errors from this is thought to be very low because of the exact workload required and locking pre-conditions. Fix is to remove pages only if index page opaque->btpo.xact precedes RecentGlobalXmin. Noah Misch, reviewed by Simon Riggs
2012-05-31	Revert back-branch changes in behavior of age(xid).	Tom Lane
	Per discussion, it does not seem like a good idea to change the behavior of age(xid) in a minor release, even though the old definition causes the function to fail on hot standby slaves. Therefore, revert commit 5829387381d2e4edf84652bb5a712f6185860670 and follow-on commits in the back branches only.
2012-05-28	Teach AbortOutOfAnyTransaction to clean up partially-started transactions.	Tom Lane
	AbortOutOfAnyTransaction failed to do anything if the state it saw on entry corresponded to failing partway through StartTransaction. I fixed AbortCurrentTransaction to cope with that case way back in commit 60b2444cc3ba037630c9b940c3c9ef01b954b87b, but evidently overlooked that AbortOutOfAnyTransaction should do likewise. Back-patch to all supported branches. It's not clear that this omission has any more-than-cosmetic consequences, but it's also not clear that it doesn't, so back-patching seems the least risky choice.
2012-05-26	Prevent synchronized scanning when systable_beginscan chooses a heapscan.	Tom Lane
	The only interesting-for-performance case wherein we force heapscan here is when we're rebuilding the relcache init file, and the only such case that is likely to be examining a catalog big enough to be syncscanned is RelationBuildTupleDesc. But the early-exit optimization in that code gets broken if we start the scan at a random place within the catalog, so that allowing syncscan is actually a big deoptimization if pg_attribute is large (at least for the normal case where the rows for core system catalogs have never been changed since initdb). Hence, prevent syncscan here. Per my testing pursuant to complaints from Jeff Frost and Greg Sabino Mullane, though neither of them seem to have actually hit this specific problem. Back-patch to 8.3, where syncscan was introduced.
2012-05-22	Ensure that seqscans check for interrupts at least once per page.	Tom Lane
	If a seqscan encounters many consecutive pages containing only dead tuples, it can remain in the loop in heapgettup for a long time, and there was no CHECK_FOR_INTERRUPTS anywhere in that loop. This meant there were real-world situations where a query would be effectively uncancelable for long stretches. Add a check placed to occur once per page, which should be enough to provide reasonable response time without adding any measurable overhead. Report and patch by Merlin Moncure (though I tweaked it a bit). Back-patch to all supported branches.
2012-05-12	Ensure backwards compatibility for GetStableLatestTransactionId()	Simon Riggs

2012-05-11	Ensure age() returns a stable value rather than the latest value	Simon Riggs

2012-04-06	Fix misleading output from gin_desc().	Tom Lane
	XLOG_GIN_UPDATE_META_PAGE and XLOG_GIN_DELETE_LISTPAGE records were printed with a list link field labeled as "blkno", which was confusing, especially when the link was empty (InvalidBlockNumber). Print the metapage block number instead, since that's what's actually being updated. We could include the link values too as a separate field, but not clear it's worth the trouble. Back-patch to 8.4 where the dubious code was added.
2012-03-29	Correct epoch of txid_current() when executed on a Hot Standby server.	Simon Riggs
	Initialise ckptXidEpoch from starting checkpoint and maintain the correct value as we roll forwards. This allows GetNextXidAndEpoch() to return the correct epoch when executed during recovery. Backpatch to 9.0 when the problem is first observable by a user. Bug report from Daniel Farina
2012-02-26	Fix some more bugs in GIN's WAL replay logic.	Tom Lane
	In commit 4016bdef8aded77b4903c457050622a5a1815c16 I fixed a bunch of ginxlog.c bugs having to do with not handling XLogReadBuffer failures correctly. However, in ginRedoUpdateMetapage and ginRedoDeleteListPages, I unaccountably thought that failure to read the metapage would be impossible and just put in an elog(PANIC) call. This is of course wrong: failure is exactly what will happen if the index got dropped (or rebuilt) between creation of the WAL record and the crash we're trying to recover from. I believe this explains Nicholas Wilson's recent report of these errors getting reached. Also, fix memory leak in forgetIncompleteSplit. This wasn't of much concern when the code was written, but in a long-running standby server page split records could be expected to accumulate indefinitely. Back-patch to 8.4 --- before that, GIN didn't have a metapage.
2012-02-22	Correctly initialise shared recoveryLastRecPtr in recovery.	Simon Riggs
	Previously we used ReadRecPtr rather than EndRecPtr, which was not a serious error but caused pg_stat_replication to report incorrect replay_location until at least one WAL record is replayed. Fujii Masao
2012-02-21	Don't clear btpo_cycleid during _bt_vacuum_one_page.	Tom Lane
	When "vacuuming" a single btree page by removing LP_DEAD tuples, we are not actually within a vacuum operation, but rather in an ordinary insertion process that could well be running concurrently with a vacuum. So clearing the cycleid is incorrect, and could cause the concurrent vacuum to miss removing tuples that it needs to remove. This is a longstanding bug introduced by commit e6284649b9e30372b3990107a082bc7520325676 of 2006-07-25. I believe it explains Maxim Boguk's recent report of index corruption, and probably some other previously unexplained reports. In 9.0 and up this is a one-line fix; before that we need to introduce a flag to tell _bt_delitems what to do.
2012-02-06	Avoid problems with OID wraparound during WAL replay.	Tom Lane
	Fix a longstanding thinko in replay of NEXTOID and checkpoint records: we tried to advance nextOid only if it was behind the value in the WAL record, but the comparison would draw the wrong conclusion if OID wraparound had occurred since the previous value. Better to just unconditionally assign the new value, since OID assignment shouldn't be happening during replay anyway. The consequences of a failure to update nextOid would be pretty minimal, since we have long had the code set up to obtain another OID and try again if the generated value is already in use. But in the worst case there could be significant performance glitches while such loops iterate through many already-used OIDs before finding a free one. The odds of a wraparound happening during WAL replay would be small in a crash-recovery scenario, and the length of any ensuing OID-assignment stall quite limited anyway. But neither of these statements hold true for a replication slave that follows a WAL stream for a long period; its behavior upon going live could be almost unboundedly bad. Hence it seems worth back-patching this fix into all supported branches. Already fixed in HEAD in commit c6d76d7c82ebebb7210029f7382c0ebe2c558bca.
2012-02-05	Fix transient clobbering of shared buffers during WAL replay.	Tom Lane
	RestoreBkpBlocks was in the habit of zeroing and refilling the target buffer; which was perfectly safe when the code was written, but is unsafe during Hot Standby operation. The reason is that we have coding rules that allow backends to continue accessing a tuple in a heap relation while holding only a pin on its buffer. Such a backend could see transiently zeroed data, if WAL replay had occasion to change other data on the page. This has been shown to be the cause of bug #6425 from Duncan Rance (who deserves kudos for developing a sufficiently-reproducible test case) as well as Bridget Frey's re-report of bug #6200. It most likely explains the original report as well, though we don't yet have confirmation of that. To fix, change the code so that only bytes that are supposed to change will change, even transiently. This actually saves cycles in RestoreBkpBlocks, since it's not writing the same bytes twice. Also fix seq_redo, which has the same disease, though it has to work a bit harder to meet the requirement. So far as I can tell, no other WAL replay routines have this type of bug. In particular, the index-related replay routines, which would certainly be broken if they had to meet the same standard, are not at risk because we do not have coding rules that allow access to an index page when not holding a buffer lock on it. Back-patch to 9.0 where Hot Standby was added.
2012-01-12	Fix CLUSTER/VACUUM FULL for toast values owned by recently-updated rows.	Tom Lane
	In commit 7b0d0e9356963d5c3e4d329a917f5fbb82a2ef05, I made CLUSTER and VACUUM FULL try to preserve toast value OIDs from the original toast table to the new one. However, if we have to copy both live and recently-dead versions of a row that has a toasted column, those versions may well reference the same toast value with the same OID. The patch then led to duplicate-key failures as we tried to insert the toast value twice with the same OID. (The previous behavior was not very desirable either, since it would have silently inserted the same value twice with different OIDs. That wastes space, but what's worse is that the toast values inserted for already-dead heap rows would not be reclaimed by subsequent ordinary VACUUMs, since they go into the new toast table marked live not deleted.) To fix, check if the copied OID already exists in the new toast table, and if so, assume that it stores the desired value. This is reasonably safe since the only case where we will copy an OID from a previous toast pointer is when toast_insert_or_update was given that toast pointer and so we just pulled the data from the old table; if we got two different values that way then we have big problems anyway. We do have to assume that no other backend is inserting items into the new toast table concurrently, but that's surely safe for CLUSTER and VACUUM FULL. Per bug #6393 from Maxim Boguk. Back-patch to 9.0, same as the previous patch.
2011-12-20	Avoid crashing when we have problems unlinking files post-commit.	Tom Lane
	smgrdounlink takes care to not throw an ERROR if it fails to unlink something, but that caution was rendered useless by commit 3396000684b41e7e9467d1abc67152b39e697035, which put an smgrexists call in front of it; smgrexists does throw error if anything looks funny, such as getting a permissions error from trying to open the file. If that happens post-commit, you get a PANIC, and what's worse the same logic appears in the WAL replay code, so the database even fails to restart. Restore the intended behavior by removing the smgrexists call --- it isn't accomplishing anything that we can't do better by adjusting mdunlink's ideas of whether it ought to warn about ENOENT or not. Per report from Joseph Shraibman of unrecoverable crash after trying to drop a table whose FSM fork had somehow gotten chmod'd to 000 permissions. Backpatch to 8.4, where the bogus coding was introduced.
2011-12-09	Don't set reachedMinRecoveryPoint during crash recovery. In crash recovery,	Heikki Linnakangas
	we don't reach consistency before replaying all of the WAL. Rename the variable to reachedConsistency, to make its intention clearer. In master, that was an active bug because of the recent patch to immediately PANIC if a reference to a missing page is found in WAL after reaching consistency, as Tom Lane's test case demonstrated. In 9.1 and 9.0, the only consequence was a misleading "consistent recovery state reached at %X/%X" message in the log at the beginning of crash recovery (the database is not consistent at that point yet). In 8.4, the log message was not printed in crash recovery, even though there was a similar reachedMinRecoveryPoint local variable that was also set early. So, backpatch to 9.1 and 9.0.
2011-11-25	Fix erroneous replay of GIN_UPDATE_META_PAGE WAL records.	Tom Lane
	A simple thinko in ginRedoUpdateMetapage, namely failing to increment a loop counter, led to inserting records into the last pending-list page in the wrong order (the opposite of that intended). So far as I can tell, this would not upset the code that eventually flushes pending items into the main part of the GIN index. But it did break the code that searched the pending list for matches, resulting in transient failure to find matching entries during index lookups, as illustrated in bug #6307 from Maksym Boguk. Back-patch to 8.4 where the incorrect code was introduced.
2011-11-04	Don't assume that a tuple's header size is unchanged during toasting.	Tom Lane
	This assumption can be wrong when the toaster is passed a raw on-disk tuple, because the tuple might pre-date an ALTER TABLE ADD COLUMN operation that added columns without rewriting the table. In such a case the tuple's natts value is smaller than what we expect from the tuple descriptor, and so its t_hoff value could be smaller too. In fact, the tuple might not have a null bitmap at all, and yet our current opinion of it is that it contains some trailing nulls. In such a situation, toast_insert_or_update did the wrong thing, because to save a few lines of code it would use the old t_hoff value as the offset where heap_fill_tuple should start filling data. This did not leave enough room for the new nulls bitmap, with the result that the first few bytes of data could be overwritten with null flag bits, as in a recent report from Hubert Depesz Lubaczewski. The particular case reported requires ALTER TABLE ADD COLUMN followed by CREATE TABLE AS SELECT * FROM ... or INSERT ... SELECT * FROM ..., and further requires that there be some out-of-line toasted fields in one of the tuples to be copied; else we'll not reach the troublesome code. The problem can only manifest in this form in 8.4 and later, because before commit a77eaa6a95009a3441e0d475d1980259d45da072, CREATE TABLE AS or INSERT/SELECT wouldn't result in raw disk tuples getting passed directly to heap_insert --- there would always have been at least a junkfilter in between, and that would reconstitute the tuple header with an up-to-date t_natts and hence t_hoff. But I'm backpatching the tuptoaster change all the way anyway, because I'm not convinced there are no older code paths that present a similar risk.
2011-11-02	Revert "Stop btree indexscans upon reaching nulls in either direction."	Tom Lane
	This reverts commit 7357f92a3ef1faff0ce85f55f3c5a7f4dc820eaa. As pointed out by Naoya Anzai, we need to do more work to make that idea handle end-of-index cases, and it is looking like too much risk for a back-patch. So bug #6278 is only going to be fixed in HEAD.
2011-11-02	Derive oldestActiveXid at correct time for Hot Standby.	Simon Riggs
	There was a timing window between when oldestActiveXid was derived and when it should have been derived that only shows itself under heavy load. Move code around to ensure correct timing of derivation. No change to StartupSUBTRANS() code, which is where this failed. Bug report by Chris Redekop
2011-11-02	Fix timing of Startup CLOG and MultiXact during Hot Standby	Simon Riggs
	Patch by me, bug report by Chris Redekop, analysis by Florian Pflug
2011-11-01	Fix race condition with toast table access from a stale syscache entry.	Tom Lane
	If a tuple in a syscache contains an out-of-line toasted field, and we try to fetch that field shortly after some other transaction has committed an update or deletion of the tuple, there is a race condition: vacuum could come along and remove the toast tuples before we can fetch them. This leads to transient failures like "missing chunk number 0 for toast value NNNNN in pg_toast_2619", as seen in recent reports from Andrew Hammond and Tim Uckun. The design idea of syscache is that access to stale syscache entries should be prevented by relation-level locks, but that fails for at least two cases where toasted fields are possible: ANALYZE updates pg_statistic rows without locking out sessions that might want to plan queries on the same table, and CREATE OR REPLACE FUNCTION updates pg_proc rows without any meaningful lock at all. The least risky fix seems to be an idea that Heikki suggested when we were dealing with a related problem back in August: forcibly detoast any out-of-line fields before putting a tuple into syscache in the first place. This avoids the problem because at the time we fetch the parent tuple from the catalog, we should be holding an MVCC snapshot that will prevent removal of the toast tuples, even if the parent tuple is outdated immediately after we fetch it. (Note: I'm not convinced that this statement holds true at every instant where we could be fetching a syscache entry at all, but it does appear to hold true at the times where we could fetch an entry that could have a toasted field. We will need to be a bit wary of adding toast tables to low-level catalogs that don't have them already.) An additional benefit is that subsequent uses of the syscache entry should be faster, since they won't have to detoast the field. Back-patch to all supported versions. The problem is significantly harder to reproduce in pre-9.0 releases, because of their willingness to flush every entry in a syscache whenever the underlying catalog is vacuumed (cf CatalogCacheFlushRelation); but there is still a window for trouble.
2011-10-31	Stop btree indexscans upon reaching nulls in either direction.	Tom Lane
	The existing scan-direction-sensitive tests were overly complex, and failed to stop the scan in cases where it's perfectly legitimate to do so. Per bug #6278 from Maksym Boguk. Back-patch to 8.3, which is as far back as the patch applies easily. Doesn't seem worth sweating over a relatively minor performance issue in 8.2 at this late date. (But note that this was a performance regression from 8.1 and before, so 8.2 is being left as an outlier.)
2011-09-16	gistendscan() forgot to free so->giststate.	Tom Lane
	This oversight led to a massive memory leak --- upwards of 10KB per tuple --- during creation-time verification of an exclusion constraint based on a GIST index. In most other scenarios it'd just be a leak of 10KB that would be recovered at end of query, so not too significant; though perhaps the leak would be noticeable in a situation where a GIST index was being used in a nestloop inner indexscan. In any case, it's a real leak of long standing, so patch all supported branches. Per report from Harald Fuchs.
2011-08-16	Preserve toast value OIDs in toast-swap-by-content for CLUSTER/VACUUM FULL.	Tom Lane
	This works around the problem that a catalog cache entry might contain a toast pointer that we try to dereference just as a VACUUM FULL completes on that catalog. We will see the sinval message on the cache entry when we acquire lock on the toast table, but by that point we've already told tuptoaster.c "here's the pointer to fetch", so it's difficult from a code structural standpoint to update the pointer before we use it. Much less painful to ensure that toast pointers are not invalidated in the first place. We have to add a bit of code to deal with the case that a value that previously wasn't toasted becomes so; but that should be a seldom-exercised corner case, so the inefficiency shouldn't be significant. Back-patch to 9.0. In prior versions, we didn't allow CLUSTER on system catalogs, and VACUUM FULL didn't result in reassignment of toast OIDs, so there was no problem.