summaryrefslogtreecommitdiff
path: root/src/backend/access
AgeCommit message (Collapse)Author
2014-05-25Fix error when trying to delete page with half-dead left sibling.Heikki Linnakangas
The new page deletion code didn't cope with the case the target page's right sibling was marked half-dead. It failed a sanity check which checked that the downlinks in the parent page match the lower level, because a half-dead page has no downlink. To cope, check for that condition, and just give up on the deletion if it happens. The vacuum will finish the deletion of the half-dead page when it gets there, and on the next vacuum after that the empty can be deleted. Reported by Jeff Janes.
2014-05-19Fix backup-block numbering in redo of b-tree split.Heikki Linnakangas
I got the backup block numbers off-by-one in the commit that changed the way incomplete-splits are handled. I blame the comments, which said "backup block 1" and "backup block 2", even though the backup blocks are numbered starting from 0, in the macros and functions used in replay. Fix the comments and the code. Per Jeff Janes' bug report about corruption caused by torn page writes. The incorrect code is new in git master, but backpatch the comment change down to 9.0, where the numbering in the redo-side macros was changed.
2014-05-17Fix a bunch of functions that were declared static then defined not-static.Tom Lane
Per testing with a compiler that whines about this.
2014-05-17Update README, we don't do post-recovery cleanup actions anymore.Heikki Linnakangas
transam/README explained how B-tree incomplete splits were tracked and fixed after recovery, as an example of handling complex actions that need multiple WAL records, but that's not how it works anymore. Explain the new paradigm.
2014-05-16Initialize tsId and dbId fields in WAL record of COMMIT PREPARED.Heikki Linnakangas
Commit dd428c79 added dbId and tsId to the xl_xact_commit struct but missed that prepared transaction commits reuse that struct. Fix that. Because those fields were left unitialized, replaying a commit prepared WAL record in a hot standby node would fail to remove the relcache init file. That can lead to "could not open file" errors on the standby. Relcache init file only needs to be removed when a system table/index is rewritten in the transaction using two phase commit, so that should be rare in practice. In HEAD, the incorrect dbId/tsId values are also used for filtering in logical replication code, causing the transaction to always be filtered out. Analysis and fix by Andres Freund. Backpatch to 9.0 where hot standby was introduced.
2014-05-15Fix race condition in preparing a transaction for two-phase commit.Heikki Linnakangas
To lock a prepared transaction's shared memory entry, we used to mark it with the XID of the backend. When the XID was no longer active according to the proc array, the entry was implicitly considered as not locked anymore. However, when preparing a transaction, the backend's proc array entry was cleared before transfering the locks (and some other state) to the prepared transaction's dummy PGPROC entry, so there was a window where another backend could finish the transaction before it was in fact fully prepared. To fix, rewrite the locking mechanism of global transaction entries. Instead of an XID, just have simple locked-or-not flag in each entry (we store the locking backend's backend id rather than a simple boolean, but that's just for debugging purposes). The backend is responsible for explicitly unlocking the entry, and to make sure that that happens, install a callback to unlock it on abort or process exit. Backpatch to all supported versions.
2014-05-14Code review for recent changes in relcache.c.Tom Lane
rd_replidindex should be managed the same as rd_oidindex, and rd_keyattr and rd_idattr should be managed like rd_indexattr. Omissions in this area meant that the bitmapsets computed for rd_keyattr and rd_idattr would be leaked during any relcache flush, resulting in a slow but permanent leak in CacheMemoryContext. There was also a tiny probability of relcache entry corruption if we ran out of memory at just the wrong point in RelationGetIndexAttrBitmap. Otherwise, the fields were not zeroed where expected, which would not bother the code any AFAICS but could greatly confuse anyone examining the relcache entry while debugging. Also, create an API function RelationGetReplicaIndex rather than letting non-relcache code be intimate with the mechanisms underlying caching of that value (we won't even mention the memory leak there). Also, fix a relcache flush hazard identified by Andres Freund: RelationGetIndexAttrBitmap must not assume that rd_replidindex stays valid across index_open. The aspects of this involving rd_keyattr date back to 9.3, so back-patch those changes.
2014-05-10Rename min_recovery_apply_delay to recovery_min_apply_delay.Tom Lane
Per discussion, this seems like a more consistent choice of name. Fabrízio de Royes Mello, after a suggestion by Peter Eisentraut; some additional documentation wordsmithing by me
2014-05-10Fix bug in lossy-page handling in GINHeikki Linnakangas
When returning rows from a bitmap, as done with partial match queries, we would get stuck in an infinite loop if the bitmap contained a lossy page reference. This bug is new in master, it was introduced by the patch to allow skipping items refuted by other entries in GIN scans. Report and fix by Alexander Korotkov
2014-05-09Remove overeager assertion in logical_heap_begin_rewrite.Robert Haas
It's legal to configure wal_level=logical and max_replication_slots=0 simultaneously. Andres Freund
2014-05-08Protect against torn pages when deleting GIN list pages.Heikki Linnakangas
To-be-deleted list pages contain no useful information, as they are being deleted, but we must still protect the writes from being torn by a crash after a partial write. To do that, re-initialize the pages on WAL replay. Jeff Janes caught this with a test program to test partial writes. Backpatch to all supported versions.
2014-05-06pgindent run for 9.4Bruce Momjian
This includes removing tabs after periods in C comments, which was applied to back branches, so this change should not effect backpatching.
2014-05-06Correct comment in Hot Standby nbtree handlingSimon Riggs
Logic is correct, matching handling of LP_DEAD elsewhere.
2014-05-05Assert that pre/post-fix updated tuples are on the same page during replay.Heikki Linnakangas
If they were not 'oldtup.t_data' would be dereferenced while set to NULL in case of a full page image for block 0. Do so primarily to silence coverity; but also to make sure this prerequisite isn't changed without adapting the replay routine as that would appear to work in many cases. Andres Freund
2014-05-01Fix failure to detoast fields in composite elements of structured types.Tom Lane
If we have an array of records stored on disk, the individual record fields cannot contain out-of-line TOAST pointers: the tuptoaster.c mechanisms are only prepared to deal with TOAST pointers appearing in top-level fields of a stored row. The same applies for ranges over composite types, nested composites, etc. However, the existing code only took care of expanding sub-field TOAST pointers for the case of nested composites, not for other structured types containing composites. For example, given a command such as UPDATE tab SET arraycol = ARRAY[(ROW(x,42)::mycompositetype] ... where x is a direct reference to a field of an on-disk tuple, if that field is long enough to be toasted out-of-line then the TOAST pointer would be inserted as-is into the array column. If the source record for x is later deleted, the array field value would become a dangling pointer, leading to errors along the line of "missing chunk number 0 for toast value ..." when the value is referenced. A reproducible test case for this was provided by Jan Pecek, but it seems likely that some of the "missing chunk number" reports we've heard in the past were caused by similar issues. Code-wise, the problem is that PG_DETOAST_DATUM() is not adequate to produce a self-contained Datum value if the Datum is of composite type. Seen in this light, the problem is not just confined to arrays and ranges, but could also affect some other places where detoasting is done in that way, for example form_index_tuple(). I tried teaching the array code to apply toast_flatten_tuple_attribute() along with PG_DETOAST_DATUM() when the array element type is composite, but this was messy and imposed extra cache lookup costs whether or not any TOAST pointers were present, indeed sometimes when the array element type isn't even composite (since sometimes it takes a typcache lookup to find that out). The idea of extending that approach to all the places that currently use PG_DETOAST_DATUM() wasn't attractive at all. This patch instead solves the problem by decreeing that composite Datum values must not contain any out-of-line TOAST pointers in the first place; that is, we expand out-of-line fields at the point of constructing a composite Datum, not at the point where we're about to insert it into a larger tuple. This rule is applied only to true composite Datums, not to tuples that are being passed around the system as tuples, so it's not as invasive as it might sound at first. With this approach, the amount of code that has to be touched for a full solution is greatly reduced, and added cache lookup costs are avoided except when there actually is a TOAST pointer that needs to be inlined. The main drawback of this approach is that we might sometimes dereference a TOAST pointer that will never actually be used by the query, imposing a rather large cost that wasn't there before. On the other side of the coin, if the field value is used multiple times then we'll come out ahead by avoiding repeat detoastings. Experimentation suggests that common SQL coding patterns are unaffected either way, though. Applications that are very negatively affected could be advised to modify their code to not fetch columns they won't be using. In future, we might consider reverting this solution in favor of detoasting only at the point where data is about to be stored to disk, using some method that can drill down into multiple levels of nested structured types. That will require defining new APIs for structured types, though, so it doesn't seem feasible as a back-patchable fix. Note that this patch changes HeapTupleGetDatum() from a macro to a function call; this means that any third-party code using that macro will not get protection against creating TOAST-pointer-containing Datums until it's recompiled. The same applies to any uses of PG_RETURN_HEAPTUPLEHEADER(). It seems likely that this is not a big problem in practice: most of the tuple-returning functions in core and contrib produce outputs that could not possibly be toasted anyway, and the same probably holds for third-party extensions. This bug has existed since TOAST was invented, so back-patch to all supported branches.
2014-04-30Rationalize common/relpath.[hc].Tom Lane
Commit a73018392636ce832b09b5c31f6ad1f18a4643ea created rather a mess by putting dependencies on backend-only include files into include/common. We really shouldn't do that. To clean it up: * Move TABLESPACE_VERSION_DIRECTORY back to its longtime home in catalog/catalog.h. We won't consider this symbol part of the FE/BE API. * Push enum ForkNumber from relfilenode.h into relpath.h. We'll consider relpath.h as the source of truth for fork numbers, since relpath.c was already partially serving that function, and anyway relfilenode.h was kind of a random place for that enum. * So, relfilenode.h now includes relpath.h rather than vice-versa. This direction of dependency is fine. (That allows most, but not quite all, of the existing explicit #includes of relpath.h to go away again.) * Push forkname_to_number from catalog.c to relpath.c, just to centralize fork number stuff a bit better. * Push GetDatabasePath from catalog.c to relpath.c; it was rather odd that the previous commit didn't keep this together with relpath(). * To avoid needing relfilenode.h in common/, redefine the underlying function (now called GetRelationPath) as taking separate OID arguments, and make the APIs using RelFileNode or RelFileNodeBackend into macro wrappers. (The macros have a potential multiple-eval risk, but none of the existing call sites have an issue with that; one of them had such a risk already anyway.) * Fix failure to follow the directions when "init" fork type was added; specifically, the errhint in forkname_to_number wasn't updated, and neither was the SGML documentation for pg_relation_size(). * Fix tablespace-path-too-long check in CreateTableSpace() to account for fork-name component of maximum-length pathnames. This requires putting FORKNAMECHARS into a header file, but it was rather useless (and actually unreferenced) where it was. The last couple of items are potentially back-patchable bug fixes, if anyone is sufficiently excited about them; but personally I'm not. Per a gripe from Christoph Berg about how include/common wasn't self-contained.
2014-04-28Fix two bugs in WAL-logging of GIN pending-list pages.Heikki Linnakangas
In writeListPage, never take a full-page image of the page, because we have all the information required to re-initialize in the WAL record anyway. Before this fix, a full-page image was always generated, unless full_page_writes=off, because when the page is initialized its LSN is always 0. In stable-branches, keep the code to restore the backup blocks if they exist, in case that the WAL is generated with an older minor version, but in master Assert that there are no full-page images. In the redo routine, add missing "off++". Otherwise the tuples are added to the page in reverse order. That happens to be harmless because we always scan and remove all the tuples together, but it was clearly wrong. Also, it was masked by the first bug unless full_page_writes=off, because the page was always restored from a full-page image. Backpatch to all supported versions.
2014-04-26Improve generation algorithm for database system identifier.Tom Lane
As noted some time ago, the original coding had a typo ("|" for "^") that made the result less unique than intended. Even the intended behavior is obsolete since it was based on wanting to produce a usable value even if we didn't have int64 arithmetic --- a limitation we stopped supporting years ago. Instead, let's redefine the system identifier as tv_sec in the upper 32 bits (same as before), tv_usec in the next 20 bits, and the low 12 bits of getpid() in the remaining bits. This is still hardly guaranteed-universally-unique, but it's noticeably better than before. Per my proposal at <29019.1374535940@sss.pgh.pa.us>
2014-04-24Fix race when updating a tuple concurrently locked by another processAlvaro Herrera
If a tuple is locked, and this lock is later upgraded either to an update or to a stronger lock, and in the meantime some other process tries to lock, update or delete the same tuple, it (the tuple) could end up being updated twice, or having conflicting locks held. The reason for this is that the second updater checks for a change in Xmax value, or in the HEAP_XMAX_IS_MULTI infomask bit, after noticing the first lock; and if there's a change, it restarts and re-evaluates its ability to update the tuple. But it neglected to check for changes in lock strength or in lock-vs-update status when those two properties stayed the same. This would lead it to take the wrong decision and continue with its own update, when in reality it shouldn't do so but instead restart from the top. This could lead to either an assertion failure much later (when a multixact containing multiple updates is detected), or duplicate copies of tuples. To fix, make sure to compare the other relevant infomask bits alongside the Xmax value and HEAP_XMAX_IS_MULTI bit, and restart from the top if necessary. Also, in the belt-and-suspenders spirit, add a check to MultiXactCreateFromMembers that a multixact being created does not have two or more members that are claimed to be updates. This should protect against other bugs that might cause similar bogus situations. Backpatch to 9.3, where the possibility of multixacts containing updates was introduced. (In prior versions it was possible to have the tuple lock upgraded from shared to exclusive, and an update would not restart from the top; yet we're protected against a bug there because there's always a sleep to wait for the locking transaction to complete before continuing to do anything. Really, the fact that tuple locks always conflicted with concurrent updates is what protected against bugs here.) Per report from Andrew Dunstan and Josh Berkus in thread at http://www.postgresql.org/message-id/534C8B33.9050807@pgexperts.com Bug analysis by Andres Freund.
2014-04-24Reset pg_stat_activity.xact_start during PREPARE TRANSACTION.Tom Lane
Once we've completed a PREPARE, our session is not running a transaction, so its entry in pg_stat_activity should show xact_start as null, rather than leaving the value as the start time of the now-prepared transaction. I think possibly this oversight was triggered by faulty extrapolation from the adjacent comment that says PrepareTransaction should not call AtEOXact_PgStat, so tweak the wording of that comment. Noted by Andres Freund while considering bug #10123 from Maxim Boguk, although this error doesn't seem to explain that report. Back-patch to all active branches.
2014-04-23Update obsolete comments.Heikki Linnakangas
We no longer have a TLI field in the page header.
2014-04-23Fix typos in comment.Heikki Linnakangas
2014-04-23Cleanup of new b-tree page deletion code.Heikki Linnakangas
When marking a branch as half-dead, a pointer to the top of the branch is stored in the leaf block's hi-key. During normal operation, the high key was left in place, and the block number was just stored in the ctid field of the high key tuple, but in WAL replay, the high key was recreated as a truncated tuple with zero columns. For the sake of easier debugging, also truncate the tuple in normal operation, so that the page is identical after WAL replay. Also, rename the 'downlink' field in the WAL record to 'topparent', as that seems like a more descriptive name. And make sure it's set to invalid when unlinking the leaf page.
2014-04-22Fix broken logic in logical_heap_rewrite_flush_mappings().Tom Lane
It's blatantly obvious that commit 4d0d607a454ee832574afd52a3c515099cc85eb3 wasn't tested. The leak's real enough, though.
2014-04-22revert 4d0d607a454ee832574afd52a3c515099cc85eb3Bruce Momjian
Revert due to contrib/test_decoding regression failure
2014-04-22release memory used while flushing logical mappingsBruce Momjian
Patch by Ants Aasma
2014-04-22Fix bug in the new B-tree incomplete-split code.Heikki Linnakangas
Forgot to update LSN of left sibling's page, when creating a new root. I fixed this for regular insertions and page splits earlier, but missed new root creation.
2014-04-22Fix Gin README.Heikki Linnakangas
The README incorrectly claimed that GIN posting tree pages contain an array of uncompressed items in addition to compressed posting lists. Earlier versions of the GIN posting list compression patch worked that way, but not the one that was committed.
2014-04-22Fix bug in new B-tree page deletion code.Heikki Linnakangas
When modifying a page, must hold an exclusive lock. A shared lock is obviously not good enough.
2014-04-22Retain original physical order of tuples in redo of b-tree splits.Heikki Linnakangas
It makes no difference to the system, but minimizing the differences between a master and standby makes debugging simpler.
2014-04-22Fix rm_desc routine of b-tree page delete records.Heikki Linnakangas
A couple of typos from my refactoring of the page deletion patch.
2014-04-20Fix typo.Robert Haas
Etsuro Fujita
2014-04-18Fix typoMagnus Hagander
Amit Langote
2014-04-17report stat() error in trigger file checkBruce Momjian
Permissions might prevent the existence of the trigger file from being checked. Per report from Andres Freund
2014-04-16Use correctly-sized buffer when zero-filling a WAL file.Heikki Linnakangas
I mixed up BLCKSZ and XLOG_BLCKSZ when I changed the way the buffer is allocated a couple of weeks ago. With the default settings, they are both 8k, but they can be changed at compile-time.
2014-04-14Set pd_lower on internal GIN posting tree pages.Heikki Linnakangas
This allows squeezing out the unused space in full-page writes. And more importantly, it can be a useful debugging aid. In hindsight we should've done this back when GIN was added - we wouldn't need the 'maxoff' field in the page opaque struct if we had used pd_lower and pd_upper like on normal pages. But as long as there can be pages in the index that have been binary-upgraded from pre-9.4 versions, we can't rely on that, and have to continue using 'maxoff'. Most of the code churn comes from renaming some macros, now that they're used on internal pages, too. This change is completely backwards-compatible, no effect on pg_upgrade.
2014-04-14Fix bogus handling of bad strategy number in GIST consistent() functions.Tom Lane
Make sure we throw an error instead of silently doing the wrong thing when fed a strategy number we don't recognize. Also, in the places that did already throw an error, spell the error message in a way more consistent with our message style guidelines. Per report from Paul Jones. Although this is a bug, it won't occur unless a superuser tries to do something he shouldn't, so it doesn't seem worth back-patching.
2014-04-14Remove dead checks for invalid left page in ginDeletePage.Heikki Linnakangas
In some places, the function assumes the left page is valid, and in others, it checks if it is valid. Remove all the checks.
2014-04-14GIN entry pages follow the standard page layout - tell XLogInsert.Heikki Linnakangas
The entry B-tree pages all follow the standard page layout. The 9.3 code has this right. I inadvertently changed this at some point during the big refactorings in git master.
2014-04-10Fix bugs in GIN "fast scan" with partial match.Heikki Linnakangas
There were a couple of bugs here. First, if the fuzzy limit was exceeded, the loop in entryGetItem might drop out too soon if a whole block needs to be skipped because it's < advancePast ("continue" in a while-loop checks the loop condition too). Secondly, the loop checked when stepping to a new page that there is at least one offset on the page < advancePast, but we cannot rely on that on subsequent calls of entryGetItem, because advancePast might change in between. That caused the skipping loop to read bogus items in the TbmIterateResult's offset array. First item and fix by Alexander Korotkov, second bug pointed out by Fabrízio de Royes Mello, by a small variation of Alexander's test query.
2014-04-10Fix typo in comment.Heikki Linnakangas
Tomonari Katsumata
2014-04-08Fix hot standby bug with GiST scans.Heikki Linnakangas
Don't reset the rightlink of a page when replaying a page update record. This was a leftover from pre-hot standby days, when it was not possible to have scans concurrent with WAL replay. Resetting the right-link was not necessary back then either, but it was done for the sake of tidiness. But with hot standby, it's wrong, because a concurrent scan might still need it. Backpatch all versions with hot standby, 9.0 and above.
2014-04-07Zero padding byte at end of GIN posting list.Heikki Linnakangas
This isn't strictly necessary, but helps debugging.
2014-04-07Fix WAL replay bug in the new GIN incomplete-split code.Heikki Linnakangas
Forgot to set the incomplete-split flag on the left page half, in redo of a page split. Spotted this by comparing the page contents on master and standby, after inserting/applying each WAL record.
2014-04-05Fix another palloc in critical section.Heikki Linnakangas
Also add a regression test for a GIN index with enough items with the same key, so that a GIN posting tree gets created. Apparently none of the existing GIN tests were large enough for that. This code is new, no backpatching required.
2014-04-04Fix some compiler warnings that clang emits with -pedantic.Robert Haas
Andres Freund
2014-04-04Move multixid allocation out of critical section.Heikki Linnakangas
It can fail if you run out of memory. This call was added in 9.3, so backpatch to 9.3 only.
2014-04-04In checkpoint, move the check for in-progress xacts out of critical section.Heikki Linnakangas
GetVirtualXIDsDelayingChkpt calls palloc, which isn't safe in a critical section. I thought I covered this case with the exemption for the checkpointer, but CreateCheckPoint is also called from the startup process.
2014-04-04Avoid allocations in critical sections.Heikki Linnakangas
If a palloc in a critical section fails, it becomes a PANIC.
2014-04-03Avoid palloc in critical section in GiST WAL-logging.Heikki Linnakangas
Memory allocation can fail if you run out of memory, and inside a critical section that will lead to a PANIC. Use conservatively-sized arrays in stack instead. There was previously no explicit limit on the number of pages a GiST split can produce, it was only limited by the number of LWLocks that can be held simultaneously (100 at the moment). This patch adds an explicit limit of 75 pages. That should be plenty, a typical split shouldn't produce more than 2-3 page halves. The bug has been there forever, but only backpatch down to 9.1. The code was changed significantly in 9.1, and it doesn't seem worth the risk or trouble to adapt this for 9.0 and 8.4.