summaryrefslogtreecommitdiff
path: root/src/backend/access
AgeCommit message (Collapse)Author
2015-01-15Fix thinko in re-setting wal_log_hints flag from a parameter-change record.Heikki Linnakangas
The flag is supposed to be copied from the record. Same issue with track_commit_timestamps, but that's master-only. Report and fix by Petr Jalinek. Backpatch to 9.4, where wal_log_hints was added.
2015-01-12Tweak heapam's rmgr desc output slightlyAlvaro Herrera
Some spaces were missing, and putting the affected tuple offset first in the lock cases instead of the locking data makes more sense. No backpatch since this is cosmetic and surrounding code has changed.
2015-01-07Don't open a WAL segment for writing at end of recovery.Heikki Linnakangas
Since commit ba94518a, we used XLogFileOpen to open the next segment for writing, but if the end-of-recovery happens exactly at a segment boundary, the new segment might not exist yet. (Before ba94518a, XLogFileOpen was correct, because we would open the previous segment if the switch happened at the boundary.) Instead of trying to create it if necessary, it's simpler to not bother opening the segment at all. XLogWrite() will open or create it soon anyway, after writing the checkpoint or end-of-recovery record. Reported by Andres Freund.
2015-01-06Update copyright for 2015Bruce Momjian
Backpatch certain files through 9.0
2015-01-04Fix thinko in lock mode enumAlvaro Herrera
Commit 0e5680f4737a9c6aa94aa9e77543e5de60411322 contained a thinko mixing LOCKMODE with LockTupleMode. This caused misbehavior in the case where a tuple is marked with a multixact with at most a FOR SHARE lock, and another transaction tries to acquire a FOR NO KEY EXCLUSIVE lock; this case should block but doesn't. Include a new isolation tester spec file to explicitely try all the tuple lock combinations; without the fix it shows the problem: starting permutation: s1_begin s1_lcksvpt s1_tuplock2 s2_tuplock3 s1_commit step s1_begin: BEGIN; step s1_lcksvpt: SELECT * FROM multixact_conflict FOR KEY SHARE; SAVEPOINT foo; a 1 step s1_tuplock2: SELECT * FROM multixact_conflict FOR SHARE; a 1 step s2_tuplock3: SELECT * FROM multixact_conflict FOR NO KEY UPDATE; a 1 step s1_commit: COMMIT; With the fixed code, step s2_tuplock3 blocks until session 1 commits, which is the correct behavior. All other cases behave correctly. Backpatch to 9.3, like the commit that introduced the problem.
2015-01-04Remove superflous variable from xlogreader's XLogFindNextRecord().Andres Freund
Pointed out by Coverity. Since this is mere, and debatable, cosmetics I'm not backpatching this.
2015-01-03Treat negative values of recovery_min_apply_delay as having no effect.Tom Lane
At one point in the development of this feature, it was claimed that allowing negative values would be useful to compensate for timezone differences between master and slave servers. That was based on a mistaken assumption that commit timestamps are recorded in local time; but of course they're in UTC. Nor is a negative apply delay likely to be a sane way of coping with server clock skew. However, the committed patch still treated negative delays as doing something, and the timezone misapprehension survived in the user documentation as well. If recovery_min_apply_delay were a proper GUC we'd just set the minimum allowed value to be zero; but for the moment it seems better to treat negative settings as if they were zero. In passing do some extra wordsmithing on the parameter's documentation, including correcting a second misstatement that the parameter affects processing of Restore Point records. Issue noted by Michael Paquier, who also provided the code patch; doc changes by me. Back-patch to 9.4 where the feature was introduced.
2014-12-26Grab heavyweight tuple lock only before sleepingAlvaro Herrera
We were trying to acquire the lock even when we were subsequently not sleeping in some other transaction, which opens us up unnecessarily to deadlocks. In particular, this is troublesome if an update tries to lock an updated version of a tuple and finds itself doing EvalPlanQual update chain walking; more than two sessions doing this concurrently will find themselves sleeping on each other because the HW tuple lock acquisition in heap_lock_tuple called from EvalPlanQualFetch races with the same tuple lock being acquired in heap_update -- one of these sessions sleeps on the other one to finish while holding the tuple lock, and the other one sleeps on the tuple lock. Per trouble report from Andrew Sackville-West in http://www.postgresql.org/message-id/20140731233051.GN17765@andrew-ThinkPad-X230 His scenario can be simplified down to a relatively simple isolationtester spec file which I don't include in this commit; the reason is that the current isolationtester is not able to deal with more than one blocked session concurrently and it blocks instead of raising the expected deadlock. In the future, if we improve isolationtester, it would be good to include the spec file in the isolation schedule. I posted it in http://www.postgresql.org/message-id/20141212205254.GC1768@alvh.no-ip.org Hat tip to Mark Kirkwood, who helped diagnose the trouble.
2014-12-25Temporarily revert "Move pg_lzcompress.c to src/common."Tom Lane
This reverts commit 60838df922345b26a616e49ac9fab808a35d1f85. That change needs a bit more thought to be workable. In view of the potentially machine-dependent stuff that went in today, we need all of the buildfarm to be testing those other changes.
2014-12-25Convert the PGPROC->lwWaitLink list into a dlist instead of open coding it.Andres Freund
Besides being shorter and much easier to read it changes the logic in LWLockRelease() to release all shared lockers when waking up any. This can yield some significant performance improvements - and the fairness isn't really much worse than before, as we always allowed new shared lockers to jump the queue.
2014-12-25Move pg_lzcompress.c to src/common.Fujii Masao
Exposing compression and decompression APIs of pglz makes possible its use by extensions and contrib modules. pglz_decompress contained a call to elog to emit an error message in case of corrupted data. This function is changed to return a status code to let its callers return an error instead. This commit is required for upcoming WAL compression feature so that the WAL reader facility can decompress the WAL data by using pglz_decompress. Michael Paquier
2014-12-23Revert "Use a bitmask to represent role attributes"Alvaro Herrera
This reverts commit 1826987a46d079458007b7b6bbcbbd852353adbb. The overall design was deemed unacceptable, in discussion following the previous commit message; we might find some parts of it still salvageable, but I don't want to be on the hook for fixing it, so let's wait until we have a new patch.
2014-12-23Use a bitmask to represent role attributesAlvaro Herrera
The previous representation using a boolean column for each attribute would not scale as well as we want to add further attributes. Extra auxilliary functions are added to go along with this change, to make up for the lost convenience of access of the old representation. Catalog version bumped due to change in catalogs and the new functions. Author: Adam Brightwell, minor tweaks by Álvaro Reviewed by: Stephen Frost, Andres Freund, Álvaro Herrera
2014-12-22Use a pairing heap for the priority queue in kNN-GiST searches.Heikki Linnakangas
This performs slightly better, uses less memory, and needs slightly less code in GiST, than the Red-Black tree previously used. Reviewed by Peter Geoghegan
2014-12-21Fix file descriptor leak at end of recovery.Heikki Linnakangas
XLogFileInit() returns a file descriptor, which needs to be closed. The leak was short-lived, since the startup process exits shortly afterwards, but it was clearly a bug, nevertheless. Per Coverity report.
2014-12-19Fix timestamp in end-of-recovery WAL records.Heikki Linnakangas
We used time(null) to set a TimestampTz field, which gave bogus results. Noticed while looking at pg_xlogdump output. Backpatch to 9.3 and above, where the fast promotion was introduced.
2014-12-18Improve hash_create's API for selecting simple-binary-key hash functions.Tom Lane
Previously, if you wanted anything besides C-string hash keys, you had to specify a custom hashing function to hash_create(). Nearly all such callers were specifying tag_hash or oid_hash; which is tedious, and rather error-prone, since a caller could easily miss the opportunity to optimize by using hash_uint32 when appropriate. Replace this with a design whereby callers using simple binary-data keys just specify HASH_BLOBS and don't need to mess with specific support functions. hash_create() itself will take care of optimizing when the key size is four bytes. This nets out saving a few hundred bytes of code space, and offers a measurable performance improvement in tidbitmap.c (which was not exploiting the opportunity to use hash_uint32 for its 4-byte keys). There might be some wins elsewhere too, I didn't analyze closely. In future we could look into offering a similar optimized hashing function for 8-byte keys. Under this design that could be done in a centralized and machine-independent fashion, whereas getting it right for keys of platform-dependent sizes would've been notationally painful before. For the moment, the old way still works fine, so as not to break source code compatibility for loadable modules. Eventually we might want to remove tag_hash and friends from the exported API altogether, since there's no real need for them to be explicitly referenced from outside dynahash.c. Teodor Sigaev and Tom Lane
2014-12-18Change how first WAL segment on new timeline after promotion is created.Heikki Linnakangas
Two changes: 1. When copying a WAL segment from old timeline to create the first segment on the new timeline, only copy up to the point where the timeline switch happens, and zero-fill the rest. This avoids corner cases where we might think that the copied WAL from the previous timeline belong to the new timeline. 2. If the timeline switch happens at a segment boundary, don't copy the whole old segment to the new timeline. It's pointless, because it's 100% identical to the old segment.
2014-12-18Fix (re-)starting from a basebackup taken off a standby after a failure.Andres Freund
When starting up from a basebackup taken off a standby extra logic has to be applied to compute the point where the data directory is consistent. Normal base backups use a WAL record for that purpose, but that isn't possible on a standby. That logic had a error check ensuring that the cluster's control file indicates being in recovery. Unfortunately that check was too strict, disregarding the fact that the control file could also indicate that the cluster was shut down while in recovery. That's possible when the a cluster starting from a basebackup is shut down before the backup label has been removed. When everything goes well that's a short window, but when either restore_command or primary_conninfo isn't configured correctly the window can get much wider. That's because inbetween reading and unlinking the label we restore the last checkpoint from WAL which can need additional WAL. To fix simply also allow starting when the control file indicates "shutdown in recovery". There's nicer fixes imaginable, but they'd be more invasive. Backpatch to 9.2 where support for taking basebackups from standbys was added.
2014-12-11Fix assorted confusion between Oid and int32.Tom Lane
In passing, also make some debugging elog's in pgstat.c a bit more consistently worded. Back-patch as far as applicable (9.3 or 9.4; none of these mistakes are really old). Mark Dilger identified and patched the type violations; the message rewordings are mine.
2014-12-08Remove duplicate code in heap_prune_chain()Simon Riggs
No need to set tuple tableOid twice Jim Nasby
2014-12-07Tweaks for recovery_target_actionSimon Riggs
Rename parameter action_at_recovery_target to recovery_target_action suggested by Christoph Berg. Place into recovery.conf suggested by Fujii Masao, replacing (deprecating) earlier parameters, per Michael Paquier.
2014-12-05Print new track_commit_timestamp in rm_desc of a parameter-change record.Heikki Linnakangas
Michael Paquier
2014-12-05Print wal_log_hints in the rm_desc routing of a parameter-change record.Heikki Linnakangas
It was an oversight in the original commit. Also note in the sample config file that changing wal_log_hints requires a restart. Michael Paquier. Backpatch to 9.4, where wal_log_hints was added.
2014-12-03Keep track of transaction commit timestampsAlvaro Herrera
Transactions can now set their commit timestamp directly as they commit, or an external transaction commit timestamp can be fed from an outside system using the new function TransactionTreeSetCommitTsData(). This data is crash-safe, and truncated at Xid freeze point, same as pg_clog. This module is disabled by default because it causes a performance hit, but can be enabled in postgresql.conf requiring only a server restart. A new test in src/test/modules is included. Catalog version bumped due to the new subdirectory within PGDATA and a couple of new SQL functions. Authors: Álvaro Herrera and Petr Jelínek Reviewed to varying degrees by Michael Paquier, Andres Freund, Robert Haas, Amit Kapila, Fujii Masao, Jaime Casanova, Simon Riggs, Steven Singer, Peter Eisentraut
2014-12-03Fix typosAlvaro Herrera
2014-12-02Minor cleanup of function declarations for BRIN.Tom Lane
Get rid of PG_FUNCTION_INFO_V1() macros, which are quite inappropriate for built-in functions (possibly leftovers from testing as a loadable module?). Also, fix gratuitous inconsistency between SQL-level and C-level names of the minmax support functions.
2014-11-28Update transaction README for persistent multixactsAlvaro Herrera
Multixacts are now maintained during recovery, but the README didn't get the memo. Backpatch to 9.3, where the divergence was introduced.
2014-11-28Fix assertion failure at end of PITR.Heikki Linnakangas
InitXLogInsert() cannot be called in a critical section, because it allocates memory. But CreateCheckPoint() did that, when called for the end-of-recovery checkpoint by the startup process. In the passing, fix the scratch space allocation in InitXLogInsert to go to the right memory context. Also update the comment at InitXLOGAccess, which hasn't been totally accurate since hot standby was introduced (in a hot standby backend, InitXLOGAccess isn't called at backend startup). Reported by Michael Paquier
2014-11-25action_at_recovery_target recovery config optionSimon Riggs
action_at_recovery_target = pause | promote | shutdown Petr Jelinek Reviewed by Muhammad Asif Naeem, Fujji Masao and Simon Riggs
2014-11-24Add a few paragraphs to B-tree README explaining L&Y algorithm.Heikki Linnakangas
This gives an overview of what Lehman & Yao's paper is all about, so that you can understand the rest of the README without having to read the paper. Per discussion with Peter Geoghegan and others.
2014-11-24Distinguish XLOG_FPI records generated for hint-bit updates.Heikki Linnakangas
Add a new XLOG_FPI_FOR_HINT record type, and use that for full-page images generated for hint bit updates, when checksums are enabled. The new record type is replayed exactly the same as XLOG_FPI, but allows them to be tallied separately e.g. in pg_xlogdump.
2014-11-21No need to call XLogEnsureRecordSpace when the relation is unlogged.Heikki Linnakangas
Amit Kapila
2014-11-21Fix bogus comments in XLogRecordAssembleHeikki Linnakangas
Pointed out by Michael Paquier
2014-11-20Remove dead code supporting mark/restore in SeqScan, TidScan, ValuesScan.Tom Lane
There seems no prospect that any of this will ever be useful, and indeed it's questionable whether some of it would work if it ever got called; it's certainly not been exercised in a very long time, if ever. So let's get rid of it, and make the comments about mark/restore in execAmi.c less wishy-washy. The mark/restore support for Result nodes is also currently dead code, but that's due to planner limitations not because it's impossible that it could be useful. So I left it in.
2014-11-20Silence compiler warning about variable being used uninitialized.Heikki Linnakangas
It's a false positive - the variable is only used when 'onleft' is true, and it is initialized in that case. But the compiler doesn't necessarily see that.
2014-11-20Revamp the WAL record format.Heikki Linnakangas
Each WAL record now carries information about the modified relation and block(s) in a standardized format. That makes it easier to write tools that need that information, like pg_rewind, prefetching the blocks to speed up recovery, etc. There's a whole new API for building WAL records, replacing the XLogRecData chains used previously. The new API consists of XLogRegister* functions, which are called for each buffer and chunk of data that is added to the record. The new API also gives more control over when a full-page image is written, by passing flags to the XLogRegisterBuffer function. This also simplifies the XLogReadBufferForRedo() calls. The function can dig the relation and block number from the WAL record, so they no longer need to be passed as arguments. For the convenience of redo routines, XLogReader now disects each WAL record after reading it, copying the main data part and the per-block data into MAXALIGNed buffers. The data chunks are not aligned within the WAL record, but the redo routines can assume that the pointers returned by XLogRecGet* functions are. Redo routines are now passed the XLogReaderState, which contains the record in the already-disected format, instead of the plain XLogRecord. The new record format also makes the fixed size XLogRecord header smaller, by removing the xl_len field. The length of the "main data" portion is now stored at the end of the WAL record, and there's a separate header after XLogRecord for it. The alignment padding at the end of XLogRecord is also removed. This compansates for the fact that the new format would otherwise be more bulky than the old format. Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera, Fujii Masao.
2014-11-18Reduce btree scan overhead for < and > strategiesSimon Riggs
For <, <=, > and >= strategies, mark the first scan key as already matched if scanning in an appropriate direction. If index tuple contains no nulls we can skip the first re-check for each tuple. Author: Rajeev Rastogi Reviewer: Haribabu Kommi Rework of the code and comments by Simon Riggs
2014-11-17Fix WAL-logging of B-tree "unlink halfdead page" operation.Heikki Linnakangas
There was some confusion on how to record the case that the operation unlinks the last non-leaf page in the branch being deleted. _bt_unlink_halfdead_page set the "topdead" field in the WAL record to the leaf page, but the redo routine assumed that it would be an invalid block number in that case. This commit fixes _bt_unlink_halfdead_page to do what the redo routine expected. This code is new in 9.4, so backpatch there.
2014-11-15Ensure unlogged tables are reset even if crash recovery errors out.Andres Freund
Unlogged relations are reset at the end of crash recovery as they're only synced to disk during a proper shutdown. Unfortunately that and later steps can fail, e.g. due to running out of space. This reset was, up to now performed after marking the database as having finished crash recovery successfully. As out of space errors trigger a crash restart that could lead to the situation that not all unlogged relations are reset. Once that happend usage of unlogged relations could yield errors like "could not open file "...": No such file or directory". Luckily clusters that show the problem can be fixed by performing a immediate shutdown, and starting the database again. To fix, just call ResetUnloggedRelations(UNLOGGED_RELATION_INIT) earlier, before marking the database as having successfully recovered. Discussion: 20140912112246.GA4984@alap3.anarazel.de Backpatch to 9.1 where unlogged tables were introduced. Abhijit Menon-Sen and Andres Freund
2014-11-14Clean up includes from RLS patchStephen Frost
The initial patch for RLS mistakenly included headers associated with the executor and planner bits in rewrite/rowsecurity.h. Per policy and general good sense, executor headers should not be included in planner headers or vice versa. The include of execnodes.h was a mistaken holdover from previous versions, while the include of relation.h was used for Relation's definition, which should have been coming from utils/relcache.h. This patch cleans these issues up, adds comments to the RowSecurityPolicy struct and the RowSecurityConfigType enum, and changes Relation->rsdesc to Relation->rd_rsdesc to follow Relation field naming convention. Additionally, utils/rel.h was including rewrite/rowsecurity.h, which wasn't a great idea since that was pulling in things not really needed in utils/rel.h (which gets included in quite a few places). Instead, use 'struct RowSecurityDesc' for the rd_rsdesc field and add comments explaining why. Lastly, add an include into access/nbtree/nbtsort.c for utils/sortsupport.h, which was evidently missed due to the above mess. Pointed out by Tom in 16970.1415838651@sss.pgh.pa.us; note that the concerns regarding a similar situation in the custom-path commit still need to be addressed.
2014-11-14Allow interrupting GetMultiXactIdMembersAlvaro Herrera
This function has a loop which can lead to uninterruptible process "stalls" (actually infinite loops) when some bugs are triggered. Avoid that unpleasant situation by adding a check for interrupts in a place that shouldn't degrade performance in the normal case. Backpatch to 9.3. Older branches have an identical loop here, but the aforementioned bugs are only a problem starting in 9.3 so there doesn't seem to be any point in backpatching any further.
2014-11-13Fix race condition between hot standby and restoring a full-page image.Heikki Linnakangas
There was a window in RestoreBackupBlock where a page would be zeroed out, but not yet locked. If a backend pinned and locked the page in that window, it saw the zeroed page instead of the old page or new page contents, which could lead to missing rows in a result set, or errors. To fix, replace RBM_ZERO with RBM_ZERO_AND_LOCK, which atomically pins, zeroes, and locks the page, if it's not in the buffer cache already. In stable branches, the old RBM_ZERO constant is renamed to RBM_DO_NOT_USE, to avoid breaking any 3rd party extensions that might use RBM_ZERO. More importantly, this avoids renumbering the other enum values, which would cause even bigger confusion in extensions that use ReadBufferExtended, but haven't been recompiled. Backpatch to all supported versions; this has been racy since hot standby was introduced.
2014-11-13Fix XLogReadBufferForRedoExtended to get cleanup lock when asked to do so.Heikki Linnakangas
2014-11-13Rename pending_list_cleanup_size to gin_pending_list_limit.Fujii Masao
Since this parameter is only for GIN index, it's better to add "gin" to the parameter name for easier understanding.
2014-11-11Add GUC and storage parameter to set the maximum size of GIN pending list.Fujii Masao
Previously the maximum size of GIN pending list was controlled only by work_mem. But the reasonable value of work_mem and the reasonable size of the list are basically not the same, so it was not appropriate to control both of them by only one GUC, i.e., work_mem. This commit separates new GUC, pending_list_cleanup_size, from work_mem to allow users to control only the size of the list. Also this commit adds pending_list_cleanup_size as new storage parameter to allow users to specify the size of the list per index. This is useful, for example, when users want to increase the size of the list only for the GIN index which can be updated heavily, and decrease it otherwise. Reviewed by Etsuro Fujita.
2014-11-10BRIN: fix bug in xlog backup block countingAlvaro Herrera
The code that generates the BRIN_XLOG_UPDATE removes the buffer reference when the page that's target for the updated tuple is freshly initialized. This is a pretty usual optimization, but was breaking the case where the revmap buffer, which is referenced in the same WAL record, is getting a backup block: the replay code was using backup block index 1, which is not valid when the update target buffer gets pruned; the revmap buffer gets assigned 0 instead. Make sure to use the correct backup block index for revmap when replaying. Bug reported by Fujii Masao.
2014-11-10Further code and wording tweaks in BRINAlvaro Herrera
Besides a couple of typo fixes, per David Rowley, Thom Brown, and Amit Langote, and mentions of BRIN in the general CREATE INDEX page again per David, this includes silencing MSVC compiler warnings (thanks Microsoft) and an additional variable initialization per Coverity scanner.
2014-11-10Fix compiler warning for non-assert builds.Kevin Grittner
Reported by Peter Geoghegan David Rowley
2014-11-08Fix some coding issues in BRINAlvaro Herrera
Reported by David Rowley: variadic macros are a problem. Get rid of them using a trick suggested by Tom Lane: add extra parentheses where needed. In the future we might decide we don't need the calls at all and remove them, but it seems appropriate to keep them while this code is still new. Also from David Rowley: brininsert() was trying to use a variable before initializing it. Fix by moving the brin_form_tuple call (which initializes the variable) to within the locked section. Reported by Peter Eisentraut: can't use "new" as a struct member name, because C++ compilers will choke on it, as reported by cpluspluscheck.