summaryrefslogtreecommitdiff
path: root/src/backend/access/transam
AgeCommit message (Collapse)Author
2019-05-08Remove leftover reference to old "flat file" mechanism in a comment.Heikki Linnakangas
The flat file mechanism was removed in PostgreSQL 9.0.
2019-04-10Avoid counting transaction stats for parallel worker cooperatingAmit Kapila
transaction. The transaction that is initiated by the parallel worker to cooperate with the actual transaction started by the main backend to complete the query execution should not be counted as a separate transaction. The other internal transactions started and committed by the parallel worker are still counted as separate transactions as we that is what we do in other places like autovacuum. This will partially fix the bloat in transaction stats due to additional transactions performed by parallel workers. For a complete fix, we need to decide how we want to show all the transactions that are started internally for various operations and that is a matter of separate patch. Reported-by: Haribabu Kommi Author: Haribabu Kommi Reviewed-by: Amit Kapila, Jamison Kirk and Rahila Syed Backpatch-through: 9.6 Discussion: https://postgr.es/m/CAJrrPGc9=jKXuScvNyQ+VNhO0FZk7LLAShAJRyZjnedd2D61EQ@mail.gmail.com
2019-02-11Adjust error messagePeter Eisentraut
We usually don't use "namespace" in user-facing error messages. Also, in master this was replaced by another error message referring to "temporary objects", so we might as well use that here to avoid introducing too many variants. Discussion: https://www.postgresql.org/message-id/bbd3f8d9-e3d5-e5aa-4305-7f0121c3fa94@2ndquadrant.com
2019-01-18Restrict the use of temporary namespace in two-phase transactionsMichael Paquier
Attempting to use a temporary table within a two-phase transaction is forbidden for ages. However, there have been uncovered grounds for a couple of other object types and commands which work on temporary objects with two-phase commit. In short, trying to create, lock or drop an object on a temporary schema should not be authorized within a two-phase transaction, as it would cause its state to create dependencies with other sessions, causing all sorts of side effects with the existing session or other sessions spawned later on trying to use the same temporary schema name. Regression tests are added to cover all the grounds found, the original report mentioned function creation, but monitoring closer there are many other patterns with LOCK, DROP or CREATE EXTENSION which are involved. One of the symptoms resulting in combining both is that the session which used the temporary schema is not able to shut down completely, waiting for being able to drop the temporary schema, something that it cannot complete because of the two-phase transaction involved with temporary objects. In this case the client is able to disconnect but the session remains alive on the backend-side, potentially blocking connection backend slots from being used. Other problems reported could also involve server crashes. This is back-patched down to v10, which is where 9b013dc has introduced MyXactFlags, something that this patch relies on. Reported-by: Alexey Bashtanov Author: Michael Paquier Reviewed-by: Masahiko Sawada Discussion: https://postgr.es/m/5d910e2e-0db8-ec06-dd5f-baec420513c3@imap.cc Backpatch-through: 10
2018-12-16Improve detection of child-process SIGPIPE failures.Tom Lane
Commit ffa4cbd62 added logic to detect SIGPIPE failure of a COPY child process, but it only worked correctly if the SIGPIPE occurred in the immediate child process. Depending on the shell in use and the complexity of the shell command string, we might instead get back an exit code of 128 + SIGPIPE, representing a shell error exit reporting SIGPIPE in the child process. We could just hack up ClosePipeToProgram() to add the extra case, but it seems like this is a fairly general issue deserving a more general and better-documented solution. I chose to add a couple of functions in src/common/wait_error.c, which is a natural place to know about wait-result encodings, that will test for either a specific child-process signal type or any child-process signal failure. Then, adjust other places that were doing ad-hoc tests of this type to use the common functions. In RestoreArchivedFile, this fixes a race condition affecting whether the process will report an error or just silently proc_exit(1): before, that depended on whether the intermediate shell got SIGTERM'd itself or reported a child process failing on SIGTERM. Like the previous patch, back-patch to v10; we could go further but there seems no real need to. Per report from Erik Rijkers. Discussion: https://postgr.es/m/f3683f87ab1701bea5d86a7742b22432@xs4all.nl
2018-11-19PANIC on fsync() failure.Thomas Munro
On some operating systems, it doesn't make sense to retry fsync(), because dirty data cached by the kernel may have been dropped on write-back failure. In that case the only remaining copy of the data is in the WAL. A subsequent fsync() could appear to succeed, but not have flushed the data. That means that a future checkpoint could apparently complete successfully but have lost data. Therefore, violently prevent any future checkpoint attempts by panicking on the first fsync() failure. Note that we already did the same for WAL data; this change extends that behavior to non-temporary data files. Provide a GUC data_sync_retry to control this new behavior, for users of operating systems that don't eject dirty data, and possibly forensic/testing uses. If it is set to on and the write-back error was transient, a later checkpoint might genuinely succeed (on a system that does not throw away buffers on failure); if the error is permanent, later checkpoints will continue to fail. The GUC defaults to off, meaning that we panic. Back-patch to all supported releases. There is still a narrow window for error-loss on some operating systems: if the file is closed and later reopened and a write-back error occurs in the intervening time, but the inode has the bad luck to be evicted due to memory pressure before we reopen, we could miss the error. A later patch will address that with a scheme for keeping files with dirty data open at all times, but we judge that to be too complicated to back-patch. Author: Craig Ringer, with some adjustments by Thomas Munro Reported-by: Craig Ringer Reviewed-by: Robert Haas, Thomas Munro, Andres Freund Discussion: https://postgr.es/m/20180427222842.in2e4mibx45zdth5%40alap3.anarazel.de
2018-11-14Initialize TransactionState and user ID consistently at transaction startMichael Paquier
If a failure happens when a transaction is starting between the moment the transaction status is changed from TRANS_DEFAULT to TRANS_START and the moment the current user ID and security context flags are fetched via GetUserIdAndSecContext(), or before initializing its basic fields, then those may get reset to incorrect values when the transaction aborts, leaving the session in an inconsistent state. One problem reported is that failing a starting transaction at the first query of a session could cause several kinds of system crashes on the follow-up queries. In order to solve that, move the initialization of the transaction state fields and the call of GetUserIdAndSecContext() in charge of fetching the current user ID close to the point where the transaction status is switched to TRANS_START, where there cannot be any error triggered in-between, per an idea of Tom Lane. This properly ensures that the current user ID, the security context flags and that the basic fields of TransactionState remain consistent even if the transaction fails while starting. Reported-by: Richard Guo Diagnosed-By: Richard Guo Author: Michael Paquier Reviewed-by: Tom Lane Discussion: https://postgr.es/m/CAN_9JTxECSb=pEPcb0a8d+6J+bDcOZ4=DgRo_B7Y5gRHJUM=Rw@mail.gmail.com Backpatch-through: 9.4
2018-10-06Propagate xactStartTimestamp and stmtStartTimestamp to parallel workers.Tom Lane
Previously, a worker process would establish values for these based on its own start time. In v10 and up, this can trivially be shown to cause misbehavior of transaction_timestamp(), timestamp_in(), and related functions which are (perhaps unwisely?) marked parallel-safe. It seems likely that other behaviors might diverge from what happens in the parent as well. It's not as trivial to demonstrate problems in 9.6 or 9.5, but I'm sure it's still possible, so back-patch to all branches containing parallel worker infrastructure. In HEAD only, mark now() and statement_timestamp() as parallel-safe (other affected functions already were). While in theory we could still squeeze that change into v11, it doesn't seem important enough to force a last-minute catversion bump. Konstantin Knizhnik, whacked around a bit by me Discussion: https://postgr.es/m/6406dbd2-5d37-4cb6-6eb2-9c44172c7e7c@postgrespro.ru
2018-09-28Fix assertion failure when updating full_page_writes for checkpointer.Amit Kapila
When the checkpointer receives a SIGHUP signal to update its configuration, it may need to update the shared memory for full_page_writes and need to write a WAL record for it. Now, it is quite possible that the XLOG machinery has not been initialized by that time and it will lead to assertion failure while doing that. Fix is to allow the initialization of the XLOG machinery outside critical section. This bug has been introduced by the commit 2c03216d83 which added the XLOG machinery initialization in RecoveryInProgress code path. Reported-by: Dilip Kumar Author: Dilip Kumar Reviewed-by: Michael Paquier and Amit Kapila Backpatch-through: 9.5 Discussion: https://postgr.es/m/CAFiTN-u4BA8KXcQUWDPNgaKAjDXC=C2whnzBM8TAcv=stckYUw@mail.gmail.com
2018-09-28Fix WAL recycling on standbys depending on archive_modeMichael Paquier
A restart point or a checkpoint recycling WAL segments treats segments marked with neither ".done" (archiving is done) or ".ready" (segment is ready to be archived) in archive_status the same way for archive_mode being "on" or "always". While for a primary this is fine, a standby running a restart point with archive_mode = on would try to mark such a segment as ready for archiving, which is something that will never happen except after the standby is promoted. Note that this problem applies only to WAL segments coming from the local pg_wal the first time archive recovery is run. Segments part of a self-contained base backup are the most common case where this could happen, however even in this case normally the .done markers would be most likely part of the backup. Segments recovered from an archive are marked as .ready or .done by the startup process, and segments finished streaming are marked as such by the WAL receiver, so they are handled already. Reported-by: Haruka Takatsuka Author: Michael Paquier Discussion: https://postgr.es/m/15402-a453c90ed4cf88b2@postgresql.org Backpatch-through: 9.5, where archive_mode = always has been added.
2018-09-26Rework activation of commit timestamps during recoveryMichael Paquier
The activation and deactivation of commit timestamp tracking has not been handled consistently for a primary or standbys at recovery. The facility can be activated at three different moments of recovery: - The beginning, where a primary would use the GUC value for the decision-making, and where a standby relies on the contents of the control file. - When replaying a XLOG_PARAMETER_CHANGE record at redo. - The end, where both primary and standby rely on the GUC value. Using the GUC value for a primary at the beginning of recovery causes problems with commit timestamp access when doing crash recovery. Particularly, when replaying transaction commits, it could be possible that an attempt to read commit timestamps is done for a transaction which committed at a moment when track_commit_timestamp was disabled. A test case is added to reproduce the failure. The test works down to v11 as it takes advantage of transaction commits within procedures. Reported-by: Hailong Li Author: Masahiko Sawasa, Michael Paquier Reviewed-by: Kyotaro Horiguchi Discussion: https://postgr.es/m/11224478-a782-203b-1f17-e4797b39bdf0@qunar.com Backpatch-through: 9.5, where commit timestamps have been introduced.
2018-09-20Defer restoration of libraries in parallel workers.Thomas Munro
Several users of extensions complained of crashes in parallel workers that turned out to be due to syscache access from their _PG_init() functions. Reorder the initialization of parallel workers so that libraries are restored after the caches are initialized, and inside a transaction. This was reported in bug #15350 and elsewhere. We don't consider it to be a bug: extensions shouldn't do that, because then they can't be used in shared_preload_libraries. However, it's a fairly obscure hazard and these extensions worked in practice before parallel query came along. So let's make it work. Later commits might add a warning message and eventually an error. Back-patch to 9.6, where parallel query landed. Author: Thomas Munro Reviewed-by: Amit Kapila Reported-by: Kieran McCusker, Jimmy Discussion: https://postgr.es/m/153512195228.1489.8545997741965926448%40wrigleys.postgresql.org
2018-09-13Attach FPI to the first record after full_page_writes is turned on.Amit Kapila
XLogInsert fails to attach a required FPI to the first record after full_page_writes is turned on by the last checkpoint. This bug got introduced in 9.5 due to code rearrangement in commits 2c03216d83 and 2076db2aea. Fix it by ensuring that XLogInsertRecord performs a recomputation when the given record is generated with FPW as off but found that the flag has been turned on while actually inserting the record. Reported-by: Kyotaro Horiguchi Author: Kyotaro Horiguchi Reviewed-by: Amit Kapila Backpatch-through: 9.5 where this problem was introduced Discussion: https://postgr.es/m/20180420.151043.74298611.horiguchi.kyotaro@lab.ntt.co.jp
2018-09-01Avoid using potentially-under-aligned page buffers.Tom Lane
There's a project policy against using plain "char buf[BLCKSZ]" local or static variables as page buffers; preferred style is to palloc or malloc each buffer to ensure it is MAXALIGN'd. However, that policy's been ignored in an increasing number of places. We've apparently got away with it so far, probably because (a) relatively few people use platforms on which misalignment causes core dumps and/or (b) the variables chance to be sufficiently aligned anyway. But this is not something to rely on. Moreover, even if we don't get a core dump, we might be paying a lot of cycles for misaligned accesses. To fix, invent new union types PGAlignedBlock and PGAlignedXLogBlock that the compiler must allocate with sufficient alignment, and use those in place of plain char arrays. I used these types even for variables where there's no risk of a misaligned access, since ensuring proper alignment should make kernel data transfers faster. I also changed some places where we had been palloc'ing short-lived buffers, for coding style uniformity and to save palloc/pfree overhead. Since this seems to be a live portability hazard (despite the lack of field reports), back-patch to all supported versions. Patch by me; thanks to Michael Paquier for review. Discussion: https://postgr.es/m/1535618100.1286.3.camel@credativ.de
2018-08-31Ensure correct minimum consistent point on standbysMichael Paquier
Startup process has improved its calculation of incorrect minimum consistent point in 8d68ee6, which ensures that all WAL available gets replayed when doing crash recovery, and has introduced an incorrect calculation of the minimum recovery point for non-startup processes, which can cause incorrect page references on a standby when for example the background writer flushed a couple of pages on-disk but was not updating the control file to let a subsequent crash recovery replay to where it should have. The only case where this has been reported to be a problem is when a standby needs to calculate the latest removed xid when replaying a btree deletion record, so one would need connections on a standby that happen just after recovery has thought it reached a consistent point. Using a background worker which is started after the consistent point is reached would be the easiest way to get into problems if it connects to a database. Having clients which attempt to connect periodically could also be a problem, but the odds of seeing this problem are much lower. The fix used is pretty simple, as the idea is to give access to the minimum recovery point written in the control file to non-startup processes so as they use a reference, while the startup process still initializes its own references of the minimum consistent point so as the original problem with incorrect page references happening post-promotion with a crash do not show up. Reported-by: Alexander Kukushkin Diagnosed-by: Alexander Kukushkin Author: Michael Paquier Reviewed-by: Kyotaro Horiguchi, Alexander Kukushkin Discussion: https://postgr.es/m/153492341830.1368.3936905691758473953@wrigleys.postgresql.org Backpatch-through: 9.3
2018-08-05Reset properly errno before calling write()Michael Paquier
6cb3372 enforces errno to ENOSPC when less bytes than what is expected have been written when it is unset, though it forgot to properly reset errno before doing a system call to write(), causing errno to potentially come from a previous system call. Reported-by: Tom Lane Author: Michael Paquier Reviewed-by: Tom Lane Discussion: https://postgr.es/m/31797.1533326676@sss.pgh.pa.us
2018-07-18Fix misc typos, mostly in comments.Heikki Linnakangas
A collection of typos I happened to spot while reading code, as well as grepping for common mistakes. Backpatch to all supported versions, as applicable, to avoid conflicts when backporting other commits in the future.
2018-07-16Add subtransaction handling for table synchronization workers.Robert Haas
Since the old logic was completely unaware of subtransactions, a change made in a subsequently-aborted subtransaction would still cause workers to be stopped at toplevel transaction commit. Fix that by managing a stack of worker lists rather than just one. Amit Khandekar and Robert Haas Discussion: http://postgr.es/m/CAJ3gD9eaG_mWqiOTA2LfAug-VRNn1hrhf50Xi1YroxL37QkZNg@mail.gmail.com
2018-07-09Rework order of end-of-recovery actions to delay timeline history writeMichael Paquier
A critical failure in some of the end-of-recovery actions before the end-of-recovery record is written can cause PostgreSQL to react inconsistently with the rest of the cluster in the event of a crash before the final record is written. Two such failures are for example an error while processing a two-phase state files or when operating on recovery.conf. With this commit, the failures are still considered FATAL, but the write of the timeline history file is delayed as much as possible so as the window between the moment the file is written and the end-of-recovery record is generated gets minimized. This way, in the event of a crash or a failure, the new timeline decided at promotion will not seem taken by other nodes in the cluster. It is not really possible to reduce to zero this window, hence one could still see failures if a crash happens between the history file write and the end-of-recovery record, so any future code should be careful when adding new end-of-recovery actions. The original report from Magnus Hagander mentioned a renamed recovery.conf as original end-of-recovery failure which caused a timeline to be seen as taken but the subsequent processing on the now-missing recovery.conf cause the startup process to issue stop on FATAL, which at follow-up startup made the system inconsistent because of on-disk changes which already happened. Processing of two-phase state files still needs some work as corrupted entries are simply ignored now. This is left as a future item and this commit fixes the original complain. Reported-by: Magnus Hagander Author: Heikki Linnakangas Reviewed-by: Alexander Korotkov, Michael Paquier, David Steele Discussion: https://postgr.es/m/CABUevEz09XY2EevA2dLjPCY-C5UO4Hq=XxmXLmF6ipNFecbShQ@mail.gmail.com
2018-07-05Prevent references to invalid relation pages after fresh promotionMichael Paquier
If a standby crashes after promotion before having completed its first post-recovery checkpoint, then the minimal recovery point which marks the LSN position where the cluster is able to reach consistency may be set to a position older than the first end-of-recovery checkpoint while all the WAL available should be replayed. This leads to the instance thinking that it contains inconsistent pages, causing a PANIC and a hard instance crash even if all the WAL available has not been replayed for certain sets of records replayed. When in crash recovery, minRecoveryPoint is expected to always be set to InvalidXLogRecPtr, which forces the recovery to replay all the WAL available, so this commit makes sure that the local copy of minRecoveryPoint from the control file is initialized properly and stays as it is while crash recovery is performed. Once switching to archive recovery or if crash recovery finishes, then the local copy minRecoveryPoint can be safely updated. Pavan Deolasee has reported and diagnosed the failure in the first place, and the base fix idea to rely on the local copy of minRecoveryPoint comes from Kyotaro Horiguchi, which has been expanded into a full-fledged patch by me. The test included in this commit has been written by Álvaro Herrera and Pavan Deolasee, which I have modified to make it faster and more reliable with sleep phases. Backpatch down to all supported versions where the bug appears, aka 9.3 which is where the end-of-recovery checkpoint is not run by the startup process anymore. The test gets easily supported down to 10, still it has been tested on all branches. Reported-by: Pavan Deolasee Diagnosed-by: Pavan Deolasee Reviewed-by: Pavan Deolasee, Kyotaro Horiguchi Author: Michael Paquier, Kyotaro Horiguchi, Pavan Deolasee, Álvaro Herrera Discussion: https://postgr.es/m/CABOikdPOewjNL=05K5CbNMxnNtXnQjhTx2F--4p4ruorCjukbA@mail.gmail.com
2018-07-05Improve the performance of relation deletes during recovery.Fujii Masao
When multiple relations are deleted at the same transaction, the files of those relations are deleted by one call to smgrdounlinkall(), which leads to scan whole shared_buffers only one time. OTOH, previously, during recovery, smgrdounlink() (not smgrdounlinkall()) was called for each file to delete, which led to scan shared_buffers multiple times. Obviously this could cause to increase the WAL replay time very much especially when shared_buffers was huge. To alleviate this situation, this commit changes the recovery so that it also calls smgrdounlinkall() only one time to delete multiple relation files. This is just fix for oversight of commit 279628a0a7, not new feature. So, per discussion on pgsql-hackers, we concluded to backpatch this to all supported versions. Author: Fujii Masao Reviewed-by: Michael Paquier, Andres Freund, Thomas Munro, Kyotaro Horiguchi, Takayuki Tsunakawa Discussion: https://postgr.es/m/CAHGQGwHVQkdfDqtvGVkty+19cQakAydXn1etGND3X0PHbZ3+6w@mail.gmail.com
2018-06-25Address set of issues with errno handlingMichael Paquier
System calls mixed up in error code paths are causing two issues which several code paths have not correctly handled: 1) For write() calls, sometimes the system may return less bytes than what has been written without errno being set. Some paths were careful enough to consider that case, and assumed that errno should be set to ENOSPC, other calls missed that. 2) errno generated by a system call is overwritten by other system calls which may succeed once an error code path is taken, causing what is reported to the user to be incorrect. This patch uses the brute-force approach of correcting all those code paths. Some refactoring could happen in the future, but this is let as future work, which is not targeted for back-branches anyway. Author: Michael Paquier Reviewed-by: Ashutosh Sharma Discussion: https://postgr.es/m/20180622061535.GD5215@paquier.xyz
2018-06-18Prevent hard failures of standbys caused by recycled WAL segmentsMichael Paquier
When a standby's WAL receiver stops reading WAL from a WAL stream, it writes data to the current WAL segment without having priorily zero'ed the page currently written to, which can cause the WAL reader to read junk data from a past recycled segment and then it would try to get a record from it. While sanity checks in place provide most of the protection needed, in some rare circumstances, with chances increasing when a record header crosses a page boundary, then the startup process could fail violently on an allocation failure, as follows: FATAL: invalid memory alloc request size XXX This is confusing for the user and also unhelpful as this requires in the worst case a manual restart of the instance, impacting potentially the availability of the cluster, and this also makes WAL data look like it is in a corrupted state. The chances of seeing failures are higher if the connection between the standby and its root node is unstable, causing WAL pages to be written in the middle. A couple of approaches have been discussed, like zero-ing new WAL pages within the WAL receiver itself but this has the disadvantage of impacting performance of any existing instances as this breaks the sequential writes done by the WAL receiver. This commit deals with the problem with a more simple approach, which has no performance impact without reducing the detection of the problem: if a record is found with a length higher than 1GB for backends, then do not try any allocation and report a soft failure which will force the standby to retry reading WAL. It could be possible that the allocation call passes and that an unnecessary amount of memory is allocated, however follow-up checks on records would just fail, making this allocation short-lived anyway. This patch owes a great deal to Tsunakawa Takayuki for reporting the failure first, and then discussing a couple of potential approaches to the problem. Backpatch down to 9.5, which is where palloc_extended has been introduced. Reported-by: Tsunakawa Takayuki Reviewed-by: Tsunakawa Takayuki Author: Michael Paquier Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05
2018-06-06Fix function code in error reportAlvaro Herrera
This bug causes a lseek() failure to be reported as a "could not open" failure in the error message, muddling bug reports. I introduced this copy-and-pasteo in commit 78e122010422. Noticed while reviewing code for bug report #15221, from lily liang. In version 10 the affected function is only used by multixact.c and commit_ts, and only in corner-case circumstances, neither of which are involved in the reported bug (a pg_subtrans failure.) Author: Álvaro Herrera
2018-05-18Fix error message on short read of pg_controlMagnus Hagander
Instead of saying "error: success", indicate that we got a working read but it was too short.
2018-05-05Fix scenario where streaming standby gets stuck at a continuation record.Heikki Linnakangas
If a continuation record is split so that its first half has already been removed from the master, and is only present in pg_wal, and there is a recycled WAL segment in the standby server that looks like it would contain the second half, recovery would get stuck. The code in XLogPageRead() incorrectly started streaming at the beginning of the WAL record, even if we had already read the first page. Backpatch to 9.4. In principle, older versions have the same problem, but without replication slots, there was no straightforward mechanism to prevent the master from recycling old WAL that was still needed by standby. Without such a mechanism, I think it's reasonable to assume that there's enough slack in how many old segments are kept around to not run into this, or you have a WAL archive. Reported by Jonathon Nelson. Analysis and patch by Kyotaro HORIGUCHI, with some extra comments by me. Discussion: https://www.postgresql.org/message-id/CACJqAM3xVz0JY1XFDKPP%2BJoJAjoGx%3DGNuOAshEDWCext7BFvCQ%40mail.gmail.com
2018-05-03Add HOLD_INTERRUPTS section into FinishPreparedTransaction.Teodor Sigaev
If an interrupt arrives in the middle of FinishPreparedTransaction and any callback decide to call CHECK_FOR_INTERRUPTS (e.g. RemoveTwoPhaseFile can write a warning with ereport, which checks for interrupts) then it's possible to leave current GXact undeleted. Backpatch to all supported branches Stas Kelvich Discussion: ihttps://www.postgresql.org/message-id/3AD85097-A3F3-4EBA-99BD-C38EDF8D2949@postgrespro.ru
2018-04-11Ignore nextOid when replaying an ONLINE checkpoint.Tom Lane
The nextOid value is from the start of the checkpoint and may well be stale compared to values from more recent XLOG_NEXTOID records. Previously, we adopted it anyway, allowing the OID counter to go backwards during a crash. While this should be harmless, it contributed to the severity of the bug fixed in commit 0408e1ed5, by allowing duplicate TOAST OIDs to be assigned immediately following a crash. Without this error, that issue would only have arisen when TOAST objects just younger than a multiple of 2^32 OIDs were deleted and then not vacuumed in time to avoid a conflict. Pavan Deolasee Discussion: https://postgr.es/m/CABOikdOgWT2hHkYG3Wwo2cyZJq2zfs1FH0FgX-=h4OLosXHf9w@mail.gmail.com
2018-02-02Be more wary about shm_toc_lookup failure.Tom Lane
Commit 445dbd82a basically missed the point of commit d46633506, which was that we shouldn't allow shm_toc_lookup() failure to lead to a core dump or assertion crash, because the odds of such a failure should never be considered negligible. It's correct that we can't expect the PARALLEL_KEY_ERROR_QUEUE TOC entry to be there if we have no workers. But if we have no workers, we're not going to do anything in this function with the lookup result anyway, so let's just skip it. That lets the code use the easy-to-prove-safe noError=false case, rather than anything requiring effort to review. Back-patch to v10, like the previous commit. Discussion: https://postgr.es/m/3647.1517601675@sss.pgh.pa.us
2018-01-23Report an ERROR if a parallel worker fails to start properly.Robert Haas
Commit 28724fd90d2f85a0573a8107b48abad062a86d83 fixed things so that if a background worker fails to start due to fork() failure or because it is terminated before startup succeeds, BGWH_STOPPED will be reported. However, that only helps if the code that uses the background worker machinery notices the change in status, and the code in parallel.c did not. To fix that, do two things. First, make sure that when a worker exits, it triggers the leader to read from error queues. That way, if a worker which has attached to an error queue exits uncleanly, the leader is sure to throw some error, either the contents of the ErrorResponse sent by the worker, or "lost connection to parallel worker" if it exited without sending one. To cover the case where the worker never starts up in the first place or exits before attaching to the error queue, the ParallelContext now keeps track of which workers have sent at least one message via the error queue. A worker which sends no messages by the time the parallel operation finishes will be checked to see whether it exited before attaching to the error queue; if so, a new error message, "parallel worker failed to initialize", will be reported. If not, we'll continue to wait until it either starts up and exits cleanly, starts up and exits uncleanly, or fails to start, and then take the appropriate action. Patch by me, reviewed by Amit Kapila. Discussion: http://postgr.es/m/CA+TgmoYnBgXgdTu6wk5YPdWhmgabYc9nY_pFLq=tB=FSLYkD8Q@mail.gmail.com
2017-12-21Cancel CV sleep during subtransaction abort.Robert Haas
Generally, error recovery paths that need to do things like LWLockReleaseAll and pgstat_report_wait_end also need to call ConditionVariableCancelSleep, but AbortSubTransaction was missed. Since subtransaction abort might destroy up the DSM segment that contains the ConditionVariable stored in cv_sleep_target, this can result in a crash for anything using condition variables. Reported and diagnosed by Andres Freund. Discussion: http://postgr.es/m/20171221110048.rxk6464azzl5t2fi@alap3.anarazel.de
2017-12-19Fix bug in cancellation of non-exclusive backup to avoid assertion failure.Fujii Masao
Previously an assertion failure occurred when pg_stop_backup() for non-exclusive backup was aborted while it's waiting for WAL files to be archived. This assertion failure happened in do_pg_abort_backup() which was called when a non-exclusive backup was canceled. do_pg_abort_backup() assumes that there is at least one non-exclusive backup running when it's called. But pg_stop_backup() can be canceled even after it marks the end of non-exclusive backup (e.g., during waiting for WAL archiving). This broke the assumption that do_pg_abort_backup() relies on, and which caused an assertion failure. This commit changes do_pg_abort_backup() so that it does nothing when non-exclusive backup has been already marked as completed. That is, the asssumption is also changed, and do_pg_abort_backup() now can handle even the case where it's called when there is no running backup. Backpatch to 9.6 where SQL-callable non-exclusive backup was added. Author: Masahiko Sawada and Michael Paquier Reviewed-By: Robert Haas and Fujii Masao Discussion: https://www.postgresql.org/message-id/CAD21AoD2L1Fu2c==gnVASMyFAAaq3y-AQ2uEVj-zTCGFFjvmDg@mail.gmail.com
2017-12-08Fix mistake in commentPeter Eisentraut
Reported-by: Masahiko Sawada <sawada.mshk@gmail.com>
2017-12-04Clean up assorted messiness around AllocateDir() usage.Tom Lane
This patch fixes a couple of low-probability bugs that could lead to reporting an irrelevant errno value (and hence possibly a wrong SQLSTATE) concerning directory-open or file-open failures. It also fixes places where we took shortcuts in reporting such errors, either by using elog instead of ereport or by using ereport but forgetting to specify an errcode. And it eliminates a lot of just plain redundant error-handling code. In service of all this, export fd.c's formerly-static function ReadDirExtended, so that external callers can make use of the coding pattern dir = AllocateDir(path); while ((de = ReadDirExtended(dir, path, LOG)) != NULL) if they'd like to treat directory-open failures as mere LOG conditions rather than errors. Also fix FreeDir to be a no-op if we reach it with dir == NULL, as such a coding pattern would cause. Then, remove code at many call sites that was throwing an error or log message for AllocateDir failure, as ReadDir or ReadDirExtended can handle that job just fine. Aside from being a net code savings, this gets rid of a lot of not-quite-up-to-snuff reports, as mentioned above. (In some places these changes result in replacing a custom error message such as "could not open tablespace directory" with more generic wording "could not open directory", but it was agreed that the custom wording buys little as long as we report the directory name.) In some other call sites where we can't just remove code, change the error reports to be fully project-style-compliant. Also reorder code in restoreTwoPhaseData that was acquiring a lock between AllocateDir and ReadDir; in the unlikely but surely not impossible case that LWLockAcquire changes errno, AllocateDir failures would be misreported. There is no great value in opening the directory before acquiring TwoPhaseStateLock, so just do it in the other order. Also fix CheckXLogRemoved to guarantee that it preserves errno, as quite a number of call sites are implicitly assuming. (Again, it's unlikely but I think not impossible that errno could change during a SpinLockAcquire. If so, this function was broken for its own purposes as well as breaking callers.) And change a few places that were using not-per-project-style messages, such as "could not read directory" when "could not open directory" is more correct. Back-patch the exporting of ReadDirExtended, in case we have occasion to back-patch some fix that makes use of it; it's not needed right now but surely making it global is pretty harmless. Also back-patch the restoreTwoPhaseData and CheckXLogRemoved fixes. The rest of this is essentially cosmetic and need not get back-patched. Michael Paquier, with a bit of additional work by me Discussion: https://postgr.es/m/CAB7nPqRpOCxjiirHmebEFhXVTK7V5Jvw4bz82p7Oimtsm3TyZA@mail.gmail.com
2017-11-28Fix ReinitializeParallelDSM to tolerate finding no error queues.Robert Haas
Commit d4663350646ca0c069a36d906155a0f7e3372eb7 changed things so that shm_toc_lookup would fail with an error rather than silently returning NULL in the hope that such failures would be reported in a useful way rather than via a system crash. However, it overlooked the fact that the lookup of PARALLEL_KEY_ERROR_QUEUE in ReinitializeParallelDSM is expected to fail when no DSM segment was created in the first place; in that case, we end up with a backend-private memory segment that still contains an entry for PARALLEL_KEY_FIXED but no others. Consequently a benign failure to initialize parallelism can escalate into an elog(ERROR); repair. Discussion: http://postgr.es/m/CA+Tgmob8LFw55DzH1QEREpBEA9RJ_W_amhBFCVZ6WMwUhVpOqg@mail.gmail.com
2017-11-26Pad XLogReaderState's main_data buffer more aggressively.Tom Lane
Originally, we palloc'd this buffer just barely big enough to hold the largest xlog record seen so far. It turns out that that can result in valgrind complaints, because some compilers will emit code that assumes it can safely fetch padding bytes at the end of a struct, and those padding bytes were unallocated so far as aset.c was concerned. We can fix that by MAXALIGN'ing the palloc request size, ensuring that it is big enough to include any possible padding that might've been omitted from the on-disk record. An additional objection to the original coding is that it could result in many repeated palloc cycles, in the worst case where we see a series of gradually larger xlog records. We can ameliorate that cheaply by imposing a minimum buffer size that's large enough for most xlog records. BLCKSZ/2 was chosen after a bit of discussion. In passing, remove an obsolete comment in struct xl_heap_new_cid that the combocid field is free due to alignment considerations. Perhaps that was true at some point, but it's not now. Back-patch to 9.5 where this code came in. Discussion: https://postgr.es/m/E1eHa4J-0006hI-Q8@gemulon.postgresql.org
2017-10-29Fix problems with the "role" GUC and parallel query.Robert Haas
Without this fix, dropping a role can sometimes result in parallel query failures in sessions that have used "SET ROLE" to assume the dropped role, even if that setting isn't active any more. Report by Pavan Deolasee. Patch by Amit Kapila, reviewed by me. Discussion: http://postgr.es/m/CABOikdOomRcZsLsLK+Z+qENM1zxyaWnAvFh3MJZzZnnKiF+REg@mail.gmail.com
2017-10-06Fix access-off-end-of-array in clog.c.Tom Lane
Sloppy loop coding in set_status_by_pages() resulted in fetching one array element more than it should from the subxids[] array. The odds of this resulting in SIGSEGV are pretty small, but we've certainly seen that happen with similar mistakes elsewhere. While at it, we can get rid of an extra TransactionIdToPage() calculation per loop. Per report from David Binderman. Back-patch to all supported branches, since this code is quite old. Discussion: https://postgr.es/m/HE1PR0802MB2331CBA919CBFFF0C465EB429C710@HE1PR0802MB2331.eurprd08.prod.outlook.com
2017-09-27Revert to 9.6 treatment of ALTER TYPE enumtype ADD VALUE.Tom Lane
This reverts commit 15bc038f9, along with the followon commits 1635e80d3 and 984c92074 that tried to clean up the problems exposed by bug #14825. The result was incomplete because it failed to address parallel-query requirements. With 10.0 release so close upon us, now does not seem like the time to be adding more code to fix that. I hope we can un-revert this code and add the missing parallel query support during the v11 cycle. Back-patch to v10. Discussion: https://postgr.es/m/20170922185904.1448.16585@wrigleys.postgresql.org
2017-09-26Use a blacklist to distinguish original from add-on enum values.Tom Lane
Commit 15bc038f9 allowed ALTER TYPE ADD VALUE to be executed inside transaction blocks, by disallowing the use of the added value later in the same transaction, except under limited circumstances. However, the test for "limited circumstances" was heuristic and could reject references to enum values that were created during CREATE TYPE AS ENUM, not just later. This breaks the use-case of restoring pg_dump scripts in a single transaction, as reported in bug #14825 from Balazs Szilfai. We can improve this by keeping a "blacklist" table of enum value OIDs created by ALTER TYPE ADD VALUE during the current transaction. Any visible-but-uncommitted value whose OID is not in the blacklist must have been created by CREATE TYPE AS ENUM, and can be used safely because it could not have a lifespan shorter than its parent enum type. This change also removes the restriction that a renamed enum value can't be used before being committed (unless it was on the blacklist). Andrew Dunstan, with cosmetic improvements by me. Back-patch to v10. Discussion: https://postgr.es/m/20170922185904.1448.16585@wrigleys.postgresql.org
2017-09-22For wal_consistency_checking, mask page checksum as well as page LSN.Robert Haas
If the LSN is different, the checksum will be different, too. Ashwin Agrawal, reviewed by Michael Paquier and Kuntal Ghosh Discussion: http://postgr.es/m/CALfoeis5iqrAU-+JAN+ZzXkpPr7+-0OAGv7QUHwFn=-wDy4o4Q@mail.gmail.com
2017-09-11Message style fixesPeter Eisentraut
2017-08-31Clean up shm_mq cleanup.Tom Lane
The logic around shm_mq_detach was a few bricks shy of a load, because (contrary to the comments for shm_mq_attach) all it did was update the shared shm_mq state. That left us leaking a bit of process-local memory, but much worse, the on_dsm_detach callback for shm_mq_detach was still armed. That means that whenever we ultimately detach from the DSM segment, we'd run shm_mq_detach again for already-detached, possibly long-dead queues. This accidentally fails to fail today, because we only ever re-use a shm_mq's memory for another shm_mq, and multiple detach attempts on the last such shm_mq are fairly harmless. But it's gonna bite us someday, so let's clean it up. To do that, change shm_mq_detach's API so it takes a shm_mq_handle not the underlying shm_mq. This makes the callers simpler in most cases anyway. Also fix a few places in parallel.c that were just pfree'ing the handle structs rather than doing proper cleanup. Back-patch to v10 because of the risk that the revenant shm_mq_detach callbacks would cause a live bug sometime. Since this is an API change, it's too late to do it in 9.6. (We could make a variant patch that preserves API, but I'm not excited enough to do that.) Discussion: https://postgr.es/m/8670.1504192177@sss.pgh.pa.us
2017-08-14Final pgindent + perltidy run for v10.Tom Lane
2017-08-14Handle elog(FATAL) during ROLLBACK more robustly.Tom Lane
Stress testing by Andreas Seltenreich disclosed longstanding problems that occur if a FATAL exit (e.g. due to receipt of SIGTERM) occurs while we are trying to execute a ROLLBACK of an already-failed transaction. In such a case, xact.c is in TBLOCK_ABORT state, so that AbortOutOfAnyTransaction would skip AbortTransaction and go straight to CleanupTransaction. This led to an assert failure in an assert-enabled build (due to the ROLLBACK's portal still having a cleanup hook) or without assertions, to a FATAL exit complaining about "cannot drop active portal". The latter's not disastrous, perhaps, but it's messy enough to want to improve it. We don't really want to run all of AbortTransaction in this code path. The minimum required to clean up the open portal safely is to do AtAbort_Memory and AtAbort_Portals. It seems like a good idea to do AtAbort_Memory unconditionally, to be entirely sure that we are starting with a safe CurrentMemoryContext. That means that if the main loop in AbortOutOfAnyTransaction does nothing, we need an extra step at the bottom to restore CurrentMemoryContext = TopMemoryContext, which I chose to do by invoking AtCleanup_Memory. This'll result in calling AtCleanup_Memory twice in many of the paths through this function, but that seems harmless and reasonably inexpensive. The original motivation for the assertion in AtCleanup_Portals was that we wanted to be sure that any user-defined code executed as a consequence of the cleanup hook runs during AbortTransaction not CleanupTransaction. That still seems like a valid concern, and now that we've seen one case of the assertion firing --- which means that exactly that would have happened in a production build --- let's replace the Assert with a runtime check. If we see the cleanup hook still set, we'll emit a WARNING and just drop the hook unexecuted. This has been like this a long time, so back-patch to all supported branches. Discussion: https://postgr.es/m/877ey7bmun.fsf@ansel.ydns.eu
2017-08-13Remove AtEOXact_CatCache().Tom Lane
The sole useful effect of this function, to check that no catcache entries have positive refcounts at transaction end, has really been obsolete since we introduced ResourceOwners in PG 8.1. We reduced the checks to assertions years ago, so that the function was a complete no-op in production builds. There have been previous discussions about removing it entirely, but consensus up to now was that it had some small value as a cross-check for bugs in the ResourceOwner logic. However, it now emerges that it's possible to trigger these assertions if you hit an assert-enabled backend with SIGTERM during a call to SearchCatCacheList, because that function temporarily increases the refcounts of entries it's intending to add to a catcache list construct. In a normal ERROR scenario, the extra refcounts are cleaned up by SearchCatCacheList's PG_CATCH block; but in a FATAL exit we do a transaction abort and exit without ever executing PG_CATCH handlers. There's a case to be made that this is a generic hazard and we should consider restructuring elog(FATAL) handling so that pending PG_CATCH handlers do get run. That's pretty scary though: it could easily create more problems than it solves. Preliminary stress testing by Andreas Seltenreich suggests that there are not many live problems of this ilk, so we rejected that idea. There are more-localized ways to fix the problem; the most principled one would be to use PG_ENSURE_ERROR_CLEANUP instead of plain PG_TRY. But adding cycles to SearchCatCacheList isn't very appealing. We could also weaken the assertions in AtEOXact_CatCache in some more or less ad-hoc way, but that just makes its raison d'etre even less compelling. In the end, the most reasonable solution seems to be to just remove AtEOXact_CatCache altogether, on the grounds that it's not worth trying to fix it. It hasn't found any bugs for us in many years. Per report from Jeevan Chalke. Back-patch to all supported branches. Discussion: https://postgr.es/m/CAM2+6=VEE30YtRQCZX7_sCFsEpoUkFBV1gZazL70fqLn8rcvBA@mail.gmail.com
2017-08-10Remove uses of "slave" in replication contextsPeter Eisentraut
This affects mostly code comments, some documentation, and tests. Official APIs already used "standby".
2017-08-10Remove incorrect assertion in clog.cRobert Haas
We must advance the oldest XID that can be safely looked up in clog *before* truncating CLOG, and the oldest XID that can't be reused *after* truncating CLOG. This assertion, and the accompanying comment, are confused; remove them. Reported by Neha Sharma. Discussion: http://postgr.es/m/CANiYTQumC3T=UMBMd1Hor=5XWZYuCEQBioL3ug0YtNQCMMT5wQ@mail.gmail.com
2017-08-05Make pg_stop_backup's wait_for_archive flag work on standbys.Robert Haas
Previously, it had no effect. Now, if archive_mode=always, it will work, and if not, you'll get a warning. Masahiko Sawada, Michael Paquier, and Robert Haas. The patch as submitted also changed the behavior so that we would write and remove history files on standbys, but that seems like material for a separate patch to me. Discussion: http://postgr.es/m/CAD21AoC2Xw6M=ZJyejq_9d_iDkReC_=rpvQRw5QsyzKQdfYpkw@mail.gmail.com
2017-08-04Only kill sync workers at commit time in subscription DDLPeter Eisentraut
This allows a transaction abort to avoid killing those workers. Author: Petr Jelinek <petr.jelinek@2ndquadrant.com>