summaryrefslogtreecommitdiff
path: root/src/backend/access/transam
AgeCommit message (Collapse)Author
2018-11-25Integrate recovery.conf into postgresql.confPeter Eisentraut
recovery.conf settings are now set in postgresql.conf (or other GUC sources). Currently, all the affected settings are PGC_POSTMASTER; this could be refined in the future case by case. Recovery is now initiated by a file recovery.signal. Standby mode is initiated by a file standby.signal. The standby_mode setting is gone. If a recovery.conf file is found, an error is issued. The trigger_file setting has been renamed to promote_trigger_file as part of the move. The documentation chapter "Recovery Configuration" has been integrated into "Server Configuration". pg_basebackup -R now appends settings to postgresql.auto.conf and creates a standby.signal file. Author: Fujii Masao <masao.fujii@gmail.com> Author: Simon Riggs <simon@2ndquadrant.com> Author: Abhijit Menon-Sen <ams@2ndquadrant.com> Author: Sergei Kornilov <sk@zsrv.org> Discussion: https://www.postgresql.org/message-id/flat/607741529606767@web3g.yandex.ru/
2018-11-23Add WL_EXIT_ON_PM_DEATH pseudo-event.Thomas Munro
Users of the WaitEventSet and WaitLatch() APIs can now choose between asking for WL_POSTMASTER_DEATH and then handling it explicitly, or asking for WL_EXIT_ON_PM_DEATH to trigger immediate exit on postmaster death. This reduces code duplication, since almost all callers want the latter. Repair all code that was previously ignoring postmaster death completely, or requesting the event but ignoring it, or requesting the event but then doing an unconditional PostmasterIsAlive() call every time through its event loop (which is an expensive syscall on platforms for which we don't have USE_POSTMASTER_DEATH_SIGNAL support). Assert that callers of WaitLatchXXX() under the postmaster remember to ask for either WL_POSTMASTER_DEATH or WL_EXIT_ON_PM_DEATH, to prevent future bugs. The only process that doesn't handle postmaster death is syslogger. It waits until all backends holding the write end of the syslog pipe (including the postmaster) have closed it by exiting, to be sure to capture any parting messages. By using the WaitEventSet API directly it avoids the new assertion, and as a by-product it may be slightly more efficient on platforms that have epoll(). Author: Thomas Munro Reviewed-by: Kyotaro Horiguchi, Heikki Linnakangas, Tom Lane Discussion: https://postgr.es/m/CAEepm%3D1TCviRykkUb69ppWLr_V697rzd1j3eZsRMmbXvETfqbQ%40mail.gmail.com, https://postgr.es/m/CAEepm=2LqHzizbe7muD7-2yHUbTOoF7Q+qkSD5Q41kuhttRTwA@mail.gmail.com
2018-11-20Remove WITH OIDS support, change oid catalog column visibility.Andres Freund
Previously tables declared WITH OIDS, including a significant fraction of the catalog tables, stored the oid column not as a normal column, but as part of the tuple header. This special column was not shown by default, which was somewhat odd, as it's often (consider e.g. pg_class.oid) one of the more important parts of a row. Neither pg_dump nor COPY included the contents of the oid column by default. The fact that the oid column was not an ordinary column necessitated a significant amount of special case code to support oid columns. That already was painful for the existing, but upcoming work aiming to make table storage pluggable, would have required expanding and duplicating that "specialness" significantly. WITH OIDS has been deprecated since 2005 (commit ff02d0a05280e0). Remove it. Removing includes: - CREATE TABLE and ALTER TABLE syntax for declaring the table to be WITH OIDS has been removed (WITH (oids[ = true]) will error out) - pg_dump does not support dumping tables declared WITH OIDS and will issue a warning when dumping one (and ignore the oid column). - restoring an pg_dump archive with pg_restore will warn when restoring a table with oid contents (and ignore the oid column) - COPY will refuse to load binary dump that includes oids. - pg_upgrade will error out when encountering tables declared WITH OIDS, they have to be altered to remove the oid column first. - Functionality to access the oid of the last inserted row (like plpgsql's RESULT_OID, spi's SPI_lastoid, ...) has been removed. The syntax for declaring a table WITHOUT OIDS (or WITH (oids = false) for CREATE TABLE) is still supported. While that requires a bit of support code, it seems unnecessary to break applications / dumps that do not use oids, and are explicit about not using them. The biggest user of WITH OID columns was postgres' catalog. This commit changes all 'magic' oid columns to be columns that are normally declared and stored. To reduce unnecessary query breakage all the newly added columns are still named 'oid', even if a table's column naming scheme would indicate 'reloid' or such. This obviously requires adapting a lot code, mostly replacing oid access via HeapTupleGetOid() with access to the underlying Form_pg_*->oid column. The bootstrap process now assigns oids for all oid columns in genbki.pl that do not have an explicit value (starting at the largest oid previously used), only oids assigned later by oids will be above FirstBootstrapObjectId. As the oid column now is a normal column the special bootstrap syntax for oids has been removed. Oids are not automatically assigned during insertion anymore, all backend code explicitly assigns oids with GetNewOidWithIndex(). For the rare case that insertions into the catalog via SQL are called for the new pg_nextoid() function can be used (which only works on catalog tables). The fact that oid columns on system tables are now normal columns means that they will be included in the set of columns expanded by * (i.e. SELECT * FROM pg_class will now include the table's oid, previously it did not). It'd not technically be hard to hide oid column by default, but that'd mean confusing behavior would either have to be carried forward forever, or it'd cause breakage down the line. While it's not unlikely that further adjustments are needed, the scope/invasiveness of the patch makes it worthwhile to get merge this now. It's painful to maintain externally, too complicated to commit after the code code freeze, and a dependency of a number of other patches. Catversion bump, for obvious reasons. Author: Andres Freund, with contributions by John Naylor Discussion: https://postgr.es/m/20180930034810.ywp2c7awz7opzcfr@alap3.anarazel.de
2018-11-19PANIC on fsync() failure.Thomas Munro
On some operating systems, it doesn't make sense to retry fsync(), because dirty data cached by the kernel may have been dropped on write-back failure. In that case the only remaining copy of the data is in the WAL. A subsequent fsync() could appear to succeed, but not have flushed the data. That means that a future checkpoint could apparently complete successfully but have lost data. Therefore, violently prevent any future checkpoint attempts by panicking on the first fsync() failure. Note that we already did the same for WAL data; this change extends that behavior to non-temporary data files. Provide a GUC data_sync_retry to control this new behavior, for users of operating systems that don't eject dirty data, and possibly forensic/testing uses. If it is set to on and the write-back error was transient, a later checkpoint might genuinely succeed (on a system that does not throw away buffers on failure); if the error is permanent, later checkpoints will continue to fail. The GUC defaults to off, meaning that we panic. Back-patch to all supported releases. There is still a narrow window for error-loss on some operating systems: if the file is closed and later reopened and a write-back error occurs in the intervening time, but the inode has the bad luck to be evicted due to memory pressure before we reopen, we could miss the error. A later patch will address that with a scheme for keeping files with dirty data open at all times, but we judge that to be too complicated to back-patch. Author: Craig Ringer, with some adjustments by Thomas Munro Reported-by: Craig Ringer Reviewed-by: Robert Haas, Thomas Munro, Andres Freund Discussion: https://postgr.es/m/20180427222842.in2e4mibx45zdth5%40alap3.anarazel.de
2018-11-19Remove unnecessary memcpy when reading WAL record fitting on pageMichael Paquier
When reading a WAL record, its contents are copied into an intermediate buffer. However, doing so is not necessary if the record fits fully into the current page, saving one memcpy for each such record. The allocation handling of the intermediate buffer is also now done only when a record crosses a page boundary, shaving some extra cycles when reading a WAL record. Author: Andrey Lepikhov Reviewed-by: Kyotaro Horiguchi, Heikki Linnakangas Discussion: https://postgr.es/m/c2ea54dd-a1d3-80eb-ddbf-7e6f258e615e@postgrespro.ru
2018-11-14Initialize TransactionState and user ID consistently at transaction startMichael Paquier
If a failure happens when a transaction is starting between the moment the transaction status is changed from TRANS_DEFAULT to TRANS_START and the moment the current user ID and security context flags are fetched via GetUserIdAndSecContext(), or before initializing its basic fields, then those may get reset to incorrect values when the transaction aborts, leaving the session in an inconsistent state. One problem reported is that failing a starting transaction at the first query of a session could cause several kinds of system crashes on the follow-up queries. In order to solve that, move the initialization of the transaction state fields and the call of GetUserIdAndSecContext() in charge of fetching the current user ID close to the point where the transaction status is switched to TRANS_START, where there cannot be any error triggered in-between, per an idea of Tom Lane. This properly ensures that the current user ID, the security context flags and that the basic fields of TransactionState remain consistent even if the transaction fails while starting. Reported-by: Richard Guo Diagnosed-By: Richard Guo Author: Michael Paquier Reviewed-by: Tom Lane Discussion: https://postgr.es/m/CAN_9JTxECSb=pEPcb0a8d+6J+bDcOZ4=DgRo_B7Y5gRHJUM=Rw@mail.gmail.com Backpatch-through: 9.4
2018-11-10Remove volatiles from {procarray,volatile}.c and fix memory ordering issue.Andres Freund
The use of volatiles in procarray.c largely originated from the time when postgres did not have reliable compiler and memory barriers. That's not the case anymore, so we can do better. Several of the functions in procarray.c can be bottlenecks, and removal of volatile yields mildly better code. The new state, with explicit memory barriers, is also more correct. The previous use of volatile did not actually deliver sufficient guarantees on weakly ordered machines, in particular the logic in GetNewTransactionId() does not look safe. It seems unlikely to be a problem in practice, but worth fixing. Thomas and I independently wrote a patch for this. Reported-By: Andres Freund and Thomas Munro Author: Andres Freund, with cherrypicked changes from a patch by Thomas Munro Discussion: https://postgr.es/m/20181005172955.wyjb4fzcdzqtaxjq@alap3.anarazel.de https://postgr.es/m/CAEepm=1nff0x=7i3YQO16jLA2qw-F9O39YmUew4oq-xcBQBs0g@mail.gmail.com
2018-11-07Use pg_pread() and pg_pwrite() for data files and WAL.Thomas Munro
Cut down on system calls by doing random I/O using offset-based OS routines where available. Remove the code for tracking the 'virtual' seek position. The only reason left to call FileSeek() was to get the file's size, so provide a new function FileSize() instead. Author: Oskari Saarenmaa, Thomas Munro Reviewed-by: Thomas Munro, Jesper Pedersen, Tom Lane, Alvaro Herrera Discussion: https://postgr.es/m/CAEepm=02rapCpPR3ZGF2vW=SBHSdFYO_bz_f-wwWJonmA3APgw@mail.gmail.com Discussion: https://postgr.es/m/b8748d39-0b19-0514-a1b9-4e5a28e6a208%40gmail.com Discussion: https://postgr.es/m/a86bd200-ebbe-d829-e3ca-0c4474b2fcb7%40ohmu.fi
2018-11-02Fix spelling errors and typos in commentsMagnus Hagander
Author: Daniel Gustafsson <daniel@yesql.se>
2018-10-31Fix typo in xlog.c.Andres Freund
Author: Daniel Gustafsson Discussion: https://postgr.es/m/A6817958-949E-4A5B-895D-FA421B6640C2@yesql.se
2018-10-25Add pg_promote functionMichael Paquier
This function is able to promote a standby with this new SQL-callable function. Execution access can be granted to non-superusers so that failover tools can observe the principle of least privilege. Catalog version is bumped. Author: Laurenz Albe Reviewed-by: Michael Paquier, Masahiko Sawada Discussion: https://postgr.es/m/6e7c79b3ec916cf49742fb8849ed17cd87aed620.camel@cybertec.at
2018-10-12Simplify use of AllocSetContextCreate() wrapper macro.Tom Lane
We can allow this macro to accept either abbreviated or non-abbreviated allocation parameters by making use of __VA_ARGS__. As noted by Andres Freund, it's unlikely that any compiler would have __builtin_constant_p but not __VA_ARGS__, so this gives up little or no error checking, and it avoids a minor but annoying API break for extensions. With this change, there is no reason for anybody to call AllocSetContextCreateExtended directly, so in HEAD I renamed it to AllocSetContextCreateInternal. It's probably too late for an ABI break like that in 11, though. Discussion: https://postgr.es/m/20181012170355.bhxi273skjt6sag4@alap3.anarazel.de
2018-10-09Relax transactional restrictions on ALTER TYPE ... ADD VALUE (redux).Thomas Munro
Originally committed as 15bc038f (plus some follow-ups), this was reverted in 28e07270 due to a problem discovered in parallel workers. This new version corrects that problem by sending the list of uncommitted enum values to parallel workers. Here follows the original commit message describing the change: To prevent possibly breaking indexes on enum columns, we must keep uncommitted enum values from getting stored in tables, unless we can be sure that any such column is new in the current transaction. Formerly, we enforced this by disallowing ALTER TYPE ... ADD VALUE from being executed at all in a transaction block, unless the target enum type had been created in the current transaction. This patch removes that restriction, and instead insists that an uncommitted enum value can't be referenced unless it belongs to an enum type created in the same transaction as the value. Per discussion, this should be a bit less onerous. It does require each function that could possibly return a new enum value to SQL operations to check this restriction, but there aren't so many of those that this seems unmaintainable. Author: Andrew Dunstan and Tom Lane, with parallel query fix by Thomas Munro Reviewed-by: Tom Lane Discussion: https://postgr.es/m/CAEepm%3D0Ei7g6PaNTbcmAh9tCRahQrk%3Dr5ZWLD-jr7hXweYX3yg%40mail.gmail.com Discussion: https://postgr.es/m/4075.1459088427%40sss.pgh.pa.us
2018-10-08Advance transaction timestamp for intra-procedure transactions.Tom Lane
Per discussion, this behavior seems less astonishing than not doing so. Peter Eisentraut and Tom Lane Discussion: https://postgr.es/m/20180920234040.GC29981@momjian.us
2018-10-06Propagate xactStartTimestamp and stmtStartTimestamp to parallel workers.Tom Lane
Previously, a worker process would establish values for these based on its own start time. In v10 and up, this can trivially be shown to cause misbehavior of transaction_timestamp(), timestamp_in(), and related functions which are (perhaps unwisely?) marked parallel-safe. It seems likely that other behaviors might diverge from what happens in the parent as well. It's not as trivial to demonstrate problems in 9.6 or 9.5, but I'm sure it's still possible, so back-patch to all branches containing parallel worker infrastructure. In HEAD only, mark now() and statement_timestamp() as parallel-safe (other affected functions already were). While in theory we could still squeeze that change into v11, it doesn't seem important enough to force a last-minute catversion bump. Konstantin Knizhnik, whacked around a bit by me Discussion: https://postgr.es/m/6406dbd2-5d37-4cb6-6eb2-9c44172c7e7c@postgrespro.ru
2018-09-28Fix assertion failure when updating full_page_writes for checkpointer.Amit Kapila
When the checkpointer receives a SIGHUP signal to update its configuration, it may need to update the shared memory for full_page_writes and need to write a WAL record for it. Now, it is quite possible that the XLOG machinery has not been initialized by that time and it will lead to assertion failure while doing that. Fix is to allow the initialization of the XLOG machinery outside critical section. This bug has been introduced by the commit 2c03216d83 which added the XLOG machinery initialization in RecoveryInProgress code path. Reported-by: Dilip Kumar Author: Dilip Kumar Reviewed-by: Michael Paquier and Amit Kapila Backpatch-through: 9.5 Discussion: https://postgr.es/m/CAFiTN-u4BA8KXcQUWDPNgaKAjDXC=C2whnzBM8TAcv=stckYUw@mail.gmail.com
2018-09-28Fix WAL recycling on standbys depending on archive_modeMichael Paquier
A restart point or a checkpoint recycling WAL segments treats segments marked with neither ".done" (archiving is done) or ".ready" (segment is ready to be archived) in archive_status the same way for archive_mode being "on" or "always". While for a primary this is fine, a standby running a restart point with archive_mode = on would try to mark such a segment as ready for archiving, which is something that will never happen except after the standby is promoted. Note that this problem applies only to WAL segments coming from the local pg_wal the first time archive recovery is run. Segments part of a self-contained base backup are the most common case where this could happen, however even in this case normally the .done markers would be most likely part of the backup. Segments recovered from an archive are marked as .ready or .done by the startup process, and segments finished streaming are marked as such by the WAL receiver, so they are handled already. Reported-by: Haruka Takatsuka Author: Michael Paquier Discussion: https://postgr.es/m/15402-a453c90ed4cf88b2@postgresql.org Backpatch-through: 9.5, where archive_mode = always has been added.
2018-09-26Rework activation of commit timestamps during recoveryMichael Paquier
The activation and deactivation of commit timestamp tracking has not been handled consistently for a primary or standbys at recovery. The facility can be activated at three different moments of recovery: - The beginning, where a primary would use the GUC value for the decision-making, and where a standby relies on the contents of the control file. - When replaying a XLOG_PARAMETER_CHANGE record at redo. - The end, where both primary and standby rely on the GUC value. Using the GUC value for a primary at the beginning of recovery causes problems with commit timestamp access when doing crash recovery. Particularly, when replaying transaction commits, it could be possible that an attempt to read commit timestamps is done for a transaction which committed at a moment when track_commit_timestamp was disabled. A test case is added to reproduce the failure. The test works down to v11 as it takes advantage of transaction commits within procedures. Reported-by: Hailong Li Author: Masahiko Sawasa, Michael Paquier Reviewed-by: Kyotaro Horiguchi Discussion: https://postgr.es/m/11224478-a782-203b-1f17-e4797b39bdf0@qunar.com Backpatch-through: 9.5, where commit timestamps have been introduced.
2018-09-20Defer restoration of libraries in parallel workers.Thomas Munro
Several users of extensions complained of crashes in parallel workers that turned out to be due to syscache access from their _PG_init() functions. Reorder the initialization of parallel workers so that libraries are restored after the caches are initialized, and inside a transaction. This was reported in bug #15350 and elsewhere. We don't consider it to be a bug: extensions shouldn't do that, because then they can't be used in shared_preload_libraries. However, it's a fairly obscure hazard and these extensions worked in practice before parallel query came along. So let's make it work. Later commits might add a warning message and eventually an error. Back-patch to 9.6, where parallel query landed. Author: Thomas Munro Reviewed-by: Amit Kapila Reported-by: Kieran McCusker, Jimmy Discussion: https://postgr.es/m/153512195228.1489.8545997741965926448%40wrigleys.postgresql.org
2018-09-19Fix minor error message style guide violation.Tom Lane
No periods at the ends of primary error messages, please. Daniel Gustafsson Discussion: https://postgr.es/m/43E004C0-18C6-42B4-A313-003B43EB0571@yesql.se
2018-09-13Attach FPI to the first record after full_page_writes is turned on.Amit Kapila
XLogInsert fails to attach a required FPI to the first record after full_page_writes is turned on by the last checkpoint. This bug got introduced in 9.5 due to code rearrangement in commits 2c03216d83 and 2076db2aea. Fix it by ensuring that XLogInsertRecord performs a recomputation when the given record is generated with FPW as off but found that the flag has been turned on while actually inserting the record. Reported-by: Kyotaro Horiguchi Author: Kyotaro Horiguchi Reviewed-by: Amit Kapila Backpatch-through: 9.5 where this problem was introduced Discussion: https://postgr.es/m/20180420.151043.74298611.horiguchi.kyotaro@lab.ntt.co.jp
2018-09-08Remove duplicated words split across lines in commentsMichael Paquier
This has been detected using some interesting tricks with sed, and the method used is mentioned in details in the discussion below. Author: Justin Pryzby Discussion: https://postgr.es/m/20180908013109.GB15350@telsasoft.com
2018-09-07Improve handling of corrupted two-phase state files at recoveryMichael Paquier
When a corrupted two-phase state file is found by WAL replay, be it for crash recovery or archive recovery, then the file is simply skipped and a WARNING is logged to the user, causing the transaction to be silently lost. Facing an on-disk WAL file which is corrupted is as likely to happen as what is stored in WAL records, but WAL records are already able to fail hard if there is a CRC mismatch. On-disk two-phase state files, on the contrary, are simply ignored if corrupted. Note that when restoring the initial two-phase data state at recovery, files newer than the horizon XID are discarded hence no files present in pg_twophase/ should be torned and have been made durable by a previous checkpoint, so recovery should never see any corrupted two-phase state file by design. The situation got better since 978b2f6 which has added two-phase state information directly in WAL instead of using on-disk files, so the risk is limited to two-phase transactions which live across at least one checkpoint for long periods. Backups having legit two-phase state files on-disk could also lose silently transactions when restored if things get corrupted. This behavior exists since two-phase commit has been introduced, no back-patch is done for now per the lack of complaints about this problem. Author: Michael Paquier Discussion: https://postgr.es/m/20180709050309.GM1467@paquier.xyz
2018-09-07Use C99 designated initializers for some structsPeter Eisentraut
These are just a few particularly egregious cases that were hard to read and write, and error prone because of many similar adjacent types. Discussion: https://www.postgresql.org/message-id/flat/4c9f01be-9245-2148-b569-61a8562ef190%402ndquadrant.com
2018-09-01Avoid using potentially-under-aligned page buffers.Tom Lane
There's a project policy against using plain "char buf[BLCKSZ]" local or static variables as page buffers; preferred style is to palloc or malloc each buffer to ensure it is MAXALIGN'd. However, that policy's been ignored in an increasing number of places. We've apparently got away with it so far, probably because (a) relatively few people use platforms on which misalignment causes core dumps and/or (b) the variables chance to be sufficiently aligned anyway. But this is not something to rely on. Moreover, even if we don't get a core dump, we might be paying a lot of cycles for misaligned accesses. To fix, invent new union types PGAlignedBlock and PGAlignedXLogBlock that the compiler must allocate with sufficient alignment, and use those in place of plain char arrays. I used these types even for variables where there's no risk of a misaligned access, since ensuring proper alignment should make kernel data transfers faster. I also changed some places where we had been palloc'ing short-lived buffers, for coding style uniformity and to save palloc/pfree overhead. Since this seems to be a live portability hazard (despite the lack of field reports), back-patch to all supported versions. Patch by me; thanks to Michael Paquier for review. Discussion: https://postgr.es/m/1535618100.1286.3.camel@credativ.de
2018-08-31Ensure correct minimum consistent point on standbysMichael Paquier
Startup process has improved its calculation of incorrect minimum consistent point in 8d68ee6, which ensures that all WAL available gets replayed when doing crash recovery, and has introduced an incorrect calculation of the minimum recovery point for non-startup processes, which can cause incorrect page references on a standby when for example the background writer flushed a couple of pages on-disk but was not updating the control file to let a subsequent crash recovery replay to where it should have. The only case where this has been reported to be a problem is when a standby needs to calculate the latest removed xid when replaying a btree deletion record, so one would need connections on a standby that happen just after recovery has thought it reached a consistent point. Using a background worker which is started after the consistent point is reached would be the easiest way to get into problems if it connects to a database. Having clients which attempt to connect periodically could also be a problem, but the odds of seeing this problem are much lower. The fix used is pretty simple, as the idea is to give access to the minimum recovery point written in the control file to non-startup processes so as they use a reference, while the startup process still initializes its own references of the minimum consistent point so as the original problem with incorrect page references happening post-promotion with a crash do not show up. Reported-by: Alexander Kukushkin Diagnosed-by: Alexander Kukushkin Author: Michael Paquier Reviewed-by: Kyotaro Horiguchi, Alexander Kukushkin Discussion: https://postgr.es/m/153492341830.1368.3936905691758473953@wrigleys.postgresql.org Backpatch-through: 9.3
2018-08-16Improve comment in GetNewObjectId().Thomas Munro
The previous comment gave the impression that skipping OIDs before FirstNormalObjectId was merely an optimization to avoid likely collisions. In fact other parts of the system have been relying on this threshold to detect system-created objects since commit 8e18d04d4da, so adjust the wording. Author: Thomas Munro Reviewed-by: Tom Lane Discussion: https://postgr.es/m/CAEepm%3D33JASACeOayr_W3%3DCSjy2jiPxM-k89axu0akFbHdjnjA%40mail.gmail.com
2018-08-13Make autovacuum more aggressive to remove orphaned temp tablesMichael Paquier
Commit dafa084, added in 10, made the removal of temporary orphaned tables more aggressive. This commit makes an extra step into the aggressiveness by adding a flag in each backend's MyProc which tracks down any temporary namespace currently in use. The flag is set when the namespace gets created and can be reset if the temporary namespace has been created in a transaction or sub-transaction which is aborted. The flag value assignment is assumed to be atomic, so this can be done in a lock-less fashion like other flags already present in PGPROC like databaseId or backendId, still the fact that the temporary namespace and table created are still locked until the transaction creating those commits acts as a barrier for other backends. This new flag gets used by autovacuum to discard more aggressively orphaned tables by additionally checking for the database a backend is connected to as well as its temporary namespace in-use, removing orphaned temporary relations even if a backend reuses the same slot as one which created temporary relations in a past session. The base idea of this patch comes from Robert Haas, has been written in its first version by Tsunakawa Takayuki, then heavily reviewed by me. Author: Tsunakawa Takayuki Reviewed-by: Michael Paquier, Kyotaro Horiguchi, Andres Freund Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F8A4DC6@G01JPEXMBYT05 Backpatch: 11-, as PGPROC gains a new flag and we don't want silent ABI breakages on already released versions.
2018-08-10Handle parallel index builds on mapped relations.Peter Geoghegan
Commit 9da0cc35284, which introduced parallel CREATE INDEX, failed to propagate relmapper.c backend local cache state to parallel worker processes. This could result in parallel index builds against mapped catalog relations where the leader process (participating as a worker) scans the new, pristine relfilenode, while worker processes scan the obsolescent relfilenode. When this happened, the final index structure was typically not consistent with the owning table's structure. The final index structure could contain entries formed from both heap relfilenodes. Only rebuilds on mapped catalog relations that occur as part of a VACUUM FULL or CLUSTER could become corrupt in practice, since their mapped relation relfilenode swap is what allows the inconsistency to arise. On master, fix the problem by propagating the required relmapper.c backend state as part of standard parallel initialization (Cf. commit 29d58fd3). On v11, simply disallow builds against mapped catalog relations by deeming them parallel unsafe. Author: Peter Geoghegan Reported-By: "death lock" Reviewed-By: Tom Lane, Amit Kapila Bug: #15309 Discussion: https://postgr.es/m/153329671686.1405.18298309097348420351@wrigleys.postgresql.org Backpatch: 11-, where parallel CREATE INDEX was introduced.
2018-08-05Reset properly errno before calling write()Michael Paquier
6cb3372 enforces errno to ENOSPC when less bytes than what is expected have been written when it is unset, though it forgot to properly reset errno before doing a system call to write(), causing errno to potentially come from a previous system call. Reported-by: Tom Lane Author: Michael Paquier Reviewed-by: Tom Lane Discussion: https://postgr.es/m/31797.1533326676@sss.pgh.pa.us
2018-07-24Fix calculation for WAL segment recycling and removalMichael Paquier
Commit 4b0d28de06 has removed the prior checkpoint and related facilities but has left WAL recycling based on the LSN of the prior checkpoint, which causes incorrect calculations for WAL removal and recycling for max_wal_size and min_wal_size. This commit changes things so as the base calculation point is the last checkpoint generated. Reported-by: Kyotaro Horiguchi Author: Kyotaro Horiguchi Reviewed-by: Michael Paquier Discussion: https://postgr.es/m/20180723.135748.42558387.horiguchi.kyotaro@lab.ntt.co.jp Backpatch: 11-, where the prior checkpoint has been removed.
2018-07-23Add proper errcodes to new error messages for read() failuresMichael Paquier
Those would use the default ERRCODE_INTERNAL_ERROR, but for foreseeable failures an errcode ought to be set, ERRCODE_DATA_CORRUPTED making the most sense here. While on the way, fix one errcode_for_file_access missing in origin.c since the code has been created, and remove one assignment of errno to 0 before calling read(), as this was around to fit with what was present before 811b6e36 where errno would not be set when not enough bytes are read. I have noticed the first one, and Tom has pinged me about the second one. Author: Michael Paquier Reported-by: Tom Lane Discussion: https://postgr.es/m/27265.1531925836@sss.pgh.pa.us
2018-07-23Make more consistent some error messages for file-related operationsMichael Paquier
Some error messages which report something about a file operation use as well context which is already provided within the path being worked on, making things rather duplicated. This creates more work for translators, and does not actually bring clarity. More could be done, however in a lot of cases the context used is actually useful, still that patch gets down things with a good cut. Author: Michael Paquier Reviewed-by: Kyotaro Horiguchi, Tom Lane Discussion: https://postgr.es/m/20180718044711.GA8565@paquier.xyz
2018-07-18Use a ResourceOwner to track buffer pins in all cases.Tom Lane
Historically, we've allowed auxiliary processes to take buffer pins without tracking them in a ResourceOwner. However, that creates problems for error recovery. In particular, we've seen multiple reports of assertion crashes in the startup process when it gets an error while holding a buffer pin, as for example if it gets ENOSPC during a write. In a non-assert build, the process would simply exit without releasing the pin at all. We've gotten away with that so far just because a failure exit of the startup process translates to a database crash anyhow; but any similar behavior in other aux processes could result in stuck pins and subsequent problems in vacuum. To improve this, institute a policy that we must *always* have a resowner backing any attempt to pin a buffer, which we can enforce just by removing the previous special-case code in resowner.c. Add infrastructure to make it easy to create a process-lifespan AuxProcessResourceOwner and clear out its contents at appropriate times. Replace existing ad-hoc resowner management in bgwriter.c and other aux processes with that. (Thus, while the startup process gains a resowner where it had none at all before, some other aux process types are replacing an ad-hoc resowner with this code.) Also use the AuxProcessResourceOwner to manage buffer pins taken during StartupXLOG and ShutdownXLOG, even when those are being run in a bootstrap process or a standalone backend rather than a true auxiliary process. In passing, remove some other ad-hoc resource owner creations that had gotten cargo-culted into various other places. As far as I can tell that was all unnecessary, and if it had been necessary it was incomplete, due to lacking any provision for clearing those resowners later. (Also worth noting in this connection is that a process that hasn't called InitBufferPoolBackend has no business accessing buffers; so there's more to do than just add the resowner if we want to touch buffers in processes not covered by this patch.) Although this fixes a very old bug, no back-patch, because there's no evidence of any significant problem in non-assert builds. Patch by me, pursuant to a report from Justin Pryzby. Thanks to Robert Haas and Kyotaro Horiguchi for reviews. Discussion: https://postgr.es/m/20180627233939.GA10276@telsasoft.com
2018-07-18Fix misc typos, mostly in comments.Heikki Linnakangas
A collection of typos I happened to spot while reading code, as well as grepping for common mistakes. Backpatch to all supported versions, as applicable, to avoid conflicts when backporting other commits in the future.
2018-07-18Fix casting in error message for two-phase fileMichael Paquier
This error from from 811b6e3, which causes compilation warnings with OSX 10.3 and clang. Reported by Tom Lane, via buildfarm member longfin.
2018-07-18Rework error messages around file handlingMichael Paquier
Some error messages related to file handling are using the code path context to define their state. For example, 2PC-related errors are referring to "two-phase status files", or "relation mapping file" is used for catalog-to-filenode mapping, however those prove to be difficult to translate, and are not more helpful than just referring to the path of the file being worked on. So simplify all those error messages by just referring to files with their path used. In some cases, like the manipulation of WAL segments, the context is actually helpful so those are kept. Calls to the system function read() have also been rather inconsistent with their error handling sometimes not reporting the number of bytes read, and some other code paths trying to use an errno which has not been set. The in-core functions are using a more consistent pattern with this patch, which checks for both errno if set or if an inconsistent read is happening. So as to care about pluralization when reading an unexpected number of byte(s), "could not read: read %d of %zu" is used as error message, with %d field being the output result of read() and %zu the expected size. This simplifies the work of translators with less variations of the same message. Author: Michael Paquier Reviewed-by: Álvaro Herrera Discussion: https://postgr.es/m/20180520000522.GB1603@paquier.xyz
2018-07-16Add subtransaction handling for table synchronization workers.Robert Haas
Since the old logic was completely unaware of subtransactions, a change made in a subsequently-aborted subtransaction would still cause workers to be stopped at toplevel transaction commit. Fix that by managing a stack of worker lists rather than just one. Amit Khandekar and Robert Haas Discussion: http://postgr.es/m/CAJ3gD9eaG_mWqiOTA2LfAug-VRNn1hrhf50Xi1YroxL37QkZNg@mail.gmail.com
2018-07-13Clean up temporary WAL segments after an instance crashMichael Paquier
Temporary WAL segments are created in pg_wal and named as xlogtemp.pid before being renamed to the real deal when creating a new segment. If an instance crashes after the temporary segment is created and before the rename is done, then the server would finish with unremovable data. After an instance crash, scan pg_wal and remove any such segments. With repetitive unlucky crashes this would contribute to disk bloat and presents risks of ENOSPC especially with max_wal_size close to the maximum allowed. Author: Michael Paquier Reviewed-by: Yugo Nagata, Heikki Linnakangas Discussion: https://postgr.es/m/20180514054955.GF1528@paquier.xyz
2018-07-10Remove dynamic_shared_memory_type=nonePeter Eisentraut
PostgreSQL nowadays offers some kind of dynamic shared memory feature on all supported platforms. Having the choice of "none" prevents us from relying on DSM in core features. So this patch removes the choice of "none". Author: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
2018-07-09Flip argument order in XLogSegNoOffsetToRecPtrAlvaro Herrera
Commit fc49e24fa69a added an input argument after the existing output argument. Flip those. Author: Álvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/20180708182345.imdgovmkffgtihhk@alvherre.pgsql
2018-07-09Rework order of end-of-recovery actions to delay timeline history writeMichael Paquier
A critical failure in some of the end-of-recovery actions before the end-of-recovery record is written can cause PostgreSQL to react inconsistently with the rest of the cluster in the event of a crash before the final record is written. Two such failures are for example an error while processing a two-phase state files or when operating on recovery.conf. With this commit, the failures are still considered FATAL, but the write of the timeline history file is delayed as much as possible so as the window between the moment the file is written and the end-of-recovery record is generated gets minimized. This way, in the event of a crash or a failure, the new timeline decided at promotion will not seem taken by other nodes in the cluster. It is not really possible to reduce to zero this window, hence one could still see failures if a crash happens between the history file write and the end-of-recovery record, so any future code should be careful when adding new end-of-recovery actions. The original report from Magnus Hagander mentioned a renamed recovery.conf as original end-of-recovery failure which caused a timeline to be seen as taken but the subsequent processing on the now-missing recovery.conf cause the startup process to issue stop on FATAL, which at follow-up startup made the system inconsistent because of on-disk changes which already happened. Processing of two-phase state files still needs some work as corrupted entries are simply ignored now. This is left as a future item and this commit fixes the original complain. Reported-by: Magnus Hagander Author: Heikki Linnakangas Reviewed-by: Alexander Korotkov, Michael Paquier, David Steele Discussion: https://postgr.es/m/CABUevEz09XY2EevA2dLjPCY-C5UO4Hq=XxmXLmF6ipNFecbShQ@mail.gmail.com
2018-07-05Prevent references to invalid relation pages after fresh promotionMichael Paquier
If a standby crashes after promotion before having completed its first post-recovery checkpoint, then the minimal recovery point which marks the LSN position where the cluster is able to reach consistency may be set to a position older than the first end-of-recovery checkpoint while all the WAL available should be replayed. This leads to the instance thinking that it contains inconsistent pages, causing a PANIC and a hard instance crash even if all the WAL available has not been replayed for certain sets of records replayed. When in crash recovery, minRecoveryPoint is expected to always be set to InvalidXLogRecPtr, which forces the recovery to replay all the WAL available, so this commit makes sure that the local copy of minRecoveryPoint from the control file is initialized properly and stays as it is while crash recovery is performed. Once switching to archive recovery or if crash recovery finishes, then the local copy minRecoveryPoint can be safely updated. Pavan Deolasee has reported and diagnosed the failure in the first place, and the base fix idea to rely on the local copy of minRecoveryPoint comes from Kyotaro Horiguchi, which has been expanded into a full-fledged patch by me. The test included in this commit has been written by Álvaro Herrera and Pavan Deolasee, which I have modified to make it faster and more reliable with sleep phases. Backpatch down to all supported versions where the bug appears, aka 9.3 which is where the end-of-recovery checkpoint is not run by the startup process anymore. The test gets easily supported down to 10, still it has been tested on all branches. Reported-by: Pavan Deolasee Diagnosed-by: Pavan Deolasee Reviewed-by: Pavan Deolasee, Kyotaro Horiguchi Author: Michael Paquier, Kyotaro Horiguchi, Pavan Deolasee, Álvaro Herrera Discussion: https://postgr.es/m/CABOikdPOewjNL=05K5CbNMxnNtXnQjhTx2F--4p4ruorCjukbA@mail.gmail.com
2018-07-05Improve the performance of relation deletes during recovery.Fujii Masao
When multiple relations are deleted at the same transaction, the files of those relations are deleted by one call to smgrdounlinkall(), which leads to scan whole shared_buffers only one time. OTOH, previously, during recovery, smgrdounlink() (not smgrdounlinkall()) was called for each file to delete, which led to scan shared_buffers multiple times. Obviously this could cause to increase the WAL replay time very much especially when shared_buffers was huge. To alleviate this situation, this commit changes the recovery so that it also calls smgrdounlinkall() only one time to delete multiple relation files. This is just fix for oversight of commit 279628a0a7, not new feature. So, per discussion on pgsql-hackers, we concluded to backpatch this to all supported versions. Author: Fujii Masao Reviewed-by: Michael Paquier, Andres Freund, Thomas Munro, Kyotaro Horiguchi, Takayuki Tsunakawa Discussion: https://postgr.es/m/CAHGQGwHVQkdfDqtvGVkty+19cQakAydXn1etGND3X0PHbZ3+6w@mail.gmail.com
2018-07-02Add wait event for fsync of WAL segmentsMichael Paquier
This has been visibly a forgotten spot in the first implementation of wait events for I/O added by 249cf07, and what has been missing is a fsync call for WAL segments which is a wrapper reacting on the value of GUC wal_sync_method. Reported-by: Konstantin Knizhnik Author: Konstantin Knizhnik Reviewed-by: Craig Ringer, Michael Paquier Discussion: https://postgr.es/m/4a243897-0ad8-f471-aa40-242591f2476e@postgrespro.ru
2018-06-30pgindent run prior to branchingAndrew Dunstan
2018-06-25Address set of issues with errno handlingMichael Paquier
System calls mixed up in error code paths are causing two issues which several code paths have not correctly handled: 1) For write() calls, sometimes the system may return less bytes than what has been written without errno being set. Some paths were careful enough to consider that case, and assumed that errno should be set to ENOSPC, other calls missed that. 2) errno generated by a system call is overwritten by other system calls which may succeed once an error code path is taken, causing what is reported to the user to be incorrect. This patch uses the brute-force approach of correcting all those code paths. Some refactoring could happen in the future, but this is let as future work, which is not targeted for back-branches anyway. Author: Michael Paquier Reviewed-by: Ashutosh Sharma Discussion: https://postgr.es/m/20180622061535.GD5215@paquier.xyz
2018-06-22Fix typo in comment of commit_ts.c for incorrect reference to CLOGMichael Paquier
Author: Shao Bret
2018-06-18Prevent hard failures of standbys caused by recycled WAL segmentsMichael Paquier
When a standby's WAL receiver stops reading WAL from a WAL stream, it writes data to the current WAL segment without having priorily zero'ed the page currently written to, which can cause the WAL reader to read junk data from a past recycled segment and then it would try to get a record from it. While sanity checks in place provide most of the protection needed, in some rare circumstances, with chances increasing when a record header crosses a page boundary, then the startup process could fail violently on an allocation failure, as follows: FATAL: invalid memory alloc request size XXX This is confusing for the user and also unhelpful as this requires in the worst case a manual restart of the instance, impacting potentially the availability of the cluster, and this also makes WAL data look like it is in a corrupted state. The chances of seeing failures are higher if the connection between the standby and its root node is unstable, causing WAL pages to be written in the middle. A couple of approaches have been discussed, like zero-ing new WAL pages within the WAL receiver itself but this has the disadvantage of impacting performance of any existing instances as this breaks the sequential writes done by the WAL receiver. This commit deals with the problem with a more simple approach, which has no performance impact without reducing the detection of the problem: if a record is found with a length higher than 1GB for backends, then do not try any allocation and report a soft failure which will force the standby to retry reading WAL. It could be possible that the allocation call passes and that an unnecessary amount of memory is allocated, however follow-up checks on records would just fail, making this allocation short-lived anyway. This patch owes a great deal to Tsunakawa Takayuki for reporting the failure first, and then discussing a couple of potential approaches to the problem. Backpatch down to 9.5, which is where palloc_extended has been introduced. Reported-by: Tsunakawa Takayuki Reviewed-by: Tsunakawa Takayuki Author: Michael Paquier Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05
2018-06-16Remove AELs from subxids correctly on standbySimon Riggs
Issues relate only to subtransactions that hold AccessExclusiveLocks when replayed on standby. Prior to PG10, aborting subtransactions that held an AccessExclusiveLock failed to release the lock until top level commit or abort. 49bff5300d527 fixed that. However, 49bff5300d527 also introduced a similar bug where subtransaction commit would fail to release an AccessExclusiveLock, leaving the lock to be removed sometimes early and sometimes late. This commit fixes that bug also. Backpatch to PG10 needed. Tested by observation. Note need for multi-node isolationtester to improve test coverage for this and other HS cases. Reported-by: Simon Riggs Author: Simon Riggs