summaryrefslogtreecommitdiff
path: root/src/backend/access/transam
AgeCommit message (Collapse)Author
2021-01-16Prevent excess SimpleLruTruncate() deletion.Noah Misch
Every core SLRU wraps around. With the exception of pg_notify, the wrap point can fall in the middle of a page. Account for this in the PagePrecedes callback specification and in SimpleLruTruncate()'s use of said callback. Update each callback implementation to fit the new specification. This changes SerialPagePrecedesLogically() from the style of asyncQueuePagePrecedes() to the style of CLOGPagePrecedes(). (Whereas pg_clog and pg_serial share a key space, pg_serial is nothing like pg_notify.) The bug fixed here has the same symptoms and user followup steps as 592a589a04bd456410b853d86bd05faa9432cbbb. Back-patch to 9.5 (all supported versions). Reviewed by Andrey Borodin and (in earlier versions) by Tom Lane. Discussion: https://postgr.es/m/20190202083822.GC32531@gust.leadboat.com
2021-01-12Fix thinko in commentAlvaro Herrera
This comment has been wrong since its introduction in commit 2c03216d8311. Author: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAD21AoAzz6qipFJBbGEaHmyWxvvNDp8httbwLR9tUQWaTjUs2Q@mail.gmail.com
2020-11-10Fix and simplify some usages of TimestampDifference().Tom Lane
Introduce TimestampDifferenceMilliseconds() to simplify callers that would rather have the difference in milliseconds, instead of the select()-oriented seconds-and-microseconds format. This gets rid of at least one integer division per call, and it eliminates some apparently-easy-to-mess-up arithmetic. Two of these call sites were in fact wrong: * pg_prewarm's autoprewarm_main() forgot to multiply the seconds by 1000, thus ending up with a delay 1000X shorter than intended. That doesn't quite make it a busy-wait, but close. * postgres_fdw's pgfdw_get_cleanup_result() thought it needed to compute microseconds not milliseconds, thus ending up with a delay 1000X longer than intended. Somebody along the way had noticed this problem but misdiagnosed the cause, and imposed an ad-hoc 60-second limit rather than fixing the units. This was relatively harmless in context, because we don't care that much about exactly how long this delay is; still, it's wrong. There are a few more callers of TimestampDifference() that don't have a direct need for seconds-and-microseconds, but can't use TimestampDifferenceMilliseconds() either because they do need microsecond precision or because they might possibly deal with intervals long enough to overflow 32-bit milliseconds. It might be worth inventing another API to improve that, but that seems outside the scope of this patch; so those callers are untouched here. Given the fact that we are fixing some bugs, and the likelihood that future patches might want to back-patch code that uses this new API, back-patch to all supported branches. Alexey Kondratov and Tom Lane Discussion: https://postgr.es/m/3b1c053a21c07c1ed5e00be3b2b855ef@postgrespro.ru
2020-11-09In security-restricted operations, block enqueue of at-commit user code.Noah Misch
Specifically, this blocks DECLARE ... WITH HOLD and firing of deferred triggers within index expressions and materialized view queries. An attacker having permission to create non-temp objects in at least one schema could execute arbitrary SQL functions under the identity of the bootstrap superuser. One can work around the vulnerability by disabling autovacuum and not manually running ANALYZE, CLUSTER, REINDEX, CREATE INDEX, VACUUM FULL, or REFRESH MATERIALIZED VIEW. (Don't restore from pg_dump, since it runs some of those commands.) Plain VACUUM (without FULL) is safe, and all commands are fine when a trusted user owns the target object. Performance may degrade quickly under this workaround, however. Back-patch to 9.5 (all supported versions). Reviewed by Robert Haas. Reported by Etienne Stalmans. Security: CVE-2020-25695
2020-09-24Fix missing fsync of SLRU directories.Thomas Munro
Harmonize behavior by moving reponsibility for fsyncing directories down into slru.c. In 10 and later, only the multixact directories were missed (see commit 1b02be21), and in older branches all SLRUs were missed. Back-patch to all supported releases. Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CA%2BhUKGLtsTUOScnNoSMZ-2ZLv%2BwGh01J6kAo_DM8mTRq1sKdSQ%40mail.gmail.com
2020-09-14Message fixes and style improvementsPeter Eisentraut
2020-08-15Prevent concurrent SimpleLruTruncate() for any given SLRU.Noah Misch
The SimpleLruTruncate() header comment states the new coding rule. To achieve this, add locktype "frozenid" and two LWLocks. This closes a rare opportunity for data loss, which manifested as "apparent wraparound" or "could not access status of transaction" errors. Data loss is more likely in pg_multixact, due to released branches' thin margin between multiStopLimit and multiWrapLimit. If a user's physical replication primary logged ": apparent wraparound" messages, the user should rebuild standbys of that primary regardless of symptoms. At less risk is a cluster having emitted "not accepting commands" errors or "must be vacuumed" warnings at some point. One can test a cluster for this data loss by running VACUUM FREEZE in every database. Back-patch to 9.5 (all supported versions). Discussion: https://postgr.es/m/20190218073103.GA1434723@rfd.leadboat.com
2020-07-20Rename wal_keep_segments to wal_keep_size.Fujii Masao
max_slot_wal_keep_size that was added in v13 and wal_keep_segments are the GUC parameters to specify how much WAL files to retain for the standby servers. While max_slot_wal_keep_size accepts the number of bytes of WAL files, wal_keep_segments accepts the number of WAL files. This difference of setting units between those similar parameters could be confusing to users. To alleviate this situation, this commit renames wal_keep_segments to wal_keep_size, and make users specify the WAL size in it instead of the number of WAL files. There was also the idea to rename max_slot_wal_keep_size to max_slot_wal_keep_segments, in the discussion. But we have been moving away from measuring in segments, for example, checkpoint_segments was replaced by max_wal_size. So we concluded to rename wal_keep_segments to wal_keep_size. Back-patch to v13 where max_slot_wal_keep_size was added. Author: Fujii Masao Reviewed-by: Álvaro Herrera, Kyotaro Horiguchi, David Steele Discussion: https://postgr.es/m/574b4ea3-e0f9-b175-ead2-ebea7faea855@oss.nttdata.com
2020-07-13Fix uninitialized value in segno calculationAlvaro Herrera
Remove previous hack in KeepLogSeg that added a case to deal with a (badly represented) invalid segment number. This was added for the sake of GetWALAvailability. But it's not needed if in that function we initialize the segment number to be retreated to the currently being written segment, so do that instead. Per valgrind-running buildfarm member skink, and some sparc64 animals. Discussion: https://postgr.es/m/1724648.1594230917@sss.pgh.pa.us
2020-07-08Fix incorrect variable datatype.Fujii Masao
Since slot_keep_segs indicates the number of WAL segments not LSN, its datatype should not be XLogRecPtr. Back-patch to v13 where this issue was added. Reported-by: Atsushi Torikoshi Author: Atsushi Torikoshi, tweaked by Fujii Masao Discussion: https://postgr.es/m/ebd0d674f3e050222238a960cac5251a@oss.nttdata.com
2020-07-07Morph pg_replication_slots.min_safe_lsn to safe_wal_sizeAlvaro Herrera
The previous definition of the column was almost universally disliked, so provide this updated definition which is more useful for monitoring purposes: a large positive value is good, while zero or a negative value means danger. This should be operationally more convenient. Backpatch to 13, where the new column to pg_replication_slots (and the feature it represents) were added. Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Author: Álvaro Herrera <alvherre@alvh.no-ip.org> Reported-by: Fujii Masao <masao.fujii@oss.nttdata.com> Discussion: https://postgr.es/m/9ddfbf8c-2f67-904d-44ed-cf8bc5916228@oss.nttdata.com
2020-06-24Adjust max_slot_wal_keep_size behavior per reviewAlvaro Herrera
In pg_replication_slot, change output from normal/reserved/lost to reserved/extended/unreserved/ lost, which better expresses the possible states particularly near the time where segments are no longer safe but checkpoint has not run yet. Under the new definition, reserved means the slot is consuming WAL that's still under the normal WAL size constraints; extended means it's consuming WAL that's being protected by wal_keep_segments or the slot itself, whose size is below max_slot_wal_keep_size; unreserved means the WAL is no longer safe, but checkpoint has not yet removed those files. Such as slot is in imminent danger, but can still continue for a little while and may catch up to the reserved WAL space. Also, there were some bugs in the calculations used to report the status; fixed those. Backpatch to 13. Reported-by: Fujii Masao <masao.fujii@oss.nttdata.com> Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/20200616.120236.1809496990963386593.horikyota.ntt@gmail.com
2020-06-24Add parens to ConvertToXSegs macroAlvaro Herrera
The current definition is dangerous. No bugs exist in our code at present, but backpatch to 11 nonetheless where it was introduced. Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
2020-06-08Fix locking bugs that could corrupt pg_control.Thomas Munro
The redo routines for XLOG_CHECKPOINT_{ONLINE,SHUTDOWN} must acquire ControlFileLock before modifying ControlFile->checkPointCopy, or the checkpointer could write out a control file with a bad checksum. Likewise, XLogReportParameters() must acquire ControlFileLock before modifying ControlFile and calling UpdateControlFile(). Back-patch to all supported releases. Author: Nathan Bossart <bossartn@amazon.com> Author: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/70BF24D6-DC51-443F-B55A-95735803842A%40amazon.com
2020-06-08Fix crash in WAL sender when starting physical replicationMichael Paquier
Since database connections can be used with WAL senders in 9.4, it is possible to use physical replication. This commit fixes a crash when starting physical replication with a WAL sender using a database connection, caused by the refactoring done in 850196b. There have been discussions about forbidding the use of physical replication in a database connection, but this is left for later, taking care only of the crash new to 13. While on it, add a test to check for a failure when attempting logical replication if the WAL sender does not have a database connection. This part is extracted from a larger patch by Kyotaro Horiguchi. Reported-by: Vladimir Sitnikov Author: Michael Paquier, Kyotaro Horiguchi Reviewed-by: Kyotaro Horiguchi, Álvaro Herrera Discussion: https://postgr.es/m/CAB=Je-GOWMj1PTPkeUhjqQp-4W3=nW-pXe2Hjax6rJFffB5_Aw@mail.gmail.com Backpatch-through: 13
2020-05-16Mop-up for wait event naming issues.Tom Lane
Synchronize the event names for parallel hash join waits with other event names, by getting rid of the slashes and dropping "-ing" suffixes. Rename ClogGroupUpdate to XactGroupUpdate, to match the new SLRU name. Move the ProcSignalBarrier event to the IPC category; it doesn't belong under IO. Also a bit more wordsmithing in the wait event documentation tables. Discussion: https://postgr.es/m/4505.1589640417@sss.pgh.pa.us
2020-05-15Rename SLRU structures and associated LWLocks.Tom Lane
Originally, the names assigned to SLRUs had no purpose other than being shmem lookup keys, so not a lot of thought went into them. As of v13, though, we're exposing them in the pg_stat_slru view and the pg_stat_reset_slru function, so it seems advisable to take a bit more care. Rename them to names based on the associated on-disk storage directories (which fortunately we *did* think about, to some extent; since those are also visible to DBAs, consistency seems like a good thing). Also rename the associated LWLocks, since those names are likewise user-exposed now as wait event names. For the most part I only touched symbols used in the respective modules' SimpleLruInit() calls, not the names of other related objects. This renaming could have been taken further, and maybe someday we will do so. But for now it seems undesirable to change the names of any globally visible functions or structs, so some inconsistency is unavoidable. (But I *did* terminate "oldserxid" with prejudice, as I found that name both unreadable and not descriptive of the SLRU's contents.) Table 27.12 needs re-alphabetization now, but I'll leave that till after the other LWLock renamings I have in mind. Discussion: https://postgr.es/m/28683.1589405363@sss.pgh.pa.us
2020-05-14Initial pgindent and pgperltidy run for v13.Tom Lane
Includes some manual cleanup of places that pgindent messed up, most of which weren't per project style anyway. Notably, it seems some people didn't absorb the style rules of commit c9d297751, because there were a bunch of new occurrences of function calls with a newline just after the left paren, all with faulty expectations about how the rest of the call would get indented.
2020-05-14Collect built-in LWLock tranche names statically, not dynamically.Tom Lane
There is little point in using the LWLockRegisterTranche mechanism for built-in tranche names. It wastes cycles, it creates opportunities for bugs (since failing to register a tranche name is a very hard-to-detect problem), and the lack of any centralized list of names encourages sloppy nonconformity in name choices. Moreover, since we have a centralized list of the tranches anyway in enum BuiltinTrancheIds, we're certainly not buying any flexibility in return for these disadvantages. Hence, nuke all the backend-internal LWLockRegisterTranche calls, and instead provide a const array of the builtin tranche names. (I have in mind to change a bunch of these names shortly, but this patch is just about getting them into one place.) Discussion: https://postgr.es/m/9056.1589419765@sss.pgh.pa.us
2020-05-13Improve management of SLRU statistics collection.Tom Lane
Instead of re-identifying which statistics bucket to use for a given SLRU on every counter increment, do it once during shmem initialization. This saves a fair number of cycles, and there's no real cost because we could not have a bucket assignment that varies over time or across backends anyway. Also, get rid of the ill-considered decision to let pgstat.c pry directly into SLRU's shared state; it's cleaner just to have slru.c pass the stats bucket number. In consequence of these changes, there's no longer any need to store an SLRU's LWLock tranche info in shared memory, so get rid of that, making this a net reduction in shmem consumption. (That partly reverts fe702a7b3.) This is basically code review for 28cac71bd, so I also cleaned up some comments, removed a dangling extern declaration, fixed some things that should be static and/or const, etc. Discussion: https://postgr.es/m/3618.1589313035@sss.pgh.pa.us
2020-05-13Adjust walsender usage of xlogreader, simplify APIsAlvaro Herrera
* Have both physical and logical walsender share a 'xlogreader' state struct for tracking state. This replaces the existing globals sendSeg and sendCxt. * Change WALRead not to receive XLogReaderState->seg and ->segcxt as separate arguments anymore; just use the ones from 'state'. This is made possible by the above change. * have the XLogReader segment_open contract require the callbacks to install the file descriptor in the state struct themselves instead of returning it. xlogreader was already ignoring any possible failed return from the callbacks, relying solely on them never returning. (This point is not altogether excellent, as it means the callbacks have to know more of XLogReaderState; but to really improve on that we would have to pass back error info from the callbacks to xlogreader. And the complexity would not be saved but instead just transferred to the callers of WALRead, which would have to learn how to throw errors from the open_segment callback in addition of, as currently, from pg_pread.) * segment_open no longer receives the 'segcxt' as a separate argument, since it's part of the XLogReaderState argument. Per comments from Kyotaro Horiguchi. Author: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/20200511203336.GA9913@alvherre.pgsql
2020-05-12Fix comment in xlogutils.cMichael Paquier
The existing callers of XLogReadDetermineTimeline() performing recovery need to check a replay LSN position when determining on which timeline to read a WAL page. A portion of the comment describing this function said exactly that, while referring to a routine for fetching a write LSN, something not available in recovery. Author: Kyotaro Horiguchi Discussion: https://postgr.es/m/20200511.101619.2043820539323292957.horikyota.ntt@gmail.com
2020-05-08Rework XLogReader callback systemAlvaro Herrera
Code review for 0dc8ead46363, prompted by a bug closed by 91c40548d5f7. XLogReader's system for opening and closing segments had gotten too complicated, with callbacks being passed at both the XLogReaderAllocate level (read_page) as well as at the WALRead level (segment_open). This was confusing and hard to follow, so restructure things so that these callbacks are passed together at XLogReaderAllocate time, and add another callback to the set (segment_close) to make it a coherent whole. Also, ensure XLogReaderState is an argument to all the callbacks, so that they can grab at the ->private data if necessary. Document the whole arrangement more clearly. Author: Álvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://postgr.es/m/20200422175754.GA19858@alvherre.pgsql
2020-05-08Report missing wait event for timeline history file.Fujii Masao
TimelineHistoryRead and TimelineHistoryWrite wait events are reported during waiting for a read and write of a timeline history file, respectively. However, previously, TimelineHistoryRead wait event was not reported while readTimeLineHistory() was reading a timeline history file. Also TimelineHistoryWrite was not reported while writeTimeLineHistory() was writing one line with the details of the timeline split, at the end. This commit fixes these issues. Back-patch to v10 where wait events for a timeline history file was added. Author: Masahiro Ikeda Reviewed-by: Michael Paquier, Fujii Masao Discussion: https://postgr.es/m/d11b0c910b63684424e06772eb844ab5@oss.nttdata.com
2020-05-05Change the display of WAL usage statistics in Explain.Amit Kapila
In commit 33e05f89c5, we have added the option to display WAL usage statistics in Explain and auto_explain. The display format used two spaces between each field which is inconsistent with Buffer usage statistics which is using one space between each field. Change the format to make WAL usage statistics consistent with Buffer usage statistics. This commit also changed the usage of "full page writes" to "full page images" for WAL usage statistics to make it consistent with other parts of code and docs. Author: Julien Rouhaud, Amit Kapila Reviewed-by: Justin Pryzby, Kyotaro Horiguchi and Amit Kapila Discussion: https://postgr.es/m/CAB-hujrP8ZfUkvL5OYETipQwA=e3n7oqHFU=4ZLxWS_Cza3kQQ@mail.gmail.com
2020-04-27Fix some typosMichael Paquier
Author: Justin Pryzby Discussion: https://postgr.es/m/20200408165653.GF2228@telsasoft.com
2020-04-26Fix typoPeter Eisentraut
from 303640199d0
2020-04-24Fix handling of WAL segments ready to be archived during crash recoveryMichael Paquier
78ea8b5 has fixed an issue related to the recycling of WAL segments on standbys depending on archive_mode. However, it has introduced a regression with the handling of WAL segments ready to be archived during crash recovery, causing those files to be recycled without getting archived. This commit fixes the regression by tracking in shared memory if a live cluster is either in crash recovery or archive recovery as the handling of WAL segments ready to be archived is different in both cases (those WAL segments should not be removed during crash recovery), and by using this new shared memory state to decide if a segment can be recycled or not. Previously, it was not possible to know if a cluster was in crash recovery or archive recovery as the shared state was able to track only if recovery was happening or not, leading to the problem. A set of TAP tests is added to close the gap here, making sure that WAL segments ready to be archived are correctly handled when a cluster is in archive or crash recovery with archive_mode set to "on" or "always", for both standby and primary. Reported-by: Benoît Lobréau Author: Jehan-Guillaume de Rorthais Reviewed-by: Kyotaro Horiguchi, Fujii Masao, Michael Paquier Discussion: https://postgr.es/m/20200331172229.40ee00dc@firost Backpatch-through: 9.5
2020-04-21Fix possible crash during FATAL exit from reindexing.Tom Lane
index.c supposed that it could just use a PG_TRY block to clean up the state associated with an active REINDEX operation. However, that code doesn't run if we do a FATAL exit --- for example, due to a SIGTERM shutdown signal --- while the REINDEX is happening. And that state does get consulted during catalog accesses, which makes it problematic if we do any catalog accesses during shutdown --- for example, to clean up any temp tables created in the session. If this combination of circumstances occurred, we could find ourselves trying to access already-freed memory. In debug builds that'd fairly reliably cause an assertion failure. In production we might often get away with it, but with some bad luck it could cause a core dump. Another possible bad outcome is an erroneous conclusion that an index-to-be-accessed is being reindexed; but it looks like that would be unlikely to have any consequences worse than failing to drop temp tables right away. (They'd still get dropped by the next session that uses that temp schema.) To fix, get rid of the use of PG_TRY here, and instead hook into the transaction abort mechanisms to clean up reindex state. Per bug #16378 from Alexander Lakhin. This has been wrong for a very long time, so back-patch to all supported branches. Discussion: https://postgr.es/m/16378-7a70ca41b3ec2009@postgresql.org
2020-04-14Comments and doc fixes for commit 40d964ec99.Amit Kapila
Reported-by: Justin Pryzby Author: Justin Pryzby, with few changes by me Reviewed-by: Amit Kapila and Sawada Masahiko Discussion: https://postgr.es/m/20200322021801.GB2563@telsasoft.com
2020-04-13Cosmetic fixups for WAL usage work.Amit Kapila
Reported-by: Justin Pryzby and Euler Taveira Author: Justin Pryzby and Julien Rouhaud Reviewed-by: Amit Kapila Discussion: https://postgr.es/m/CAB-hujrP8ZfUkvL5OYETipQwA=e3n7oqHFU=4ZLxWS_Cza3kQQ@mail.gmail.com
2020-04-08Rationalize GetWalRcv{Write,Flush}RecPtr().Thomas Munro
GetWalRcvWriteRecPtr() previously reported the latest *flushed* location. Adopt the conventional terminology used elsewhere in the tree by renaming it to GetWalRcvFlushRecPtr(), and likewise for some related variables that used the term "received". Add a new definition of GetWalRcvWriteRecPtr(), which returns the latest *written* value. This will allow later patches to use the value for non-data-integrity purposes, without having to wait for the flush pointer to advance. Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
2020-04-08Revert 0f5ca02f53Alexander Korotkov
0f5ca02f53 introduces 3 new keywords. It appears to be too much for relatively small feature. Given now we past feature freeze, it's already late for discussion of the new syntax. So, revert. Discussion: https://postgr.es/m/28209.1586294824%40sss.pgh.pa.us
2020-04-07snapshot scalability: Move delayChkpt from PGXACT to PGPROC.Andres Freund
The goal of separating hotly accessed per-backend data from PGPROC into PGXACT is to make accesses fast (GetSnapshotData() in particular). But delayChkpt is not actually accessed frequently; only when starting a checkpoint. As it is frequently modified (multiple times in the course of a single transaction), storing it in the same cacheline as hotly accessed data unnecessarily dirties a contended cacheline. Therefore move delayChkpt to PGPROC. This is part of a larger series of patches intending to improve GetSnapshotData() scalability. It is committed and pushed separately, as it is independently beneficial (small but measurable win, limited by the other frequent modifications of PGXACT). Author: Andres Freund Reviewed-By: Robert Haas, Thomas Munro, David Rowley Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
2020-04-08Track SLRU page hits in SimpleLruReadPage_ReadOnlyTomas Vondra
SLRU page hits were tracked only in SimpleLruReadPage, but that's not enough because we may hit the page in SimpleLruReadPage_ReadOnly in which case we don't call SimpleLruReadPage at all. Reported-by: Kuntal Ghosh Discussion: https://postgr.es/m/20200119143707.gyinppnigokesjok@development
2020-04-07Fix XLogReader FD leak that makes backends unusable after 2PC usage.Andres Freund
Before the fix every 2PC commit/abort leaked a file descriptor. As the files are opened using BasicOpenFile(), that quickly leads to the backend running out of file descriptors. Once enough 2PC abort/commit have caused enough FDs to leak, any IO in the backend will fail with "Too many open files", as BasicOpenFilePerm() will have triggered all open files known to fd.c to be closed. The leak causing the problem at hand is a consequence of 0dc8ead46, but is only exascerbated by it. Previously most XLogPageReadCB callbacks used static variables to cache one open file, but after the commit the cache is private to each XLogReader instance. There never was infrastructure to close FDs at the time of XLogReaderFree, but the way XLogReader was used limited the leak to one FD. This commit just closes the during XLogReaderFree() if the FD is stored in XLogReaderState.seg.ws_segno. This may not be the way to solve this medium/long term, but at least unbreaks 2PC. Discussion: https://postgr.es/m/20200406025651.fpzdb5yyb7qyhqko@alap3.anarazel.de
2020-04-07Allow users to limit storage reserved by replication slotsAlvaro Herrera
Replication slots are useful to retain data that may be needed by a replication system. But experience has shown that allowing them to retain excessive data can lead to the primary failing because of running out of space. This new feature allows the user to configure a maximum amount of space to be reserved using the new option max_slot_wal_keep_size. Slots that overrun that space are invalidated at checkpoint time, enabling the storage to be released. Author: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Jehan-Guillaume de Rorthais <jgdr@dalibo.com> Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/20170228.122736.123383594.horiguchi.kyotaro@lab.ntt.co.jp
2020-04-07Implement waiting for given lsn at transaction startAlexander Korotkov
This commit adds following optional clause to BEGIN and START TRANSACTION commands. WAIT FOR LSN lsn [ TIMEOUT timeout ] New clause pospones transaction start till given lsn is applied on standby. This clause allows user be sure, that changes previously made on primary would be visible on standby. New shared memory struct is used to track awaited lsn per backend. Recovery process wakes up backend once required lsn is applied. Author: Ivan Kartyshov, Anna Akenteva Reviewed-by: Craig Ringer, Thomas Munro, Robert Haas, Kyotaro Horiguchi Reviewed-by: Masahiko Sawada, Ants Aasma, Dmitry Ivanov, Simon Riggs Reviewed-by: Amit Kapila, Alexander Korotkov Discussion: https://postgr.es/m/0240c26c-9f84-30ea-fca9-93ab2df5f305%40postgrespro.ru
2020-04-08Prevent archive recovery from scanning non-existent WAL files.Fujii Masao
Previously when there were multiple timelines listed in the history file of the recovery target timeline, archive recovery searched all of them, starting from the newest timeline to the oldest one, to find the segment to read. That is, archive recovery had to continuously fail scanning the segment until it reached the timeline that the segment belonged to. These scans for non-existent segment could be harmful on the recovery performance especially when archival area was located on the remote storage and each scan could take a long time. To address the issue, this commit changes archive recovery so that it skips scanning the timeline that the segment to read doesn't belong to. Author: Kyotaro Horiguchi, tweaked a bit by Fujii Masao Reviewed-by: David Steele, Pavel Suderevsky, Grigory Smolkin Discussion: https://postgr.es/m/16159-f5a34a3a04dc67e0@postgresql.org Discussion: https://postgr.es/m/20200129.120222.1476610231001551715.horikyota.ntt@gmail.com
2020-04-04Skip WAL for new relfilenodes, under wal_level=minimal.Noah Misch
Until now, only selected bulk operations (e.g. COPY) did this. If a given relfilenode received both a WAL-skipping COPY and a WAL-logged operation (e.g. INSERT), recovery could lose tuples from the COPY. See src/backend/access/transam/README section "Skipping WAL for New RelFileNode" for the new coding rules. Maintainers of table access methods should examine that section. To maintain data durability, just before commit, we choose between an fsync of the relfilenode and copying its contents to WAL. A new GUC, wal_skip_threshold, guides that choice. If this change slows a workload that creates small, permanent relfilenodes under wal_level=minimal, try adjusting wal_skip_threshold. Users setting a timeout on COMMIT may need to adjust that timeout, and log_min_duration_statement analysis will reflect time consumption moving to COMMIT from commands like COPY. Internally, this requires a reliable determination of whether RollbackAndReleaseCurrentSubTransaction() would unlink a relation's current relfilenode. Introduce rd_firstRelfilenodeSubid. Amend the specification of rd_createSubid such that the field is zero when a new rel has an old rd_node. Make relcache.c retain entries for certain dropped relations until end of transaction. Bump XLOG_PAGE_MAGIC, since this introduces XLOG_GIST_ASSIGN_LSN. Future servers accept older WAL, so this bump is discretionary. Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert Haas. Heikki Linnakangas and Michael Paquier implemented earlier designs that materially clarified the problem. Reviewed, in earlier designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane, Fujii Masao, and Simon Riggs. Reported by Martijn van Oosterhout. Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
2020-04-04Revert "Improve handling of parameter differences in physical replication"Peter Eisentraut
This reverts commit 246f136e76ecd26844840f2b2057e2c87ec9868d. That patch wasn't quite complete enough. Discussion: https://www.postgresql.org/message-id/flat/E1jIpJu-0007Ql-CL%40gemulon.postgresql.org
2020-04-04Add infrastructure to track WAL usage.Amit Kapila
This allows gathering the WAL generation statistics for each statement execution. The three statistics that we collect are the number of WAL records, the number of full page writes and the amount of WAL bytes generated. This helps the users who have write-intensive workload to see the impact of I/O due to WAL. This further enables us to see approximately what percentage of overall WAL is due to full page writes. In the future, we can extend this functionality to allow us to compute the the exact amount of WAL data due to full page writes. This patch in itself is just an infrastructure to compute WAL usage data. The upcoming patches will expose this data via explain, auto_explain, pg_stat_statements and verbose (auto)vacuum output. Author: Kirill Bychik, Julien Rouhaud Reviewed-by: Dilip Kumar, Fujii Masao and Amit Kapila Discussion: https://postgr.es/m/CAB-hujrP8ZfUkvL5OYETipQwA=e3n7oqHFU=4ZLxWS_Cza3kQQ@mail.gmail.com
2020-04-03Generate backup manifests for base backups, and validate them.Robert Haas
A manifest is a JSON document which includes (1) the file name, size, last modification time, and an optional checksum for each file backed up, (2) timelines and LSNs for whatever WAL will need to be replayed to make the backup consistent, and (3) a checksum for the manifest itself. By default, we use CRC-32C when checksumming data files, because we are trying to detect corruption and user error, not foil an adversary. However, pg_basebackup and the server-side BASE_BACKUP command now have options to select a different algorithm, so users wanting a cryptographic hash function can select SHA-224, SHA-256, SHA-384, or SHA-512. Users not wanting file checksums at all can disable them, or disable generating of the backup manifest altogether. Using a cryptographic hash function in place of CRC-32C consumes significantly more CPU cycles, which may slow down backups in some cases. A new tool called pg_validatebackup can validate a backup against the manifest. If no checksums are present, it can still check that the right files exist and that they have the expected sizes. If checksums are present, it can also verify that each file has the expected checksum. Additionally, it calls pg_waldump to verify that the expected WAL files are present and parseable. Only plain format backups can be validated directly, but tar format backups can be validated after extracting them. Robert Haas, with help, ideas, review, and testing from David Steele, Stephen Frost, Andrew Dunstan, Rushabh Lathia, Suraj Kharage, Tushar Ahuja, Rajkumar Raghuwanshi, Mark Dilger, Davinder Singh, Jeevan Chalke, Amit Kapila, Andres Freund, and Noah Misch. Discussion: http://postgr.es/m/CA+TgmoZV8dw1H2bzZ9xkKwdrk8+XYa+DC9H=F7heO2zna5T6qg@mail.gmail.com
2020-04-02Collect statistics about SLRU cachesTomas Vondra
There's a number of SLRU caches used to access important data like clog, commit timestamps, multixact, asynchronous notifications, etc. Until now we had no easy way to monitor these shared caches, compute hit ratios, number of reads/writes etc. This commit extends the statistics collector to track this information for a predefined list of SLRUs, and also introduces a new system view pg_stat_slru displaying the data. The list of built-in SLRUs is fixed, but additional SLRUs may be defined in extensions. Unfortunately, there's no suitable registry of SLRUs, so this patch simply defines a fixed list of SLRUs with entries for the built-in ones and one entry for all additional SLRUs. Extensions adding their own SLRU are fairly rare, so this seems acceptable. This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes are still fixed (hard-coded in the code) and it's not entirely clear which of the SLRUs might need a GUC to tune size. In a way, allowing us to determine that is one of the goals of this patch. Bump catversion as the patch introduces new functions and system view. Author: Tomas Vondra Reviewed-by: Alvaro Herrera Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
2020-04-01Improve the message logged when recovery is paused.Fujii Masao
When recovery target is reached and recovery is paused because of recovery_target_action=pause, executing pg_wal_replay_resume() causes the standby to promote, i.e., the recovery to end. So, in this case, the previous message "Execute pg_wal_replay_resume() to continue" logged was confusing because pg_wal_replay_resume() doesn't cause the recovery to continue. This commit improves the message logged when recovery is paused, and the proper message is output based on what (pg_wal_replay_pause or recovery_target_action) causes recovery to be paused. Author: Sergei Kornilov, revised by Fujii Masao Reviewed-by: Robert Haas Discussion: https://postgr.es/m/19168211580382043@myt5-b646bde4b8f3.qloud-c.yandex.net
2020-03-31Move routine definitions of xlogarchive.c to a new header fileMichael Paquier
The definitions of the routines defined in xlogarchive.c have been part of xlog_internal.h which is included by several frontend tools, but all those routines are only called by the backend. More cleanup could be done within xlog_internal.h, but that's already a nice cut. This will help a follow-up patch for pg_rewind where handling of restore_command is added for frontends. Author: Alexey Kondratov, Michael Paquier Reviewed-by: Álvaro Herrera, Alexander Korotkov Discussion: https://postgr.es/m/a3acff50-5a0d-9a2c-b3b2-ee36168955c1@postgrespro.ru
2020-03-30Improve handling of parameter differences in physical replicationPeter Eisentraut
When certain parameters are changed on a physical replication primary, this is communicated to standbys using the XLOG_PARAMETER_CHANGE WAL record. The standby then checks whether its own settings are at least as big as the ones on the primary. If not, the standby shuts down with a fatal error. The correspondence of settings between primary and standby is required because those settings influence certain shared memory sizings that are required for processing WAL records that the primary might send. For example, if the primary sends a prepared transaction, the standby must have had max_prepared_transaction set appropriately or it won't be able to process those WAL records. However, fatally shutting down the standby immediately upon receipt of the parameter change record might be a bit of an overreaction. The resources related to those settings are not required immediately at that point, and might never be required if the activity on the primary does not exhaust all those resources. If we just let the standby roll on with recovery, it will eventually produce an appropriate error when those resources are used. So this patch relaxes this a bit. Upon receipt of XLOG_PARAMETER_CHANGE, we still check the settings but only issue a warning and set a global flag if there is a problem. Then when we actually hit the resource issue and the flag was set, we issue another warning message with relevant information. At that point we pause recovery, so a hot standby remains usable. We also repeat the last warning message once a minute so it is harder to miss or ignore. Reviewed-by: Sergei Kornilov <sk@zsrv.org> Reviewed-by: Masahiko Sawada <masahiko.sawada@2ndquadrant.com> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/4ad69a4c-cc9b-0dfe-0352-8b1b0cd36c7b@2ndquadrant.com
2020-03-27Allow walreceiver configuration to change on reloadAlvaro Herrera
The parameters primary_conninfo, primary_slot_name and wal_receiver_create_temp_slot can now be changed with a simple "reload" signal, no longer requiring a server restart. This is achieved by signalling the walreceiver process to terminate and having it start again with the new values. Thanks to Andres Freund, Kyotaro Horiguchi, Fujii Masao for discussion. Author: Sergei Kornilov <sk@zsrv.org> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/19513901543181143@sas1-19a94364928d.qloud-c.yandex.net
2020-03-27Set wal_receiver_create_temp_slot PGC_POSTMASTERAlvaro Herrera
Commit 329730827848 gave walreceiver the ability to create and use a temporary replication slot, and made it controllable by a GUC (enabled by default) that can be changed with SIGHUP. That's useful but has two problems: one, it's possible to cause the origin server to fill its disk if the slot doesn't advance in time; and also there's a disconnect between state passed down via the startup process and GUCs that walreceiver reads directly. We handle the first problem by setting the option to disabled by default. If the user enables it, its on their head to make sure that disk doesn't fill up. We handle the second problem by passing the flag via startup rather than having walreceiver acquire it directly, and making it PGC_POSTMASTER (which ensures a walreceiver always has the fresh value). A future commit can relax this (to PGC_SIGHUP again) by having the startup process signal walreceiver to shutdown whenever the value changes. Author: Sergei Kornilov <sk@zsrv.org> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/20200122055510.GH174860@paquier.xyz
2020-03-24Prefer standby promotion over recovery pause.Fujii Masao
Previously if a promotion was triggered while recovery was paused, the paused state continued. Also recovery could be paused by executing pg_wal_replay_pause() even while a promotion was ongoing. That is, recovery pause had higher priority over a standby promotion. But this behavior was not desirable because most users basically wanted the recovery to complete as soon as possible and the server to become the master when they requested a promotion. This commit changes recovery so that it prefers a promotion over recovery pause. That is, if a promotion is triggered while recovery is paused, the paused state ends and a promotion continues. Also this commit makes recovery pause functions like pg_wal_replay_pause() throw an error if they are executed while a promotion is ongoing. Internally, this commit adds new internal function PromoteIsTriggered() that returns true if a promotion is triggered. Since the name of this function and the existing function IsPromoteTriggered() are confusingly similar, the commit changes the name of IsPromoteTriggered() to IsPromoteSignaled, as more appropriate name. Author: Fujii Masao Reviewed-by: Atsushi Torikoshi, Sergei Kornilov Discussion: https://postgr.es/m/00c194b2-dbbb-2e8a-5b39-13f14048ef0a@oss.nttdata.com