summaryrefslogtreecommitdiff
path: root/src/backend/access/transam/xlogutils.c
AgeCommit message (Collapse)Author
2022-03-29Revert "Fix replay of create database records on standby"Alvaro Herrera
This reverts commit 49d9cfc68bf4. The approach taken by this patch has problems, so we'll come up with a radically different fix. Discussion: https://postgr.es/m/CA+TgmoYcUPL+WOJL2ZzhH=zmrhj0iOQ=iCFM0SuYqBbqZEamEg@mail.gmail.com
2022-03-25Fix replay of create database records on standbyAlvaro Herrera
Crash recovery on standby may encounter missing directories when replaying create database WAL records. Prior to this patch, the standby would fail to recover in such a case. However, the directories could be legitimately missing. Consider a sequence of WAL records as follows: CREATE DATABASE DROP DATABASE DROP TABLESPACE If, after replaying the last WAL record and removing the tablespace directory, the standby crashes and has to replay the create database record again, the crash recovery must be able to move on. This patch adds a mechanism similar to invalid-page tracking, to keep a tally of missing directories during crash recovery. If all the missing directory references are matched with corresponding drop records at the end of crash recovery, the standby can safely continue following the primary. Backpatch to 13, at least for now. The bug is older, but fixing it in older branches requires more careful study of the interactions with commit e6d8069522c8, which appeared in 13. A new TAP test file is added to verify the condition. However, because it depends on commit d6d317dbf615, it can only be added to branch master. I (Álvaro) manually verified that the code behaves as expected in branch 14. It's a bit nervous-making to leave the code uncovered by tests in older branches, but leaving the bug unfixed is even worse. Also, the main reason this fix took so long is precisely that we couldn't agree on a good strategy to approach testing for the bug, so perhaps this is the best we can do. Diagnosed-by: Paul Guo <paulguo@gmail.com> Author: Paul Guo <paulguo@gmail.com> Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Author: Asim R Praveen <apraveen@pivotal.io> Discussion: https://postgr.es/m/CAEET0ZGx9AvioViLf7nbR_8tH9-=27DN5xWJ2P9-ROH16e4JUA@mail.gmail.com
2020-05-13Adjust walsender usage of xlogreader, simplify APIsAlvaro Herrera
* Have both physical and logical walsender share a 'xlogreader' state struct for tracking state. This replaces the existing globals sendSeg and sendCxt. * Change WALRead not to receive XLogReaderState->seg and ->segcxt as separate arguments anymore; just use the ones from 'state'. This is made possible by the above change. * have the XLogReader segment_open contract require the callbacks to install the file descriptor in the state struct themselves instead of returning it. xlogreader was already ignoring any possible failed return from the callbacks, relying solely on them never returning. (This point is not altogether excellent, as it means the callbacks have to know more of XLogReaderState; but to really improve on that we would have to pass back error info from the callbacks to xlogreader. And the complexity would not be saved but instead just transferred to the callers of WALRead, which would have to learn how to throw errors from the open_segment callback in addition of, as currently, from pg_pread.) * segment_open no longer receives the 'segcxt' as a separate argument, since it's part of the XLogReaderState argument. Per comments from Kyotaro Horiguchi. Author: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/20200511203336.GA9913@alvherre.pgsql
2020-05-12Fix comment in xlogutils.cMichael Paquier
The existing callers of XLogReadDetermineTimeline() performing recovery need to check a replay LSN position when determining on which timeline to read a WAL page. A portion of the comment describing this function said exactly that, while referring to a routine for fetching a write LSN, something not available in recovery. Author: Kyotaro Horiguchi Discussion: https://postgr.es/m/20200511.101619.2043820539323292957.horikyota.ntt@gmail.com
2020-05-08Rework XLogReader callback systemAlvaro Herrera
Code review for 0dc8ead46363, prompted by a bug closed by 91c40548d5f7. XLogReader's system for opening and closing segments had gotten too complicated, with callbacks being passed at both the XLogReaderAllocate level (read_page) as well as at the WALRead level (segment_open). This was confusing and hard to follow, so restructure things so that these callbacks are passed together at XLogReaderAllocate time, and add another callback to the set (segment_close) to make it a coherent whole. Also, ensure XLogReaderState is an argument to all the callbacks, so that they can grab at the ->private data if necessary. Document the whole arrangement more clearly. Author: Álvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://postgr.es/m/20200422175754.GA19858@alvherre.pgsql
2020-04-04Skip WAL for new relfilenodes, under wal_level=minimal.Noah Misch
Until now, only selected bulk operations (e.g. COPY) did this. If a given relfilenode received both a WAL-skipping COPY and a WAL-logged operation (e.g. INSERT), recovery could lose tuples from the COPY. See src/backend/access/transam/README section "Skipping WAL for New RelFileNode" for the new coding rules. Maintainers of table access methods should examine that section. To maintain data durability, just before commit, we choose between an fsync of the relfilenode and copying its contents to WAL. A new GUC, wal_skip_threshold, guides that choice. If this change slows a workload that creates small, permanent relfilenodes under wal_level=minimal, try adjusting wal_skip_threshold. Users setting a timeout on COMMIT may need to adjust that timeout, and log_min_duration_statement analysis will reflect time consumption moving to COMMIT from commands like COPY. Internally, this requires a reliable determination of whether RollbackAndReleaseCurrentSubTransaction() would unlink a relation's current relfilenode. Introduce rd_firstRelfilenodeSubid. Amend the specification of rd_createSubid such that the field is zero when a new rel has an old rd_node. Make relcache.c retain entries for certain dropped relations until end of transaction. Bump XLOG_PAGE_MAGIC, since this introduces XLOG_GIST_ASSIGN_LSN. Future servers accept older WAL, so this bump is discretionary. Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert Haas. Heikki Linnakangas and Michael Paquier implemented earlier designs that materially clarified the problem. Reviewed, in earlier designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane, Fujii Masao, and Simon Riggs. Reported by Martijn van Oosterhout. Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
2020-03-22Revert "Skip WAL for new relfilenodes, under wal_level=minimal."Noah Misch
This reverts commit cb2fd7eac285b1b0a24eeb2b8ed4456b66c5a09f. Per numerous buildfarm members, it was incompatible with parallel query, and a test case assumed LP64. Back-patch to 9.5 (all supported versions). Discussion: https://postgr.es/m/20200321224920.GB1763544@rfd.leadboat.com
2020-03-21Skip WAL for new relfilenodes, under wal_level=minimal.Noah Misch
Until now, only selected bulk operations (e.g. COPY) did this. If a given relfilenode received both a WAL-skipping COPY and a WAL-logged operation (e.g. INSERT), recovery could lose tuples from the COPY. See src/backend/access/transam/README section "Skipping WAL for New RelFileNode" for the new coding rules. Maintainers of table access methods should examine that section. To maintain data durability, just before commit, we choose between an fsync of the relfilenode and copying its contents to WAL. A new GUC, wal_skip_threshold, guides that choice. If this change slows a workload that creates small, permanent relfilenodes under wal_level=minimal, try adjusting wal_skip_threshold. Users setting a timeout on COMMIT may need to adjust that timeout, and log_min_duration_statement analysis will reflect time consumption moving to COMMIT from commands like COPY. Internally, this requires a reliable determination of whether RollbackAndReleaseCurrentSubTransaction() would unlink a relation's current relfilenode. Introduce rd_firstRelfilenodeSubid. Amend the specification of rd_createSubid such that the field is zero when a new rel has an old rd_node. Make relcache.c retain entries for certain dropped relations until end of transaction. Back-patch to 9.5 (all supported versions). This introduces a new WAL record type, XLOG_GIST_ASSIGN_LSN, without bumping XLOG_PAGE_MAGIC. As always, update standby systems before master systems. This changes sizeof(RelationData) and sizeof(IndexStmt), breaking binary compatibility for affected extensions. (The most recent commit to affect the same class of extensions was 089e4d405d0f3b94c74a2c6a54357a84a681754b.) Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert Haas. Heikki Linnakangas and Michael Paquier implemented earlier designs that materially clarified the problem. Reviewed, in earlier designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane, Fujii Masao, and Simon Riggs. Reported by Martijn van Oosterhout. Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
2020-01-22Add GUC ignore_invalid_pages.Fujii Masao
Detection of WAL records having references to invalid pages during recovery causes PostgreSQL to raise a PANIC-level error, aborting the recovery. Setting ignore_invalid_pages to on causes the system to ignore those WAL records (but still report a warning), and continue recovery. This behavior may cause crashes, data loss, propagate or hide corruption, or other serious problems. However, it may allow you to get past the PANIC-level error, to finish the recovery, and to cause the server to start up. Author: Fujii Masao Reviewed-by: Michael Paquier Discussion: https://www.postgresql.org/message-id/CAHGQGwHCK6f77yeZD4MHOnN+PaTf6XiJfEB+Ce7SksSHjeAWtg@mail.gmail.com
2020-01-01Update copyrights for 2020Bruce Momjian
Backpatch-through: update all files in master, backpatch legal files through 9.4
2019-12-03Remove XLogFileNameP() from the treeMichael Paquier
XLogFileNameP() is a wrapper routine able to build a palloc'd string for a WAL segment name, which is used for error string generation. There were several code paths where it gets called in a critical section, where memory allocation is not allowed. This results in triggering an assertion failure instead of generating the wanted error message. Another, more annoying, problem is that if the allocation to generate the WAL segment name fails on OOM, then the failure would be escalated to a PANIC. This removes the routine and all its callers are replaced with a logic using a fixed-size buffer. This way, all the existing mistakes are fixed and future ones are prevented. Author: Masahiko Sawada Reviewed-by: Michael Paquier, Álvaro Herrera Discussion: https://postgr.es/m/CA+fd4k5gC9H4uoWMLg9K_QfNrnkkdEw+-AFveob9YX7z8JnKTA@mail.gmail.com
2019-11-25Refactor WAL file-reading code into WALRead()Alvaro Herrera
XLogReader, walsender and pg_waldump all had their own routines to read data from WAL files to memory, with slightly different approaches according to the particular conditions of each environment. There's a lot of commonality, so we can refactor that into a single routine WALRead in XLogReader, and move the differences to a separate (simpler) callback that just opens the next WAL-segment. This results in a clearer (ahem) code flow. The error reporting needs are covered by filling in a new error-info struct, WALReadError, and it's the caller's responsibility to act on it. The backend has WALReadRaiseError() to do so. We no longer ever need to seek in this interface; switch to using pg_pread(). Author: Antonin Houska, with contributions from Álvaro Herrera Reviewed-by: Michaël Paquier, Kyotaro Horiguchi Discussion: https://postgr.es/m/14984.1554998742@spoje.net
2019-09-24Rework WAL-reading supporting structsAlvaro Herrera
The state-tracking of WAL reading in various places was pretty messy, mostly because the ancient physical-replication WAL reading code wasn't using the XLogReader abstraction. This led to some untidy code. Make it prettier by creating two additional supporting structs, WALSegmentContext and WALOpenSegment which keep track of WAL-reading state. This makes code cleaner, as well as supports more future cleanup. Author: Antonin Houska Reviewed-by: Álvaro Herrera and (older versions) Robert Haas Discussion: https://postgr.es/m/14984.1554998742@spoje.net
2019-08-19Fix inconsistencies and typos in the tree, take 11Michael Paquier
This fixes various typos in docs and comments, and removes some orphaned definitions. Author: Alexander Lakhin Discussion: https://postgr.es/m/5da8e325-c665-da95-21e0-c8a99ea61fbf@gmail.com
2019-01-02Update copyright for 2019Bruce Momjian
Backpatch-through: certain files through 9.4
2018-06-25Address set of issues with errno handlingMichael Paquier
System calls mixed up in error code paths are causing two issues which several code paths have not correctly handled: 1) For write() calls, sometimes the system may return less bytes than what has been written without errno being set. Some paths were careful enough to consider that case, and assumed that errno should be set to ENOSPC, other calls missed that. 2) errno generated by a system call is overwritten by other system calls which may succeed once an error code path is taken, causing what is reported to the user to be incorrect. This patch uses the brute-force approach of correcting all those code paths. Some refactoring could happen in the future, but this is let as future work, which is not targeted for back-branches anyway. Author: Michael Paquier Reviewed-by: Ashutosh Sharma Discussion: https://postgr.es/m/20180622061535.GD5215@paquier.xyz
2018-04-09Further cleanup of client dependencies on src/include/catalog headers.Tom Lane
In commit 9c0a0de4c, I'd failed to notice that catalog/catalog.h should also be considered a frontend-unsafe header, because it includes (and needs) the full form of pg_class.h, not to mention relcache.h. However, various frontend code was depending on it to get TABLESPACE_VERSION_DIRECTORY, so refactoring of some sort is called for. The cleanest answer seems to be to move TABLESPACE_VERSION_DIRECTORY, as well as the OIDCHARS symbol, to common/relpath.h. Do that, and mop up inclusions as necessary. (I found that quite a few current users of catalog/catalog.h don't seem to need it at all anymore, apparently as a result of the refactorings that created common/relpath.[hc]. And initdb.c needed it only as a route to pg_class_d.h.) Discussion: https://postgr.es/m/6629.1523294509@sss.pgh.pa.us
2018-01-02Update copyright for 2018Bruce Momjian
Backpatch-through: certain files through 9.3
2017-11-29Update typedefs.list and re-run pgindentRobert Haas
Discussion: http://postgr.es/m/CA+TgmoaA9=1RWKtBWpDaj+sF3Stgc8sHgf5z=KGtbjwPLQVDMA@mail.gmail.com
2017-09-23Refactor new file permission handlingPeter Eisentraut
The file handling functions from fd.c were called with a diverse mix of notations for the file permissions when they were opening new files. Almost all files created by the server should have the same permissions set. So change the API so that e.g. OpenTransientFile() automatically uses the standard permissions set, and OpenTransientFilePerm() is a new function that takes an explicit permissions set for the few cases where it is needed. This also saves an unnecessary argument for call sites that are just opening an existing file. While we're reviewing these APIs, get rid of the FileName typedef and use the standard const char * for the file name and mode_t for the file mode. This makes these functions match other file handling functions and removes an unnecessary layer of mysteriousness. We can also get rid of a few casts that way. Author: David Steele <david@pgmasters.net>
2017-09-19Make WAL segment size configurable at initdb time.Andres Freund
For performance reasons a larger segment size than the default 16MB can be useful. A larger segment size has two main benefits: Firstly, in setups using archiving, it makes it easier to write scripts that can keep up with higher amounts of WAL, secondly, the WAL has to be written and synced to disk less frequently. But at the same time large segment size are disadvantageous for smaller databases. So far the segment size had to be configured at compile time, often making it unrealistic to choose one fitting to a particularly load. Therefore change it to a initdb time setting. This includes a breaking changes to the xlogreader.h API, which now requires the current segment size to be configured. For that and similar reasons a number of binaries had to be taught how to recognize the current segment size. Author: Beena Emerson, editorialized by Andres Freund Reviewed-By: Andres Freund, David Steele, Kuntal Ghosh, Michael Paquier, Peter Eisentraut, Robert Hass, Tushar Ahuja Discussion: https://postgr.es/m/CAOG9ApEAcQ--1ieKbhFzXSQPw_YLmepaa4hNdnY5+ZULpt81Mw@mail.gmail.com
2017-06-21Phase 3 of pgindent updates.Tom Lane
Don't move parenthesized lines to the left, even if that means they flow past the right margin. By default, BSD indent lines up statement continuation lines that are within parentheses so that they start just to the right of the preceding left parenthesis. However, traditionally, if that resulted in the continuation line extending to the right of the desired right margin, then indent would push it left just far enough to not overrun the margin, if it could do so without making the continuation line start to the left of the current statement indent. That makes for a weird mix of indentations unless one has been completely rigid about never violating the 80-column limit. This behavior has been pretty universally panned by Postgres developers. Hence, disable it with indent's new -lpl switch, so that parenthesized lines are always lined up with the preceding left paren. This patch is much less interesting than the first round of indent changes, but also bulkier, so I thought it best to separate the effects. Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-05-17Post-PG 10 beta1 pgindent runBruce Momjian
perltidy run not included.
2017-05-12Replace "transaction log" with "write-ahead log"Peter Eisentraut
This makes documentation and error messages match the renaming of "xlog" to "wal" in APIs and file naming.
2017-04-16Fix typo in commentPeter Eisentraut
Author: Masahiko Sawada <sawada.mshk@gmail.com>
2017-03-22Teach xlogreader to follow timeline switchesSimon Riggs
Uses page-based mechanism to ensure we’re using the correct timeline. Tests are included to exercise the functionality using a cold disk-level copy of the master that's started up as a replica with slots intact, but the intended use of the functionality is with later features. Craig Ringer, reviewed by Simon Riggs and Andres Freund
2017-03-18Create and use wait events for read, write, and fsync operations.Robert Haas
Previous commits, notably 53be0b1add7064ca5db3cd884302dfc3268d884e and 6f3bd98ebfc008cbd676da777bb0b2376c4c4bfa, made it possible to see from pg_stat_activity when a backend was stuck waiting for another backend, but it's also fairly common for a backend to be stuck waiting for an I/O. Add wait events for those operations, too. Rushabh Lathia, with further hacking by me. Reviewed and tested by Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed. Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
2017-02-09Rename user-facing tools with "xlog" in the name to say "wal".Robert Haas
This means pg_receivexlog because pg_receivewal, pg_resetxlog becomes pg_resetwal, and pg_xlogdump becomes pg_waldump.
2017-02-08Add WAL consistency checking facility.Robert Haas
When the new GUC wal_consistency_checking is set to a non-empty value, it triggers recording of additional full-page images, which are compared on the standby against the results of applying the WAL record (without regard to those full-page images). Allowable differences such as hints are masked out, and the resulting pages are compared; any difference results in a FATAL error on the standby. Kuntal Ghosh, based on earlier patches by Michael Paquier and Heikki Linnakangas. Extensively reviewed and revised by Michael Paquier and by me, with additional reviews and comments from Amit Kapila, Álvaro Herrera, Simon Riggs, and Peter Eisentraut.
2017-01-03Update copyright via script for 2017Bruce Momjian
2016-05-12Fix bogus commentsAlvaro Herrera
Some comments mentioned XLogReplayBuffer, but there's no such function: that was an interim name for a function that got renamed to XLogReadBufferForRedo, before commit 2c03216d831160 was pushed.
2016-05-04Revert timeline following in replication slotsAlvaro Herrera
This reverts commits f07d18b6e94d, 82c83b337202, 3a3b309041b0, and 24c5f1a103ce. This feature has shown enough immaturity that it was deemed better to rip it out before rushing some more fixes at the last minute. There are discussions on larger changes in this area for the next release.
2016-04-20Revert no-op changes to BufferGetPage()Kevin Grittner
The reverted changes were intended to force a choice of whether any newly-added BufferGetPage() calls needed to be accompanied by a test of the snapshot age, to support the "snapshot too old" feature. Such an accompanying test is needed in about 7% of the cases, where the page is being used as part of a scan rather than positioning for other purposes (such as DML or vacuuming). The additional effort required for back-patching, and the doubt whether the intended benefit would really be there, have indicated it is best just to rely on developers to do the right thing based on comments and existing usage, as we do with many other conventions. This change should have little or no effect on generated executable code. Motivated by the back-patching pain of Tom Lane and Robert Haas
2016-04-08Modify BufferGetPage() to prepare for "snapshot too old" featureKevin Grittner
This patch is a no-op patch which is intended to reduce the chances of failures of omission once the functional part of the "snapshot too old" patch goes in. It adds parameters for snapshot, relation, and an enum to specify whether the snapshot age check needs to be done for the page at this point. This initial patch passes NULL for the first two new parameters and BGP_NO_SNAPSHOT_TEST for the third. The follow-on patch will change the places where the test needs to be made.
2016-03-30Enable logical slots to follow timeline switchesAlvaro Herrera
When decoding from a logical slot, it's necessary for xlog reading to be able to read xlog from historical (i.e. not current) timelines; otherwise, decoding fails after failover, because the archives are in the historical timeline. This is required to make "failover logical slots" possible; it currently has no other use, although theoretically it could be used by an extension that creates a slot on a standby and continues to replay from the slot when the standby is promoted. This commit includes a module in src/test/modules with functions to manipulate the slots (which is not otherwise possible in SQL code) in order to enable testing, and a new test in src/test/recovery to ensure that the behavior is as expected. Author: Craig Ringer Reviewed-By: Oleksii Kliukin, Andres Freund, Petr Jelínek
2016-03-30XLogReader general code cleanupAlvaro Herrera
Some minor tweaks and comment additions, for cleanliness sake and to avoid having the upcoming timeline-following patch be polluted with unrelated cleanup. Extracted from a larger patch by Craig Ringer, reviewed by Andres Freund, with some additions by myself.
2016-01-20Refactor to create generic WAL page read callbackSimon Riggs
Previously we didn’t have a generic WAL page read callback function, surprisingly. Logical decoding has logical_read_local_xlog_page(), which was actually generic, so move that to xlogfunc.c and rename to read_local_xlog_page(). Maintain logical_read_local_xlog_page() so existing callers still work. As requested by Michael Paquier, Alvaro Herrera and Andres Freund
2016-01-02Update copyright for 2016Bruce Momjian
Backpatch certain files through 9.1
2015-12-10Fix bug leading to restoring unlogged relations from empty files.Andres Freund
At the end of crash recovery, unlogged relations are reset to the empty state, using their init fork as the template. The init fork is copied to the main fork without going through shared buffers. Unfortunately WAL replay so far has not necessarily flushed writes from shared buffers to disk at that point. In normal crash recovery, and before the introduction of 'fast promotions' in fd4ced523 / 9.3, the END_OF_RECOVERY checkpoint flushes the buffers out in time. But with fast promotions that's not the case anymore. To fix, force WAL writes targeting the init fork to be flushed immediately (using the new FlushOneBuffer() function). In 9.5+ that flush can centrally be triggered from the code dealing with restoring full page writes (XLogReadBufferForRedoExtended), in earlier releases that responsibility is in the hands of XLOG_HEAP_NEWPAGE's replay function. Backpatch to 9.1, even if this currently is only known to trigger in 9.3+. Flushing earlier is more robust, and it is advantageous to keep the branches similar. Typical symptoms of this bug are errors like 'ERROR: index "..." contains unexpected zero page at block 0' shortly after promoting a node. Reported-By: Thom Brown Author: Andres Freund and Michael Paquier Discussion: 20150326175024.GJ451@alap3.anarazel.de Backpatch: 9.1-
2015-07-20Sanity-check that a page zeroed by redo routine is marked with WILL_INIT.Heikki Linnakangas
There was already a sanity-check in the other direction: if a page was marked with WILL_INIT, it had to be initialized by the redo routine. It's not strictly necessary for correctness that a page is marked with WILL_INIT if it's going to be initialized at redo, but it's a missed optimization if nothing else. Fix a few instances of this issue in SP-GiST, where a block in WAL record was not marked with WILL_INIT, but was in fact always initialized at redo. We were creating a full-page image of the page unnecessarily in those cases. Backpatch to 9.5, where the new WILL_INIT flag was added.
2015-01-06Update copyright for 2015Bruce Momjian
Backpatch certain files through 9.0
2014-12-18Improve hash_create's API for selecting simple-binary-key hash functions.Tom Lane
Previously, if you wanted anything besides C-string hash keys, you had to specify a custom hashing function to hash_create(). Nearly all such callers were specifying tag_hash or oid_hash; which is tedious, and rather error-prone, since a caller could easily miss the opportunity to optimize by using hash_uint32 when appropriate. Replace this with a design whereby callers using simple binary-data keys just specify HASH_BLOBS and don't need to mess with specific support functions. hash_create() itself will take care of optimizing when the key size is four bytes. This nets out saving a few hundred bytes of code space, and offers a measurable performance improvement in tidbitmap.c (which was not exploiting the opportunity to use hash_uint32 for its 4-byte keys). There might be some wins elsewhere too, I didn't analyze closely. In future we could look into offering a similar optimized hashing function for 8-byte keys. Under this design that could be done in a centralized and machine-independent fashion, whereas getting it right for keys of platform-dependent sizes would've been notationally painful before. For the moment, the old way still works fine, so as not to break source code compatibility for loadable modules. Eventually we might want to remove tag_hash and friends from the exported API altogether, since there's no real need for them to be explicitly referenced from outside dynahash.c. Teodor Sigaev and Tom Lane
2014-11-20Revamp the WAL record format.Heikki Linnakangas
Each WAL record now carries information about the modified relation and block(s) in a standardized format. That makes it easier to write tools that need that information, like pg_rewind, prefetching the blocks to speed up recovery, etc. There's a whole new API for building WAL records, replacing the XLogRecData chains used previously. The new API consists of XLogRegister* functions, which are called for each buffer and chunk of data that is added to the record. The new API also gives more control over when a full-page image is written, by passing flags to the XLogRegisterBuffer function. This also simplifies the XLogReadBufferForRedo() calls. The function can dig the relation and block number from the WAL record, so they no longer need to be passed as arguments. For the convenience of redo routines, XLogReader now disects each WAL record after reading it, copying the main data part and the per-block data into MAXALIGNed buffers. The data chunks are not aligned within the WAL record, but the redo routines can assume that the pointers returned by XLogRecGet* functions are. Redo routines are now passed the XLogReaderState, which contains the record in the already-disected format, instead of the plain XLogRecord. The new record format also makes the fixed size XLogRecord header smaller, by removing the xl_len field. The length of the "main data" portion is now stored at the end of the WAL record, and there's a separate header after XLogRecord for it. The alignment padding at the end of XLogRecord is also removed. This compansates for the fact that the new format would otherwise be more bulky than the old format. Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera, Fujii Masao.
2014-11-13Fix race condition between hot standby and restoring a full-page image.Heikki Linnakangas
There was a window in RestoreBackupBlock where a page would be zeroed out, but not yet locked. If a backend pinned and locked the page in that window, it saw the zeroed page instead of the old page or new page contents, which could lead to missing rows in a result set, or errors. To fix, replace RBM_ZERO with RBM_ZERO_AND_LOCK, which atomically pins, zeroes, and locks the page, if it's not in the buffer cache already. In stable branches, the old RBM_ZERO constant is renamed to RBM_DO_NOT_USE, to avoid breaking any 3rd party extensions that might use RBM_ZERO. More importantly, this avoids renumbering the other enum values, which would cause even bigger confusion in extensions that use ReadBufferExtended, but haven't been recompiled. Backpatch to all supported versions; this has been racy since hot standby was introduced.
2014-11-13Fix XLogReadBufferForRedoExtended to get cleanup lock when asked to do so.Heikki Linnakangas
2014-11-06Move the backup-block logic from XLogInsert to a new file, xloginsert.c.Heikki Linnakangas
xlog.c is huge, this makes it a little bit smaller, which is nice. Functions related to putting together the WAL record are in xloginsert.c, and the lower level stuff for managing WAL buffers and such are in xlog.c. Also move the definition of XLogRecord to a separate header file. This causes churn in the #includes of all the files that write WAL records, and redo routines, but it avoids pulling in xlog.h into most places. Reviewed by Michael Paquier, Alvaro Herrera, Andres Freund and Amit Kapila.
2014-09-02Refactor per-page logic common to all redo routines to a new function.Heikki Linnakangas
Every redo routine uses the same idiom to determine what to do to a page: check if there's a backup block for it, and if not read, the buffer if the block exists, and check its LSN. Refactor that into a common function, XLogReadBufferForRedo, making all the redo routines shorter and more readable. This has no user-visible effect, and makes no changes to the WAL format. Reviewed by Andres Freund, Alvaro Herrera, Michael Paquier.
2014-04-30Rationalize common/relpath.[hc].Tom Lane
Commit a73018392636ce832b09b5c31f6ad1f18a4643ea created rather a mess by putting dependencies on backend-only include files into include/common. We really shouldn't do that. To clean it up: * Move TABLESPACE_VERSION_DIRECTORY back to its longtime home in catalog/catalog.h. We won't consider this symbol part of the FE/BE API. * Push enum ForkNumber from relfilenode.h into relpath.h. We'll consider relpath.h as the source of truth for fork numbers, since relpath.c was already partially serving that function, and anyway relfilenode.h was kind of a random place for that enum. * So, relfilenode.h now includes relpath.h rather than vice-versa. This direction of dependency is fine. (That allows most, but not quite all, of the existing explicit #includes of relpath.h to go away again.) * Push forkname_to_number from catalog.c to relpath.c, just to centralize fork number stuff a bit better. * Push GetDatabasePath from catalog.c to relpath.c; it was rather odd that the previous commit didn't keep this together with relpath(). * To avoid needing relfilenode.h in common/, redefine the underlying function (now called GetRelationPath) as taking separate OID arguments, and make the APIs using RelFileNode or RelFileNodeBackend into macro wrappers. (The macros have a potential multiple-eval risk, but none of the existing call sites have an issue with that; one of them had such a risk already anyway.) * Fix failure to follow the directions when "init" fork type was added; specifically, the errhint in forkname_to_number wasn't updated, and neither was the SGML documentation for pg_relation_size(). * Fix tablespace-path-too-long check in CreateTableSpace() to account for fork-name component of maximum-length pathnames. This requires putting FORKNAMECHARS into a header file, but it was rather useless (and actually unreferenced) where it was. The last couple of items are potentially back-patchable bug fixes, if anyone is sufficiently excited about them; but personally I'm not. Per a gripe from Christoph Berg about how include/common wasn't self-contained.
2014-03-07Fix dangling smgr_owner pointer when a fake relcache entry is freed.Heikki Linnakangas
A fake relcache entry can "own" a SmgrRelation object, like a regular relcache entry. But when it was free'd, the owner field in SmgrRelation was not cleared, so it was left pointing to free'd memory. Amazingly this apparently hasn't caused crashes in practice, or we would've heard about it earlier. Andres found this with Valgrind. Report and fix by Andres Freund, with minor modifications by me. Backpatch to all supported versions.
2014-02-12In XLogReadBufferExtended, don't assume P_NEW yields consecutive pages.Tom Lane
In a database that's not yet reached consistency, it's possible that some segments of a relation are not full-size but are not the last ones either. Because of the way smgrnblocks() works, asking for a new page with P_NEW will fill in the last not-full-size segment --- and if that makes it full size, the apparent EOF of the relation will increase by more than one page, so that the next P_NEW request will yield a page past the next consecutive one. This breaks the relation-extension logic in XLogReadBufferExtended, possibly allowing a page update to be applied to some page far past where it was intended to go. This appears to be the explanation for reports of table bloat on replication slaves compared to their masters, and probably explains some corrupted-slave reports as well. Fix the loop to check the page number it actually got, rather than merely Assert()'ing that dead reckoning got it to the desired place. AFAICT, there are no other places that make assumptions about exactly which page they'll get from P_NEW. Problem identified by Greg Stark, though this is not the same as his proposed patch. It's been like this for a long time, so back-patch to all supported branches.