summaryrefslogtreecommitdiff
path: root/src/backend/access
AgeCommit message (Collapse)Author
2020-12-23Fix portability issues with parsing of recovery_target_xidMichael Paquier
The parsing of this parameter has been using strtoul(), which is not portable across platforms. On most Unix platforms, unsigned long has a size of 64 bits, while on Windows it is 32 bits. It is common in recovery scenarios to rely on the output of txid_current() or even the newer pg_current_xact_id() to get a transaction ID for setting up recovery_target_xid. The value returned by those functions includes the epoch in the computed result, which would cause strtoul() to fail where unsigned long has a size of 32 bits once the epoch is incremented. WAL records and 2PC data include only information about 32-bit XIDs and it is not possible to have XIDs across more than one epoch, so discarding the high bits from the transaction ID set has no impact on recovery. On the contrary, the use of strtoul() prevents a consistent behavior across platforms depending on the size of unsigned long. This commit changes the parsing of recovery_target_xid to use pg_strtouint64() instead, available down to 9.6. There is one TAP test stressing recovery with recovery_target_xid, where a tweak based on pg_reset{xlog,wal} is added to bump the XID epoch so as this change gets tested, as per an idea from Alexander Lakhin. Reported-by: Alexander Lakhin Discussion: https://postgr.es/m/16780-107fd0c0385b1035@postgresql.org Backpatch-through: 9.6
2020-11-11Remove duplicate code in brin_memtuple_initializeTomas Vondra
Commit 8bf74967dab moved some of the code from brin_new_memtuple to brin_memtuple_initialize, but this resulted in some of the code being duplicate. Fix by removing the duplicate lines and backpatch to 10. Author: Tomas Vondra Backpatch-through: 10 Discussion: https://postgr.es/m/5eb50c97-9a8e-b691-8c40-1b2a55611c4c%40enterprisedb.com
2020-11-10Fix and simplify some usages of TimestampDifference().Tom Lane
Introduce TimestampDifferenceMilliseconds() to simplify callers that would rather have the difference in milliseconds, instead of the select()-oriented seconds-and-microseconds format. This gets rid of at least one integer division per call, and it eliminates some apparently-easy-to-mess-up arithmetic. Two of these call sites were in fact wrong: * pg_prewarm's autoprewarm_main() forgot to multiply the seconds by 1000, thus ending up with a delay 1000X shorter than intended. That doesn't quite make it a busy-wait, but close. * postgres_fdw's pgfdw_get_cleanup_result() thought it needed to compute microseconds not milliseconds, thus ending up with a delay 1000X longer than intended. Somebody along the way had noticed this problem but misdiagnosed the cause, and imposed an ad-hoc 60-second limit rather than fixing the units. This was relatively harmless in context, because we don't care that much about exactly how long this delay is; still, it's wrong. There are a few more callers of TimestampDifference() that don't have a direct need for seconds-and-microseconds, but can't use TimestampDifferenceMilliseconds() either because they do need microsecond precision or because they might possibly deal with intervals long enough to overflow 32-bit milliseconds. It might be worth inventing another API to improve that, but that seems outside the scope of this patch; so those callers are untouched here. Given the fact that we are fixing some bugs, and the likelihood that future patches might want to back-patch code that uses this new API, back-patch to all supported branches. Alexey Kondratov and Tom Lane Discussion: https://postgr.es/m/3b1c053a21c07c1ed5e00be3b2b855ef@postgrespro.ru
2020-11-09In security-restricted operations, block enqueue of at-commit user code.Noah Misch
Specifically, this blocks DECLARE ... WITH HOLD and firing of deferred triggers within index expressions and materialized view queries. An attacker having permission to create non-temp objects in at least one schema could execute arbitrary SQL functions under the identity of the bootstrap superuser. One can work around the vulnerability by disabling autovacuum and not manually running ANALYZE, CLUSTER, REINDEX, CREATE INDEX, VACUUM FULL, or REFRESH MATERIALIZED VIEW. (Don't restore from pg_dump, since it runs some of those commands.) Plain VACUUM (without FULL) is safe, and all commands are fine when a trusted user owns the target object. Performance may degrade quickly under this workaround, however. Back-patch to 9.5 (all supported versions). Reviewed by Robert Haas. Reported by Etienne Stalmans. Security: CVE-2020-25695
2020-11-07Properly detoast data in brin_form_tupleTomas Vondra
brin_form_tuple failed to consider the values may be toasted, inserting the toast pointer into the index. This may easily result in index corruption, as the toast data may be deleted and cleaned up by vacuum. The cleanup however does not care about indexes, leaving invalid toast pointers behind, which triggers errors like this: ERROR: missing chunk number 0 for toast value 16433 in pg_toast_16426 A less severe consequence are inconsistent failures due to the index row being too large, depending on whether brin_form_tuple operated on plain or toasted version of the row. For example CREATE TABLE t (val TEXT); INSERT INTO t VALUES ('... long value ...') CREATE INDEX idx ON t USING brin (val); would likely succeed, as the row would likely include toast pointer. Switching the order of INSERT and CREATE INDEX would likely fail: ERROR: index row size 8712 exceeds maximum 8152 for index "idx" because this happens before the row values are toasted. The bug exists since PostgreSQL 9.5 where BRIN indexes were introduced. So backpatch all the way back. Author: Tomas Vondra Reviewed-by: Alvaro Herrera Backpatch-through: 9.5 Discussion: https://postgr.es/m/20201001184133.oq5uq75sb45pu3aw@development Discussion: https://postgr.es/m/20201104010544.zexj52mlldagzowv%40development
2020-09-24Fix missing fsync of SLRU directories.Thomas Munro
Harmonize behavior by moving reponsibility for fsyncing directories down into slru.c. In 10 and later, only the multixact directories were missed (see commit 1b02be21), and in older branches all SLRUs were missed. Back-patch to all supported releases. Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CA%2BhUKGLtsTUOScnNoSMZ-2ZLv%2BwGh01J6kAo_DM8mTRq1sKdSQ%40mail.gmail.com
2020-09-17Update parallel BTree scan state when the scan keys can't be satisfied.Amit Kapila
For parallel btree scan to work for array of scan keys, it should reach BTPARALLEL_DONE state once for every distinct combination of array keys. This is required to ensure that the parallel workers don't try to seize blocks at the same time for different scan keys. We missed to update this state when we discovered that the scan keys can't be satisfied. Author: James Hunter Reviewed-by: Amit Kapila Tested-by: Justin Pryzby Backpatch-through: 10, where it was introduced Discussion: https://postgr.es/m/4248CABC-25E3-4809-B4D0-128E1BAABC3C@amazon.com
2020-08-27Fix code for re-finding scan position in a multicolumn GIN index.Tom Lane
collectMatchBitmap() needs to re-find the index tuple it was previously looking at, after transiently dropping lock on the index page it's on. The tuple should still exist and be at its prior position or somewhere to the right of that, since ginvacuum never removes tuples but concurrent insertions could add one. However, there was a thinko in that logic, to the effect of expecting any inserted tuples to have the same index "attnum" as what we'd been scanning. Since there's no physical separation of tuples with different attnums, it's not terribly hard to devise scenarios where this fails, leading to transient "lost saved point in index" errors. (While I've duplicated this with manual testing, it seems impossible to make a reproducible test case with our available testing technology.) Fix by just continuing the scan when the attnum doesn't match. While here, improve the error message used if we do fail, so that it matches the wording used in btree for a similar case. collectMatchBitmap()'s posting-tree code path was previously not exercised at all by our regression tests. While I can't make a regression test that exhibits the bug, I can at least improve the code coverage here, so do that. The test case I made for this is an extension of one added by 4b754d6c1, so it only works in HEAD and v13; didn't seem worth trying hard to back-patch it. Per bug #16595 from Jesse Kinkead. This has been broken since multicolumn capability was added to GIN (commit 27cb66fdf), so back-patch to all supported branches. Discussion: https://postgr.es/m/16595-633118be8eef9ce2@postgresql.org
2020-08-15Prevent concurrent SimpleLruTruncate() for any given SLRU.Noah Misch
The SimpleLruTruncate() header comment states the new coding rule. To achieve this, add locktype "frozenid" and two LWLocks. This closes a rare opportunity for data loss, which manifested as "apparent wraparound" or "could not access status of transaction" errors. Data loss is more likely in pg_multixact, due to released branches' thin margin between multiStopLimit and multiWrapLimit. If a user's physical replication primary logged ": apparent wraparound" messages, the user should rebuild standbys of that primary regardless of symptoms. At less risk is a cluster having emitted "not accepting commands" errors or "must be vacuumed" warnings at some point. One can test a cluster for this data loss by running VACUUM FREEZE in every database. Back-patch to 9.5 (all supported versions). Discussion: https://postgr.es/m/20190218073103.GA1434723@rfd.leadboat.com
2020-08-13Handle new HOT chains in index-build table scansAlvaro Herrera
When a table is scanned by heapam_index_build_range_scan (née IndexBuildHeapScan) and the table lock being held allows concurrent data changes, it is possible for new HOT chains to sprout in a page that were unknown when the scan of a page happened. This leads to an error such as ERROR: failed to find parent tuple for heap-only tuple at (X,Y) in table "tbl" because the root tuple was not present when we first obtained the list of the page's root tuples. This can be fixed by re-obtaining the list of root tuples, if we see that a heap-only tuple appears to point to a non-existing root. This was reported by Anastasia as occurring for BRIN summarization (which exists since 9.5), but I think it could theoretically also happen with CREATE INDEX CONCURRENTLY (much older) or REINDEX CONCURRENTLY (very recent). It seems a happy coincidence that BRIN forces us to backpatch this all the way to 9.5. Reported-by: Anastasia Lubennikova <a.lubennikova@postgrespro.ru> Diagnosed-by: Anastasia Lubennikova <a.lubennikova@postgrespro.ru> Co-authored-by: Anastasia Lubennikova <a.lubennikova@postgrespro.ru> Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/602d8487-f0b2-5486-0088-0f372b2549fa@postgrespro.ru Backpatch: 9.5 - master
2020-08-12BRIN: Handle concurrent desummarization properlyAlvaro Herrera
If a page range is desummarized at just the right time concurrently with an index walk, BRIN would raise an error indicating index corruption. This is scary and unhelpful; silently returning that the page range is not summarized is sufficient reaction. This bug was introduced by commit 975ad4e602ff as additional protection against a bug whose actual fix was elsewhere. Backpatch equally. Reported-By: Anastasia Lubennikova <a.lubennikova@postgrespro.ru> Diagnosed-By: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/2588667e-d07d-7e10-74e2-7e1e46194491@postgrespro.ru Backpatch: 9.5 - master
2020-07-09Remove WARNING message from brin_desummarize_rangeAlvaro Herrera
This message was being emitted on the grounds that only crashed summarization could cause it, but in reality even an aborted vacuum could do it ... which makes it way too noisy, particularly since it shows up in regression tests and makes them die. Reported by Tom Lane. Discussion: https://postgr.es/m/489091.1593534251@sss.pgh.pa.us
2020-06-16Fix buffile.c error handling.Thomas Munro
Convert buffile.c error handling to use ereport. This fixes cases where I/O errors were indistinguishable from EOF or not reported. Also remove "%m" from error messages where errno would be bogus. While we're modifying those strings, add block numbers and short read byte counts where appropriate. Back-patch to all supported releases. Reported-by: Amit Khandekar <amitdkhan.pg@gmail.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Ibrar Ahmed <ibrar.ahmad@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CA%2BhUKGJE04G%3D8TLK0DLypT_27D9dR8F1RQgNp0jK6qR0tZGWOw%40mail.gmail.com
2020-06-08Fix locking bugs that could corrupt pg_control.Thomas Munro
The redo routines for XLOG_CHECKPOINT_{ONLINE,SHUTDOWN} must acquire ControlFileLock before modifying ControlFile->checkPointCopy, or the checkpointer could write out a control file with a bad checksum. Likewise, XLogReportParameters() must acquire ControlFileLock before modifying ControlFile and calling UpdateControlFile(). Back-patch to all supported releases. Author: Nathan Bossart <bossartn@amazon.com> Author: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/70BF24D6-DC51-443F-B55A-95735803842A%40amazon.com
2020-05-15Avoid killing btree items that are already deadAlvaro Herrera
_bt_killitems marks btree items dead when a scan leaves the page where they live, but it does so with only share lock (to improve concurrency). This was historicall okay, since killing a dead item has no consequences. However, with the advent of data checksums and wal_log_hints, this action incurs a WAL full-page-image record of the page. Multiple concurrent processes would write the same page several times, leading to WAL bloat. The probability of this happening can be reduced by only killing items if they're not already dead, so change the code to do that. The problem could eliminated completely by having _bt_killitems upgrade to exclusive lock upon seeing a killable item, but that would reduce concurrency so it's considered a cure worse than the disease. Backpatch all the way back to 9.5, since wal_log_hints was introduced in 9.4. Author: Masahiko Sawada <masahiko.sawada@2ndquadrant.com> Discussion: https://postgr.es/m/CA+fd4k6PeRj2CkzapWNrERkja5G0-6D-YQiKfbukJV+qZGFZ_Q@mail.gmail.com
2020-05-09Prevent archive recovery from scanning non-existent WAL files.Fujii Masao
Previously when there were multiple timelines listed in the history file of the recovery target timeline, archive recovery searched all of them, starting from the newest timeline to the oldest one, to find the segment to read. That is, archive recovery had to continuously fail scanning the segment until it reached the timeline that the segment belonged to. These scans for non-existent segment could be harmful on the recovery performance especially when archival area was located on the remote storage and each scan could take a long time. To address the issue, this commit changes archive recovery so that it skips scanning the timeline that the segment to read doesn't belong to. Per discussion, back-patch to all supported versions. Author: Kyotaro Horiguchi, tweaked a bit by Fujii Masao Reviewed-by: David Steele, Pavel Suderevsky, Grigory Smolkin Discussion: https://postgr.es/m/16159-f5a34a3a04dc67e0@postgresql.org Discussion: https://postgr.es/m/20200129.120222.1476610231001551715.horikyota.ntt@gmail.com
2020-05-08Report missing wait event for timeline history file.Fujii Masao
TimelineHistoryRead and TimelineHistoryWrite wait events are reported during waiting for a read and write of a timeline history file, respectively. However, previously, TimelineHistoryRead wait event was not reported while readTimeLineHistory() was reading a timeline history file. Also TimelineHistoryWrite was not reported while writeTimeLineHistory() was writing one line with the details of the timeline split, at the end. This commit fixes these issues. Back-patch to v10 where wait events for a timeline history file was added. Author: Masahiro Ikeda Reviewed-by: Michael Paquier, Fujii Masao Discussion: https://postgr.es/m/d11b0c910b63684424e06772eb844ab5@oss.nttdata.com
2020-04-24Fix handling of WAL segments ready to be archived during crash recoveryMichael Paquier
78ea8b5 has fixed an issue related to the recycling of WAL segments on standbys depending on archive_mode. However, it has introduced a regression with the handling of WAL segments ready to be archived during crash recovery, causing those files to be recycled without getting archived. This commit fixes the regression by tracking in shared memory if a live cluster is either in crash recovery or archive recovery as the handling of WAL segments ready to be archived is different in both cases (those WAL segments should not be removed during crash recovery), and by using this new shared memory state to decide if a segment can be recycled or not. Previously, it was not possible to know if a cluster was in crash recovery or archive recovery as the shared state was able to track only if recovery was happening or not, leading to the problem. A set of TAP tests is added to close the gap here, making sure that WAL segments ready to be archived are correctly handled when a cluster is in archive or crash recovery with archive_mode set to "on" or "always", for both standby and primary. Reported-by: Benoît Lobréau Author: Jehan-Guillaume de Rorthais Reviewed-by: Kyotaro Horiguchi, Fujii Masao, Michael Paquier Discussion: https://postgr.es/m/20200331172229.40ee00dc@firost Backpatch-through: 9.5
2020-04-21Fix possible crash during FATAL exit from reindexing.Tom Lane
index.c supposed that it could just use a PG_TRY block to clean up the state associated with an active REINDEX operation. However, that code doesn't run if we do a FATAL exit --- for example, due to a SIGTERM shutdown signal --- while the REINDEX is happening. And that state does get consulted during catalog accesses, which makes it problematic if we do any catalog accesses during shutdown --- for example, to clean up any temp tables created in the session. If this combination of circumstances occurred, we could find ourselves trying to access already-freed memory. In debug builds that'd fairly reliably cause an assertion failure. In production we might often get away with it, but with some bad luck it could cause a core dump. Another possible bad outcome is an erroneous conclusion that an index-to-be-accessed is being reindexed; but it looks like that would be unlikely to have any consequences worse than failing to drop temp tables right away. (They'd still get dropped by the next session that uses that temp schema.) To fix, get rid of the use of PG_TRY here, and instead hook into the transaction abort mechanisms to clean up reindex state. Per bug #16378 from Alexander Lakhin. This has been wrong for a very long time, so back-patch to all supported branches. Discussion: https://postgr.es/m/16378-7a70ca41b3ec2009@postgresql.org
2020-04-05Use TransactionXmin instead of RecentGlobalXmin in heap_abort_speculative().Andres Freund
There's a very low risk that RecentGlobalXmin could be far enough in the past to be older than relfrozenxid, or even wrapped around. Luckily the consequences of that having happened wouldn't be too bad - the page wouldn't be pruned for a while. Avoid that risk by using TransactionXmin instead. As that's announced via MyPgXact->xmin, it is protected against wrapping around (see code comments for details around relfrozenxid). Author: Andres Freund Discussion: https://postgr.es/m/20200328213023.s4eyijhdosuc4vcj@alap3.anarazel.de Backpatch: 9.5-
2020-04-03Fix bugs in gin_fuzzy_search_limit processing.Tom Lane
entryGetItem()'s three code paths each contained bugs associated with filtering the entries for gin_fuzzy_search_limit. The posting-tree path failed to advance "advancePast" after having decided to filter an item. If we ran out of items on the current page and needed to advance to the next, what would actually happen is that entryLoadMoreItems() would re-load the same page. Eventually, the random dropItem() test would accept one of the same items it'd previously rejected, and we'd move on --- but it could take awhile with small gin_fuzzy_search_limit. To add insult to injury, this case would inevitably cause entryLoadMoreItems() to decide it needed to re-descend from the root, making things even slower. The posting-list path failed to implement gin_fuzzy_search_limit filtering at all, so that all entries in the posting list would be returned. The bitmap-result path used a "gotitem" variable that it failed to update in the one place where it'd actually make a difference, ie at the one "continue" statement. I think this was unreachable in practice, because if we'd looped around then it shouldn't be the case that the entries on the new page are before advancePast. Still, the "gotitem" variable was contributing nothing to either clarity or correctness, so get rid of it. Refactor all three loops so that the termination conditions are more alike and less unreadable. The code coverage report showed that we had no coverage at all for the re-descend-from-root code path in entryLoadMoreItems(), which seems like a very bad thing, so add a test case that exercises it. We also had exactly no coverage for gin_fuzzy_search_limit, so add a simplistic test case that at least hits those code paths a little bit. Back-patch to all supported branches. Adé Heyward and Tom Lane Discussion: https://postgr.es/m/CAEknJCdS-dE1Heddptm7ay2xTbSeADbkaQ8bU2AXRCVC2LdtKQ@mail.gmail.com
2020-03-22Revert "Skip WAL for new relfilenodes, under wal_level=minimal."Noah Misch
This reverts commit cb2fd7eac285b1b0a24eeb2b8ed4456b66c5a09f. Per numerous buildfarm members, it was incompatible with parallel query, and a test case assumed LP64. Back-patch to 9.5 (all supported versions). Discussion: https://postgr.es/m/20200321224920.GB1763544@rfd.leadboat.com
2020-03-21Skip WAL for new relfilenodes, under wal_level=minimal.Noah Misch
Until now, only selected bulk operations (e.g. COPY) did this. If a given relfilenode received both a WAL-skipping COPY and a WAL-logged operation (e.g. INSERT), recovery could lose tuples from the COPY. See src/backend/access/transam/README section "Skipping WAL for New RelFileNode" for the new coding rules. Maintainers of table access methods should examine that section. To maintain data durability, just before commit, we choose between an fsync of the relfilenode and copying its contents to WAL. A new GUC, wal_skip_threshold, guides that choice. If this change slows a workload that creates small, permanent relfilenodes under wal_level=minimal, try adjusting wal_skip_threshold. Users setting a timeout on COMMIT may need to adjust that timeout, and log_min_duration_statement analysis will reflect time consumption moving to COMMIT from commands like COPY. Internally, this requires a reliable determination of whether RollbackAndReleaseCurrentSubTransaction() would unlink a relation's current relfilenode. Introduce rd_firstRelfilenodeSubid. Amend the specification of rd_createSubid such that the field is zero when a new rel has an old rd_node. Make relcache.c retain entries for certain dropped relations until end of transaction. Back-patch to 9.5 (all supported versions). This introduces a new WAL record type, XLOG_GIST_ASSIGN_LSN, without bumping XLOG_PAGE_MAGIC. As always, update standby systems before master systems. This changes sizeof(RelationData) and sizeof(IndexStmt), breaking binary compatibility for affected extensions. (The most recent commit to affect the same class of extensions was 089e4d405d0f3b94c74a2c6a54357a84a681754b.) Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert Haas. Heikki Linnakangas and Michael Paquier implemented earlier designs that materially clarified the problem. Reviewed, in earlier designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane, Fujii Masao, and Simon Riggs. Reported by Martijn van Oosterhout. Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
2020-03-21Back-patch log_newpage_range().Noah Misch
Back-patch a subset of commit 9155580fd5fc2a0cbb23376dfca7cd21f59c2c7b to v11, v10, 9.6, and 9.5. Include the latest repairs to this function. Use a new XLOG_FPI_MULTI value instead of reusing XLOG_FPI. That way, if an older server reads WAL from this function, that server will PANIC instead of applying just one page of the record. The next commit adds a call to this function. Discussion: https://postgr.es/m/20200304.162919.898938381201316571.horikyota.ntt@gmail.com
2020-03-21During heap rebuild, lock any TOAST index until end of transaction.Noah Misch
swap_relation_files() calls toast_get_valid_index() to find and lock this index, just before swapping with the rebuilt TOAST index. The latter function releases the lock before returning. Potential for mischief is low; a concurrent session can issue ALTER INDEX ... SET (fillfactor = ...), which is not alarming. Nonetheless, changing pg_class.relfilenode without a lock is unconventional. Back-patch to 9.5 (all supported versions), because another fix needs this. Discussion: https://postgr.es/m/20191226001521.GA1772687@rfd.leadboat.com
2020-03-09Avoid assertion failure with targeted recovery in standby mode.Fujii Masao
At the end of recovery, standby mode is turned off to re-fetch the last valid record from archive or pg_wal. Previously, if recovery target was reached and standby mode was turned off while the current WAL source was stream, recovery could try to retrieve WAL file containing the last valid record unexpectedly from stream even though not in standby mode. This caused an assertion failure. That is, the assertion test confirms that WAL file should not be retrieved from stream if standby mode is not true. This commit moves back the current WAL source to archive if it's stream even though not in standby mode, to avoid that assertion failure. This issue doesn't cause the server to crash when built with assertion disabled. In this case, the attempt to retrieve WAL file from stream not in standby mode just fails. And then recovery tries to retrieve WAL file from archive or pg_wal. Back-patch to all supported branches. Author: Kyotaro Horiguchi Reviewed-by: Fujii Masao Discussion: https://postgr.es/m/20200227.124830.2197604521555566121.horikyota.ntt@gmail.com
2020-01-20Fix crash in BRIN inclusion op functions, due to missing datum copy.Heikki Linnakangas
The BRIN add_value() and union() functions need to make a longer-lived copy of the argument, if they want to store it in the BrinValues struct also passed as argument. The functions for the "inclusion operator classes" used with box, range and inet types didn't take into account that the union helper function might return its argument as is, without making a copy. Check for that case, and make a copy if necessary. That case arises at least with the range_union() function, when one of the arguments is an 'empty' range: CREATE TABLE brintest (n numrange); CREATE INDEX brinidx ON brintest USING brin (n); INSERT INTO brintest VALUES ('empty'); INSERT INTO brintest VALUES (numrange(0, 2^1000::numeric)); INSERT INTO brintest VALUES ('(-1, 0)'); SELECT brin_desummarize_range('brinidx', 0); SELECT brin_summarize_range('brinidx', 0); Backpatch down to 9.5, where BRIN was introduced. Discussion: https://www.postgresql.org/message-id/e6e1d6eb-0a67-36aa-e779-bcca59167c14%40iki.fi Reviewed-by: Emre Hasegeli, Tom Lane, Alvaro Herrera
2020-01-08Reimplement nullification of walsender timestampAlvaro Herrera
Make the value null only at pg_stat_activity-output time, as suggested by Tom Lane, instead of messing with the internal state. This should appease buildfarm members with force_parallel_mode=regress, which are running parallel queries on logical replication walsenders. The fact that walsenders can run parallel queries should perhaps be studied more carefully, but for the moment let's get rid of the red blots in buildfarm. Backpatch to pg10, like the previous commit. Discussion: https://postgr.es/m/30804.1578438763@sss.pgh.pa.us
2020-01-07pg_stat_activity: show NULL stmt start time for walsendersAlvaro Herrera
Returning a non-NULL time is pointless, sinc a walsender is not a process that would be running normal transactions anyway, but the code was unintentionally exposing the process start time intermittently, which was not only bogus but it also confused monitoring systems looking for idle transactions. Fix by avoiding all updates in walsenders. Backpatch to pg10: previously I misidentified the branches that show auxiliary processes in pg_stat_activity. Reported-by: Tomas Vondra Discussion: https://postgr.es/m/20191209234409.exe7osmyalwkt5j4@development
2019-11-24Avoid assertion failure with LISTEN in a serializable transaction.Tom Lane
If LISTEN is the only action in a serializable-mode transaction, and the session was not previously listening, and the notify queue is not empty, predicate.c reported an assertion failure. That happened because we'd acquire the transaction's initial snapshot during PreCommit_Notify, which was called *after* predicate.c expects any such snapshot to have been established. To fix, just swap the order of the PreCommit_Notify and PreCommit_CheckForSerializationFailure calls during CommitTransaction. This will imply holding the notify-insertion lock slightly longer, but the difference could only be meaningful in serializable mode, which is an expensive option anyway. It appears that this is just an assertion failure, with no consequences in non-assert builds. A snapshot used only to scan the notify queue could not have been involved in any serialization conflicts, so there would be nothing for PreCommit_CheckForSerializationFailure to do except assign it a prepareSeqNo and set the SXACT_FLAG_PREPARED flag. And given no conflicts, neither of those omissions affect the behavior of ReleasePredicateLocks. This admittedly once-over-lightly analysis is backed up by the lack of field reports of trouble. Per report from Mark Dilger. The bug is old, so back-patch to all supported branches; but the new test case only goes back to 9.6, for lack of adequate isolationtester infrastructure before that. Discussion: https://postgr.es/m/3ac7f397-4d5f-be8e-f354-440020675694@gmail.com Discussion: https://postgr.es/m/13881.1574557302@sss.pgh.pa.us
2019-11-20Revise GIN READMEAlexander Korotkov
We find GIN concurrency bugs from time to time. One of the problems here is that concurrency of GIN isn't well-documented in README. So, it might be even hard to distinguish design bugs from implementation bugs. This commit revised concurrency section in GIN README providing more details. Some examples are illustrated in ASCII art. Also, this commit add the explanation of how is tuple layout in internal GIN B-tree page different in comparison with nbtree. Discussion: https://postgr.es/m/CAPpHfduXR_ywyaVN4%2BOYEGaw%3DcPLzWX6RxYLBncKw8de9vOkqw%40mail.gmail.com Author: Alexander Korotkov Reviewed-by: Peter Geoghegan Backpatch-through: 9.4
2019-11-20Fix traversing to the deleted GIN page via downlinkAlexander Korotkov
Current GIN code appears to don't handle traversing to the deleted page via downlink. This commit fixes that by stepping right from the delete page like we do in nbtree. This commit also fixes setting 'deleted' flag to the GIN pages. Now other page flags are not erased once page is deleted. That helps to keep our assertions true if we arrive deleted page via downlink. Discussion: https://postgr.es/m/CAPpHfdvMvsw-NcE5bRS7R1BbvA4BxoDnVVjkXC5W0Czvy9LVrg%40mail.gmail.com Author: Alexander Korotkov Reviewed-by: Peter Geoghegan Backpatch-through: 9.4
2019-11-20Fix deadlock between ginDeletePage() and ginStepRight()Alexander Korotkov
When ginDeletePage() is about to delete page it locks its left sibling to revise the rightlink. So, it locks pages in right to left manner. Int he same time ginStepRight() locks pages in left to right manner, and that could cause a deadlock. This commit makes ginScanToDelete() keep exclusive lock on left siblings of currently investigated path. That elimites need to relock left sibling in ginDeletePage(). Thus, deadlock with ginStepRight() can't happen anymore. Reported-by: Chen Huajun Discussion: https://postgr.es/m/5c332bd1.87b6.16d7c17aa98.Coremail.chjischj%40163.com Author: Alexander Korotkov Reviewed-by: Peter Geoghegan Backpatch-through: 10
2019-11-07Fix assertion failure when running pgbench -s.Fujii Masao
If there is the WAL page that the continuation WAL record just fits within (i.e., the continuation record ends just at the end of the page) and the LSN in such page is specified with -s option, previously pg_waldump caused an assertion failure. The cause of this assertion failure was that XLogFindNextRecord() that pg_waldump -s calls mistakenly handled such special WAL page. This commit changes XLogFindNextRecord() so that it can handle such WAL page correctly. Back-patch to all supported versions. Author: Andrey Lepikhov Reviewed-by: Fujii Masao, Michael Paquier Discussion: https://postgr.es/m/99303554-5dd5-06e6-f943-b3005ccd6edd@postgrespro.ru
2019-10-18Fix failure of archive recovery with recovery_min_apply_delay enabled.Fujii Masao
recovery_min_apply_delay parameter is intended for use with streaming replication deployments. However, the document clearly explains that the parameter will be honored in all cases if it's specified. So it should take effect even if in archive recovery. But, previously, archive recovery with recovery_min_apply_delay enabled always failed, and caused assertion failure if --enable-caasert is enabled. The cause of this problem is that; the ownership of recoveryWakeupLatch that recovery_min_apply_delay uses was taken only when standby mode is requested. So unowned latch could be used in archive recovery, and which caused the failure. This commit changes recovery code so that the ownership of recoveryWakeupLatch is taken even in archive recovery. Which prevents archive recovery with recovery_min_apply_delay from failing. Back-patch to v9.4 where recovery_min_apply_delay was added. Author: Fujii Masao Reviewed-by: Michael Paquier Discussion: https://postgr.es/m/CAHGQGwEyD6HdZLfdWc+95g=VQFPR4zQL4n+yHxQgGEGjaSVheQ@mail.gmail.com
2019-10-09Flush logical mapping files with fd opened for read/write at checkpointMichael Paquier
The file descriptor was opened with read-only to fsync a regular file, which would cause EBADFD errors on some platforms. This is similar to the recent fix done by a586cc4b (which was broken by me with 82a5649), except that I noticed this issue while monitoring the backend code for similar mistakes. Backpatch to 9.4, as this has been introduced since logical decoding exists as of b89e151. Author: Michael Paquier Reviewed-by: Andres Freund Discussion: https://postgr.es/m/20191006045548.GA14532@paquier.xyz Backpatch-through: 9.4
2019-10-02Remove temporary WAL and history files at the end of archive recoveryMichael Paquier
cbc55da has reworked the order of some actions at the end of archive recovery. Unfortunately this overlooked the fact that the startup process needs to remove RECOVERYXLOG (for temporary WAL segment newly recovered from archives) and RECOVERYHISTORY (for temporary history file) at this step, leaving the files around even after recovery ended. Backpatch to 9.5, like the previous commit. Author: Sawada Masahiko Reviewed-by: Fujii Masao, Michael Paquier Discussion: https://postgr.es/m/CAD21AoBO_eDQub6zojFnWtnmutRBWvYf7=cW4Hsqj+U_R26w3Q@mail.gmail.com Backpatch-through: 9.5
2019-09-25Fix failure with lock mode used for custom relation optionsMichael Paquier
In-core relation options can use a custom lock mode since 47167b7, that has lowered the lock available for some autovacuum parameters. However it forgot to consider custom relation options. This causes failures with ALTER TABLE SET when changing a custom relation option, as its lock is not defined. The existing APIs to define a custom reloption does not allow to define a custom lock mode, so enforce its initialization to AccessExclusiveMode which should be safe enough in all cases. An upcoming patch will extend the existing APIs to allow a custom lock mode to be defined. The problem can be reproduced with bloom indexes, so add a test there. Reported-by: Nikolay Sharplov Analyzed-by: Thomas Munro, Michael Paquier Author: Michael Paquier Reviewed-by: Kuntal Ghosh Discussion: https://postgr.es/m/20190920013831.GD1844@paquier.xyz Backpatch-through: 9.6
2019-09-19Fix oversight in backpatch of 6cae9d2c10Alexander Korotkov
During backpatch of 6cae9d2c10 Float8GetDatum() was accidentally removed. This commit turns it back. Reported-by: Erik Rijkers Discussion: https://postgr.es/m/6d51305e1159241cabee132f7efc7eff%40xs4all.nl Author: Tom Lane Backpatch-through: from 11 to 9.5
2019-09-19Improve handling of NULLs in KNN-GiST and KNN-SP-GiSTAlexander Korotkov
This commit improves subject in two ways: * It removes ugliness of 02f90879e7, which stores distance values and null flags in two separate arrays after GISTSearchItem struct. Instead we pack both distance value and null flag in IndexOrderByDistance struct. Alignment overhead should be negligible, because we typically deal with at most few "col op const" expressions in ORDER BY clause. * It fixes handling of "col op NULL" expression in KNN-SP-GiST. Now, these expression are not passed to support functions, which can't deal with them. Instead, NULL result is implicitly assumed. It future we may decide to teach support functions to deal with NULL arguments, but current solution is bugfix suitable for backpatch. Reported-by: Nikita Glukhov Discussion: https://postgr.es/m/826f57ee-afc7-8977-c44c-6111d18b02ec%40postgrespro.ru Author: Nikita Glukhov Reviewed-by: Alexander Korotkov Backpatch-through: 9.4
2019-09-12Fix nbtree page split rmgr desc routine.Peter Geoghegan
Include newitemoff in rmgr desc output for nbtree page split records. In passing, correct an obsolete comment that claimed that newitemoff is only logged for _L variant nbtree page split WAL records. Both issues were oversights in commit 2c03216d831, which revamped the WAL format. Author: Peter Geoghegan Backpatch: 9.5-, where the WAL format was revamped.
2019-09-08Fix RelationIdGetRelation calls that weren't bothering with error checks.Tom Lane
Some of these are quite old, but that doesn't make them not bugs. We'd rather report a failure via elog than SIGSEGV. While at it, uniformly spell the error check as !RelationIsValid(rel) rather than a bare rel == NULL test. The machine code is the same but it seems better to be consistent. Coverity complained about this today, not sure why, because the mistake is in fact old.
2019-09-08Fix handling of NULL distances in KNN-GiSTAlexander Korotkov
In order to implement NULL LAST semantic GiST previously assumed distance to the NULL value to be Inf. However, our distance functions can return Inf and NaN for non-null values. In such cases, NULL LAST semantic appears to be broken. This commit fixes that by introducing separate array of null flags for distances. Backpatch to all supported versions. Discussion: https://postgr.es/m/CAPpHfdsNvNdA0DBS%2BwMpFrgwT6C3-q50sFVGLSiuWnV3FqOJuQ%40mail.gmail.com Author: Alexander Korotkov Backpatch-through: 9.4
2019-09-08Fix handling Inf and Nan values in GiST pairing heap comparatorAlexander Korotkov
Previously plain float comparison was used in GiST pairing heap. Such comparison doesn't provide proper ordering for value sets containing Inf and Nan values. This commit fixes that by usage of float8_cmp_internal(). Note, there is remaining problem with NULL distances, which are represented as Inf in pairing heap. It would be fixes in subsequent commit. Backpatch to all supported versions. Reported-by: Andrey Borodin Discussion: https://postgr.es/m/CAPpHfdsNvNdA0DBS%2BwMpFrgwT6C3-q50sFVGLSiuWnV3FqOJuQ%40mail.gmail.com Author: Alexander Korotkov Reviewed-by: Heikki Linnakangas Backpatch-through: 9.4
2019-08-28Fix overflow check and comment in GIN posting list encoding.Heikki Linnakangas
The comment did not match what the code actually did for integers with the 43rd bit set. You get an integer like that, if you have a posting list with two adjacent TIDs that are more than 2^31 blocks apart. According to the comment, we would store that in 6 bytes, with no continuation bit on the 6th byte, but in reality, the code encodes it using 7 bytes, with a continuation bit on the 6th byte as normal. The decoding routine also handled these 7-byte integers correctly, except for an overflow check that assumed that one integer needs at most 6 bytes. Fix the overflow check, and fix the comment to match what the code actually does. Also fix the comment that claimed that there are 17 unused bits in the 64-bit representation of an item pointer. In reality, there are 64-32-11=21. Fitting any item pointer into max 6 bytes was an important property when this was written, because in the old pre-9.4 format, item pointers were stored as plain arrays, with 6 bytes for every item pointer. The maximum of 6 bytes per integer in the new format guaranteed that we could convert any page from the old format to the new format after upgrade, so that the new format was never larger than the old format. But we hardly need to worry about that anymore, and running into that problem during upgrade, where an item pointer is expanded from 6 to 7 bytes such that the data doesn't fit on a page anymore, is implausible in practice anyway. Backpatch to all supported versions. This also includes a little test module to test these large distances between item pointers, without requiring a 16 TB table. It is not backpatched, I'm including it more for the benefit of future development of new posting list formats. Discussion: https://www.postgresql.org/message-id/33bfc20a-5c86-f50c-f5a5-58e9925d05ff%40iki.fi Reviewed-by: Masahiko Sawada, Alexander Korotkov
2019-08-20Fix bogus commentAlvaro Herrera
Author: Alexander Lakhin Discussion: https://postgr.es/m/20190819072244.GE18166@paquier.xyz
2019-08-07Fix predicate-locking of HOT updated rows.Heikki Linnakangas
In serializable mode, heap_hot_search_buffer() incorrectly acquired a predicate lock on the root tuple, not the returned tuple that satisfied the visibility checks. As explained in README-SSI, the predicate lock does not need to be copied or extended to other tuple versions, but for that to work, the correct, visible, tuple version must be locked in the first place. The original SSI commit had this bug in it, but it was fixed back in 2013, in commit 81fbbfe335. But unfortunately, it was reintroduced a few months later in commit b89e151054. Wising up from that, add a regression test to cover this, so that it doesn't get reintroduced again. Also, move the code that sets 't_self', so that it happens at the same time that the other HeapTuple fields are set, to make it more clear that all the code in the loop operate on the "current" tuple in the chain, not the root tuple. Bug spotted by Andres Freund, analysis and original fix by Thomas Munro, test case and some additional changes to the fix by Heikki Linnakangas. Backpatch to all supported versions (9.4). Discussion: https://www.postgresql.org/message-id/20190731210630.nqhszuktygwftjty%40alap3.anarazel.de
2019-07-10Fix variable initialization when using buffering build with GiSTMichael Paquier
This can cause valgrind to complain, as the flag marking a buffer as a temporary copy was not getting initialized. While on it, fill in with zeros newly-created buffer pages. This does not matter when loading a block from a temporary file, but it makes the push of an index tuple into a new buffer page safer. This has been introduced by 1d27dcf, so backpatch all the way down to 9.4. Author: Alexander Lakhin Discussion: https://postgr.es/m/15899-0d24fb273b3dd90c@postgresql.org Backpatch-through: 9.4
2019-06-18Avoid spurious deadlocks when upgrading a tuple lockAlvaro Herrera
This puts back reverted commit de87a084c0a5, with some bug fixes. When two (or more) transactions are waiting for transaction T1 to release a tuple-level lock, and transaction T1 upgrades its lock to a higher level, a spurious deadlock can be reported among the waiting transactions when T1 finishes. The simplest example case seems to be: T1: select id from job where name = 'a' for key share; Y: select id from job where name = 'a' for update; -- starts waiting for T1 Z: select id from job where name = 'a' for key share; T1: update job set name = 'b' where id = 1; Z: update job set name = 'c' where id = 1; -- starts waiting for T1 T1: rollback; At this point, transaction Y is rolled back on account of a deadlock: Y holds the heavyweight tuple lock and is waiting for the Xmax to be released, while Z holds part of the multixact and tries to acquire the heavyweight lock (per protocol) and goes to sleep; once T1 releases its part of the multixact, Z is awakened only to be put back to sleep on the heavyweight lock that Y is holding while sleeping. Kaboom. This can be avoided by having Z skip the heavyweight lock acquisition. As far as I can see, the biggest downside is that if there are multiple Z transactions, the order in which they resume after T1 finishes is not guaranteed. Backpatch to 9.6. The patch applies cleanly on 9.5, but the new tests don't work there (because isolationtester is not smart enough), so I'm not going to risk it. Author: Oleksii Kliukin Discussion: https://postgr.es/m/B9C9D7CD-EB94-4635-91B6-E558ACEC0EC3@hintbits.com Discussion: https://postgr.es/m/2815.1560521451@sss.pgh.pa.us
2019-06-16Revert "Avoid spurious deadlocks when upgrading a tuple lock"Alvaro Herrera
This reverts commits 3da73d6839dc and de87a084c0a5. This code has some tricky corner cases that I'm not sure are correct and not properly tested anyway, so I'm reverting the whole thing for next week's releases (reintroducing the deadlock bug that we set to fix). I'll try again afterwards. Discussion: https://postgr.es/m/E1hbXKQ-0003g1-0C@gemulon.postgresql.org