summaryrefslogtreecommitdiff
path: root/src/backend/commands
AgeCommit message (Collapse)Author
2020-09-30Reword partitioning error messageAlvaro Herrera
The error message about columns in the primary key not including all of the partition key was unclear; reword it. Backpatch all the way to pg11, where it appeared. Reported-by: Nagaraj Raj <nagaraj.sf@yahoo.com> Discussion: https://postgr.es/m/64062533.78364.1601415362244@mail.yahoo.com
2020-09-29Fix progress reporting of REINDEX CONCURRENTLYMichael Paquier
This addresses a couple of issues with the so-said subject: - Report the correct parent relation with the index actually being rebuilt or validated. Previously, the command status remained set to the last index created for the progress of the index build and validation, which would be incorrect when working on a table that has more than one index. - Use the correct phase when waiting before the drop of the old indexes. Previously, this was reported with the same status as when waiting before the old indexes are marked as dead. Author: Matthias van de Meent, Michael Paquier Discussion: https://postgr.es/m/CAEze2WhqFgcwe1_tv=sFYhLWV2AdpfukumotJ6JNcAOQs3jufg@mail.gmail.com Backpatch-through: 12
2020-09-28Add for_each_from, to simplify loops starting from non-first list cells.Tom Lane
We have a dozen or so places that need to iterate over all but the first cell of a List. Prior to v13 this was typically written as for_each_cell(lc, lnext(list_head(list))) Commit 1cff1b95a changed these to for_each_cell(lc, list, list_second_cell(list)) This patch introduces a new macro for_each_from() which expresses the start point as a list index, allowing these to be written as for_each_from(lc, list, 1) This is marginally more efficient, since ForEachState.i can be initialized directly instead of backing into it from a ListCell address. It also seems clearer and less typo-prone. Some of the remaining uses of for_each_cell() look like they could profitably be changed to for_each_from(), but here I confined myself to changing uses of list_second_cell(). Also, fix for_each_cell_setup() and for_both_cell_setup() to const-ify their arguments; that's a simple oversight in 1cff1b95a. Back-patch into v13, on the grounds that (1) the const-ification is a minor bug fix, and (2) it's better for back-patching purposes if we only have two ways to write these loops rather than three. In HEAD, also remove list_third_cell() and list_fourth_cell(), which were also introduced in 1cff1b95a, and are unused as of cc99baa43. It seems unlikely that any third-party code would have started to use them already; anyone who has can be directed to list_nth_cell instead. Discussion: https://postgr.es/m/CAApHDvpo1zj9KhEpU2cCRZfSM3Q6XGdhzuAS2v79PH7WJBkYVA@mail.gmail.com
2020-09-27Minor mop-up for List improvements.Tom Lane
Fix a few places that were using written-out versions of the pg_list.h macros that commit cc99baa43 just improved, making them also use those macros so as to gain whatever performance improvement is to be had. Discussion: https://postgr.es/m/CAApHDvpo1zj9KhEpU2cCRZfSM3Q6XGdhzuAS2v79PH7WJBkYVA@mail.gmail.com
2020-09-26Revise RelationBuildRowSecurity() to avoid memory leaks.Tom Lane
This function leaked some memory while loading qual clauses for an RLS policy. While ordinarily negligible, that could build up in some repeated-reload cases, as reported by Konstantin Knizhnik. We can improve matters by borrowing the coding long used in RelationBuildRuleLock: build stringToNode's result directly in the target context, and remember to explicitly pfree the input string. This patch by no means completely guarantees zero leaks within this function, since we have no real guarantee that the catalog- reading subroutines it calls don't leak anything. However, practical tests suggest that this is enough to resolve the issue. In any case, any remaining leaks are similar to those risked by RelationBuildRuleLock and other relcache-loading subroutines. If we need to fix them, we should adopt a more global approach such as that used by the RECOVER_RELATION_BUILD_MEMORY hack. While here, let's remove the need for an expensive PG_TRY block by using MemoryContextSetParent to reparent an initially-short-lived context for the RLS data. Back-patch to all supported branches. Discussion: https://postgr.es/m/21356c12-8917-8249-b35f-1c447231922b@postgrespro.ru
2020-09-25Defer flushing of SLRU files.Thomas Munro
Previously, we called fsync() after writing out individual pg_xact, pg_multixact and pg_commit_ts pages due to cache pressure, leading to regular I/O stalls in user backends and recovery. Collapse requests for the same file into a single system call as part of the next checkpoint, as we already did for relation files, using the infrastructure developed by commit 3eb77eba. This can cause a significant improvement to recovery performance, especially when it's otherwise CPU-bound. Hoist ProcessSyncRequests() up into CheckPointGuts() to make it clearer that it applies to all the SLRU mini-buffer-pools as well as the main buffer pool. Rearrange things so that data collected in CheckpointStats includes SLRU activity. Also remove the Shutdown{CLOG,CommitTS,SUBTRANS,MultiXact}() functions, because they were redundant after the shutdown checkpoint that immediately precedes them. (I'm not sure if they were ever needed, but they aren't now.) Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> (parts) Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com> Discussion: https://postgr.es/m/CA+hUKGLJ=84YT+NvhkEEDAuUtVHMfQ9i-N7k_o50JmQ6Rpj_OQ@mail.gmail.com
2020-09-23Improve error cursor positions for problems with partition bounds.Tom Lane
We failed to pass down the query string to check_new_partition_bound, so that its attempts to provide error cursor positions were for naught; one must have the query string to get parser_errposition to do anything. Adjust its API to require a ParseState to be passed down. Also, improve the logic inside check_new_partition_bound so that the cursor points at the partition bound for the specific column causing the issue, when one can be identified. That part is also for naught if we can't determine the query position of the column with the problem. Improve transformPartitionBoundValue so that it makes sure that const-simplified partition expressions will be properly labeled with positions. In passing, skip calling evaluate_expr if the value is already a Const, which is surely the most common case. Alexandra Wang, Ashwin Agrawal, Amit Langote; reviewed by Ashutosh Bapat Discussion: https://postgr.es/m/CACiyaSopZoqssfMzgHk6fAkp01cL6vnqBdmTw2C5_KJaFR_aMg@mail.gmail.com Discussion: https://postgr.es/m/CAJV4CdrZ5mKuaEsRSbLf2URQ3h6iMtKD=hik8MaF5WwdmC9uZw@mail.gmail.com
2020-09-17Remove support for postfix (right-unary) operators.Tom Lane
This feature has been a thorn in our sides for a long time, causing many grammatical ambiguity problems. It doesn't seem worth the pain to continue to support it, so remove it. There are some follow-on improvements we can make in the grammar, but this commit only removes the bare minimum number of productions, plus assorted backend support code. Note that pg_dump and psql continue to have full support, since they may be used against older servers. However, pg_dump warns about postfix operators. There is also a check in pg_upgrade. Documentation-wise, I (tgl) largely removed the "left unary" terminology in favor of saying "prefix operator", which is a more standard and IMO less confusing term. I included a catversion bump, although no initial catalog data changes here, to mark the boundary at which oprkind = 'r' stopped being valid in pg_operator. Mark Dilger, based on work by myself and Robert Haas; review by John Naylor Discussion: https://postgr.es/m/38ca86db-42ab-9b48-2902-337a0d6b8311@2ndquadrant.com
2020-09-16Don't fetch partition check expression during InitResultRelInfo.Tom Lane
Since there is only one place that actually needs the partition check expression, namely ExecPartitionCheck, it's better to fetch it from the relcache there. In this way we will never fetch it at all if the query never has use for it, and we still fetch it just once when we do need it. The reason for taking an interest in this is that if the relcache doesn't already have the check expression cached, fetching it requires obtaining AccessShareLock on the partition root. That means that operations that look like they should only touch the partition itself will also take a lock on the root. In particular we observed that TRUNCATE on a partition may take a lock on the partition's root, contributing to a deadlock situation in parallel pg_restore. As written, this patch does have a small cost, which is that we are microscopically reducing efficiency for the case where a partition has an empty check expression. ExecPartitionCheck will be called, and will go through the motions of setting up and checking an empty qual, where before it would not have been called at all. We could avoid that by adding a separate boolean flag to track whether there is a partition expression to test. However, this case only arises for a default partition with no siblings, which surely is not an interesting case in practice. Hence adding complexity for it does not seem like a good trade-off. Amit Langote, per a suggestion by me Discussion: https://postgr.es/m/VI1PR03MB31670CA1BD9625C3A8C5DD05EB230@VI1PR03MB3167.eurprd03.prod.outlook.com
2020-09-16Avoid unnecessary recursion to child tables in ALTER TABLE SET NOT NULL.Tom Lane
If a partitioned table's column is already marked NOT NULL, there is no need to examine its partitions, because we can rely on previous DDL to have enforced that the child columns are NOT NULL as well. (Unfortunately, the same cannot be said for traditional inheritance, so for now we have to restrict the optimization to partitioned tables.) Hence, we may skip recursing to child tables in this situation. The reason this case is worth worrying about is that when pg_dump dumps a partitioned table having a primary key, it will include the requisite NOT NULL markings in the CREATE TABLE commands, and then add the primary key as a separate step. The primary key addition generates a SET NOT NULL as a subcommand, just to be sure. So the situation where a SET NOT NULL is redundant does arise in the real world. Skipping the recursion does more than just save a few cycles: it means that a command such as "ALTER TABLE ONLY partition_parent ADD PRIMARY KEY" will take locks only on the partition parent table, not on the partitions. It turns out that parallel pg_restore is effectively assuming that that's true, and has little choice but to do so because the dependencies listed for such a TOC entry don't include the partitions. pg_restore could thus issue this ALTER while data restores on the partitions are still in progress. Taking unnecessary locks on the partitions not only hurts concurrency, but can lead to actual deadlock failures, as reported by Domagoj Smoljanovic. (A contributing factor in the deadlock is that TRUNCATE on a child partition wants a non-exclusive lock on the parent. This seems likewise unnecessary, but the fix for it is more invasive so we won't consider back-patching it. Fortunately, getting rid of one of these two poor behaviors is enough to remove the deadlock.) Although support for partitioned primary keys came in with v11, this patch is dependent on the SET NOT NULL refactoring done by commit f4a3fdfbd, so we can only patch back to v12. Patch by me; thanks to Alvaro Herrera and Amit Langote for review. Discussion: https://postgr.es/m/VI1PR03MB31670CA1BD9625C3A8C5DD05EB230@VI1PR03MB3167.eurprd03.prod.outlook.com
2020-09-15Fix use-after-free bug with event triggers in an extension scriptAlvaro Herrera
ALTER TABLE commands in an extension script are added to an event trigger command list; but starting with commit b5810de3f4 they do so in a memory context that's too short-lived, so when execution ends and time comes to use the entries, they've already been freed. (This would also be a problem with ALTER TABLE commands in a multi-command query string, but these serendipitously end in PortalContext -- which probably explains why it took so long for this to be reported.) Fix by using the memory context specifically set for that, instead. Backpatch to 13, where the aforementioned commit appeared. Reported-by: Philippe Beaudoin Author: Jehan-Guillaume de Rorthais <jgdr@dalibo.com> Discussion: https://postgr.es/m/20200902193715.6e0269d4@firost
2020-09-14Message fixes and style improvementsPeter Eisentraut
2020-09-08Add support for partitioned tables and indexes in REINDEXMichael Paquier
Until now, REINDEX was not able to work with partitioned tables and indexes, forcing users to reindex partitions one by one. This extends REINDEX INDEX and REINDEX TABLE so as they can accept a partitioned index and table in input, respectively, to reindex all the partitions assigned to them with physical storage (foreign tables, partitioned tables and indexes are then discarded). This shares some logic with schema and database REINDEX as each partition gets processed in its own transaction after building a list of relations to work on. This choice has the advantage to minimize the number of invalid indexes to one partition with REINDEX CONCURRENTLY in the event a cancellation or failure in-flight, as the only indexes handled at once in a single REINDEX CONCURRENTLY loop are the ones from the partition being working on. Isolation tests are added to emulate some cases I bumped into while developing this feature, particularly with the concurrent drop of a leaf partition reindexed. However, this is rather limited as LOCK would cause REINDEX to block in the first transaction building the list of partitions. Per its multi-transaction nature, this new flavor cannot run in a transaction block, similarly to REINDEX SCHEMA, SYSTEM and DATABASE. Author: Justin Pryzby, Michael Paquier Reviewed-by: Anastasia Lubennikova Discussion: https://postgr.es/m/db12e897-73ff-467e-94cb-4af03705435f.adger.lj@alibaba-inc.com
2020-09-05Yet more elimination of dead stores and useless initializations.Tom Lane
I'm not sure what tool Ranier was using, but the ones I contributed were found by using a newer version of scan-build than I tried before. Ranier Vilela and Tom Lane Discussion: https://postgr.es/m/CAEudQAo1+AcGppxDSg8k+zF4+Kv+eJyqzEDdbpDg58-=MQcerQ@mail.gmail.com
2020-09-05Switch to multi-inserts when registering dependencies for many code pathsMichael Paquier
This commit improves the dependency registrations by taking advantage of the preliminary work done in 63110c62, to group together the insertion of dependencies of the same type to pg_depend. With the current layer of routines available, and as only dependencies of the same type can be grouped, there are code paths still doing more than one multi-insert when it is necessary to register dependencies of multiple types (constraint and index creation are two cases doing that). While on it, this refactors some of the code to use ObjectAddressSet() when manipulating object addresses. Author: Daniel Gustafsson, Michael Paquier Reviewed-by: Andres Freund, Álvaro Herrera Discussion: https://postgr.es/m/20200807061619.GA23955@paquier.xyz
2020-09-04Remove variable "concurrent" from ReindexStmtMichael Paquier
This node already handles multiple options using a bitmask, so having a separate boolean flag is not necessary. This simplifies the code a bit with less arguments to give to the reindex routines, by replacing the boolean with an equivalent bitmask value. Reviewed-by: Julien Rouhaud Discussion: https://postgr.es/m/20200902110326.GA14963@paquier.xyz
2020-09-03Add support for streaming to built-in logical replication.Amit Kapila
To add support for streaming of in-progress transactions into the built-in logical replication, we need to do three things: * Extend the logical replication protocol, so identify in-progress transactions, and allow adding additional bits of information (e.g. XID of subtransactions). * Modify the output plugin (pgoutput) to implement the new stream API callbacks, by leveraging the extended replication protocol. * Modify the replication apply worker, to properly handle streamed in-progress transaction by spilling the data to disk and then replaying them on commit. We however must explicitly disable streaming replication during replication slot creation, even if the plugin supports it. We don't need to replicate the changes accumulated during this phase, and moreover we don't have a replication connection open so we don't have where to send the data anyway. Author: Tomas Vondra, Dilip Kumar and Amit Kapila Reviewed-by: Amit Kapila, Kuntal Ghosh and Ajin Cherian Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
2020-09-02Improve handling of dropped relations for REINDEX DATABASE/SCHEMA/SYSTEMMichael Paquier
When multiple relations are reindexed, a scan of pg_class is done first to build the list of relations to work on. However the REINDEX logic has never checked if a relation listed still exists when beginning the work on it, causing for example sudden cache lookup failures. This commit adds safeguards against dropped relations for REINDEX, similarly to VACUUM or CLUSTER where we try to open the relation, ignoring it if it is missing. A new option is added to the REINDEX routines to control if a missed relation is OK to ignore or not. An isolation test, based on REINDEX SCHEMA, is added for the concurrent and non-concurrent cases. Author: Michael Paquier Reviewed-by: Anastasia Lubennikova Discussion: https://postgr.es/m/20200813043805.GE11663@paquier.xyz
2020-09-01Set cutoff xmin more aggressively when vacuuming a temporary table.Tom Lane
Since other sessions aren't allowed to look into a temporary table of our own session, we do not need to worry about the global xmin horizon when setting the vacuum XID cutoff. Indeed, if we're not inside a transaction block, we may set oldestXmin to be the next XID, because there cannot be any in-doubt tuples in a temp table, nor any tuples that are dead but still visible to some snapshot of our transaction. (VACUUM, of course, is never inside a transaction block; but we need to test that because CLUSTER shares the same code.) This approach allows us to always clean out a temp table completely during VACUUM, independently of concurrent activity. Aside from being useful in its own right, that simplifies building reproducible test cases. Discussion: https://postgr.es/m/3490536.1598629609@sss.pgh.pa.us
2020-09-01Raise error on concurrent drop of partitioned indexAlvaro Herrera
We were already raising an error for DROP INDEX CONCURRENTLY on a partitioned table, albeit a different and confusing one: ERROR: DROP INDEX CONCURRENTLY must be first action in transaction Change that to throw a more comprehensible error: ERROR: cannot drop partitioned index \"%s\" concurrently Michael Paquier authored the test case for indexes on temporary partitioned tables. Backpatch to 11, where indexes on partitioned tables were added. Reported-by: Jan Mussler <jan.mussler@zalando.de> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/16594-d2956ca909585067@postgresql.org
2020-08-30Redefine pg_class.reltuples to be -1 before the first VACUUM or ANALYZE.Tom Lane
Historically, we've considered the state with relpages and reltuples both zero as indicating that we do not know the table's tuple density. This is problematic because it's impossible to distinguish "never yet vacuumed" from "vacuumed and seen to be empty". In particular, a user cannot use VACUUM or ANALYZE to override the planner's normal heuristic that an empty table should not be believed to be empty because it is probably about to get populated. That heuristic is a good safety measure, so I don't care to abandon it, but there should be a way to override it if the table is indeed intended to stay empty. Hence, represent the initial state of ignorance by setting reltuples to -1 (relpages is still set to zero), and apply the minimum-ten-pages heuristic only when reltuples is still -1. If the table is empty, VACUUM or ANALYZE (but not CREATE INDEX) will override that to reltuples = relpages = 0, and then we'll plan on that basis. This requires a bunch of fiddly little changes, but we can get rid of some ugly kluges that were formerly needed to maintain the old definition. One notable point is that FDWs' GetForeignRelSize methods will see baserel->tuples = -1 when no ANALYZE has been done on the foreign table. That seems like a net improvement, since those methods were formerly also in the dark about what baserel->tuples = 0 really meant. Still, it is an API change. I bumped catversion because code predating this change would get confused by seeing reltuples = -1. Discussion: https://postgr.es/m/F02298E0-6EF4-49A1-BCB6-C484794D9ACC@thebuild.com
2020-08-22Fix ALTER TABLE's scheduling rules for AT_AddConstraint subcommands.Tom Lane
Commit 1281a5c90 rearranged the logic in this area rather drastically, and it broke the case of adding a foreign key constraint in the same ALTER that adds the pkey or unique constraint it depends on. While self-referential fkeys are surely a pretty niche case, this used to work so we shouldn't break it. To fix, reorganize the scheduling rules in ATParseTransformCmd so that a transformed AT_AddConstraint subcommand will be delayed into a later pass in all cases, not only when it's been spit out as a side-effect of parsing some other command type. Also tweak the logic so that we won't run ATParseTransformCmd twice while doing this. It seems to work even without that, but it's surely wasting cycles to do so. Per bug #16589 from Jeremy Evans. Back-patch to v13 where the new code was introduced. Discussion: https://postgr.es/m/16589-31c8d981ca503896@postgresql.org
2020-08-21Fix handling of CREATE TABLE LIKE with inheritance.Tom Lane
If a CREATE TABLE command uses both LIKE and traditional inheritance, Vars in CHECK constraints and expression indexes that are absorbed from a LIKE parent table tended to get mis-numbered, resulting in wrong answers and/or bizarre error messages (though probably not any actual crashes, thanks to validation occurring in the executor). In v12 and up, the same could happen to Vars in GENERATED expressions, even in cases with no LIKE clause but multiple traditional-inheritance parents. The cause of the problem for LIKE is that parse_utilcmd.c supposed it could renumber such Vars correctly during transformCreateStmt(), which it cannot since we have not yet accounted for columns added via inheritance. Fix that by postponing processing of LIKE INCLUDING CONSTRAINTS, DEFAULTS, GENERATED, INDEXES till after we've performed DefineRelation(). The error with GENERATED and multiple inheritance is a simple oversight in MergeAttributes(); it knows it has to renumber Vars in inherited CHECK constraints, but forgot to apply the same processing to inherited GENERATED expressions (a/k/a defaults). Per bug #16272 from Tom Gottfried. The non-GENERATED variants of the issue are ancient, presumably dating right back to the addition of CREATE TABLE LIKE; hence back-patch to all supported branches. Discussion: https://postgr.es/m/16272-6e32da020e9a9381@postgresql.org
2020-08-21Rework EXPLAIN for planner's buffer usage.Fujii Masao
Commit ce77abe63c allowed EXPLAIN (BUFFERS) to report the information on buffer usage during planning phase. However three issues were reported regarding this feature. (1) Previously, EXPLAIN option BUFFERS required ANALYZE. So the query had to be actually executed by specifying ANALYZE even when we want to see only the planner's buffer usage. This was inconvenient especially when the query was write one like DELETE. (2) EXPLAIN included the planner's buffer usage in summary information. So SUMMARY option had to be enabled to report that. Also this format was confusing. (3) The output structure for planning information was not consistent between TEXT format and the others. For example, "Planning" tag was output in JSON format, but not in TEXT format. For (1), this commit allows us to perform EXPLAIN (BUFFERS) without ANALYZE to report the planner's buffer usage. For (2), this commit changed EXPLAIN output so that the planner's buffer usage is reported before summary information. For (3), this commit made the output structure for planning information more consistent between the formats. Back-patch to v13 where the planner's buffer usage was allowed to be reported in EXPLAIN. Reported-by: Pierre Giraud, David Rowley Author: Fujii Masao Reviewed-by: David Rowley, Julien Rouhaud, Pierre Giraud Discussion: https://postgr.es/m/07b226e6-fa49-687f-b110-b7c37572f69e@dalibo.com
2020-08-15Correct several behavior descriptions in comments.Noah Misch
Reuse cautionary language from src/test/ssl/README in src/test/kerberos/README. SLRUs have had access to six-character segments names since commit 73c986adde5d73a5e2555da9b5c8facedb146dcd, and recovery stopped calling HeapTupleHeaderAdvanceLatestRemovedXid() in commit 558a9165e081d1936573e5a7d576f5febd7fb55a. The other corrections are more self-evident.
2020-08-15Prevent concurrent SimpleLruTruncate() for any given SLRU.Noah Misch
The SimpleLruTruncate() header comment states the new coding rule. To achieve this, add locktype "frozenid" and two LWLocks. This closes a rare opportunity for data loss, which manifested as "apparent wraparound" or "could not access status of transaction" errors. Data loss is more likely in pg_multixact, due to released branches' thin margin between multiStopLimit and multiWrapLimit. If a user's physical replication primary logged ": apparent wraparound" messages, the user should rebuild standbys of that primary regardless of symptoms. At less risk is a cluster having emitted "not accepting commands" errors or "must be vacuumed" warnings at some point. One can test a cluster for this data loss by running VACUUM FREEZE in every database. Back-patch to 9.5 (all supported versions). Discussion: https://postgr.es/m/20190218073103.GA1434723@rfd.leadboat.com
2020-08-14snapshot scalability: Move PGXACT->vacuumFlags to ProcGlobal->vacuumFlags.Andres Freund
Similar to the previous commit this increases the chance that data frequently needed by GetSnapshotData() stays in l2 cache. As we now take care to not unnecessarily write to ProcGlobal->vacuumFlags, there should be very few modifications to the ProcGlobal->vacuumFlags array. Author: Andres Freund <andres@anarazel.de> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
2020-08-14snapshot scalability: Introduce dense array of in-progress xids.Andres Freund
The new array contains the xids for all connected backends / in-use PGPROC entries in a dense manner (in contrast to the PGPROC/PGXACT arrays which can have unused entries interspersed). This improves performance because GetSnapshotData() always needs to scan the xids of all live procarray entries and now there's no need to go through the procArray->pgprocnos indirection anymore. As the set of running top-level xids changes rarely, compared to the number of snapshots taken, this substantially increases the likelihood of most data required for a snapshot being in l2 cache. In read-mostly workloads scanning the xids[] array will sufficient to build a snapshot, as most backends will not have an xid assigned. To keep the xid array dense ProcArrayRemove() needs to move entries behind the to-be-removed proc's one further up in the array. Obviously moving array entries cannot happen while a backend sets it xid. I.e. locking needs to prevent that array entries are moved while a backend modifies its xid. To avoid locking ProcArrayLock in GetNewTransactionId() - a fairly hot spot already - ProcArrayAdd() / ProcArrayRemove() now needs to hold XidGenLock in addition to ProcArrayLock. Adding / Removing a procarray entry is not a very frequent operation, even taking 2PC into account. Due to the above, the dense array entries can only be read or modified while holding ProcArrayLock and/or XidGenLock. This prevents a concurrent ProcArrayRemove() from shifting the dense array while it is accessed concurrently. While the new dense array is very good when needing to look at all xids it is less suitable when accessing a single backend's xid. In particular it would be problematic to have to acquire a lock to access a backend's own xid. Therefore a backend's xid is not just stored in the dense array, but also in PGPROC. This also allows a backend to only access the shared xid value when the backend had acquired an xid. The infrastructure added in this commit will be used for the remaining PGXACT fields in subsequent commits. They are kept separate to make review easier. Author: Andres Freund <andres@anarazel.de> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
2020-08-13snapshot scalability: Move PGXACT->xmin back to PGPROC.Andres Freund
Now that xmin isn't needed for GetSnapshotData() anymore, it leads to unnecessary cacheline ping-pong to have it in PGXACT, as it is updated considerably more frequently than the other PGXACT members. After the changes in dc7420c2c92, this is a very straight-forward change. For highly concurrent, snapshot acquisition heavy, workloads this change alone can significantly increase scalability. E.g. plain pgbench on a smaller 2 socket machine gains 1.07x for read-only pgbench, 1.22x for read-only pgbench when submitting queries in batches of 100, and 2.85x for batches of 100 'SELECT';. The latter numbers are obviously not to be expected in the real-world, but micro-benchmark the snapshot computation scalability (previously spending ~80% of the time in GetSnapshotData()). Author: Andres Freund <andres@anarazel.de> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
2020-08-12snapshot scalability: Don't compute global horizons while building snapshots.Andres Freund
To make GetSnapshotData() more scalable, it cannot not look at at each proc's xmin: While snapshot contents do not need to change whenever a read-only transaction commits or a snapshot is released, a proc's xmin is modified in those cases. The frequency of xmin modifications leads to, particularly on higher core count systems, many cache misses inside GetSnapshotData(), despite the data underlying a snapshot not changing. That is the most significant source of GetSnapshotData() scaling poorly on larger systems. Without accessing xmins, GetSnapshotData() cannot calculate accurate horizons / thresholds as it has so far. But we don't really have to: The horizons don't actually change that much between GetSnapshotData() calls. Nor are the horizons actually used every time a snapshot is built. The trick this commit introduces is to delay computation of accurate horizons until there use and using horizon boundaries to determine whether accurate horizons need to be computed. The use of RecentGlobal[Data]Xmin to decide whether a row version could be removed has been replaces with new GlobalVisTest* functions. These use two thresholds to determine whether a row can be pruned: 1) definitely_needed, indicating that rows deleted by XIDs >= definitely_needed are definitely still visible. 2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can definitely be removed GetSnapshotData() updates definitely_needed to be the xmin of the computed snapshot. When testing whether a row can be removed (with GlobalVisTestIsRemovableXid()) and the tested XID falls in between the two (i.e. XID >= maybe_needed && XID < definitely_needed) the boundaries can be recomputed to be more accurate. As it is not cheap to compute accurate boundaries, we limit the number of times that happens in short succession. As the boundaries used by GlobalVisTestIsRemovableXid() are never reset (with maybe_needed updated by GetSnapshotData()), it is likely that further test can benefit from an earlier computation of accurate horizons. To avoid regressing performance when old_snapshot_threshold is set (as that requires an accurate horizon to be computed), heap_page_prune_opt() doesn't unconditionally call TransactionIdLimitedForOldSnapshots() anymore. Both the computation of the limited horizon, and the triggering of errors (with SetOldSnapshotThresholdTimestamp()) is now only done when necessary to remove tuples. This commit just removes the accesses to PGXACT->xmin from GetSnapshotData(), but other members of PGXACT residing in the same cache line are accessed. Therefore this in itself does not result in a significant improvement. Subsequent commits will take advantage of the fact that GetSnapshotData() now does not need to access xmins anymore. Note: This contains a workaround in heap_page_prune_opt() to keep the snapshot_too_old tests working. While that workaround is ugly, the tests currently are not meaningful, and it seems best to address them separately. Author: Andres Freund <andres@anarazel.de> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
2020-08-10Replace remaining StrNCpy() by strlcpy()Peter Eisentraut
They are equivalent, except that StrNCpy() zero-fills the entire destination buffer instead of providing just one trailing zero. For all but a tiny number of callers, that's just overhead rather than being desirable. Remove StrNCpy() as it is now unused. In some cases, namestrcpy() is the more appropriate function to use. While we're here, simplify the API of namestrcpy(): Remove the return value, don't check for NULL input. Nothing was using that anyway. Also, remove a few unused name-related functions. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/flat/44f5e198-36f6-6cdb-7fa9-60e34784daae%402ndquadrant.com
2020-08-10Make contrib modules' installation scripts more secure.Tom Lane
Hostile objects located within the installation-time search_path could capture references in an extension's installation or upgrade script. If the extension is being installed with superuser privileges, this opens the door to privilege escalation. While such hazards have existed all along, their urgency increases with the v13 "trusted extensions" feature, because that lets a non-superuser control the installation path for a superuser-privileged script. Therefore, make a number of changes to make such situations more secure: * Tweak the construction of the installation-time search_path to ensure that references to objects in pg_catalog can't be subverted; and explicitly add pg_temp to the end of the path to prevent attacks using temporary objects. * Disable check_function_bodies within installation/upgrade scripts, so that any security gaps in SQL-language or PL-language function bodies cannot create a risk of unwanted installation-time code execution. * Adjust lookup of type input/receive functions and join estimator functions to complain if there are multiple candidate functions. This prevents capture of references to functions whose signature is not the first one checked; and it's arguably more user-friendly anyway. * Modify various contrib upgrade scripts to ensure that catalog modification queries are executed with secure search paths. (These are in-place modifications with no extension version changes, since it is the update process itself that is at issue, not the end result.) Extensions that depend on other extensions cannot be made fully secure by these methods alone; therefore, revert the "trusted" marking that commit eb67623c9 applied to earthdistance and hstore_plperl, pending some better solution to that set of issues. Also add documentation around these issues, to help extension authors write secure installation scripts. Patch by me, following an observation by Andres Freund; thanks to Noah Misch for review. Security: CVE-2020-14350
2020-08-07Remove PROC_IN_ANALYZE and derived flagsAlvaro Herrera
These flags are unused and always have been. Discussion: https://postgr.es/m/20200805235549.GA8118@alvherre.pgsql
2020-08-07Fix bogus EXPLAIN output for Hash AggregateDavid Rowley
9bdb300de modified the EXPLAIN output for Hash Aggregate to show details from parallel workers. However, it neglected to consider that a given parallel worker may not have assisted with the given Hash Aggregate. This can occur when workers fail to start or during Parallel Append with enable_partitionwise_join enabled when only a single worker is working on a non-parallel aware sub-plan. It could also happen if a worker simply wasn't fast enough to get any work done before other processes went and finished all the work. The bogus output came from the fact that ExplainOpenWorker() skipped showing any details for non-initialized workers but show_hashagg_info() did show details from the worker. This meant that the worker properties that were shown were not properly attributed to the worker that they belong to. In passing, we also now don't show Hash Aggregate properties for the leader process when it did not contribute any work to the Hash Aggregate. This can occur either during Parallel Append when only a parallel worker worked on a given sub plan or with parallel_leader_participation set to off. This aims to make the behavior of Hash Aggregate's EXPLAIN output more similar to Sort's. Reported-by: Justin Pryzby Discussion: https://postgr.es/m/20200805012105.GZ28072%40telsasoft.com Backpatch-through: 13, where the original breakage was introduced
2020-08-02Use int64 instead of long in incremental sort codeDavid Rowley
Windows 64bit has 4-byte long values which is not suitable for tracking disk space usage in the incremental sort code. Let's just make all these fields int64s. Author: James Coleman Discussion: https://postgr.es/m/CAApHDvpky%2BUhof8mryPf5i%3D6e6fib2dxHqBrhp0Qhu0NeBhLJw%40mail.gmail.com Backpatch-through: 13, where the incremental sort code was added
2020-08-01Invent "amadjustmembers" AM method for validating opclass members.Tom Lane
This allows AM-specific knowledge to be applied during creation of pg_amop and pg_amproc entries. Specifically, the AM knows better than core code which entries to consider as required or optional. Giving the latter entries the appropriate sort of dependency allows them to be dropped without taking out the whole opclass or opfamily; which is something we'd like to have to correct obsolescent entries in extensions. This callback also opens the door to performing AM-specific validity checks during opclass creation, rather than hoping than an opclass developer will remember to test with "amvalidate". For the most part I've not actually added any such checks yet; that can happen in a follow-on patch. (Note that we shouldn't remove any tests from "amvalidate", as those are still needed to cross-check manually constructed entries in the initdb data. So adding tests to "amadjustmembers" will be somewhat duplicative, but it seems like a good idea anyway.) Patch by me, reviewed by Alexander Korotkov, Hamid Akhtar, and Anastasia Lubennikova. Discussion: https://postgr.es/m/4578.1565195302@sss.pgh.pa.us
2020-07-31Fix oversight in ALTER TYPE: typmodin/typmodout must propagate to arrays.Tom Lane
If a base type supports typmods, its array type does too, with the same interpretation. Hence changes in pg_type.typmodin/typmodout must be propagated to the array type. While here, improve AlterTypeRecurse to not recurse to domains if there is nothing we'd need to change. Oversight in fe30e7ebf. Back-patch to v13 where that came in.
2020-07-31Use multi-inserts for pg_attribute and pg_shdependMichael Paquier
For pg_attribute, this allows to insert at once a full set of attributes for a relation (roughly 15% of WAL reduction in extreme cases). For pg_shdepend, this reduces the work done when creating new shared dependencies from a database template. The number of slots used for the insertion is capped at 64kB of data inserted for both, depending on the number of items to insert and the length of the rows involved. More can be done for other catalogs, like pg_depend. This part requires a different approach as the number of slots to use depends also on the number of entries discarded as pinned dependencies. This is also related to the rework or dependency handling for ALTER TABLE and CREATE TABLE, mainly. Author: Daniel Gustafsson Reviewed-by: Andres Freund, Michael Paquier Discussion: https://postgr.es/m/20190213182737.mxn6hkdxwrzgxk35@alap3.anarazel.de
2020-07-29Make EXPLAIN ANALYZE of HashAgg more similar to Hash JoinDavid Rowley
There were various unnecessary differences between Hash Agg's EXPLAIN ANALYZE output and Hash Join's. Here we modify the Hash Agg output so that it's better aligned to Hash Join's. The following changes have been made: 1. Start batches counter at 1 instead of 0. 2. Always display the "Batches" property, even when we didn't spill to disk. 3. Use the text "Batches" instead of "HashAgg Batches" for text format. 4. Use the text "Memory Usage" instead of "Peak Memory Usage" for text format. 5. Include "Batches" before "Memory Usage" in both text and non-text formats. In passing also modify the "Planned Partitions" property so that we show it regardless of if the value is 0 or not for non-text EXPLAIN formats. This was pointed out by Justin Pryzby and probably should have been part of 40efbf870. Reviewed-by: Justin Pryzby, Jeff Davis Discussion: https://postgr.es/m/CAApHDvrshRnA6C0VFnu7Fb9TVvgGo80PUMm5+2DiaS1gEkPvtw@mail.gmail.com Backpatch-through: 13, where HashAgg batching was introduced
2020-07-25Improve performance of binary COPY FROM through better buffering.Tom Lane
At least on Linux and macOS, fread() turns out to have far higher per-call overhead than one could wish. Reading 64KB of data at a time and then parceling it out with our own memcpy logic makes binary COPY from a file significantly faster --- around 30% in simple testing for cases with narrow text columns (on Linux ... even more on macOS). In binary COPY from frontend, there's no per-call fread(), and this patch introduces an extra layer of memcpy'ing, but it still manages to eke out a small win. Apparently, the control-logic overhead in CopyGetData() is enough to be worth avoiding for small fetches. Bharath Rupireddy and Amit Langote, reviewed by Vignesh C, cosmetic tweaks by me Discussion: https://postgr.es/m/CALj2ACU5Bz06HWLwqSzNMN=Gupoj6Rcn_QVC+k070V4em9wu=A@mail.gmail.com
2020-07-20Add generic_plans and custom_plans fields into pg_prepared_statements.Fujii Masao
There was no easy way to find how many times generic and custom plans have been executed for a prepared statement. This commit exposes those numbers of times in pg_prepared_statements view. Author: Atsushi Torikoshi, Kyotaro Horiguchi Reviewed-by: Tatsuro Yamada, Masahiro Ikeda, Fujii Masao Discussion: https://postgr.es/m/CACZ0uYHZ4M=NZpofH6JuPHeX=__5xcDELF8hT8_2T+R55w4RQw@mail.gmail.com
2020-07-18Allow logical replication to transfer data in binary format.Tom Lane
This patch adds a "binary" option to CREATE/ALTER SUBSCRIPTION. When that's set, the publisher will send data using the data type's typsend function if any, rather than typoutput. This is generally faster, if slightly less robust. As committed, we won't try to transfer user-defined array or composite types in binary, for fear that type OIDs won't match at the subscriber. This might be changed later, but it seems like fit material for a follow-on patch. Dave Cramer, reviewed by Daniel Gustafsson, Petr Jelinek, and others; adjusted some by me Discussion: https://postgr.es/m/CADK3HH+R3xMn=8t3Ct+uD+qJ1KD=Hbif5NFMJ+d5DkoCzp6Vgw@mail.gmail.com
2020-07-15Eliminate cache lookup errors in SQL functions for object addressesMichael Paquier
When using the following functions, users could see various types of errors of the type "cache lookup failed for OID XXX" with elog(), that can only be used for internal errors: * pg_describe_object() * pg_identify_object() * pg_identify_object_as_address() The set of APIs managing object addresses for all object types are made smarter by gaining a new argument "missing_ok" that allows any caller to control if an error is raised or not on an undefined object. The SQL functions listed above are changed to handle the case where an object is missing. Regression tests are added for all object types for the cases where these are undefined. Before this commit, these cases failed with cache lookup errors, and now they basically return NULL (minus the name of the object type requested). Author: Michael Paquier Reviewed-by: Aleksander Alekseev, Dmitry Dolgov, Daniel Gustafsson, Álvaro Herrera, Kyotaro Horiguchi Discussion: https://postgr.es/m/CAB7nPqSZxrSmdHK-rny7z8mi=EAFXJ5J-0RbzDw6aus=wB5azQ@mail.gmail.com
2020-07-14Add comment to explain an unused function parameterDavid Rowley
Removing the unused 'miinfo' parameter has been raised a couple of times now. It was decided in the 2nd discussion below that we're going to leave it alone. It seems like it might be useful to add a comment to mention this fact so that nobody wastes any time in the future proposing its removal again. Discussion: https://postgr.es/m/CAApHDvpCf-qR5HC1rXskUM4ToV+3YDb4-n1meY=vpAHsRS_1PA@mail.gmail.com Discussion: https://postgr.es/m/CAE9k0P%3DFvcDswnSVtRpSyZMpcAWC%3DGp%3DifZ0HdfPaRQ%3D__LBtw%40mail.gmail.com
2020-07-14Fix timing issue with ALTER TABLE's validate constraintDavid Rowley
An ALTER TABLE to validate a foreign key in which another subcommand already caused a pending table rewrite could fail due to ALTER TABLE attempting to validate the foreign key before the actual table rewrite takes place. This situation could result in an error such as: ERROR: could not read block 0 in file "base/nnnnn/nnnnn": read only 0 of 8192 bytes The failure here was due to the SPI call which validates the foreign key trying to access an index which is yet to be rebuilt. Similarly, we also incorrectly tried to validate CHECK constraints before the heap had been rewritten. The fix for both is to delay constraint validation until phase 3, after the table has been rewritten. For CHECK constraints this means a slight behavioral change. Previously ALTER TABLE VALIDATE CONSTRAINT on inheritance tables would be validated from the bottom up. This was different from the order of evaluation when a new CHECK constraint was added. The changes made here aligns the VALIDATE CONSTRAINT evaluation order for inheritance tables to be the same as ADD CONSTRAINT, which is generally top-down. Reported-by: Nazli Ugur Koyluoglu, using SQLancer Discussion: https://postgr.es/m/CAApHDvp%3DZXv8wiRyk_0rWr00skhGkt8vXDrHJYXRMft3TjkxCA%40mail.gmail.com Backpatch-through: 9.5 (all supported versions)
2020-07-11Avoid useless buffer allocations during binary COPY FROM.Tom Lane
The raw_buf and line_buf buffers aren't used when reading binary format, so skip allocating them. raw_buf is 64K so that seems like a worthwhile savings. An unused line_buf only wastes 1K, but as long as we're checking it's free to avoid allocating that too. Bharath Rupireddy, tweaked a bit by me Discussion: https://postgr.es/m/CALj2ACXcCKaGPY0whowqrJ4OPJvDnTssgpGCzvuFQu5z0CXb-g@mail.gmail.com
2020-07-11Rename field "relkind" to "objtype" for CTAS and ALTER TABLE nodesMichael Paquier
"relkind" normally refers to the char field from pg_class. However, in the parse nodes AlterTableStmt and CreateTableAsStmt, "relkind" was used for a field of type enum ObjectType, that could refer to other object types than those possible for a relkind. Such fields being usually named "objtype", switch the name in both structures to make things more consistent. Note that this led to some confusion in functions that also operate on a RangeTableEntry object, which also has a field named "relkind". This naming goes back to commit 09d4e96, where only OBJECT_TABLE and OBJECT_INDEX were used. This got extended later to use as well OBJECT_TYPE with e440e12, not really a relation kind. Author: Mark Dilger Reviewed-by: Daniel Gustafsson, Álvaro Herrera, Michael Paquier Discussion: https://postgr.es/m/609181AE-E399-47C7-9221-856E0F96BF93@enterprisedb.com
2020-07-09Fix whitespace in HashAgg EXPLAIN ANALYZEDavid Rowley
The Sort node does not put a space between the number of kilobytes and the "kB" of memory or disk space used, but HashAgg does. Here we align HashAgg to do the same as Sort. Sort has been displaying this information for longer than HashAgg, so it makes sense to align HashAgg to Sort rather than the other way around. Reported-by: Justin Pryzby Discussion: https://postgr.es/m/20200708163021.GW4107@telsasoft.com Backpatch-through: 13, where the hashagg started showing these details
2020-07-08code: replace most remaining uses of 'master'.Andres Freund
Author: Andres Freund Reviewed-By: David Steele Discussion: https://postgr.es/m/20200615182235.x7lch5n6kcjq4aue@alap3.anarazel.de
2020-07-08code: replace 'master' with 'primary' where appropriate.Andres Freund
Also changed "in the primary" to "on the primary", and added a few "the" before "primary". Author: Andres Freund Reviewed-By: David Steele Discussion: https://postgr.es/m/20200615182235.x7lch5n6kcjq4aue@alap3.anarazel.de