Age | Commit message (Collapse) | Author |
|
This is necessary so that REASSIGN OWNED does the right thing with
composite types, to wit, that it also alters ownership of the type's
pg_class entry -- previously, the pg_class entry remained owned by the
original user, which caused later other failures such as the new owner's
inability to use ALTER TYPE to rename an attribute of the affected
composite. Also, if the original owner is later dropped, the pg_class
entry becomes owned by a non-existant user which is bogus.
To fix, create a new routine AlterTypeOwner_oid which knows whether to
pass the request to ATExecChangeOwner or deal with it directly, and use
that in shdepReassignOwner rather than calling AlterTypeOwnerInternal
directly. AlterTypeOwnerInternal is now simpler in that it only
modifies the pg_type entry and recurses to handle a possible array type;
higher-level tasks are handled by either AlterTypeOwner directly or
AlterTypeOwner_oid.
I took the opportunity to add a few more objects to the test rig for
REASSIGN OWNED, so that more cases are exercised. Additional ones could
be added for superuser-only-ownable objects (such as FDWs and event
triggers) but I didn't want to push my luck by adding a new superuser to
the tests on a backpatchable bug fix.
Per bug #13666 reported by Chris Pacejo.
This is a backpatch of commit 756e7b4c9db1 to branches 9.1 -- 9.4.
|
|
When REASSIGN and ALTER OWNER TO are used, both the object owner and ACL
list should be changed from the old owner to the new owner. This patch
fixes types, foreign data wrappers, and foreign servers to change their
ACL list properly; they already changed owners properly.
Report by Alexey Bashtanov
This is a backpatch of commit 59367fdf97c (for bug #9923) by Bruce
Momjian to branches 9.1 - 9.4; it wasn't backpatched originally out of
concerns that it would create a backwards compatibility problem, but per
discussion related to bug #13666 that turns out to have been misguided.
(Therefore, the entry in the 9.5 release notes should be removed.)
Note that 9.1 didn't have privileges on types (which were introduced by
commit 729205571e81), so this commit only changes foreign-data related
objects in that branch.
Discussion: http://www.postgresql.org/message-id/20151216224004.GL2618@alvherre.pgsql
http://www.postgresql.org/message-id/10227.1450373793@sss.pgh.pa.us
|
|
Changing the tablespace of an unlogged relation did not WAL log the
creation and content of the init fork. Thus, after a standby is
promoted, unlogged relation cannot be accessed anymore, with errors
like:
ERROR: 58P01: could not open file "pg_tblspc/...": No such file or directory
Additionally the init fork was not synced to disk, independent of the
configured wal_level, a relatively small durability risk.
Investigation of that problem also brought to light that, even for
permanent relations, the creation of !main forks was not WAL logged,
i.e. no XLOG_SMGR_CREATE record were emitted. That mostly turns out not
to be a problem, because these files were created when the actual
relation data is copied; nonexistent files are not treated as an error
condition during replay. But that doesn't work for empty files, and
generally feels a bit haphazard. Luckily, outside init and main forks,
empty forks don't occur often or are not a problem.
Add the required WAL logging and syncing to disk.
Reported-By: Michael Paquier
Author: Michael Paquier and Andres Freund
Discussion: 20151210163230.GA11331@alap3.anarazel.de
Backpatch: 9.1, where unlogged relations were introduced
|
|
The previous way of reconstructing check constraints was to do a separate
"ALTER TABLE ONLY tab ADD CONSTRAINT" for each table in an inheritance
hierarchy. However, that way has no hope of reconstructing the check
constraints' own inheritance properties correctly, as pointed out in
bug #13779 from Jan Dirk Zijlstra. What we should do instead is to do
a regular "ALTER TABLE", allowing recursion, at the topmost table that
has a particular constraint, and then suppress the work queue entries
for inherited instances of the constraint.
Annoyingly, we'd tried to fix this behavior before, in commit 5ed6546cf,
but we failed to notice that it wasn't reconstructing the pg_constraint
field values correctly.
As long as I'm touching pg_get_constraintdef_worker anyway, tweak it to
always schema-qualify the target table name; this seems like useful backup
to the protections installed by commit 5f173040.
In HEAD/9.5, get rid of get_constraint_relation_oids, which is now unused.
(I could alternatively have modified it to also return conislocal, but that
seemed like a pretty single-purpose API, so let's not pretend it has some
other use.) It's unused in the back branches as well, but I left it in
place just in case some third-party code has decided to use it.
In HEAD/9.5, also rename pg_get_constraintdef_string to
pg_get_constraintdef_command, as the previous name did nothing to explain
what that entry point did differently from others (and its comment was
equally useless). Again, that change doesn't seem like material for
back-patching.
I did a bit of re-pgindenting in tablecmds.c in HEAD/9.5, as well.
Otherwise, back-patch to all supported branches.
|
|
If some existing listener is far behind, incoming new listener sessions
would start from that session's read pointer and then need to advance over
many already-committed notification messages, which they have no interest
in. This was expensive in itself and also thrashed the pg_notify SLRU
buffers a lot more than necessary. We can improve matters considerably
in typical scenarios, without much added cost, by starting from the
furthest-ahead read pointer, not the furthest-behind one. We do have to
consider only sessions in our own database when doing this, which requires
an extra field in the data structure, but that's a pretty small cost.
Back-patch to 9.0 where the current LISTEN/NOTIFY logic was introduced.
Matt Newell, slightly adjusted by me
|
|
Formerly, we treated only portals created in the current subtransaction as
having failed during subtransaction abort. However, if the error occurred
while running a portal created in an outer subtransaction (ie, a cursor
declared before the last savepoint), that has to be considered broken too.
To allow reliable detection of which ones those are, add a bookkeeping
field to struct Portal that tracks the innermost subtransaction in which
each portal has actually been executed. (Without this, we'd end up
failing portals containing functions that had called the subtransaction,
thereby breaking plpgsql exception blocks completely.)
In addition, when we fail an outer-subtransaction Portal, transfer its
resources into the subtransaction's resource owner, so that they're
released early in cleanup of the subxact. This fixes a problem reported by
Jim Nasby in which a function executed in an outer-subtransaction cursor
could cause an Assert failure or crash by referencing a relation created
within the inner subtransaction.
The proximate cause of the Assert failure is that AtEOSubXact_RelationCache
assumed it could blow away a relcache entry without first checking that the
entry had zero refcount. That was a bad idea on its own terms, so add such
a check there, and to the similar coding in AtEOXact_RelationCache. This
provides an independent safety measure in case there are still ways to
provoke the situation despite the Portal-level changes.
This has been broken since subtransactions were invented, so back-patch
to all supported branches.
Tom Lane and Michael Paquier
|
|
If a row that potentially violates a deferred exclusion constraint is
HOT-updated later in the same transaction, the exclusion constraint would
be reported as violated when the check finally occurs, even if the row(s)
the new row originally conflicted with have since been removed. This
happened because the wrong TID was passed to check_exclusion_constraint(),
causing the live HOT-updated row to be seen as a conflicting row rather
than recognized as the row-under-test.
Per bug #13148 from Evan Martin. It's been broken since exclusion
constraints were invented, so back-patch to all supported branches.
|
|
Slow functions in index expressions might cause this loop to take long
enough to make it worth being cancellable. Probably it would be enough
to call CHECK_FOR_INTERRUPTS here, but for consistency with other
per-sample-row loops in this file, let's use vacuum_delay_point.
Report and patch by Jeff Janes. Back-patch to all supported branches.
|
|
Up to now RecordTransactionCommit() waited for WAL to be flushed (if
synchronous_commit != off) and to be synchronously replicated (if
enabled), even if a transaction did not have a xid assigned. The primary
reason for that is that sequence's nextval() did not assign a xid, but
are worthwhile to wait for on commit.
This can be problematic because sometimes read only transactions do
write WAL, e.g. HOT page prune records. That then could lead to read only
transactions having to wait during commit. Not something people expect
in a read only transaction.
This lead to such strange symptoms as backends being seemingly stuck
during connection establishment when all synchronous replicas are
down. Especially annoying when said stuck connection is the standby
trying to reconnect to allow syncrep again...
This behavior also is involved in a rather complicated <= 9.4 bug where
the transaction started by catchup interrupt processing waited for
syncrep using latches, but didn't get the wakeup because it was already
running inside the same overloaded signal handler. Fix the issue here
doesn't properly solve that issue, merely papers over the problems. In
9.5 catchup interrupts aren't processed out of signal handlers anymore.
To fix all this, make nextval() acquire a top level xid, and only wait for
transaction commit if a transaction both acquired a xid and emitted WAL
records. If only a xid has been assigned we don't uselessly want to
wait just because of writes to temporary/unlogged tables; if only WAL
has been written we don't want to wait just because of HOT prunes.
The xid assignment in nextval() is unlikely to cause overhead in
real-world workloads. For one it only happens SEQ_LOG_VALS/32 values
anyway, for another only usage of nextval() without using the result in
an insert or similar is affected.
Discussion: 20150223165359.GF30784@awork2.anarazel.de,
369698E947874884A77849D8FE3680C2@maumau,
5CF4ABBA67674088B3941894E22A0D25@maumau
Per complaint from maumau and Thom Brown
Backpatch all the way back; 9.0 doesn't have syncrep, but it seems
better to be consistent behavior across all maintained branches.
|
|
Commit 9406884af19e2620a14059e64d4eb6ab430ab328 cleaned up
column-privilege related leaks in various error-message paths, but ended
up including a few more things than it should have in the back branches.
Specifically, there's no need for the GetModifiedColumns macro in
execMain.c as 9.1 and older didn't include the row in check constraint
violations. Further, the regression tests added to check those cases
aren't necessary.
This patch removes the GetModifiedColumns macro from execMain.c, removes
the comment which was added to trigger.c related to the duplicate macro
definition, and removes the check-constraint-related regression tests.
Pointed out by Robert.
Back-patched to 9.1 and 9.0.
|
|
If any error occurred while we were in the middle of reading a protocol
message from the client, we could lose sync, and incorrectly try to
interpret a part of another message as a new protocol message. That will
usually lead to an "invalid frontend message" error that terminates the
connection. However, this is a security issue because an attacker might
be able to deliberately cause an error, inject a Query message in what's
supposed to be just user data, and have the server execute it.
We were quite careful to not have CHECK_FOR_INTERRUPTS() calls or other
operations that could ereport(ERROR) in the middle of processing a message,
but a query cancel interrupt or statement timeout could nevertheless cause
it to happen. Also, the V2 fastpath and COPY handling were not so careful.
It's very difficult to recover in the V2 COPY protocol, so we will just
terminate the connection on error. In practice, that's what happened
previously anyway, as we lost protocol sync.
To fix, add a new variable in pqcomm.c, PqCommReadingMsg, that is set
whenever we're in the middle of reading a message. When it's set, we cannot
safely ERROR out and continue running, because we might've read only part
of a message. PqCommReadingMsg acts somewhat similarly to critical sections
in that if an error occurs while it's set, the error handler will force the
connection to be terminated, as if the error was FATAL. It's not
implemented by promoting ERROR to FATAL in elog.c, like ERROR is promoted
to PANIC in critical sections, because we want to be able to use
PG_TRY/CATCH to recover and regain protocol sync. pq_getmessage() takes
advantage of that to prevent an OOM error from terminating the connection.
To prevent unnecessary connection terminations, add a holdoff mechanism
similar to HOLD/RESUME_INTERRUPTS() that can be used hold off query cancel
interrupts, but still allow die interrupts. The rules on which interrupts
are processed when are now a bit more complicated, so refactor
ProcessInterrupts() and the calls to it in signal handlers so that the
signal handlers always call it if ImmediateInterruptOK is set, and
ProcessInterrupts() can decide to not do anything if the other conditions
are not met.
Reported by Emil Lenngren. Patch reviewed by Noah Misch and Andres Freund.
Backpatch to all supported versions.
Security: CVE-2015-0244
|
|
Commit 804b6b6db4dcfc590a468e7be390738f9f7755fb added the build of a
range table in copy.c to initialize the EState es_range_table since it
can be needed in error paths. Unfortunately, that commit didn't
appreciate that some code paths might end up not initializing the rte
which is used to build the range table.
Fix that and clean up a couple others things along the way- build it
only once and don't explicitly set it on the !is_from path as it
doesn't make any sense there (cstate is palloc0'd, so this isn't an
issue from an initializing standpoint either).
The prior commit went back to 9.0, but this only goes back to 9.1 as
prior to that the range table build happens immediately after building
the RTE and therefore doesn't suffer from this issue.
Pointed out by Robert.
|
|
While building error messages to return to the user,
BuildIndexValueDescription and ri_ReportViolation would happily include
the entire key or entire row in the result returned to the user, even if
the user didn't have access to view all of the columns being included.
Instead, include only those columns which the user is providing or which
the user has select rights on. If the user does not have any rights
to view the table or any of the columns involved then no detail is
provided and a NULL value is returned from BuildIndexValueDescription.
Note that, for key cases, the user must have access to all of the
columns for the key to be shown; a partial key will not be returned.
Back-patch all the way, as column-level privileges are now in all
supported versions.
This has been assigned CVE-2014-8161, but since the issue and the patch
have already been publicized on pgsql-hackers, there's no point in trying
to hide this commit.
|
|
The "pos" variable is modified within PG_TRY and then referenced
within PG_CATCH, so for strict POSIX conformance it must be marked
volatile. Superficially the code looked safe because pos's address
was taken, which was sufficient to force it into memory ... but it's
not sufficient to ensure that the compiler applies updates exactly
where the program text says to. The volatility marking has to extend
into a couple of subroutines too, but I think that's probably a good
thing because the risk of out-of-order updates is mostly in those
subroutines not asyncQueueReadAllNotifications() itself. In principle
the compiler could have re-ordered operations such that an error could
be thrown while "pos" had an incorrect value.
It's unclear how real the risk is here, but for safety back-patch
to all active branches.
|
|
vacuum()'s static variable handling makes it non-reentrant; an ensuing
null pointer deference crashed the backend. Back-patch to 9.0 (all
supported versions).
|
|
The previous coding assumed that we could just let buffers for the
database's old tablespace age out of the buffer arena naturally.
The folly of that is exposed by bug #11867 from Marc Munro: the user could
later move the database back to its original tablespace, after which any
still-surviving buffers would match lookups again and appear to contain
valid data. But they'd be missing any changes applied while the database
was in the new tablespace.
This has been broken since ALTER SET TABLESPACE was introduced, so
back-patch to all supported branches.
|
|
As noted by Noah Misch, my initial cut at fixing bug #11638 didn't cover
all cases where ANALYZE might be invoked in an unsafe context. We need to
test the result of IsInTransactionChain not IsTransactionBlock; which is
notationally a pain because IsInTransactionChain requires an isTopLevel
flag, which would have to be passed down through several levels of callers.
I chose to pass in_outer_xact (ie, the result of IsInTransactionChain)
rather than isTopLevel per se, as that seemed marginally more apropos
for the intermediate functions to know about.
|
|
VACUUM and ANALYZE update the target table's pg_class row in-place, that is
nontransactionally. This is OK, more or less, for the statistical columns,
which are mostly nontransactional anyhow. It's not so OK for the DDL hint
flags (relhasindex etc), which might get changed in response to
transactional changes that could still be rolled back. This isn't a
problem for VACUUM, since it can't be run inside a transaction block nor
in parallel with DDL on the table. However, we allow ANALYZE inside a
transaction block, so if the transaction had earlier removed the last
index, rule, or trigger from the table, and then we roll back the
transaction after ANALYZE, the table would be left in a corrupted state
with the hint flags not set though they should be.
To fix, suppress the hint-flag updates if we are InTransactionBlock().
This is safe enough because it's always OK to postpone hint maintenance
some more; the worst-case consequence is a few extra searches of pg_index
et al. There was discussion of instead using a transactional update,
but that would change the behavior in ways that are not all desirable:
in most scenarios we're better off keeping ANALYZE's statistical values
even if the ANALYZE itself rolls back. In any case we probably don't want
to change this behavior in back branches.
Per bug #11638 from Casey Shobe. This has been broken for a good long
time, so back-patch to all supported branches.
Tom Lane and Michael Paquier, initial diagnosis by Andres Freund
|
|
CREATE DATABASE and ALTER DATABASE .. SET TABLESPACE copy the source
database directory on the filesystem level. To ensure the on disk
state is consistent they block out users of the affected database and
force a checkpoint to flush out all data to disk. Unfortunately, up to
now, that checkpoint didn't flush out dirty buffers from unlogged
relations.
That bug means there could be leftover dirty buffers in either the
template database, or the database in its old location. Leading to
problems when accessing relations in an inconsistent state; and to
possible problems during shutdown in the SET TABLESPACE case because
buffers belonging files that don't exist anymore are flushed.
This was reported in bug #10675 by Maxim Boguk.
Fix by Pavan Deolasee, modified somewhat by me. Reviewed by MauMau and
Fujii Masao.
Backpatch to 9.1 where unlogged tables were introduced.
|
|
Such cases are disallowed by the SQL spec, and even if we wanted to allow
them, the semantics seem ambiguous: how should the FK columns be matched up
with the columns of a unique index? (The matching could be significant in
the presence of opclasses with different notions of equality, so this issue
isn't just academic.) However, our code did not previously reject such
cases, but instead would either fail to match to any unique index, or
generate a bizarre opclass-lookup error because of sloppy thinking in the
index-matching code.
David Rowley
|
|
In commit 631dc390f49909a5c8ebd6002cfb2bcee5415a9d, we started to handle
simple numeric timezone offsets via the zic library instead of the old
CTimeZone/HasCTZSet kluge. However, we overlooked the fact that the zic
code will reject UTC offsets exceeding a week (which seems a bit arbitrary,
but not because it's too tight ...). This led to possibly setting
session_timezone to NULL, which results in crashes in most timezone-related
operations as of 9.4, and crashes in a small number of places even before
that. So check for NULL return from pg_tzset_offset() and report an
appropriate error message. Per bug #11014 from Duncan Gillis.
Back-patch to all supported branches, like the previous patch.
(Unfortunately, as of today that no longer includes 8.4.)
|
|
Trying to reassign objects owned by a user that had text search
dictionaries or configurations used to fail with:
ERROR: unexpected classid 3600
or
ERROR: unexpected classid 3602
Fix by adding cases for those object types in a switch in pg_shdepend.c.
Both REASSIGN OWNED and text search objects go back all the way to 8.1,
so backpatch to all supported branches. In 9.3 the alter-owner code was
made generic, so the required change in recent branches is pretty
simple; however, for 9.2 and older ones we need some additional
reshuffling to enable specifying objects by OID rather than name.
Text search templates and parsers are not owned objects, so there's no
change required for them.
Per bug #9749 reported by Michal Novotný
|
|
The syntax doesn't let you specify "WITH OIDS" for foreign tables, but it
was still possible with default_with_oids=true. But the rest of the system,
including pg_dump, isn't prepared to handle foreign tables with OIDs
properly.
Backpatch down to 9.1, where foreign tables were introduced. It's possible
that there are databases out there that already have foreign tables with
OIDs. There isn't much we can do about that, but at least we can prevent
them from being created in the future.
Patch by Etsuro Fujita, reviewed by Hadi Moshayedi.
|
|
This was not changed in HEAD, but will be done later as part of a
pgindent run. Future pgindent runs will also do this.
Report by Tom Lane
Backpatch through all supported branches, but not HEAD
|
|
The code segment that removes the old symlink (if present) wasn't clued
into the fact that on Windows, symlinks are junction points which have
to be removed with rmdir().
Backpatch to 9.0, where the failing code was introduced.
MauMau, reviewed by Muhammad Asif Naeem and Amit Kapila
|
|
Use TransactionIdIsInProgress, then TransactionIdDidCommit, to distinguish
whether a NOTIFY message's originating transaction is in progress,
committed, or aborted. The previous coding could accept a message from a
transaction that was still in-progress according to the PGPROC array;
if the client were fast enough at starting a new transaction, it might fail
to see table rows added/updated by the message-sending transaction. Which
of course would usually be the point of receiving the message. We noted
this type of race condition long ago in tqual.c, but async.c overlooked it.
The race condition probably cannot occur unless there are multiple NOTIFY
senders in action, since an individual backend doesn't send NOTIFY signals
until well after it's done committing. But if two senders commit in close
succession, it's certainly possible that we could see the second sender's
message within the race condition window while responding to the signal
from the first one.
Per bug #9557 from Marko Tiikkaja. This patch is slightly more invasive
than what he proposed, since it removes the now-redundant
TransactionIdDidAbort call.
Back-patch to 9.0, where the current NOTIFY implementation was introduced.
|
|
If the name lookups come to different conclusions due to concurrent
activity, we might perform some parts of the DDL on a different table
than other parts. At least in the case of CREATE INDEX, this can be
used to cause the permissions checks to be performed against a
different table than the index creation, allowing for a privilege
escalation attack.
This changes the calling convention for DefineIndex, CreateTrigger,
transformIndexStmt, transformAlterTableStmt, CheckIndexCompatible
(in 9.2 and newer), and AlterTable (in 9.1 and older). In addition,
CheckRelationOwnership is removed in 9.2 and newer and the calling
convention is changed in older branches. A field has also been added
to the Constraint node (FkConstraint in 8.4). Third-party code calling
these functions or using the Constraint node will require updating.
Report by Andres Freund. Patch by Robert Haas and Andres Freund,
reviewed by Tom Lane.
Security: CVE-2014-0062
|
|
The primary role of PL validators is to be called implicitly during
CREATE FUNCTION, but they are also normal functions that a user can call
explicitly. Add a permissions check to each validator to ensure that a
user cannot use explicit validator calls to achieve things he could not
otherwise achieve. Back-patch to 8.4 (all supported versions).
Non-core procedural language extensions ought to make the same two-line
change to their own validators.
Andres Freund, reviewed by Tom Lane and Noah Misch.
Security: CVE-2014-0061
|
|
Granting a role without ADMIN OPTION is supposed to prevent the grantee
from adding or removing members from the granted role. Issuing SET ROLE
before the GRANT bypassed that, because the role itself had an implicit
right to add or remove members. Plug that hole by recognizing that
implicit right only when the session user matches the current role.
Additionally, do not recognize it during a security-restricted operation
or during execution of a SECURITY DEFINER function. The restriction on
SECURITY DEFINER is not security-critical. However, it seems best for a
user testing his own SECURITY DEFINER function to see the same behavior
others will see. Back-patch to 8.4 (all supported versions).
The SQL standards do not conflate roles and users as PostgreSQL does;
only SQL roles have members, and only SQL users initiate sessions. An
application using PostgreSQL users and roles as SQL users and roles will
never attempt to grant membership in the role that is the session user,
so the implicit right to add or remove members will never arise.
The security impact was mostly that a role member could revoke access
from others, contrary to the wishes of his own grantor. Unapproved role
member additions are less notable, because the member can still largely
achieve that by creating a view or a SECURITY DEFINER function.
Reviewed by Andres Freund and Tom Lane. Reported, independently, by
Jonas Sundman and Noah Misch.
Security: CVE-2014-0060
|
|
Various places were supposing that errno could be expected to hold still
within an ereport() nest or similar contexts. This isn't true necessarily,
though in some cases it accidentally failed to fail depending on how the
compiler chanced to order the subexpressions. This class of thinko
explains recent reports of odd failures on clang-built versions, typically
missing or inappropriate HINT fields in messages.
Problem identified by Christian Kruse, who also submitted the patch this
commit is based on. (I fixed a few issues in his patch and found a couple
of additional places with the same disease.)
Back-patch as appropriate to all supported branches.
|
|
We've always allowed CREATE TABLE to create tables in the database's default
tablespace without checking for CREATE permissions on that tablespace.
Unfortunately, the original implementation of ALTER TABLE ... SET TABLESPACE
didn't pick up on that exception.
This changes ALTER TABLE ... SET TABLESPACE to allow the database's default
tablespace without checking for CREATE rights on that tablespace, just as
CREATE TABLE works today. Users could always do this through a series of
commands (CREATE TABLE ... AS SELECT * FROM ...; DROP TABLE ...; etc), so
let's fix the oversight in SET TABLESPACE's original implementation.
|
|
The standard typanalyze functions skip over values whose detoasted size
exceeds WIDTH_THRESHOLD (1024 bytes), so as to limit memory bloat during
ANALYZE. However, we (I think I, actually :-() failed to consider the
possibility that *every* non-null value in a column is too wide. While
compute_minimal_stats() seems to behave reasonably anyway in such a case,
compute_scalar_stats() just fell through and generated no pg_statistic
entry at all. That's unnecessarily pessimistic: we can still produce
valid stanullfrac and stawidth values in such cases, since we do include
too-wide values in the average-width calculation. Furthermore, since the
general assumption in this code is that too-wide values are probably all
distinct from each other, it seems reasonable to set stadistinct to -1
("all distinct").
Per complaint from Kadri Raudsepp. This has been like this since roughly
neolithic times, so back-patch to all supported branches.
|
|
This prevents a possible longjmp out of the signal handler if a timeout
or SIGINT occurs while something within the handler has transiently set
ImmediateInterruptOK. For safety we must hold off the timeout or cancel
error until we're back in mainline, or at least till we reach the end of
the signal handler when ImmediateInterruptOK was true at entry. This
syncs these functions with the logic now present in handle_sig_alarm.
AFAICT there is no live bug here in 9.0 and up, because I don't think we
currently can wait for any heavyweight lock inside these functions, and
there is no other code (except read-from-client) that will turn on
ImmediateInterruptOK. However, that was not true pre-9.0: in older
branches ProcessIncomingNotify might block trying to lock pg_listener, and
then a SIGINT could lead to undesirable control flow. It might be all
right anyway given the relatively narrow code ranges in which NOTIFY
interrupts are enabled, but for safety's sake I'm back-patching this.
|
|
Vacuum recognizes that it can update relfrozenxid by checking whether it has
processed all pages of a relation. Unfortunately it performed that check
after truncating the dead pages at the end of the relation, and used the new
number of pages to decide whether all pages have been scanned. If the new
number of pages happened to be smaller or equal to the number of pages
scanned, it incorrectly decided that all pages were scanned.
This can lead to relfrozenxid being updated, even though some pages were
skipped that still contain old XIDs. That can lead to data loss due to xid
wraparounds with some rows suddenly missing. This likely has escaped notice
so far because it takes a large number (~2^31) of xids being used to see the
effect, while a full-table vacuum before that would fix the issue.
The incorrect logic was introduced by commit
b4b6923e03f4d29636a94f6f4cc2f5cf6298b8c8. Backpatch this fix down to 8.4,
like that commit.
Andres Freund, with some modifications by me.
|
|
Formerly, when using a SQL-spec timezone setting with a fixed GMT offset
(called a "brute force" timezone in the code), the session_timezone
variable was not updated to match the nominal timezone; rather, all code
was expected to ignore session_timezone if HasCTZSet was true. This is
of course obviously fragile, though a search of the code finds only
timeofday() failing to honor the rule. A bigger problem was that
DetermineTimeZoneOffset() supposed that if its pg_tz parameter was
pointer-equal to session_timezone, then HasCTZSet should override the
parameter. This would cause datetime input containing an explicit zone
name to be treated as referencing the brute-force zone instead, if the
zone name happened to match the session timezone that had prevailed
before installing the brute-force zone setting (as reported in bug #8572).
The same malady could affect AT TIME ZONE operators.
To fix, set up session_timezone so that it matches the brute-force zone
specification, which we can do using the POSIX timezone definition syntax
"<abbrev>offset", and get rid of the bogus lookaside check in
DetermineTimeZoneOffset(). Aside from fixing the erroneous behavior in
datetime parsing and AT TIME ZONE, this will cause the timeofday() function
to print its result in the user-requested time zone rather than some
previously-set zone. It might also affect results in third-party
extensions, if there are any that make use of session_timezone without
considering HasCTZSet, but in all cases the new behavior should be saner
than before.
Back-patch to all supported branches.
|
|
There is a rare race condition, when a transaction that inserted a tuple
aborts while vacuum is processing the page containing the inserted tuple.
Vacuum prunes the page first, which normally removes any dead tuples, but
if the inserting transaction aborts right after that, the loop after
pruning will see a dead tuple and remove it instead. That's OK, but if the
page is on a table with no indexes, and the page becomes completely empty
after removing the dead tuple (or tuples) on it, it will be immediately
marked as all-visible. That's OK, but the sanity check in vacuum would
throw a warning because it thinks that the page contains dead tuples and
was nevertheless marked as all-visible, even though it just vacuumed away
the dead tuples and so it doesn't actually contain any.
Spotted this while reading the code. It's difficult to hit the race
condition otherwise, but can be done by putting a breakpoint after the
heap_page_prune() call.
Backpatch all the way to 8.4, where this code first appeared.
|
|
Refactoring as part of commit 8ceb24568054232696dddc1166a8563bc78c900a
had the unintended effect of making REINDEX TABLE and REINDEX DATABASE
no longer validate constraints enforced by the indexes in question;
REINDEX INDEX still did so. Indexes marked invalid remained so, and
constraint violations arising from data corruption went undetected.
Back-patch to 9.0, like the causative commit.
|
|
In most scenarios a portal without a ResourceOwner is dead and not subject
to any further execution, but a portal for a cursor WITH HOLD remains in
existence with no ResourceOwner after the creating transaction is over.
In this situation, if we attempt to "execute" the portal directly to fetch
data from it, we were setting CurrentResourceOwner to NULL, leading to a
segfault if the datatype output code did anything that required a resource
owner (such as trying to fetch system catalog entries that weren't already
cached). The case appears to be impossible to provoke with stock libpq,
but psqlODBC at least is able to cause it when working with held cursors.
Simplest fix is to just skip the assignment to CurrentResourceOwner, so
that any resources used by the data output operations will be managed by
the transaction-level resource owner instead. For consistency I changed
all the places that install a portal's resowner as current, even though
some of them are probably not reachable with a held cursor's portal.
Per report from Joshua Berry (with thanks to Hiroshi Inoue for developing
a self-contained test case). Back-patch to all supported versions.
|
|
The new message (and SQLSTATE) matches the corresponding error cases in
namespace.c.
This was thought to be a "can't happen" case when extension.c was written,
so we didn't think hard about how to report it. But it definitely can
happen in 9.2 and later, since we no longer require search_path to contain
any valid schema names. It's probably also possible in 9.1 if search_path
came from a noninteractive source. So, back-patch to all releases
containing this code.
Per report from Sean Chittenden, though this isn't exactly his patch.
|
|
Patch b19e4250b45e91c9cbdd18d35ea6391ab5961c8d attempted to
preserve existing behavior regarding statistics generation in the
case that a truncation attempt was canceled due to lock conflicts.
It failed to do this accurately in two regards: (1) autovacuum had
previously generated statistics if the truncate attempt failed to
initially get the lock rather than having started the attempt, and
(2) the VACUUM ANALYZE command had always generated statistics.
Both of these changes were unintended, and are reverted by this
patch. On review, there seems to be consensus that the previous
failure to generate statistics when the truncate was terminated
was more an unfortunate consequence of how that effort was
previously terminated than a feature we want to keep; so this
patch generates statistics even when an autovacuum truncation
attempt terminates early. Another unintended change which is kept
on the basis that it is an improvement is that when a VACUUM
command is truncating, it will the new heuristic for avoiding
blocking other processes, rather than keeping an
AccessExclusiveLock on the table for however long the truncation
takes.
Per multiple reports, with some renaming per patch by Jeff Janes.
Backpatch to 9.0, where problem was created.
|
|
There was a high probability of two or more concurrent C.I.C. commands
deadlocking just before completion, because each would wait for the others
to release their reference snapshots. Fix by releasing the snapshot
before waiting for other snapshots to go away.
Per report from Paul Hinze. Back-patch to all active branches.
|
|
|
|
Since a backend adds itself to the global listener array during
Exec_ListenPreCommit, it's inappropriate for it to remove itself during
Exec_UnlistenCommit or Exec_UnlistenAllCommit --- that leads to failure
when committing a transaction that did UNLISTEN then LISTEN, since we end
up not registered though we should be. (This leads to missing later
notifications, or to Assert failures in assert-enabled builds.) Instead
deal with deregistering at the bottom of AtCommit_Notify, when we know the
final state of the listenChannels list.
Also, simplify the representation of registration status by replacing the
transient backendHasExecutedInitialListen flag with an amRegisteredListener
flag.
Per report from Greg Sabino Mullane. Back-patch to 9.0, where the problem
was introduced during the LISTEN/NOTIFY rewrite.
|
|
The original code used freeze_min_age instead of freeze_table_age. The
main consequence of this mistake is that lowering freeze_min_age would
cause full-table scans to occur much more frequently, which causes
serious issues because the number of writes required is much larger.
That feature (freeze_min_age) is supposed to affect only how soon tuples
are frozen; some pages should still be skipped due to the visibility
map.
Backpatch to 8.4, where the freeze_table_age feature was introduced.
Report and patch from Andres Freund
|
|
In situations where there are over 8MB of empty pages at the end of
a table, the truncation work for trailing empty pages takes longer
than deadlock_timeout, and there is frequent access to the table by
processes other than autovacuum, there was a problem with the
autovacuum worker process being canceled by the deadlock checking
code. The truncation work done by autovacuum up that point was
lost, and the attempt tried again by a later autovacuum worker. The
attempts could continue indefinitely without making progress,
consuming resources and blocking other processes for up to
deadlock_timeout each time.
This patch has the autovacuum worker checking whether it is
blocking any other thread at 20ms intervals. If such a condition
develops, the autovacuum worker will persist the work it has done
so far, release its lock on the table, and sleep in 50ms intervals
for up to 5 seconds, hoping to be able to re-acquire the lock and
try again. If it is unable to get the lock in that time, it moves
on and a worker will try to continue later from the point this one
left off.
While this patch doesn't change the rules about when and what to
truncate, it does cause the truncation to occur sooner, with less
blocking, and with the consumption of fewer resources when there is
contention for the table's lock.
The only user-visible change other than improved performance is
that the table size during truncation may change incrementally
instead of just once.
Backpatched to 9.0 from initial master commit at
b19e4250b45e91c9cbdd18d35ea6391ab5961c8d -- before that the
differences are too large to be clearly safe.
Jan Wieck
|
|
Use of SnapshotNow is known to expose us to race conditions if the tuple(s)
being sought could be updated by concurrently-committing transactions.
CREATE DATABASE and DROP DATABASE are particularly exposed because they do
heavyweight filesystem operations during their scans of pg_tablespace,
so that the scans run for a very long time compared to most. Furthermore,
the potential consequences of a missed or twice-visited row are nastier
than average:
* createdb() could fail with a bogus "file already exists" error, or
silently fail to copy one or more tablespace's worth of files into the
new database.
* remove_dbtablespaces() could miss one or more tablespaces, thus failing
to free filesystem space for the dropped database.
* check_db_file_conflict() could likewise miss a tablespace, leading to an
OID conflict that could result in data loss either immediately or in
future operations. (This seems of very low probability, though, since a
duplicate database OID would be unlikely to start with.)
Hence, it seems worth fixing these three places to use MVCC snapshots, even
though this will someday be superseded by a generic solution to SnapshotNow
race conditions.
Back-patch to all active branches.
Stephen Frost and Tom Lane
|
|
If pg_extension_config_dump() is executed again for a table already listed
in the extension's extconfig, the code was blindly making a new array entry.
This does not seem useful. Fix it to replace the existing array entry
instead, so that it's possible for extension update scripts to alter the
filter conditions for configuration tables.
In addition, teach ALTER EXTENSION DROP TABLE to check for an extconfig
entry for the target table, and remove it if present. This is not a 100%
solution because it's allowed for an extension update script to just
summarily DROP a member table, and that code path doesn't go through
ExecAlterExtensionContentsStmt. We could probably make that case clean
things up if we had to, but it would involve sticking a very ugly wart
somewhere in the guts of dependency.c. Since on the whole it seems quite
unlikely that extension updates would want to remove pre-existing
configuration tables, making the case possible with an explicit command
seems sufficient.
Per bug #7756 from Regina Obe. Back-patch to 9.1 where extensions were
introduced.
|
|
During crash recovery, we remove disk files belonging to temporary tables,
but the system catalog entries for such tables are intentionally not
cleaned up right away. Instead, the first backend that uses a temp schema
is expected to clean out any leftover objects therein. This approach
requires that we be careful to ignore leftover temp tables (since any
actual access attempt would fail), *even if their BackendId matches our
session*, if we have not yet established use of the session's corresponding
temp schema. That worked fine in the past, but was broken by commit
debcec7dc31a992703911a9953e299c8d730c778 which incorrectly removed the
rd_islocaltemp relcache flag. Put it back, and undo various changes
that substituted tests like "rel->rd_backend == MyBackendId" for use
of a state-aware flag. Per trouble report from Heikki Linnakangas.
Back-patch to 9.1 where the erroneous change was made. In the back
branches, be careful to add rd_islocaltemp in a spot in the struct that
was alignment padding before, so as not to break existing add-on code.
|
|
During VACUUM if we pause to perform a cycle
of index cleanup we drop the vmbuffer pin,
so we should do the same thing when heap
scan completes. This avoids holding vmbuffer
pin across the main index cleanup in VACUUM,
which could be minutes or hours longer than
necessary for correctness.
Bug report and suggested fix from Pavan Deolasee
|
|
If we had not been holding buffer pin continuously since the tuple was
initially fetched by the UPDATE or DELETE query, it would be possible for
VACUUM or a page-prune operation to move the tuple while we're trying to
copy it. This would result in a garbage "old" tuple value being passed to
an AFTER ROW UPDATE or AFTER ROW DELETE trigger. The preconditions for
this are somewhat improbable, and the timing constraints are very tight;
so it's not so surprising that this hasn't been reported from the field,
even though the bug has been there a long time.
Problem found by Andres Freund. Back-patch to all active branches.
|