Age | Commit message (Collapse) | Author |
|
This reverts commits f07d18b6e94d, 82c83b337202, 3a3b309041b0, and
24c5f1a103ce.
This feature has shown enough immaturity that it was deemed better to
rip it out before rushing some more fixes at the last minute. There are
discussions on larger changes in this area for the next release.
|
|
Back in 3b02ea4f0780 I added some comments in various places to explain
how logical decoding and other things worked. Not all of the changes
were welcome, because they were misleading or wrong. This changes them
a little bit to make them more accurate.
Some other comments are also changed to be more accurate. Also, fix a
bunch of typos.
Author: Álvaro Herrera, Craig Ringer
Andres Freund reviewed some parts of this.
|
|
So far, when a transaction with pending invalidations, but without an
assigned xid, committed, we simply ignored those invalidation
messages. That's problematic, because those are actually sent for a
reason.
Known symptoms of this include that existing sessions on a hot-standby
replica sometimes fail to notice new concurrently built indexes and
visibility map updates.
The solution is to WAL log such invalidations in transactions without an
xid. We considered to alternatively force-assign an xid, but that'd be
problematic for vacuum, which might be run in systems with few xids.
Important: This adds a new WAL record, but as the patch has to be
back-patched, we can't bump the WAL page magic. This means that standbys
have to be updated before primaries; otherwise
"PANIC: standby_redo: unknown op code 32" errors can be encountered.
XXX:
Reported-By: Васильев Дмитрий, Masahiko Sawada
Discussion:
CAB-SwXY6oH=9twBkXJtgR4UC1NqT-vpYAtxCseME62ADwyK5OA@mail.gmail.com
CAD21AoDpZ6Xjg=gFrGPnSn4oTRRcwK1EBrWCq9OqOHuAcMMC=w@mail.gmail.com
|
|
|
|
When re-reading an update involving both an old tuple and a new tuple from
disk, reorderbuffer.c was careless about whether the new tuple is suitably
aligned for direct access --- in general, it isn't. We'd missed seeing
this in the buildfarm because the contrib/test_decoding tests exercise this
code path only a few times, and by chance all of those cases have old
tuples with length a multiple of 4, which is usually enough to make the
access to the new tuple's t_len safe. For some still-not-entirely-clear
reason, however, Debian's sparc build gets a bus error, as reported by
Christoph Berg; perhaps it's assuming 8-byte alignment of the pointer?
The lack of previous field reports is probably because you need all of
these conditions to trigger a crash: an alignment-picky platform (not
Intel), a transaction large enough to spill to disk, an update within
that xact that changes a primary-key field and has an odd-length old tuple,
and of course logical decoding tracing the transaction.
Avoid the alignment assumption by using memcpy instead of fetching t_len
directly, and add a test case that exposes the crash on picky platforms.
Back-patch to 9.4 where the bug was introduced.
Discussion: <20160413094117.GC21485@msg.credativ.de>
|
|
It was declared as "pid_t", which would be fine except that none of
the places that printed it in error messages took any thought for the
possibility that it's not equivalent to "int". This leads to warnings
on some buildfarm members, and could possibly lead to actually wrong
error messages on those platforms. There doesn't seem to be any very
good reason not to just make it "int"; it's only ever assigned from
MyProcPid, which is int. If we want to cope with PIDs that are wider
than int, this is not the place to start.
Also, fix the comment, which seems to perhaps be a leftover from a time
when the field was only a bool?
Per buildfarm. Back-patch to 9.5 which has same issue.
|
|
Logical messages, added in 3fe3511d05, during decoding failed to filter
messages emitted in other databases and messages emitted "under" a
replication origin the output plugin isn't interested in.
Add tests to verify that both types of filtering actually work. While
touching message.sql remove hunk obsoleted by d25379e.
Bump XLOG_PAGE_MAGIC because xl_logical_message changed and because
3fe3511d05 had omitted doing so. 3fe3511d05 additionally didn't bump
catversion, but 7a542700d has done so since.
Author: Petr Jelinek
Reported-By: Andres Freund
Discussion: 20160406142513.wotqy3ba3kanr423@alap3.anarazel.de
|
|
API and mechanism to allow generic messages to be inserted into WAL that are
intended to be read by logical decoding plugins. This commit adds an optional
new callback to the logical decoding API.
Messages are either text or bytea. Messages can be transactional, or not, and
are identified by a prefix to allow multiple concurrent decoding plugins.
(Not to be confused with Generic WAL records, which are intended to allow crash
recovery of extensible objects.)
Author: Petr Jelinek and Andres Freund
Reviewers: Artur Zakirov, Tomas Vondra, Simon Riggs
Discussion: 5685F999.6010202@2ndquadrant.com
|
|
This interface is designed to give an access to WAL for extensions which
could implement new access method, for example. Previously it was
impossible because restoring from custom WAL would need to access system
catalog to find a redo custom function. This patch suggests generic way
to describe changes on page with standart layout.
Bump XLOG_PAGE_MAGIC because of new record type.
Author: Alexander Korotkov with a help of Petr Jelinek, Markus Nullmeier and
minor editorization by my
Reviewers: Petr Jelinek, Alvaro Herrera, Teodor Sigaev, Jim Nasby,
Michael Paquier
|
|
Author: Konstantin Knizhnik
|
|
When decoding from a logical slot, it's necessary for xlog reading to be
able to read xlog from historical (i.e. not current) timelines;
otherwise, decoding fails after failover, because the archives are in
the historical timeline. This is required to make "failover logical
slots" possible; it currently has no other use, although theoretically
it could be used by an extension that creates a slot on a standby and
continues to replay from the slot when the standby is promoted.
This commit includes a module in src/test/modules with functions to
manipulate the slots (which is not otherwise possible in SQL code) in
order to enable testing, and a new test in src/test/recovery to ensure
that the behavior is as expected.
Author: Craig Ringer
Reviewed-By: Oleksii Kliukin, Andres Freund, Petr Jelínek
|
|
Some minor tweaks and comment additions, for cleanliness sake and to
avoid having the upcoming timeline-following patch be polluted with
unrelated cleanup.
Extracted from a larger patch by Craig Ringer, reviewed by Andres
Freund, with some additions by myself.
|
|
I introduced several uses of !! to force bit arithmetic to be boolean,
but per discussion the project prefers != 0/NULL.
Discussion: CA+TgmoZP5KakLGP6B4vUjgMBUW0woq_dJYi0paOz-My0Hwt_vQ@mail.gmail.com
|
|
Renaming a file using rename(2) is not guaranteed to be durable in face
of crashes. Use the previously added durable_rename()/durable_link_or_rename()
in various places where we previously just renamed files.
Most of the changed call sites are arguably not critical, but it seems
better to err on the side of too much durability. The most prominent
known case where the previously missing fsyncs could cause data loss is
crashes at the end of a checkpoint. After the actual checkpoint has been
performed, old WAL files are recycled. When they're filled, their
contents are fdatasynced, but we did not fsync the containing
directory. An OS/hardware crash in an unfortunate moment could then end
up leaving that file with its old name, but new content; WAL replay
would thus not replay it.
Reported-By: Tomas Vondra
Author: Michael Paquier, Tomas Vondra, Andres Freund
Discussion: 56583BDD.9060302@2ndquadrant.com
Backpatch: All supported branches
|
|
Renaming a file using rename(2) is not guaranteed to be durable in face
of crashes; especially on filesystems like xfs and ext4 when mounted
with data=writeback. To be certain that a rename() atomically replaces
the previous file contents in the face of crashes and different
filesystems, one has to fsync the old filename, rename the file, fsync
the new filename, fsync the containing directory. This sequence is not
generally adhered to currently; which exposes us to data loss risks. To
avoid having to repeat this arduous sequence, introduce
durable_rename(), which wraps all that.
Also add durable_link_or_rename(). Several places use link() (with a
fallback to rename()) to rename a file, trying to avoid replacing the
target file out of paranoia. Some of those rename sequences need to be
durable as well. There seems little reason extend several copies of the
same logic, so centralize the link() callers.
This commit does not yet make use of the new functions; they're used in
a followup commit.
Author: Michael Paquier, Andres Freund
Discussion: 56583BDD.9060302@2ndquadrant.com
Backpatch: All supported branches
|
|
Coverity and inspection for the issue addressed in fd45d16f found some
questionable code.
Specifically coverity noticed that the wrong length was added in
ReorderBufferSerializeChange() - without immediate negative consequences
as the variable isn't used afterwards. During code-review and testing I
noticed that a bit of space was wasted when allocating tuple bufs in
several places. Thirdly, the debug memset()s in
ReorderBufferGetTupleBuf() reduce the error checking valgrind can do.
Backpatch: 9.4, like c8f621c43.
|
|
In c8f621c43 I forgot to account for MAXALIGN when allocating a new
tuplebuf in ReorderBufferGetTupleBuf(). That happens to currently not
cause active problems on a number of platforms because the affected
pointer is already aligned, but others, like ppc and hppa, trigger this
in the regression test, due to a debug memset clearing memory.
Fix that.
Backpatch: 9.4, like the previous commit.
|
|
When decoding the old version of an UPDATE or DELETE change, and if that
tuple was bigger than MaxHeapTupleSize, we either Assert'ed out, or
failed in more subtle ways in non-assert builds. Normally individual
tuples aren't bigger than MaxHeapTupleSize, with big datums toasted.
But that's not the case for the old version of a tuple for logical
decoding; the replica identity is logged as one piece. With the default
replica identity btree limits that to small tuples, but that's not the
case for FULL.
Change the tuple buffer infrastructure to separate allocate over-large
tuples, instead of always going through the slab cache.
This unfortunately requires changing the ReorderBufferTupleBuf
definition, we need to store the allocated size someplace. To avoid
requiring output plugins to recompile, don't store HeapTupleHeaderData
directly after HeapTupleData, but point to it via t_data; that leaves
rooms for the allocated size. As there's no reason for an output plugin
to look at ReorderBufferTupleBuf->t_data.header, remove the field. It
was just a minor convenience having it directly accessible.
Reported-By: Adam Dratwiński
Discussion: CAKg6ypLd7773AOX4DiOGRwQk1TVOQKhNwjYiVjJnpq8Wo+i62Q@mail.gmail.com
|
|
Somehow I managed to flip the order of restoring old & new tuples when
de-spooling a change in a large transaction from disk. This happens to
only take effect when a change is spooled to disk which has old/new
versions of the tuple. That only is the case for UPDATEs where he
primary key changed or where replica identity is changed to FULL.
The tests didn't catch this because either spooled updates, or updates
that changed primary keys, were tested; not both at the same time.
Found while adding tests for the following commit.
Backpatch: 9.4, where logical decoding was added
|
|
Logical decoding's reorderbuffer keeps transactions in an LSN ordered
list for efficiency. To make that's efficiently possible upper-level
xids are forced to be logged before nested subtransaction xids. That
only works though if these records are all looked at: Unfortunately we
didn't do so for e.g. row level locks, which are otherwise uninteresting
for logical decoding.
This could lead to errors like:
"ERROR: subxact logged without previous toplevel record".
It's not sufficient to just look at row locking records, the xid could
appear first due to a lot of other types of records (which will trigger
the transaction to be marked logged with MarkCurrentTransactionIdLoggedIfAny).
So invent infrastructure to tell reorderbuffer about xids seen, when
they'd otherwise not pass through reorderbuffer.c.
Reported-By: Jarred Ward
Bug: #13844
Discussion: 20160105033249.1087.66040@wrigleys.postgresql.org
Backpatch: 9.4, where logical decoding was added
|
|
When adding replication origins in 5aa235042, I somehow managed to set
the timestamp of decoded transactions to InvalidXLogRecptr when decoding
one made without a replication origin. Fix that, and the wrong type of
the new commit_time variable.
This didn't trigger a regression test failure because we explicitly
don't show commit timestamps in the regression tests, as they obviously
are variable. Add a test that checks that a decoded commit's timestamp
is within minutes of NOW() from before the commit.
Reported-By: Weiping Qu
Diagnosed-By: Artur Zakirov
Discussion: 56D4197E.9050706@informatik.uni-kl.de,
56D42918.1010108@postgrespro.ru
Backpatch: 9.5, where 5aa235042 originates.
|
|
Previously, we had a mix of styles.
Amit Kapila
|
|
This makes the values more stable, which seems like a good thing for
anybody who needs to look at at them.
Alexander Korotkov and Amit Kapila
|
|
Author: Michael Paquier
|
|
Previously we didn’t have a generic WAL page read callback function,
surprisingly. Logical decoding has logical_read_local_xlog_page(), which was
actually generic, so move that to xlogfunc.c and rename to
read_local_xlog_page().
Maintain logical_read_local_xlog_page() so existing callers still work.
As requested by Michael Paquier, Alvaro Herrera and Andres Freund
|
|
A scan for missed proisstrict markings in the core code turned up
these functions:
brin_summarize_new_values
pg_stat_reset_single_table_counters
pg_stat_reset_single_function_counters
pg_create_logical_replication_slot
pg_create_physical_replication_slot
pg_drop_replication_slot
The first three of these take OID, so a null argument will normally look
like a zero to them, resulting in "ERROR: could not open relation with OID
0" for brin_summarize_new_values, and no action for the pg_stat_reset_XXX
functions. The other three will dump core on a null argument, though this
is mitigated by the fact that they won't do so until after checking that
the caller is superuser or has rolreplication privilege.
In addition, the pg_logical_slot_get/peek[_binary]_changes family was
intentionally marked nonstrict, but failed to make nullness checks on all
the arguments; so again a null-pointer-dereference crash is possible but
only for superusers and rolreplication users.
Add the missing ARGISNULL checks to the latter functions, and mark the
former functions as strict in pg_proc. Make that change in the back
branches too, even though we can't force initdb there, just so that
installations initdb'd in future won't have the issue. Since none of these
bugs rise to the level of security issues (and indeed the pg_stat_reset_XXX
functions hardly misbehave at all), it seems sufficient to do this.
In addition, fix some order-of-operations oddities in the slot_get_changes
family, mostly cosmetic, but not the part that moves the function's last
few operations into the PG_TRY block. As it stood, there was significant
risk for an error to exit without clearing historical information from
the system caches.
The slot_get_changes bugs go back to 9.4 where that code was introduced.
Back-patch appropriate subsets of the pg_proc changes into all active
branches, as well.
|
|
Backpatch certain files through 9.1
|
|
This could result in the error context misidentifying where the error
actually occurred.
Craig Ringer
|
|
By accident the replication origin was not set properly in
DecodeCommit(). That's bad because the origin is passed to the output
plugins origin filter, and accessible from the output plugin via
ReorderBufferTXN->origin_id. Accessing the origin of individual changes
worked before the fix, which is why this wasn't notices earlier.
Reported-By: Craig Ringer
Author: Craig Ringer
Discussion: CAMsr+YFhBJLp=qfSz3-J+0P1zLkE8zNXM2otycn20QRMx380gw@mail.gmail.com
Backpatch: 9.5, where replication origins where introduced
|
|
Message style, plurals, quoting, spelling, consistency with similar
messages
|
|
Prior to commit 0709b7ee72e4bc71ad07b7120acd117265ab51d0, access to
variables within a spinlock-protected critical section had to be done
through a volatile pointer, but that should no longer be necessary.
This continues work begun in df4077cda2eae3eb4a5cf387da0c1e7616e73204
and 6ba4ecbf477e0b25dd7bde1b0c4e07fc2da19348.
Thomas Munro and Michael Paquier
|
|
It was introduced alongside replication origins, by commit
5aa2350426c, so backpatch to 9.5.
Pointed out by Fujii Masao
|
|
Oskari Saarenmaa. Backpatch to stable branches where applicable.
|
|
This fixes a bunch of somewhat pedantic warnings with new
compilers. Since by far the majority of other functions definitions use
the (void) style it just seems to be consistent to do so as well in the
remaining few places.
|
|
Trivial change to reduce exposure of rel.h.
|
|
When creating a physical slot it's often useful to immediately reserve
the current WAL position instead of only doing after the first feedback
message arrives. That e.g. allows slots to guarantee that all the WAL
for a base backup will be available afterwards.
Logical slots already have to reserve WAL during creation, so generalize
that logic into being usable for both physical and logical slots.
Catversion bump because of the new parameter.
Author: Gurjeet Singh
Reviewed-By: Andres Freund
Discussion: CABwTF4Wh_dBCzTU=49pFXR6coR4NW1ynb+vBqT+Po=7fuq5iCw@mail.gmail.com
|
|
These make the code a bit easier to read, and make it easier to add a
more explicit notion of a slot's type at some point in the future.
Author: Gurjeet Singh
Discussion: CABwTF4Wh_dBCzTU=49pFXR6coR4NW1ynb+vBqT+Po=7fuq5iCw@mail.gmail.com
|
|
Fix a bunch of typos, and remove two superflous includes.
Author: Gurjeet Singh
Discussion: CABwTF4Wh_dBCzTU=49pFXR6coR4NW1ynb+vBqT+Po=7fuq5iCw@mail.gmail.com
Backpatch: 9.4
|
|
Amit reviewed the replication origins patch and made some good
points. Address them. This fixes typos in error messages, docs and
comments and adds a missing error check (although in a
should-never-happen scenario).
Discussion: CAA4eK1JqUBVeWWKwUmBPryFaje4190ug0y-OAUHWQ6tD83V4xg@mail.gmail.com
Backpatch: 9.5, where replication origins were introduced.
|
|
Previously the message erroneously printed the same LSN twice as the
assignment to the start_lsn variable was before the message. Correct
that.
Reported-By: Marko Tiikkaja
Author: Marko Tiikkaja
Backpatch: 9.5, where logical decoding was introduced
|
|
When spilling transaction data to disk a simple typo caused the output
file to be closed and reopened for every serialized change. That happens
to not have a huge impact on linux, which is why it probably wasn't
noticed so far, but on windows that appears to trigger actual disk
writes after every change. Not fun.
The bug fortunately does not have any impact besides speed. A change
could end up being in the wrong segment (last instead of next), but
since we read all files to the end, that's just ugly, not really
problematic. It's not a problem to upgrade, since transaction spill
files do not persist across restarts.
Bug: #13484
Reported-By: Olivier Gosseaume
Discussion: 20150703090217.1190.63940@wrigleys.postgresql.org
Backpatch to 9.4, where logical decoding was added.
|
|
|
|
Patch by CharSyam, plus a few more I spotted with grep.
|
|
Use "a" and "an" correctly, mostly in comments. Two error messages were
also fixed (they were just elogs, so no translation work required). Two
function comments in pg_proc.h were also fixed. Etsuro Fujita reported one
of these, but I found a lot more with grep.
Also fix a few other typos spotted while grepping for the a/an typos.
For example, "consists out of ..." -> "consists of ...". Plus a "though"/
"through" mixup reported by Euler Taveira.
Many of these typos were in old code, which would be nice to backpatch to
make future backpatching easier. But much of the code was new, and I didn't
feel like crafting separate patches for each branch. So no backpatching.
|
|
Dmitriy Olshevskiy
|
|
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint. DO NOTHING avoids the
constraint violation, without touching the pre-existing row. DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed. The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.
This feature is often referred to as upsert.
This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert. If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made. If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.
To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.
Bumps catversion as stored rules change.
Author: Peter Geoghegan, with significant contributions from Heikki
Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
Dean Rasheed, Stephen Frost and many others.
|
|
Author: Erik Rijkers <er@xs4all.nl>
|
|
Michael Paquier and myself.
|
|
We can't rely on UINT16_MAX being present, which is why we introduced
PG_UINT16_MAX...
Buildfarm animal bowerbird via Andrew Gierth.
|
|
When implementing a replication solution ontop of logical decoding, two
related problems exist:
* How to safely keep track of replication progress
* How to change replication behavior, based on the origin of a row;
e.g. to avoid loops in bi-directional replication setups
The solution to these problems, as implemented here, consist out of
three parts:
1) 'replication origins', which identify nodes in a replication setup.
2) 'replication progress tracking', which remembers, for each
replication origin, how far replay has progressed in a efficient and
crash safe manner.
3) The ability to filter out changes performed on the behest of a
replication origin during logical decoding; this allows complex
replication topologies. E.g. by filtering all replayed changes out.
Most of this could also be implemented in "userspace", e.g. by inserting
additional rows contain origin information, but that ends up being much
less efficient and more complicated. We don't want to require various
replication solutions to reimplement logic for this independently. The
infrastructure is intended to be generic enough to be reusable.
This infrastructure also replaces the 'nodeid' infrastructure of commit
timestamps. It is intended to provide all the former capabilities,
except that there's only 2^16 different origins; but now they integrate
with logical decoding. Additionally more functionality is accessible via
SQL. Since the commit timestamp infrastructure has also been introduced
in 9.5 (commit 73c986add) changing the API is not a problem.
For now the number of origins for which the replication progress can be
tracked simultaneously is determined by the max_replication_slots
GUC. That GUC is not a perfect match to configure this, but there
doesn't seem to be sufficient reason to introduce a separate new one.
Bumps both catversion and wal page magic.
Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer
Reviewed-By: Heikki Linnakangas, Petr Jelinek, Robert Haas, Steve Singer
Discussion: 20150216002155.GI15326@awork2.anarazel.de,
20140923182422.GA15776@alap3.anarazel.de,
20131114172632.GE7522@alap2.anarazel.de
|