| Age | Commit message (Collapse) | Author |
|
The maxoff field is not used in the new, compressed page format. Let's
reset it when converting an old-format page to the new format. The code
won't care either way, but this makes it possible to use the field for
something else in the future.
|
|
Spotted by Alexander Korotkov
|
|
When vacuuming a data leaf page, any compressed posting lists that are not
modified, are copied back to the buffer from a later location in the same
buffer rather than from a palloc'd copy. IOW, they are just moved
downwards in the same buffer. Because the source and destination addresses
can overlap, we must use memmove rather than memcpy.
Report and fix by Alexander Korotkov.
|
|
Since C99, it's been standard for printf and friends to accept a "z" size
modifier, meaning "whatever size size_t has". Up to now we've generally
dealt with printing size_t values by explicitly casting them to unsigned
long and using the "l" modifier; but this is really the wrong thing on
platforms where pointers are wider than longs (such as Win64). So let's
start using "z" instead. To ensure we can do that on all platforms, teach
src/port/snprintf.c to understand "z", and add a configure test to force
use of that implementation when the platform's version doesn't handle "z".
Having done that, modify a bunch of places that were using the
unsigned-long hack to use "z" instead. This patch doesn't pretend to have
gotten everyplace that could benefit, but it catches many of them. I made
an effort in particular to ensure that all uses of the same error message
text were updated together, so as not to increase the number of
translatable strings.
It's possible that this change will result in format-string warnings from
pre-C99 compilers. We might have to reconsider if there are any popular
compilers that will warn about this; but let's start by seeing what the
buildfarm thinks.
Andres Freund, with a little additional work by me
|
|
The Sparc machines in the buildfarm are crashing because of misaligned
access to posting lists stored in entry tuples.
I accidentally removed a critical SHORTALIGN() from ginFormTuple, as part
of the packed posting lists patch. Perhaps I thought it was unnecessary,
because the index_form_tuple() call above the SHORTALIGN already aligned
the size, missing the fact that the null-category byte makes it misaligned
again (I think the SHORTALIGN is indeed unnecessary if there's no null-
category byte, but let's just play it safe...)
|
|
Not all compilers understand that elog(ERROR, ...) never returns.
|
|
gcc 4.8 was happy with having a duplicate typedef, but most compilers seem not
to be, per buildfarm.
|
|
GIN posting lists are now encoded using varbyte-encoding, which allows them
to fit in much smaller space than the straight ItemPointer array format used
before. The new encoding is used for both the lists stored in-line in entry
tree items, and in posting tree leaf pages.
To maintain backwards-compatibility and keep pg_upgrade working, the code
can still read old-style pages and tuples. Posting tree leaf pages in the
new format are flagged with GIN_COMPRESSED flag, to distinguish old and new
format pages. Likewise, entry tree tuples in the new format have a
GIN_ITUP_COMPRESSED flag set in a bit that was previously unused.
This patch bumps GIN_CURRENT_VERSION from 1 to 2. New indexes created with
version 9.4 will therefore have version number 2 in the metapage, while old
pg_upgraded indexes will have version 1. The code treats them the same, but
it might be come handy in the future, if we want to drop support for the
uncompressed format.
Alexander Korotkov and me. Reviewed by Tomas Vondra and Amit Langote.
|
|
Update all files in head, and files COPYRIGHT and legal.sgml in all back
branches.
|
|
Per "clang -Wmissing-variable-declarations" output, posted by Andres Freund.
I didn't silence all those warnings, though, only the most obvious cases.
|
|
This is the same trick we use when taking a full page image of a buffer
passed to XLogInsert.
|
|
Insertion to a non-leaf GIN page didn't make a full-page image of the page,
which is wrong. The code used to do it correctly, but was changed (commit
853d1c3103fa961ae6219f0281885b345593d101) because the redo-routine didn't
track incomplete splits correctly when the page was restored from a full
page image. Of course, that was not right way to fix it, the redo routine
should've been fixed instead. The redo-routine was surreptitiously fixed
in 2010 (commit 4016bdef8aded77b4903c457050622a5a1815c16), so all we need
to do now is revert the code that creates the record to its original form.
This doesn't change the format of the WAL record.
Backpatch to all supported versions.
|
|
Replace it with an approach similar to what GiST uses: when a page is split,
the left sibling is marked with a flag indicating that the parent hasn't been
updated yet. When the parent is updated, the flag is cleared. If an insertion
steps on a page with the flag set, it will finish split before proceeding
with the insertion.
The post-recovery cleanup mechanism was never totally reliable, as insertion
to the parent could fail e.g because of running out of memory or disk space,
leaving the tree in an inconsistent state.
This also divides the responsibility of WAL-logging more clearly between
the generic ginbtree.c code, and the parts specific to entry and posting
trees. There is now a common WAL record format for insertions and deletions,
which is written by ginbtree.c, followed by tree-specific payload, which is
returned by the placetopage- and split- callbacks.
|
|
Separate the insertion payload from the more static portions of GinBtree.
GinBtree now only contains information related to searching the tree, and
the information of what to insert is passed separately.
Add root block number to GinBtree, instead of passing it around all the
functions as argument.
Split off ginFinishSplit() from ginInsertValue(). ginFinishSplit is
responsible for finding the parent and inserting the downlink to it.
|
|
Split off the portion of ginInsertValue that inserts the tuple to current
level into a separate function, ginPlaceToPage. ginInsertValue's charter
is now to recurse up the tree to insert the downlink, when a page split is
required.
This is in preparation for a patch to change the way incomplete splits are
handled, which will need to do these operations separately. And IMHO makes
the code more readable anyway.
|
|
This creates a new gin-btree callback function for creating a downlink for
a page. Previously, ginxlog.c duplicated the logic used during normal
operation.
|
|
Merge some functions that were always called together. Makes the code
little bit more readable.
|
|
The root page is filled with as many items as fit, and the rest are inserted
using normal insertions. However, I fumbled the variable names, and the code
actually memcpy'd all the items on the page, overflowing the buffer. While
at it, rename the variable to make the distinction more clear.
Reported by Teodor Sigaev. This bug was introduced by my recent
refactorings, so no backpatching required.
|
|
If a page is deleted, and reused for something else, just as a search is
following a rightlink to it from its left sibling, the search would continue
scanning whatever the new contents of the page are. That could lead to
incorrect query results, or even something more curious if the page is
reused for a different kind of a page.
To fix, modify the search algorithm to lock the next page before releasing
the previous one, and refrain from deleting pages from the leftmost branch
of the tree.
Add a new Concurrency section to the README, explaining why this works.
There is a lot more one could say about concurrency in GIN, but that's for
another patch.
Backpatch to all supported versions.
|
|
Broken by my refactoring.
|
|
Not sure how I missed these in previous commit.
|
|
Merge the isEnoughSpace and placeToPage functions in the b-tree interface
into one function that tries to put a tuple on page, and returns false if
it doesn't fit.
Move createPostingTree function to gindatapage.c, and change its contract
so that it can be passed more items than fit on the root page. It's in a
better position than the callers to know how many items fit.
Move ginMergeItemPointers out of gindatapage.c, into a separate file.
These changes make no difference now, but reduce the footprint of Alexander
Korotkov's upcoming patch to pack item pointers more tightly.
|
|
It makes for cleaner code to have separate Get/Add functions for PostingItems
and ItemPointers. A few callsites that have to deal with both types need to
be duplicated because of this, but all the callers have to know which one
they're dealing with anyway. Overall, this reduces the amount of casting
required.
Extracted from Alexander Korotkov's larger patch to change the data page
format.
|
|
ginCompareItemPointers function is called heavily in gin index scans -
inlining it speeds up some kind of queries a lot.
|
|
Every other core buffer page consumer initializes the bytes it furnishes
to PageAddItem(). For consistency, do the same here. No back-patch;
regardless, we couldn't count on the fix so long as binary upgrade can
carry forward affected index builds.
|
|
This is the first run of the Perl-based pgindent script. Also update
pgindent instructions.
|
|
Remove use of PageSetTLI() from all page manipulation functions
and adjust README to indicate change in the way we make changes
to pages. Repurpose those bytes into the pd_checksum field and
explain how that works in comments about page header.
Refactoring ahead of actual feature patch which would make use
of the checksum field, arriving later.
Jeff Davis, with comments and doc changes by Simon Riggs
Direction suggested by Robert Haas; many others providing
review comments.
|
|
Fully update git head, and update back branches in ./COPYRIGHT and
legal.sgml files.
|
|
This gets rid of XLByteLT, XLByteLE, XLByteEQ and XLByteAdvance.
These were useful for brevity when XLogRecPtrs were split in
xlogid/xrecoff; but now that they are simple uint64's, they are just
clutter. The only downside to making this change would be ease of
backporting patches, but that has been negated by other substantive
changes to the involved code anyway. The clarity of simpler expressions
makes the change worthwhile.
Most of the changes are mechanical, but in a couple of places, the patch
author chose to invert the operator sense, making the code flow more
logical (and more in line with preceding comments).
Author: Andres Freund
Eyeballed by Dimitri Fontaine and Alvaro Herrera
|
|
This is necessary (but not sufficient) to have them compilable outside
of a backend environment.
|
|
Most of the replay functions for WAL record types that modify more than
one page failed to ensure that those pages were locked correctly to ensure
that concurrent queries could not see inconsistent page states. This is
a hangover from coding decisions made long before Hot Standby was added,
when it was hardly necessary to acquire buffer locks during WAL replay
at all, let alone hold them for carefully-chosen periods.
The key problem was that RestoreBkpBlocks was written to hold lock on each
page restored from a full-page image for only as long as it took to update
that page. This was guaranteed to break any WAL replay function in which
there was any update-ordering constraint between pages, because even if the
nominal order of the pages is the right one, any mixture of full-page and
non-full-page updates in the same record would result in out-of-order
updates. Moreover, it wouldn't work for situations where there's a
requirement to maintain lock on one page while updating another. Failure
to honor an update ordering constraint in this way is thought to be the
cause of bug #7648 from Daniel Farina: what seems to have happened there
is that a btree page being split was rewritten from a full-page image
before the new right sibling page was written, and because lock on the
original page was not maintained it was possible for hot standby queries to
try to traverse the page's right-link to the not-yet-existing sibling page.
To fix, get rid of RestoreBkpBlocks as such, and instead create a new
function RestoreBackupBlock that restores just one full-page image at a
time. This function can be invoked by WAL replay functions at the points
where they would otherwise perform non-full-page updates; in this way, the
physical order of page updates remains the same no matter which pages are
replaced by full-page images. We can then further adjust the logic in
individual replay functions if it is necessary to hold buffer locks
for overlapping periods. A side benefit is that we can simplify the
handling of concurrency conflict resolution by moving that code into the
record-type-specfic functions; there's no more need to contort the code
layout to keep conflict resolution in front of the RestoreBkpBlocks call.
In connection with that, standardize on zero-based numbering rather than
one-based numbering for referencing the full-page images. In HEAD, I
removed the macros XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4. They are
still there in the header files in previous branches, but are no longer
used by the code.
In addition, fix some other bugs identified in the course of making these
changes:
spgRedoAddNode could fail to update the parent downlink at all, if the
parent tuple is in the same page as either the old or new split tuple and
we're not doing a full-page image: it would get fooled by the LSN having
been advanced already. This would result in permanent index corruption,
not just transient failure of concurrent queries.
Also, ginHeapTupleFastInsert's "merge lists" case failed to mark the old
tail page as a candidate for a full-page image; in the worst case this
could result in torn-page corruption.
heap_xlog_freeze() was inconsistent about using a cleanup lock or plain
exclusive lock: it did the former in the normal path but the latter for a
full-page image. A plain exclusive lock seems sufficient, so change to
that.
Also, remove gistRedoPageDeleteRecord(), which has been dead code since
VACUUM FULL was rewritten.
Back-patch to 9.0, where hot standby was introduced. Note however that 9.0
had a significantly different WAL-logging scheme for GIST index updates,
and it doesn't appear possible to make that scheme safe for concurrent hot
standby queries, because it can leave inconsistent states in the index even
between WAL records. Given the lack of complaints from the field, we won't
work too hard on fixing that branch.
|
|
The heapam XLog functions are used by other modules, not all of which
are interested in the rest of the heapam API. With this, we let them
get just the XLog stuff in which they are interested and not pollute
them with unrelated includes.
Also, since heapam.h no longer requires xlog.h, many files that do
include heapam.h no longer get xlog.h automatically, including a few
headers. This is useful because heapam.h is getting pulled in by
execnodes.h, which is in turn included by a lot of files.
|
|
The Solaris Studio compiler warns about these instances, unlike more
mainstream compilers such as gcc. But manual inspection showed that
the code is clearly not reachable, and we hope no worthy compiler will
complain about removing this code.
|
|
Make it clearer that the passed stack mustn't be empty, and that we
are not supposed to fall off the end of the stack in the main loop.
Tighten the loop that extracts the root block number, too.
Markus Wanner and Tom Lane
|
|
When I implemented the ginbuildempty() function as part of
implementing unlogged tables, I falsified the note in the header
comment for log_newpage. Although we could fix that up by changing
the comment, it seems cleaner to add a new function which is
specifically intended to handle this case. So do that.
|
|
Josh Kupershmidt
|
|
XLOG_GIN_UPDATE_META_PAGE and XLOG_GIN_DELETE_LISTPAGE records were printed
with a list link field labeled as "blkno", which was confusing, especially
when the link was empty (InvalidBlockNumber). Print the metapage block
number instead, since that's what's actually being updated. We could
include the link values too as a separate field, but not clear it's worth
the trouble.
Back-patch to 8.4 where the dubious code was added.
|
|
In commit 4016bdef8aded77b4903c457050622a5a1815c16 I fixed a bunch of
ginxlog.c bugs having to do with not handling XLogReadBuffer failures
correctly. However, in ginRedoUpdateMetapage and ginRedoDeleteListPages,
I unaccountably thought that failure to read the metapage would be
impossible and just put in an elog(PANIC) call. This is of course wrong:
failure is exactly what will happen if the index got dropped (or rebuilt)
between creation of the WAL record and the crash we're trying to recover
from. I believe this explains Nicholas Wilson's recent report of these
errors getting reached.
Also, fix memory leak in forgetIncompleteSplit. This wasn't of much
concern when the code was written, but in a long-running standby server
page split records could be expected to accumulate indefinitely.
Back-patch to 8.4 --- before that, GIN didn't have a metapage.
|
|
|
|
|
|
A simple thinko in ginRedoUpdateMetapage, namely failing to increment a
loop counter, led to inserting records into the last pending-list page in
the wrong order (the opposite of that intended). So far as I can tell,
this would not upset the code that eventually flushes pending items into
the main part of the GIN index. But it did break the code that searched
the pending list for matches, resulting in transient failure to find
matching entries during index lookups, as illustrated in bug #6307 from
Maksym Boguk.
Back-patch to 8.4 where the incorrect code was introduced.
|
|
walsender.h should depend on xlog.h, not vice versa. (Actually, the
inclusion was circular until a couple hours ago, which was even sillier;
but Bruce broke it in the expedient rather than logically correct
direction.) Because of that poor decision, plus blind application of
pgrminclude, we had a situation where half the system was depending on
xlog.h to include such unrelated stuff as array.h and guc.h. Clean up
the header inclusion, and manually revert a lot of what pgrminclude had
done so things build again.
This episode reinforces my feeling that pgrminclude should not be run
without adult supervision. Inclusion changes in header files in particular
need to be reviewed with great care. More generally, it'd be good if we
had a clearer notion of module layering to dictate which headers can sanely
include which others ... but that's a big task for another day.
|
|
|
|
This lets us stop including rel.h into execnodes.h, which is a widely
used header.
|
|
|
|
Experimentation with contrib/btree_gist shows that the majority of the GIST
support functions potentially need collation information. Safest policy
seems to be to pass it to all of them, instead of making assumptions about
which ones could possibly need it.
|
|
Since collation is effectively an argument, not a property of the function,
FmgrInfo is really the wrong place for it; and this becomes critical in
cases where a cached FmgrInfo is used for varying purposes that might need
different collation settings. Fix by passing it in FunctionCallInfoData
instead. In particular this allows a clean fix for bug #5970 (record_cmp
not working). This requires touching a bit more code than the original
method, but nobody ever thought that collations would not be an invasive
patch...
|
|
|
|
Honor index column's collation spec if there is one, don't go to the
expense of calling get_typcollation when we can reasonably assume that
all GIN storage types will use default collation, and be sure to set
a collation for the comparePartialFn too.
|
|
I found actual bugs in GiST and plpgsql; the rest of this is cosmetic
but meant to decrease the odds of future bugs of omission.
|