summaryrefslogtreecommitdiff
path: root/src/backend/access
AgeCommit message (Collapse)Author
2012-10-02Split off functions related to timeline history files and XLOG archiving.Heikki Linnakangas
This is just refactoring, to make the functions accessible outside xlog.c. A followup patch will make use of that, to allow fetching timeline history files over streaming replication.
2012-09-27Fix btmarkpos/btrestrpos to handle array keys.Tom Lane
This fixes another error in commit 9e8da0f75731aaa7605cf4656c21ea09e84d2eb1. I neglected to make the mark/restore functionality save and restore the current set of array key values, which led to strange behavior if an IndexScan with ScalarArrayOpExpr quals was used as the inner side of a mergejoin. Per bug #7570 from Melese Tesfaye.
2012-09-19Put back AcceptInvalidationMessages calls in heap_openrv(_extended).Tom Lane
These calls were removed in commit 4240e429d0c2d889d0cda23c618f94e12c13ade7 as part of a general refactoring and improvement of DDL locking. However, there's a problem not solved by the rewrite, which is that GRANT/REVOKE update pg_class.relacl without taking any particular lock on the target table as such. If another backend fails to do AcceptInvalidationMessages, it won't notice a recently-committed change in ACLs. Bug #7557 from Piotr Czachur demonstrates that there's at least one code path in 9.2.0 in which a command fails to do any AcceptInvalidationMessages calls at all, if the current transaction already holds all the locks it will need. Since we're hard up against the release deadline for 9.2.1, fix this by putting back the AcceptInvalidationMessages calls in heap_openrv and heap_openrv_extended, thereby restoring the historical behavior in this area. We ought to look for a more elegant and perhaps more bulletproof solution, but there's no time for that right now.
2012-09-14Properly set relpersistence for fake relcache entries.Robert Haas
This can result in buffers failing to be properly flushed at checkpoint time, leading to data loss. Report, diagnosis, and patch by Jeff Davis.
2012-09-05Fix WAL file replacement during cascading replication on Windows.Heikki Linnakangas
When the startup process restores a WAL file from the archive, it deletes any old file with the same name and renames the new file in its place. On Windows, however, when a file is deleted, it still lingers as long as a process holds a file handle open on it. With cascading replication, a walsender process can hold the old file open, so the rename() in the startup process would fail. To fix that, rename the old file to a temporary name, to make the original file name available for reuse, before deleting the old file.
2012-09-05Fix inappropriate error messages for Hot Standby misconfiguration errors.Tom Lane
Give the correct name of the GUC parameter being complained of. Also, emit a more suitable SQLSTATE (INVALID_PARAMETER_VALUE, not the default INTERNAL_ERROR). Gurjeet Singh, errcode adjustment by me
2012-09-05Trim spgist_private.h inclusionAlvaro Herrera
It doesn't really need rel.h; relcache.h is enough.
2012-09-04Fix compiler warnings about unused variables, caused by my previous commit.Heikki Linnakangas
Reported by Peter Eisentraut.
2012-09-04Fix bugs in cascading replication with recovery_target_timeline='latest'Heikki Linnakangas
The cascading replication code assumed that the current RecoveryTargetTLI never changes, but that's not true with recovery_target_timeline='latest'. The obvious upshot of that is that RecoveryTargetTLI in shared memory needs to be protected by a lock. A less obvious consequence is that when a cascading standby is connected, and the standby switches to a new target timeline after scanning the archive, it will continue to stream WAL to the cascading standby, but from a wrong file, ie. the file of the previous timeline. For example, if the standby is currently streaming from the middle of file 000000010000000000000005, and the timeline changes, the standby will continue to stream from that file. However, the WAL on the new timeline is in file 000000020000000000000005, so the standby sends garbage from 000000010000000000000005 to the cascading standby, instead of the correct WAL from file 000000020000000000000005. This also fixes a related bug where a partial WAL segment is restored from the archive and streamed to a cascading standby. The code assumed that when a WAL segment is copied from the archive, it can immediately be fully streamed to a cascading standby. However, if the segment is only partially filled, ie. has the right size, but only N first bytes contain valid WAL, that's not safe. That can happen if a partial WAL segment is manually copied to the archive, or if a partial WAL segment is archived because a server is started up on a new timeline within that segment. The cascading standby will get confused if the WAL it received is not valid, and will get stuck until it's restarted. This patch fixes that problem by not allowing WAL restored from the archive to be streamed to a cascading standby until it's been replayed, and thus validated.
2012-09-03Replace memcpy() calls in xlog.c critical sections with struct assignments.Tom Lane
This gets rid of a dangerous-looking use of the not-volatile XLogCtl pointer in a couple of spinlock-protected sections, where the normal coding rule is that you should only access shared memory through a pointer-to-volatile. I think the risk is only hypothetical not actual, since for there to be a bug the compiler would have to move the spinlock acquire or release across the memcpy() call, which one sincerely hopes it will not. Still, it looks cleaner this way. Per comment from Daniel Farina and subsequent discussion.
2012-08-30Improve coding of gistchoose and gistRelocateBuildBuffersOnSplit.Tom Lane
This is mostly cosmetic, but it does eliminate a speculative portability issue. The previous coding ignored the fact that sum_grow could easily overflow (in fact, it could be summing multiple IEEE float infinities). On a platform where that didn't guarantee to produce a positive result, the code would misbehave. In any case, it was less than readable.
2012-08-30Split tuple struct defs from htup.h to htup_details.hAlvaro Herrera
This reduces unnecessary exposure of other headers through htup.h, which is very widely included by many files. I have chosen to move the function prototypes to the new file as well, because that means htup.h no longer needs to include tupdesc.h. In itself this doesn't have much effect in indirect inclusion of tupdesc.h throughout the tree, because it's also required by execnodes.h; but it's something to explore in the future, and it seemed best to do the htup.h change now while I'm busy with it.
2012-08-30Fix logic bug in gistchoose and gistRelocateBuildBuffersOnSplit.Robert Haas
Every time the best-tuple-found-so-far changes, we need to reset all the penalty values in which_grow[] to the penalties for the new best tuple. The old code failed to do this, resulting in inferior index quality. The original patch from Alexander Korotkov was just two lines; I took the liberty of fleshing that out by adding a bunch of comments that I hope will make this logic easier for others to understand than it was for me.
2012-08-29Optimize SP-GiST insertions.Heikki Linnakangas
This includes two micro-optimizations to the tight inner loop in descending the SP-GiST tree: 1. avoid an extra function call to index_getprocinfo when calling user-defined choose function, and 2. avoid a useless palloc+pfree when node labels are not used.
2012-08-28Split heapam_xlog.h from heapam.hAlvaro Herrera
The heapam XLog functions are used by other modules, not all of which are interested in the rest of the heapam API. With this, we let them get just the XLog stuff in which they are interested and not pollute them with unrelated includes. Also, since heapam.h no longer requires xlog.h, many files that do include heapam.h no longer get xlog.h automatically, including a few headers. This is useful because heapam.h is getting pulled in by execnodes.h, which is in turn included by a lot of files.
2012-08-28Split resowner.hAlvaro Herrera
This lets files that are mere users of ResourceOwner not automatically include the headers for stuff that is managed by the resowner mechanism.
2012-08-21Avoid somewhat-theoretical overflow risks in RecordIsValid().Tom Lane
This improves on commit 51fed14d73ed3acd2282b531fb1396877e44e86a by eliminating the assumption that we can form <some pointer value> + <some offset> without overflow. The entire point of those tests is that we don't trust the offset value, so coding them in a way that could wrap around if the buffer happens to be near the top of memory doesn't seem sound. Instead, track the remaining space as a size_t variable and compare offsets against that. Also, improve comment about why we need the extra early check on xl_tot_len.
2012-08-20Don't get confused if a WAL partial record header has xl_tot_len == 0.Heikki Linnakangas
If a WAL record header was split across pages, but xl_tot_len was 0, we would get confused and conclude that we had already read the whole record, and proceed to CRC check it. That can lead to a crash in RecordIsValid(), which isn't careful to not read beyond end-of-record, as defined by xl_tot_len. Add an explicit sanity check for xl_tot_len <= SizeOfXlogRecord. Also, make RecordIsValid() more robust by checking in each step that it doesn't try to access memory beyond end of record, even if a length field in the record's or a backup block's header is bogus. Per report and analysis by Tom Lane.
2012-08-16Delete inaccurate C comment about FSM and adding pages, per Robert Haas.Bruce Momjian
2012-08-16Fix GiST buffering build bug, which caused "failed to re-find parent" errors.Heikki Linnakangas
We use a hash table to track the parents of inner pages, but when inserting to a leaf page, the caller of gistbufferinginserttuples() must pass a correct block number of the leaf's parent page. Before gistProcessItup() descends to a child page, it checks if the downlink needs to be adjusted to accommodate the new tuple, and updates the downlink if necessary. However, updating the downlink might require splitting the page, which might move the downlink to a page to the right. gistProcessItup() doesn't realize that, so when it descends to the leaf page, it might pass an out-of-date parent block number as a result. Fix that by returning the block a tuple was inserted to from gistbufferinginserttuples(). This fixes the bug reported by Zdeněk Jílovec.
2012-08-15Update C comment to NOTICE to reflect previous commit changing the errorBruce Momjian
level, per report from Tom.
2012-08-08Fix minor bug in XLogFileRead() that accidentally worked.Simon Riggs
Cascading replication copied the incoming file into pg_xlog but didn't set path correctly, so the first attempt to open file failed causing it to loop around and look for file in pg_xlog. So the earlier coding worked, but accidentally rather than by design. Spotted by Fujii Masao, fix by Fujii Masao and Simon Riggs
2012-08-08Fix TwoPhaseGetDummyBackendId().Tom Lane
This was broken in commit ed0b409d22346b1b027a4c2099ca66984d94b6dd, which revised the GlobalTransactionData struct to not include the associated PGPROC as its first member, but overlooked one place where a cast was used in reliance on that equivalence. The most effective way of fixing this seems to be to create a new function that looks up the GlobalTransactionData struct given the XID, and make both TwoPhaseGetDummyBackendId and TwoPhaseGetDummyProc rely on that. Per report from Robert Ross.
2012-08-07fsync backup_label after pg_start_backup()Simon Riggs
Dave Kerr
2012-08-03Improve underdocumented btree_xlog_delete_get_latestRemovedXid() code.Tom Lane
As noted by Noah Misch, btree_xlog_delete_get_latestRemovedXid is critically dependent on the assumption that it's examining a consistent state of the database. This was undocumented though, so the seemingly-unrelated check for no active HS sessions might be thought to be merely an optional optimization. Improve comments, and add an explicit check of reachedConsistency just to be sure. This function returns InvalidTransactionId (thereby killing all HS transactions) in several cases that are not nearly unlikely enough for my taste. This commit doesn't attempt to fix those deficiencies, just document them. Back-patch to 9.2, not from any real functional need but just to keep the branches more closely synced to simplify possible future back-patching.
2012-08-03In SPGiST replay, do conflict resolution before modifying the page.Tom Lane
In yesterday's commit 962e0cc71e839c58fb9125fa85511b8bbb8bdbee, I added the ResolveRecoveryConflictWithSnapshot call in the wrong place. I correctly put it before spgRedoVacuumRedirect itself would modify the index page --- but not before RestoreBkpBlocks, so replay of a record with a full-page image would modify the page before kicking off any conflicting HS transactions. Oops.
2012-08-02Fix race conditions associated with SPGiST redirection tuples.Tom Lane
The correct test for whether a redirection tuple is removable is whether tuple's xid < RecentGlobalXmin, not OldestXmin; the previous coding failed to protect index searches being done in concurrent transactions that have no XID. This mirrors the recent fix in btree's page recycling logic made in commit d3abbbebe52eb1e59e621c880ad57df9d40d13f2. Also, WAL-log the newest XID of any removed redirection tuple on an index page, and apply ResolveRecoveryConflictWithSnapshot during InHotStandby WAL replay. This protects against concurrent Hot Standby transactions possibly needing to see the redirection tuple(s). Per my query of 2012-03-12 and subsequent discussion.
2012-07-18Fix management of pendingOpsTable in auxiliary processes.Tom Lane
mdinit() was misusing IsBootstrapProcessingMode() to decide whether to create an fsync pending-operations table in the current process. This led to creating a table not only in the startup and checkpointer processes as intended, but also in the bgwriter process, not to mention other auxiliary processes such as walwriter and walreceiver. Creation of the table in the bgwriter is fatal, because it absorbs fsync requests that should have gone to the checkpointer; instead they just sit in bgwriter local memory and are never acted on. So writes performed by the bgwriter were not being fsync'd which could result in data loss after an OS crash. I think there is no live bug with respect to walwriter and walreceiver because those never perform any writes of shared buffers; but the potential is there for future breakage in those processes too. To fix, make AuxiliaryProcessMain() export the current process's AuxProcType as a global variable, and then make mdinit() test directly for the types of aux process that should have a pendingOpsTable. Having done that, we might as well also get rid of the random bool flags such as am_walreceiver that some of the aux processes had grown. (Note that we could not have fixed the bug by examining those variables in mdinit(), because it's called from BaseInit() which is run by AuxiliaryProcessMain() before entering any of the process-type-specific code.) Back-patch to 9.2, where the problem was introduced by the split-up of bgwriter and checkpointer processes. The bogus pendingOpsTable exists in walwriter and walreceiver processes in earlier branches, but absent any evidence that it causes actual problems there, I'll leave the older branches alone.
2012-07-16Remove unreachable codePeter Eisentraut
The Solaris Studio compiler warns about these instances, unlike more mainstream compilers such as gcc. But manual inspection showed that the code is clearly not reachable, and we hope no worthy compiler will complain about removing this code.
2012-07-13Cosmetic cleanup of ginInsertValue().Tom Lane
Make it clearer that the passed stack mustn't be empty, and that we are not supposed to fall off the end of the stack in the main loop. Tighten the loop that extracts the root block number, too. Markus Wanner and Tom Lane
2012-07-02Fix a stupid bug I introduced into XLogFlush().Robert Haas
Commit f11e8be3e812cdbbc139c1b4e49141378b118dee broke this; it was right in Peter's original patch, but I messed it up before committing.
2012-07-02Fix position of WalSndWakeupRequest call.Robert Haas
This avoids discriminating against wal_sync_method = open_sync or open_datasync. Fujii Masao, reviewed by Andres Freund
2012-07-02Assorted message style improvementsPeter Eisentraut
2012-07-02Work a little harder on comments for walsender wakeup patch.Robert Haas
Per gripe from Tom Lane.
2012-07-02Make commit_delay much smarter.Robert Haas
Instead of letting every backend participating in a group commit wait independently, have the first one that becomes ready to flush WAL wait for the configured delay, and let all the others wait just long enough for that first process to complete its flush. This greatly increases the chances of being able to configure a commit_delay setting that actually improves performance. As a side consequence of this change, commit_delay now affects all WAL flushes, rather than just commits. There was some discussion on pgsql-hackers about whether to rename the GUC to, say, wal_flush_delay, but in the absence of consensus I am leaving it alone for now. Peter Geoghegan, with some changes, mostly to the documentation, by me.
2012-07-02Make walsender more responsive.Robert Haas
Per testing by Andres Freund, this improves replication performance and reduces replication latency and latency jitter. I was a bit concerned about moving more work into XLogInsert, but testing seems to show that it's not a problem in practice. Along the way, improve comments for WaitLatchOrSocket. Andres Freund. Review and stylistic cleanup by me.
2012-06-30Validate xlog record header before enlarging the work area to store it.Heikki Linnakangas
If the record header is garbled, we're now quite likely to notice it before we try to make a bogus memory allocation and run out of memory. That can still happen, if the xlog record is split across pages (we cannot verify the record header until reading the next page in that scenario), but this reduces the chances. An out-of-memory is treated as a corrupt record anyway, so this isn't a correctness issue, just a case of giving a better error message. Per Amit Kapila's suggestion.
2012-06-29Initialize shared memory copy of ckptXidEpoch correctly when not in recovery.Heikki Linnakangas
This bug was introduced by commit 20d98ab6e4110087d1816cd105a40fcc8ce0a307, so backpatch this to 9.0-9.2 like that one. This fixes bug #6710, reported by Tarvi Pillessaar
2012-06-28Update outdated commit; xlp_rem_len field is in page header now.Heikki Linnakangas
Spotted by Amit Kapila
2012-06-27Fix two more neglected comments, still referring to log/seg.Heikki Linnakangas
Fujii Masao
2012-06-27I neglected many comments in the log+seg -> 64-bit segno patch. Fix.Heikki Linnakangas
Reported by Amit Kapila.
2012-06-26Cope with smaller-than-normal BLCKSZ setting in SPGiST indexes on text.Tom Lane
The original coding failed miserably for BLCKSZ of 4K or less, as reported by Josh Kupershmidt. With the present design for text indexes, a given inner tuple could have up to 256 labels (requiring either 3K or 4K bytes depending on MAXALIGN), which means that we can't positively guarantee no failures for smaller blocksizes. But we can at least make it behave sanely so long as there are few enough labels to fit on a page. Considering that btree is also more prone to "index tuple too large" failures when BLCKSZ is small, it's not clear that we should expend more work than this on this case.
2012-06-26Reduce use of heavyweight locking inside hash AM.Robert Haas
Avoid using LockPage(rel, 0, lockmode) to protect against changes to the bucket mapping. Instead, an exclusive buffer content lock is now viewed as sufficient permission to modify the metapage, and a shared buffer content lock is used when such modifications need to be prevented. This more relaxed locking regimen makes it possible that, when we're busy getting a heavyweight bucket on the bucket we intend to search or insert into, a bucket split might occur underneath us. To compenate for that possibility, we use a loop-and-retry system: release the metapage content lock, acquire the heavyweight lock on the target bucket, and then reacquire the metapage content lock and check that the bucket mapping has not changed. Normally it hasn't, and we're done. But if by chance it has, we simply unlock the metapage, release the heavyweight lock we acquired previously, lock the new bucket, and loop around again. Even in the worst case we cannot loop very many times here, since we don't split the same bucket again until we've split all the other buckets, and 2^N gets big pretty fast. This results in greatly improved concurrency, because we're effectively replacing two lwlock acquire-and-release cycles in exclusive mode (on one of the lock manager locks) with a single acquire-and-release cycle in shared mode (on the metapage buffer content lock). Testing shows that it's still not quite as good as btree; for that, we'd probably have to find some way of getting rid of the heavyweight bucket locks as well, which does not appear straightforward. Patch by me, review by Jeff Janes.
2012-06-25Tighten up includes in sinvaladt.h, twophase.h, proc.hAlvaro Herrera
Remove proc.h from sinvaladt.h and twophase.h; also replace xlog.h in proc.h with xlogdefs.h.
2012-06-25Replace int2/int4 in C code with int16/int32Peter Eisentraut
The latter was already the dominant use, and it's preferable because in C the convention is that intXX means XX bits. Therefore, allowing mixed use of int2, int4, int8, int16, int32 is obviously confusing. Remove the typedefs for int2 and int4 for now. They don't seem to be widely used outside of the PostgreSQL source tree, and the few uses can probably be cleaned up by the time this ships.
2012-06-24Oops. Remove stray paren.Heikki Linnakangas
I didn't notice this on my laptop as I don't HAVE_FSYNC_WRITETHROUGH.
2012-06-24Replace XLogRecPtr struct with a 64-bit integer.Heikki Linnakangas
This simplifies code that needs to do arithmetic on XLogRecPtrs. To avoid changing on-disk format of data pages, the LSN on data pages is still stored in the old format. That should keep pg_upgrade happy. However, we have XLogRecPtrs embedded in the control file, and in the structs that are sent over the replication protocol, so this changes breaks compatibility of pg_basebackup and server. I didn't do anything about this in this patch, per discussion on -hackers, the right thing to do would to be to change the replication protocol to be architecture-independent, so that you could use a newer version of pg_receivexlog, for example, against an older server version.
2012-06-24Allow WAL record header to be split across pages.Heikki Linnakangas
This saves a few bytes of WAL space, but the real motivation is to make it predictable how much WAL space a record requires, as it no longer depends on whether we need to waste the last few bytes at end of WAL page because the header doesn't fit. The total length field of WAL record, xl_tot_len, is moved to the beginning of the WAL record header, so that it is still always found on the first page where a WAL record begins. Bump WAL version number again as this is an incompatible change.
2012-06-24Move WAL continuation record information to WAL page header.Heikki Linnakangas
The continuation record only contained one field, xl_rem_len, so it makes things simpler to just include it in the WAL page header. This wastes four bytes on pages that don't begin with a continuation from previos page, plus four bytes on every page, because of padding. The motivation of this is to make it easier to calculate how much space a WAL record needs. Before this patch, it depended on how many page boundaries the record crosses. The motivation of that, in turn, is to separate the allocation of space in the WAL from the copying of the record data to the allocated space. Keeping the calculation of space required simple helps to keep the critical section of allocating the space from WAL short. But that's not included in this patch yet. Bump WAL version number again, as this is an incompatible change.
2012-06-24Don't waste the last segment of each 4GB logical log file.Heikki Linnakangas
The comments claimed that wasting the last segment made it easier to do calculations with XLogRecPtrs, because you don't have problems representing last-byte-position-plus-1 that way. In my experience, however, it only made things more complicated, because the there was two ways to represent the boundary at the beginning of a logical log file: logid = n+1 and xrecoff = 0, or as xlogid = n and xrecoff = 4GB - XLOG_SEG_SIZE. Some functions were picky about which representation was used. Also, use a 64-bit segment number instead of the log/seg combination, to point to a certain WAL segment. We assume that all platforms have a working 64-bit integer type nowadays. This is an incompatible change in WAL format, so bumping WAL version number.