From 4a78cdeb6b598940e9d9adb92deca6494628802a Mon Sep 17 00:00:00 2001 From: Tom Lane Date: Wed, 1 Aug 2007 22:45:09 +0000 Subject: Support an optional asynchronous commit mode, in which we don't flush WAL before reporting a transaction committed. Data consistency is still guaranteed (unlike setting fsync = off), but a crash may lose the effects of the last few transactions. Patch by Simon, some editorialization by Tom. --- doc/src/sgml/config.sgml | 41 ++++++++++- doc/src/sgml/wal.sgml | 187 ++++++++++++++++++++++++++++++++++++++++------- 2 files changed, 200 insertions(+), 28 deletions(-) (limited to 'doc/src') diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 0e49ba32178..37afd9845aa 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1,4 +1,4 @@ - + Server Configuration @@ -1307,6 +1307,13 @@ SET ENABLE_SEQSCAN TO OFF; disabling fsync. + + In many situations, turning off + for noncritical transactions can provide much of the potential + performance benefit of turning off fsync, without + the attendant risks of data corruption. + + This parameter can only be set in the postgresql.conf file or on the server command line. @@ -1315,6 +1322,38 @@ SET ENABLE_SEQSCAN TO OFF; + + + synchronous_commit (boolean) + + synchronous_commit configuration parameter + + + + Specifies whether transaction commit will wait for WAL records + to be written to disk before the command returns a success + indication to the client. The default, and safe, setting is + on. When off, there can be a delay between + when success is reported to the client and when the transaction is + really guaranteed to be safe against a server crash. (The maximum + delay is three times .) Unlike + , setting this parameter to off + does not create any risk of database inconsistency: a crash might + result in some recent allegedly-committed transactions being lost, but + the database state will be just the same as if those transactions had + been aborted cleanly. So, turning synchronous_commit off + can be a useful alternative when performance is more important than + exact certainty about the durability of a transaction. For more + discussion see . + + + This parameter can be changed at any time; the behavior for any + one transaction is determined by the setting in effect when it + commits. It is therefore possible, and useful, to have some + transactions commit synchronously and others asynchronously. + + + wal_sync_method (string) diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml index aaf1d0c71ef..d1fa700acee 100644 --- a/doc/src/sgml/wal.sgml +++ b/doc/src/sgml/wal.sgml @@ -1,10 +1,10 @@ - + Reliability and the Write-Ahead Log - This chapter explain how the Write-Ahead Log is used to obtain + This chapter explains how the Write-Ahead Log is used to obtain efficient, reliable operation. @@ -71,7 +71,7 @@ write caches. At the drive level, disable write-back caching if the drive cannot guarantee the data will be written before shutdown. - + Another risk of data loss is posed by the disk platter write operations themselves. Disk platters are divided into sectors, @@ -86,11 +86,11 @@ disk. By doing this, during crash recovery PostgreSQL can restore partially-written pages. If you have a battery-backed disk controller or file-system software that prevents partial page writes - (e.g., ReiserFS 4), you can turn off this page imaging by using the + (e.g., ReiserFS 4), you can turn off this page imaging by using the parameter. - + Write-Ahead Logging (<acronym>WAL</acronym>) @@ -105,12 +105,12 @@ Write-Ahead Logging (WAL) - is a standard approach to transaction logging. Its detailed + is a standard method for ensuring data integrity. A detailed description can be found in most (if not all) books about transaction processing. Briefly, WAL's central concept is that changes to data files (where tables and indexes reside) must be written only after those changes have been logged, - that is, when log records describing the changes have been flushed + that is, after log records describing the changes have been flushed to permanent storage. If we follow this procedure, we do not need to flush data pages to disk on every transaction commit, because we know that in the event of a crash we will be able to recover the @@ -120,17 +120,17 @@ - A major benefit of using WAL is a + Using WAL results in a significantly reduced number of disk writes, because only the log - file needs to be flushed to disk at the time of transaction - commit, rather than every data file changed by the transaction. - In multiuser environments, commits of many transactions - can be accomplished with a single fsync of - the log file. Furthermore, the log file is written sequentially, + file needs to be flushed to disk to guarantee that a transaction is + committed, rather than every data file changed by the transaction. + The log file is written sequentially, and so the cost of syncing the log is much less than the cost of - flushing the data pages. This is especially true for servers + flushing the data pages. This is especially true for servers handling many small transactions touching different parts of the data - store. + store. Furthermore, when the server is processing many small concurrent + transactions, one fsync of the log file may + suffice to commit many transactions. @@ -147,6 +147,139 @@ + + Asynchronous Commit + + + synchronous commit + + + + asynchronous commit + + + + Asynchronous commit is an option that allows transactions + to complete more quickly, at the cost that the most recent transactions may + be lost if the database should crash. In many applications this is an + acceptable tradeoff. + + + + As described in the previous section, transaction commit is normally + synchronous: the server waits for the transaction's + WAL records to be flushed to permanent storage + before returning a success indication to the client. The client is + therefore guaranteed that a transaction reported to be committed will + be preserved, even in the event of a server crash immediately after. + However, for short transactions this delay is a major component of the + total transaction time. Selecting asynchronous commit mode means that + the server returns success as soon as the transaction is logically + completed, before the WAL records it generated have + actually made their way to disk. This can provide a significant boost + in throughput for small transactions. + + + + Asynchronous commit introduces the risk of data loss. There is a short + time window between the report of transaction completion to the client + and the time that the transaction is truly committed (that is, it is + guaranteed not to be lost if the server crashes). Thus asynchronous + commit should not be used if the client will take external actions + relying on the assumption that the transaction will be remembered. + As an example, a bank would certainly not use asynchronous commit for + a transaction recording an ATM's dispensing of cash. But in many + scenarios, such as event logging, there is no need for a strong + guarantee of this kind. + + + + The risk that is taken by using asynchronous commit is of data loss, + not data corruption. If the database should crash, it will recover + by replaying WAL up to the last record that was + flushed. The database will therefore be restored to a self-consistent + state, but any transactions that were not yet flushed to disk will + not be reflected in that state. The net effect is therefore loss of + the last few transactions. Because the transactions are replayed in + commit order, no inconsistency can be introduced — for example, + if transaction B made changes relying on the effects of a previous + transaction A, it is not possible for A's effects to be lost while B's + effects are preserved. + + + + The user can select the commit mode of each transaction, so that + it is possible to have both synchronous and asynchronous commit + transactions running concurrently. This allows flexible tradeoffs + between performance and certainty of transaction durability. + The commit mode is controlled by the user-settable parameter + , which can be changed in any of + the ways that a configuration parameter can be set. The mode used for + any one transaction depends on the value of + synchronous_commit when transaction commit begins. + + + + Certain utility commands, for instance DROP TABLE, are + forced to commit synchronously regardless of the setting of + synchronous_commit. This is to ensure consistency + between the server's filesystem and the logical state of the database. + The commands supporting two-phase commit, such as PREPARE + TRANSACTION, are also always synchronous. + + + + If the database crashes during the risk window between an + asynchronous commit and the writing of the transaction's + WAL records, + then changes made during that transaction will be lost. + The duration of the + risk window is limited because a background process (the wal + writer) flushes unwritten WAL records to disk + every milliseconds. + The actual maximum duration of the risk window is three times + wal_writer_delay because the wal writer is + designed to favor writing whole pages at a time during busy periods. + + + + + An immediate-mode shutdown is equivalent to a server crash, and will + therefore cause loss of any unflushed asynchronous commits. + + + + + Asynchronous commit provides behavior different from setting + = off. + fsync is a server-wide + setting that will alter the behavior of all transactions. It disables + all logic within PostgreSQL that attempts to synchronize + writes to different portions of the database, and therefore a system + crash (that is, a hardware or operating system crash, not a failure of + PostgreSQL itself) could result in arbitrarily bad + corruption of the database state. In many scenarios, asynchronous + commit provides most of the performance improvement that could be + obtained by turning off fsync, but without the risk + of data corruption. + + + + also sounds very similar to + asynchronous commit, but it is actually a synchronous commit method + (in fact, commit_delay is ignored during an + asynchronous commit). commit_delay causes a delay + just before a synchronous commit attempts to flush + WAL to disk, in the hope that a single flush + executed by one such transaction can also serve other transactions + committing at about the same time. Setting commit_delay + can only help when there are many concurrently committing transactions, + and it is difficult to tune it to a value that actually helps rather + than hurting throughput. + + + + <acronym>WAL</acronym> Configuration @@ -188,13 +321,13 @@ checkpoint_timeout causes checkpoints to be done more often. This allows faster after-crash recovery (since less work will need to be redone). However, one must balance this against the - increased cost of flushing dirty data pages more often. If - is set (as is the default), there is - another factor to consider. To ensure data page consistency, - the first modification of a data page after each checkpoint results in + increased cost of flushing dirty data pages more often. If + is set (as is the default), there is + another factor to consider. To ensure data page consistency, + the first modification of a data page after each checkpoint results in logging the entire page content. In that case, a smaller checkpoint interval increases the volume of output to the WAL log, - partially negating the goal of using a smaller interval, + partially negating the goal of using a smaller interval, and in any case causing more disk I/O. @@ -206,8 +339,8 @@ don't happen too often. As a simple sanity check on your checkpointing parameters, you can set the parameter. If checkpoints happen closer together than - checkpoint_warning seconds, - a message will be output to the server log recommending increasing + checkpoint_warning seconds, + a message will be output to the server log recommending increasing checkpoint_segments. Occasional appearance of such a message is not cause for alarm, but if it appears often then the checkpoint control parameters should be increased. Bulk operations such @@ -280,9 +413,9 @@ modifying the configuration parameter . The default number of WAL buffers is 8. Increasing this value will - correspondingly increase shared memory usage. When - is set and the system is very busy, - setting this value higher will help smooth response times during the + correspondingly increase shared memory usage. When + is set and the system is very busy, + setting this value higher will help smooth response times during the period immediately following each checkpoint. @@ -307,7 +440,7 @@ The parameter determines how PostgreSQL will ask the kernel to force - WAL updates out to disk. + WAL updates out to disk. All the options should be the same as far as reliability goes, but it's quite platform-specific which one will be the fastest. Note that this parameter is irrelevant if fsync @@ -360,7 +493,7 @@ The aim of WAL, to ensure that the log is written before database records are altered, can be subverted by disk drivesdisk drive that falsely report a - successful write to the kernel, + successful write to the kernel, when in fact they have only cached the data and not yet stored it on the disk. A power failure in such a situation might still lead to irrecoverable data corruption. Administrators should try to ensure -- cgit v1.2.3