| 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
 | src/backend/storage/buffer/README
Notes About Shared Buffer Access Rules
======================================
There are two separate access control mechanisms for shared disk buffers:
reference counts (a/k/a pin counts) and buffer content locks.  (Actually,
there's a third level of access control: one must hold the appropriate kind
of lock on a relation before one can legally access any page belonging to
the relation.  Relation-level locks are not discussed here.)
Pins: one must "hold a pin on" a buffer (increment its reference count)
before being allowed to do anything at all with it.  An unpinned buffer is
subject to being reclaimed and reused for a different page at any instant,
so touching it is unsafe.  Normally a pin is acquired via ReadBuffer and
released via ReleaseBuffer.  It is OK and indeed common for a single
backend to pin a page more than once concurrently; the buffer manager
handles this efficiently.  It is considered OK to hold a pin for long
intervals --- for example, sequential scans hold a pin on the current page
until done processing all the tuples on the page, which could be quite a
while if the scan is the outer scan of a join.  Similarly, a btree index
scan may hold a pin on the current index page.  This is OK because normal
operations never wait for a page's pin count to drop to zero.  (Anything
that might need to do such a wait is instead handled by waiting to obtain
the relation-level lock, which is why you'd better hold one first.)  Pins
may not be held across transaction boundaries, however.
Buffer content locks: there are two kinds of buffer lock, shared and exclusive,
which act just as you'd expect: multiple backends can hold shared locks on
the same buffer, but an exclusive lock prevents anyone else from holding
either shared or exclusive lock.  (These can alternatively be called READ
and WRITE locks.)  These locks are intended to be short-term: they should not
be held for long.  Buffer locks are acquired and released by LockBuffer().
It will *not* work for a single backend to try to acquire multiple locks on
the same buffer.  One must pin a buffer before trying to lock it.
Buffer access rules:
1. To scan a page for tuples, one must hold a pin and either shared or
exclusive content lock.  To examine the commit status (XIDs and status bits)
of a tuple in a shared buffer, one must likewise hold a pin and either shared
or exclusive lock.
2. Once one has determined that a tuple is interesting (visible to the
current transaction) one may drop the content lock, yet continue to access
the tuple's data for as long as one holds the buffer pin.  This is what is
typically done by heap scans, since the tuple returned by heap_fetch
contains a pointer to tuple data in the shared buffer.  Therefore the
tuple cannot go away while the pin is held (see rule #5).  Its state could
change, but that is assumed not to matter after the initial determination
of visibility is made.
3. To add a tuple or change the xmin/xmax fields of an existing tuple,
one must hold a pin and an exclusive content lock on the containing buffer.
This ensures that no one else might see a partially-updated state of the
tuple while they are doing visibility checks.
4. It is considered OK to update tuple commit status bits (ie, OR the
values HEAP_XMIN_COMMITTED, HEAP_XMIN_INVALID, HEAP_XMAX_COMMITTED, or
HEAP_XMAX_INVALID into t_infomask) while holding only a shared lock and
pin on a buffer.  This is OK because another backend looking at the tuple
at about the same time would OR the same bits into the field, so there
is little or no risk of conflicting update; what's more, if there did
manage to be a conflict it would merely mean that one bit-update would
be lost and need to be done again later.  These four bits are only hints
(they cache the results of transaction status lookups in pg_xact), so no
great harm is done if they get reset to zero by conflicting updates.
Note, however, that a tuple is frozen by setting both HEAP_XMIN_INVALID
and HEAP_XMIN_COMMITTED; this is a critical update and accordingly requires
an exclusive buffer lock (and it must also be WAL-logged).
5. To physically remove a tuple or compact free space on a page, one
must hold a pin and an exclusive lock, *and* observe while holding the
exclusive lock that the buffer's shared reference count is one (ie,
no other backend holds a pin).  If these conditions are met then no other
backend can perform a page scan until the exclusive lock is dropped, and
no other backend can be holding a reference to an existing tuple that it
might expect to examine again.  Note that another backend might pin the
buffer (increment the refcount) while one is performing the cleanup, but
it won't be able to actually examine the page until it acquires shared
or exclusive content lock.
Obtaining the lock needed under rule #5 is done by the bufmgr routines
LockBufferForCleanup() or ConditionalLockBufferForCleanup().  They first get
an exclusive lock and then check to see if the shared pin count is currently
1.  If not, ConditionalLockBufferForCleanup() releases the exclusive lock and
then returns false, while LockBufferForCleanup() releases the exclusive lock
(but not the caller's pin) and waits until signaled by another backend,
whereupon it tries again.  The signal will occur when UnpinBuffer decrements
the shared pin count to 1.  As indicated above, this operation might have to
wait a good while before it acquires the lock, but that shouldn't matter much
for concurrent VACUUM.  The current implementation only supports a single
waiter for pin-count-1 on any particular shared buffer.  This is enough for
VACUUM's use, since we don't allow multiple VACUUMs concurrently on a single
relation anyway.  Anyone wishing to obtain a cleanup lock outside of recovery
or a VACUUM must use the conditional variant of the function.
Buffer Manager's Internal Locking
---------------------------------
Before PostgreSQL 8.1, all operations of the shared buffer manager itself
were protected by a single system-wide lock, the BufMgrLock, which
unsurprisingly proved to be a source of contention.  The new locking scheme
avoids grabbing system-wide exclusive locks in common code paths.  It works
like this:
* There is a system-wide LWLock, the BufMappingLock, that notionally
protects the mapping from buffer tags (page identifiers) to buffers.
(Physically, it can be thought of as protecting the hash table maintained
by buf_table.c.)  To look up whether a buffer exists for a tag, it is
sufficient to obtain share lock on the BufMappingLock.  Note that one
must pin the found buffer, if any, before releasing the BufMappingLock.
To alter the page assignment of any buffer, one must hold exclusive lock
on the BufMappingLock.  This lock must be held across adjusting the buffer's
header fields and changing the buf_table hash table.  The only common
operation that needs exclusive lock is reading in a page that was not
in shared buffers already, which will require at least a kernel call
and usually a wait for I/O, so it will be slow anyway.
* As of PG 8.2, the BufMappingLock has been split into NUM_BUFFER_PARTITIONS
separate locks, each guarding a portion of the buffer tag space.  This allows
further reduction of contention in the normal code paths.  The partition
that a particular buffer tag belongs to is determined from the low-order
bits of the tag's hash value.  The rules stated above apply to each partition
independently.  If it is necessary to lock more than one partition at a time,
they must be locked in partition-number order to avoid risk of deadlock.
* A separate system-wide spinlock, buffer_strategy_lock, provides mutual
exclusion for operations that select buffers for replacement.  A spinlock is
used here rather than a lightweight lock for efficiency; no other locks of any
sort should be acquired while buffer_strategy_lock is held.  This is essential
to allow buffer replacement to happen in multiple backends with reasonable
concurrency.
* Each buffer header contains a spinlock that must be taken when examining
or changing fields of that buffer header.  This allows operations such as
ReleaseBuffer to make local state changes without taking any system-wide
lock.  We use a spinlock, not an LWLock, since there are no cases where
the lock needs to be held for more than a few instructions.
Note that a buffer header's spinlock does not control access to the data
held within the buffer.  Each buffer header also contains an LWLock, the
"buffer content lock", that *does* represent the right to access the data
in the buffer.  It is used per the rules above.
* The BM_IO_IN_PROGRESS flag acts as a kind of lock, used to wait for I/O on a
buffer to complete (and in releases before 14, it was accompanied by a
per-buffer LWLock).  The process starting a read or write sets the flag. When
the I/O is completed, be it by the process that initiated the I/O or by
another process, the flag is removed and the Buffer's condition variable is
signalled.  Processes that need to wait for the I/O to complete can wait for
asynchronous I/O by using BufferDesc->io_wref and for BM_IO_IN_PROGRESS to be
unset by sleeping on the buffer's condition variable.
Normal Buffer Replacement Strategy
----------------------------------
To choose a victim buffer to recycle we use a simple clock-sweep algorithm.  It
works like this:
Each buffer header contains a usage counter, which is incremented (up to a
small limit value) whenever the buffer is pinned.  (This requires only the
buffer header spinlock, which would have to be taken anyway to increment the
buffer reference count, so it's nearly free.)
The "clock hand" is a buffer index, nextVictimBuffer, that moves circularly
through all the available buffers.  nextVictimBuffer is protected by the
buffer_strategy_lock.
The algorithm for a process that needs to obtain a victim buffer is:
1. Obtain buffer_strategy_lock.
2. Select the buffer pointed to by nextVictimBuffer, and circularly advance
nextVictimBuffer for next time. Release buffer_strategy_lock.
3. If the selected buffer is pinned or has a nonzero usage count, it cannot
be used.  Decrement its usage count (if nonzero), reacquire
buffer_strategy_lock, and return to step 3 to examine the next buffer.
4. Pin the selected buffer, and return.
(Note that if the selected buffer is dirty, we will have to write it out
before we can recycle it; if someone else pins the buffer meanwhile we will
have to give up and try another buffer.  This however is not a concern
of the basic select-a-victim-buffer algorithm.)
Buffer Ring Replacement Strategy
---------------------------------
When running a query that needs to access a large number of pages just once,
such as VACUUM or a large sequential scan, a different strategy is used.
A page that has been touched only by such a scan is unlikely to be needed
again soon, so instead of running the normal clock-sweep algorithm and
blowing out the entire buffer cache, a small ring of buffers is allocated
using the normal clock-sweep algorithm and those buffers are reused for the
whole scan.  This also implies that much of the write traffic caused by such
a statement will be done by the backend itself and not pushed off onto other
processes.
For sequential scans, a 256KB ring is used. That's small enough to fit in L2
cache, which makes transferring pages from OS cache to shared buffer cache
efficient.  Even less would often be enough, but the ring must be big enough
to accommodate all pages in the scan that are pinned concurrently.  256KB
should also be enough to leave a small cache trail for other backends to
join in a synchronized seq scan.  If a ring buffer is dirtied and its LSN
updated, we would normally have to write and flush WAL before we could
re-use the buffer; in this case we instead discard the buffer from the ring
and (later) choose a replacement using the normal clock-sweep algorithm.
Hence this strategy works best for scans that are read-only (or at worst
update hint bits).  In a scan that modifies every page in the scan, like a
bulk UPDATE or DELETE, the buffers in the ring will always be dirtied and
the ring strategy effectively degrades to the normal strategy.
VACUUM uses a ring like sequential scans, however, the size of this ring is
controlled by the vacuum_buffer_usage_limit GUC.  Dirty pages are not removed
from the ring.  Instead, the WAL is flushed if needed to allow reuse of the
buffers.  Before introducing the buffer ring strategy in 8.3, VACUUM's buffers
were sent to the freelist, which was effectively a buffer ring of 1 buffer,
resulting in excessive WAL flushing.
Bulk writes work similarly to VACUUM.  Currently this applies only to
COPY IN and CREATE TABLE AS SELECT.  (Might it be interesting to make
seqscan UPDATE and DELETE use the bulkwrite strategy?)  For bulk writes
we use a ring size of 16MB (but not more than 1/8th of shared_buffers).
Smaller sizes have been shown to result in the COPY blocking too often
for WAL flushes.  While it's okay for a background vacuum to be slowed by
doing its own WAL flushing, we'd prefer that COPY not be subject to that,
so we let it use up a bit more of the buffer arena.
Background Writer's Processing
------------------------------
The background writer is designed to write out pages that are likely to be
recycled soon, thereby offloading the writing work from active backends.
To do this, it scans forward circularly from the current position of
nextVictimBuffer (which it does not change!), looking for buffers that are
dirty and not pinned nor marked with a positive usage count.  It pins,
writes, and releases any such buffer.
If we can assume that reading nextVictimBuffer is an atomic action, then
the writer doesn't even need to take buffer_strategy_lock in order to look
for buffers to write; it needs only to spinlock each buffer header for long
enough to check the dirtybit.  Even without that assumption, the writer
only needs to take the lock long enough to read the variable value, not
while scanning the buffers.  (This is a very substantial improvement in
the contention cost of the writer compared to PG 8.0.)
The background writer takes shared content lock on a buffer while writing it
out (and anyone else who flushes buffer contents to disk must do so too).
This ensures that the page image transferred to disk is reasonably consistent.
We might miss a hint-bit update or two but that isn't a problem, for the same
reasons mentioned under buffer access rules.
As of 8.4, background writer starts during recovery mode when there is
some form of potentially extended recovery to perform. It performs an
identical service to normal processing, except that checkpoints it
writes are technically restartpoints.
 |