Database Page Layout
A description of the database file page format.
This section provides an overview of the page format used by
PostgreSQL tables and indexes.
Actually, index access methods need not use this page format.
All the existing index methods do use this basic format,
but the data kept on index metapages usually doesn't follow
the item layout rules.
TOAST tables and sequences are formatted just like a regular table.
In the following explanation, a
byte
is assumed to contain 8 bits. In addition, the term
item
refers to an individual data value that is stored on a page. In a table,
an item is a row; in an index, an item is an index entry.
Every table and index is stored as an array of pages> of a
fixed size (usually 8K, although a different page size can be selected
when compiling the server). In a table, all the pages are logically
equivalent, so a particular item (row) can be stored in any page. In
indexes, the first page is generally reserved as a metapage>
holding control information, and there may be different types of pages
within the index, depending on the index access method.
shows the overall layout of a page.
There are five parts to each page.
Overall Page Layout
Page Layout
Item
Description
PageHeaderData
20 bytes long. Contains general information about the page, including
free space pointers.
ItemPointerData
Array of (offset,length) pairs pointing to the actual items.
4 bytes per item.
Free space
The unallocated space. New item pointers are allocated from the start
of this area, new items from the end.
Items
The actual items themselves.
Special space
Index access method specific data. Different methods store different
data. Empty in ordinary tables.
The first 20 bytes of each page consists of a page header
(PageHeaderData). Its format is detailed in . The first two fields track the most
recent WAL entry related to this page. They are followed by three 2-byte
integer fields
(pd_lower, pd_upper,
and pd_special). These contain byte offsets
from the page start to the start
of unallocated space, to the end of unallocated space, and to the start of
the special space.
The last 2 bytes of the page header,
pd_pagesize_version, store both the page size
and a version indicator. Beginning with
PostgreSQL 8.0 the version number is 2;
PostgreSQL 7.3 and 7.4 used version number 1;
prior releases used version number 0.
(The basic page layout and header format has not changed in these versions,
but the layout of heap row headers has.) The page size
is basically only present as a cross-check; there is no support for having
more than one page size in an installation.
All the details may be found in
src/include/storage/bufpage.h.
Following the page header are item identifiers
(ItemIdData), each requiring four bytes.
An item identifier contains a byte-offset to
the start of an item, its length in bytes, and a few attribute bits
which affect its interpretation.
New item identifiers are allocated
as needed from the beginning of the unallocated space.
The number of item identifiers present can be determined by looking at
pd_lower>, which is increased to allocate a new identifier.
Because an item
identifier is never moved until it is freed, its index may be used on a
long-term basis to reference an item, even when the item itself is moved
around on the page to compact free space. In fact, every pointer to an
item (ItemPointer, also known as
CTID) created by
PostgreSQL consists of a page number and the
index of an item identifier.
The items themselves are stored in space allocated backwards from the end
of unallocated space. The exact structure varies depending on what the
table is to contain. Tables and sequences both use a structure named
HeapTupleHeaderData, described below.
The final section is the special section
which may
contain anything the access method wishes to store. For example,
b-tree indexes store links to the page's left and right siblings,
as well as some other data relevant to the index structure.
Ordinary tables do not use a special section at all (indicated by setting
pd_special> to equal the page size).
All table rows are structured in the same way. There is a fixed-size
header (occupying 27 bytes on most machines), followed by an optional null
bitmap, an optional object ID field, and the user data. The header is
detailed
in . The actual user data
(columns of the row) begins at the offset indicated by
t_hoff>, which must always be a multiple of the MAXALIGN
distance for the platform.
The null bitmap is
only present if the HEAP_HASNULL bit is set in
t_infomask. If it is present it begins just after
the fixed header and occupies enough bytes to have one bit per data column
(that is, t_natts> bits altogether). In this list of bits, a
1 bit indicates not-null, a 0 bit is a null. When the bitmap is not
present, all columns are assumed not-null.
The object ID is only present if the HEAP_HASOID bit
is set in t_infomask. If present, it appears just
before the t_hoff> boundary. Any padding needed to make
t_hoff> a MAXALIGN multiple will appear between the null
bitmap and the object ID. (This in turn ensures that the object ID is
suitably aligned.)
All the details may be found in
src/include/access/htup.h.
Interpreting the actual data can only be done with information obtained
from other tables, mostly pg_attribute. The
key values needed to identify field locations are
attlen and attalign.
There is no way to directly get a
particular attribute, except when there are only fixed width fields and no
NULLs. All this trickery is wrapped up in the functions
heap_getattr, fastgetattr
and heap_getsysattr.
To read the data you need to examine each attribute in turn. First check
whether the field is NULL according to the null bitmap. If it is, go to
the next. Then make sure you have the right alignment. If the field is a
fixed width field, then all the bytes are simply placed. If it's a
variable length field (attlen = -1) then it's a bit more complicated.
All variable-length datatypes share the common header structure
varattrib, which includes the total length of the stored
value and some flag bits. Depending on the flags, the data may be either
inline or in another table (TOAST); it might be compressed, too.