diff options
Diffstat (limited to 'Documentation/technical/bitmap-format.adoc')
-rw-r--r-- | Documentation/technical/bitmap-format.adoc | 398 |
1 files changed, 398 insertions, 0 deletions
diff --git a/Documentation/technical/bitmap-format.adoc b/Documentation/technical/bitmap-format.adoc new file mode 100644 index 0000000000..bfb0ec7beb --- /dev/null +++ b/Documentation/technical/bitmap-format.adoc @@ -0,0 +1,398 @@ +GIT bitmap v1 format +==================== + +== Pack and multi-pack bitmaps + +Bitmaps store reachability information about the set of objects in a packfile, +or a multi-pack index (MIDX). The former is defined obviously, and the latter is +defined as the union of objects in packs contained in the MIDX. + +A bitmap may belong to either one pack, or the repository's multi-pack index (if +it exists). A repository may have at most one bitmap. + +An object is uniquely described by its bit position within a bitmap: + + - If the bitmap belongs to a packfile, the __n__th bit corresponds to + the __n__th object in pack order. For a function `offset` which maps + objects to their byte offset within a pack, pack order is defined as + follows: + + o1 <= o2 <==> offset(o1) <= offset(o2) + + - If the bitmap belongs to a MIDX, the __n__th bit corresponds to the + __n__th object in MIDX order. With an additional function `pack` which + maps objects to the pack they were selected from by the MIDX, MIDX order + is defined as follows: + + o1 <= o2 <==> pack(o1) <= pack(o2) /\ offset(o1) <= offset(o2) ++ +The ordering between packs is done according to the MIDX's .rev file. +Notably, the preferred pack sorts ahead of all other packs. + +The on-disk representation (described below) of a bitmap is the same regardless +of whether or not that bitmap belongs to a packfile or a MIDX. The only +difference is the interpretation of the bits, which is described above. + +Certain bitmap extensions are supported (see: Appendix B). No extensions are +required for bitmaps corresponding to packfiles. For bitmaps that correspond to +MIDXs, both the bit-cache and rev-cache extensions are required. + +== On-disk format + + * A header appears at the beginning: + + 4-byte signature: :: {'B', 'I', 'T', 'M'} + + 2-byte version number (network byte order): :: + + The current implementation only supports version 1 + of the bitmap index (the same one as JGit). + + 2-byte flags (network byte order): :: + + The following flags are supported: + + ** {empty} + BITMAP_OPT_FULL_DAG (0x1) REQUIRED: ::: + + This flag must always be present. It implies that the + bitmap index has been generated for a packfile or + multi-pack index (MIDX) with full closure (i.e. where + every single object in the packfile/MIDX can find its + parent links inside the same packfile/MIDX). This is a + requirement for the bitmap index format, also present in + JGit, that greatly reduces the complexity of the + implementation. + + ** {empty} + BITMAP_OPT_HASH_CACHE (0x4): ::: + + If present, the end of the bitmap file contains + `N` 32-bit name-hash values, one per object in the + pack/MIDX. The format and meaning of the name-hash is + described below. + + ** {empty} + BITMAP_OPT_LOOKUP_TABLE (0x10): ::: + If present, the end of the bitmap file contains a table + containing a list of `N` <commit_pos, offset, xor_row> + triplets. The format and meaning of the table is described + below. ++ +NOTE: Unlike the xor_offset used to compress an individual bitmap, +`xor_row` stores an *absolute* index into the lookup table, not a location +relative to the current entry. + + 4-byte entry count (network byte order): :: + The total count of entries (bitmapped commits) in this bitmap index. + + 20-byte checksum: :: + The SHA1 checksum of the pack/MIDX this bitmap index + belongs to. + + * 4 EWAH bitmaps that act as type indexes ++ +Type indexes are serialized after the hash cache in the shape +of four EWAH bitmaps stored consecutively (see Appendix A for +the serialization format of an EWAH bitmap). ++ +There is a bitmap for each Git object type, stored in the following +order: ++ + - Commits + - Trees + - Blobs + - Tags + ++ +In each bitmap, the `n`th bit is set to true if the `n`th object +in the packfile or multi-pack index is of that type. ++ +The obvious consequence is that the OR of all 4 bitmaps will result +in a full set (all bits set), and the AND of all 4 bitmaps will +result in an empty bitmap (no bits set). + + * N entries with compressed bitmaps, one for each indexed commit ++ +Where `N` is the total number of entries in this bitmap index. +Each entry contains the following: + + ** {empty} + 4-byte object position (network byte order): :: + The position **in the index for the packfile or + multi-pack index** where the bitmap for this commit is + found. + + ** {empty} + 1-byte XOR-offset: :: + The xor offset used to compress this bitmap. For an entry + in position `x`, an XOR offset of `y` means that the actual + bitmap representing this commit is composed by XORing the + bitmap for this entry with the bitmap in entry `x-y` (i.e. + the bitmap `y` entries before this one). ++ +NOTE: This compression can be recursive. In order to +XOR this entry with a previous one, the previous entry needs +to be decompressed first, and so on. ++ +The hard-limit for this offset is 160 (an entry can only be +xor'ed against one of the 160 entries preceding it). This +number is always positive, and hence entries are always xor'ed +with **previous** bitmaps, not bitmaps that will come afterwards +in the index. + + ** {empty} + 1-byte flags for this bitmap: :: + At the moment the only available flag is `0x1`, which hints + that this bitmap can be re-used when rebuilding bitmap indexes + for the repository. + + ** The compressed bitmap itself, see Appendix A. + + * {empty} + TRAILER: :: + Trailing checksum of the preceding contents. + +== Appendix A: Serialization format for an EWAH bitmap + +Ewah bitmaps are serialized in the same protocol as the JAVAEWAH +library, making them backwards compatible with the JGit +implementation: + + - 4-byte number of bits of the resulting UNCOMPRESSED bitmap + + - 4-byte number of words of the COMPRESSED bitmap, when stored + + - N x 8-byte words, as specified by the previous field ++ +This is the actual content of the compressed bitmap. + + - 4-byte position of the current RLW for the compressed + bitmap + +All words are stored in network byte order for their corresponding +sizes. + +The compressed bitmap is stored in a form of run-length encoding, as +follows. It consists of a concatenation of an arbitrary number of +chunks. Each chunk consists of one or more 64-bit words + + H L_1 L_2 L_3 .... L_M + +H is called RLW (run length word). It consists of (from lower to higher +order bits): + + - 1 bit: the repeated bit B + + - 32 bits: repetition count K (unsigned) + + - 31 bits: literal word count M (unsigned) + +The bitstream represented by the above chunk is then: + + - K repetitions of B + + - The bits stored in `L_1` through `L_M`. Within a word, bits at + lower order come earlier in the stream than those at higher + order. + +The next word after `L_M` (if any) must again be a RLW, for the next +chunk. For efficient appending to the bitstream, the EWAH stores a +pointer to the last RLW in the stream. + + +== Appendix B: Optional Bitmap Sections + +These sections may or may not be present in the `.bitmap` file; their +presence is indicated by the header flags section described above. + +Name-hash cache +--------------- + +If the BITMAP_OPT_HASH_CACHE flag is set, the end of the bitmap contains +a cache of 32-bit values, one per object in the pack/MIDX. The value at +position `i` is the hash of the pathname at which the `i`th object +(counting in index or multi-pack index order) in the pack/MIDX can be found. +This can be fed into the delta heuristics to compare objects with similar +pathnames. + +The hash algorithm used is: + + hash = 0; + while ((c = *name++)) + if (!isspace(c)) + hash = (hash >> 2) + (c << 24); + +Note that this hashing scheme is tied to the BITMAP_OPT_HASH_CACHE flag. +If implementations want to choose a different hashing scheme, they are +free to do so, but MUST allocate a new header flag (because comparing +hashes made under two different schemes would be pointless). + +Commit lookup table +------------------- + +If the BITMAP_OPT_LOOKUP_TABLE flag is set, the last `N * (4 + 8 + 4)` +bytes (preceding the name-hash cache and trailing hash) of the `.bitmap` +file contains a lookup table specifying the information needed to get +the desired bitmap from the entries without parsing previous unnecessary +bitmaps. + +For a `.bitmap` containing `nr_entries` reachability bitmaps, the table +contains a list of `nr_entries` <commit_pos, offset, xor_row> triplets +(sorted in the ascending order of `commit_pos`). The content of the i'th +triplet is - + + * {empty} + commit_pos (4 byte integer, network byte order): :: + It stores the object position of a commit (in the midx or pack + index). + + * {empty} + offset (8 byte integer, network byte order): :: + The offset from which that commit's bitmap can be read. + + * {empty} + xor_row (4 byte integer, network byte order): :: + The position of the triplet whose bitmap is used to compress + this one, or `0xffffffff` if no such bitmap exists. + +Pseudo-merge bitmaps +-------------------- + +If the `BITMAP_OPT_PSEUDO_MERGES` flag is set, a variable number of +bytes (preceding the name-hash cache, commit lookup table, and trailing +checksum) of the `.bitmap` file is used to store pseudo-merge bitmaps. + +For more information on what pseudo-merges are, why they are useful, and +how to configure them, see the information in linkgit:gitpacking[7]. + +=== File format + +If enabled, pseudo-merge bitmaps are stored in an optional section at +the end of a `.bitmap` file. The format is as follows: + +.... ++-------------------------------------------+ +| .bitmap File | ++-------------------------------------------+ +| | +| Pseudo-merge bitmaps (Variable Length) | +| +---------------------------+ | +| | commits_bitmap (EWAH) | | +| +---------------------------+ | +| | merge_bitmap (EWAH) | | +| +---------------------------+ | +| | ++-------------------------------------------+ +| | +| Lookup Table | +| +---------------------------+ | +| | commit_pos (4 bytes) | | +| +---------------------------+ | +| | offset (8 bytes) | | +| +------------+--------------+ | +| | +| Offset Cases: | +| ------------- | +| | +| 1. MSB Unset: single pseudo-merge bitmap | +| + offset to pseudo-merge bitmap | +| | +| 2. MSB Set: multiple pseudo-merges | +| + offset to extended lookup table | +| | ++-------------------------------------------+ +| | +| Extended Lookup Table (Optional) | +| +----+----------+----------+----------+ | +| | N | Offset 1 | .... | Offset N | | +| +----+----------+----------+----------+ | +| | | 8 bytes | .... | 8 bytes | | +| +----+----------+----------+----------+ | +| | ++-------------------------------------------+ +| | +| Pseudo-merge position table | +| +----+----------+----------+----------+ | +| | N | Offset 1 | .... | Offset N | | +| +----+----------+----------+----------+ | +| | | 8 bytes | .... | 8 bytes | | +| +----+----------+----------+----------+ | +| | ++-------------------------------------------+ +| | +| Pseudo-merge Metadata | +| +-----------------------------------+ | +| | # pseudo-merges (4 bytes) | | +| +-----------------------------------+ | +| | # commits (4 bytes) | | +| +-----------------------------------+ | +| | Lookup offset (8 bytes) | | +| +-----------------------------------+ | +| | Extension size (8 bytes) | | +| +-----------------------------------+ | +| | ++-------------------------------------------+ +.... + +* One or more pseudo-merge bitmaps, each containing: + + ** `commits_bitmap`, an EWAH-compressed bitmap describing the set of + commits included in the this psuedo-merge. + + ** `merge_bitmap`, an EWAH-compressed bitmap describing the union of + the set of objects reachable from all commits listed in the + `commits_bitmap`. + +* A lookup table, mapping pseudo-merged commits to the pseudo-merges + they belong to. Entries appear in increasing order of each commit's + bit position. Each entry is 12 bytes wide, and is comprised of the + following: + + ** `commit_pos`, a 4-byte unsigned value (in network byte-order) + containing the bit position for this commit. + + ** `offset`, an 8-byte unsigned value (also in network byte-order) + containing either one of two possible offsets, depending on whether or + not the most-significant bit is set. + + *** If unset (i.e. `offset & ((uint64_t)1<<63) == 0`), the offset + (relative to the beginning of the `.bitmap` file) at which the + pseudo-merge bitmap for this commit can be read. This indicates + only a single pseudo-merge bitmap contains this commit. + + *** If set (i.e. `offset & ((uint64_t)1<<63) != 0`), the offset + (again relative to the beginning of the `.bitmap` file) at which + the extended offset table can be located describing the set of + pseudo-merge bitmaps which contain this commit. This indicates + that multiple pseudo-merge bitmaps contain this commit. + +* An (optional) extended lookup table (written if and only if there is + at least one commit which appears in more than one pseudo-merge). + There are as many entries as commits which appear in multiple + pseudo-merges. Each entry contains the following: + + ** `N`, a 4-byte unsigned value equal to the number of pseudo-merges + which contain a given commit. + + ** An array of `N` 8-byte unsigned values, each of which is + interpreted as an offset (relative to the beginning of the + `.bitmap` file) at which a pseudo-merge bitmap for this commit can + be read. These values occur in no particular order. + +* Positions for all pseudo-merges, each stored as an 8-byte unsigned + value (in network byte-order) containing the offset (relative to the + beginning of the `.bitmap` file) of each consecutive pseudo-merge. + +* A 4-byte unsigned value (in network byte-order) equal to the number of + pseudo-merges. + +* A 4-byte unsigned value (in network byte-order) equal to the number of + unique commits which appear in any pseudo-merge. + +* An 8-byte unsigned value (in network byte-order) equal to the number + of bytes between the start of the pseudo-merge section and the + beginning of the lookup table. + +* An 8-byte unsigned value (in network byte-order) equal to the number + of bytes in the pseudo-merge section (including this field). |