<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/git.git/commit-graph.h, branch v2.49.0-rc2</title>
<subtitle>Git
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/git.git/atom?h=v2.49.0-rc2</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/atom?h=v2.49.0-rc2'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/'/>
<updated>2024-06-14T17:26:33Z</updated>
<entry>
<title>hash-ll: merge with "hash.h"</title>
<updated>2024-06-14T17:26:33Z</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2024-06-14T06:50:32Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=8a676bdc5c3a9377322834326605b3804c7c54bf'/>
<id>urn:sha1:8a676bdc5c3a9377322834326605b3804c7c54bf</id>
<content type='text'>
The "hash-ll.h" header was introduced via d1cbe1e6d8 (hash-ll.h: split
out of hash.h to remove dependency on repository.h, 2023-04-22) to make
explicit the split between hash-related functions that rely on the
global `the_repository`, and those that don't. This split is no longer
necessary now that we we have removed the reliance on `the_repository`.

Merge "hash-ll.h" back into "hash.h". This causes some code units to not
include "repository.h" anymore, which requires us to add some forward
declarations.

Signed-off-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Merge branch 'ps/do-not-trust-commit-graph-blindly-for-existence'</title>
<updated>2023-11-08T02:03:59Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2023-11-08T02:03:59Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=42b87f7ee60323f2e45a91233db80d44d3e33ad2'/>
<id>urn:sha1:42b87f7ee60323f2e45a91233db80d44d3e33ad2</id>
<content type='text'>
The codepath to traverse the commit-graph learned to notice that a
commit is missing (e.g., corrupt repository lost an object), even
though it knows something about the commit (like its parents) from
what is in commit-graph.

* ps/do-not-trust-commit-graph-blindly-for-existence:
  commit: detect commits that exist in commit-graph but not in the ODB
  commit-graph: introduce envvar to disable commit existence checks
</content>
</entry>
<entry>
<title>commit-graph: introduce envvar to disable commit existence checks</title>
<updated>2023-11-01T03:04:06Z</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2023-10-31T07:16:13Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=e04838ea828651cc122de505320e5ea85b43f1b1'/>
<id>urn:sha1:e04838ea828651cc122de505320e5ea85b43f1b1</id>
<content type='text'>
Our `lookup_commit_in_graph()` helper tries to look up commits from the
commit graph and, if it doesn't exist there, falls back to parsing it
from the object database instead. This is intended to speed up the
lookup of any such commit that exists in the database. There is an edge
case though where the commit exists in the graph, but not in the object
database. To avoid returning such stale commits the helper function thus
double checks that any such commit parsed from the graph also exists in
the object database. This makes the function safe to use even when
commit graphs aren't updated regularly.

We're about to introduce the same pattern into other parts of our code
base though, namely `repo_parse_commit_internal()`. Here the extra
sanity check is a bit of a tougher sell: `lookup_commit_in_graph()` was
a newly introduced helper, and as such there was no performance hit by
adding this sanity check. If we added `repo_parse_commit_internal()`
with that sanity check right from the beginning as well, this would
probably never have been an issue to begin with. But by retrofitting it
with this sanity check now we do add a performance regression to
preexisting code, and thus there is a desire to avoid this or at least
give an escape hatch.

In practice, there is no inherent reason why either of those functions
should have the sanity check whereas the other one does not: either both
of them are able to detect this issue or none of them should be. This
also means that the default of whether we do the check should likely be
the same for both. To err on the side of caution, we thus rather want to
make `repo_parse_commit_internal()` stricter than to loosen the checks
that we already have in `lookup_commit_in_graph()`.

The escape hatch is added in the form of a new GIT_COMMIT_GRAPH_PARANOIA
environment variable that mirrors GIT_REF_PARANOIA. If enabled, which is
the default, we will double check that commits looked up in the commit
graph via `lookup_commit_in_graph()` also exist in the object database.
This same check will also be added in `repo_parse_commit_internal()`.

Signed-off-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>commit-graph: check bounds when accessing BDAT chunk</title>
<updated>2023-10-09T22:55:01Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2023-10-09T21:05:50Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=920f400e919c7c51f81adc6989cdd52630220783'/>
<id>urn:sha1:920f400e919c7c51f81adc6989cdd52630220783</id>
<content type='text'>
When loading Bloom filters from a commit-graph file, we use the offset
values in the BIDX chunk to index into the memory mapped for the BDAT
chunk. But since we don't record how big the BDAT chunk is, we just
trust that the BIDX offsets won't cause us to read outside of the chunk
memory. A corrupted or malicious commit-graph file will cause us to
segfault (in practice this isn't a very interesting attack, since
commit-graph files are local-only, and the worst case is an
out-of-bounds read).

We can't fix this by checking the chunk size during parsing, since the
data in the BDAT chunk doesn't have a fixed size (that's why we need the
BIDX in the first place). So we'll fix it in two parts:

  1. Record the BDAT chunk size during parsing, and then later check
     that the BIDX offsets we look up are within bounds.

  2. Because the offsets are relative to the end of the BDAT header, we
     must also make sure that the BDAT chunk is at least as large as the
     expected header size. Otherwise, we overflow when trying to move
     past the header, even for an offset of "0". We can check this
     early, during the parsing stage.

The error messages are rather verbose, but since this is not something
you'd expect to see outside of severe bugs or corruption, it makes sense
to err on the side of too many details. Sadly we can't mention the
filename during the chunk-parsing stage, as we haven't set g-&gt;filename
at this point, nor passed it down through the stack.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>commit-graph: bounds-check generation overflow chunk</title>
<updated>2023-10-09T22:55:01Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2023-10-09T21:05:47Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=ee6a7924124b2a9030da55c94a27829e410b3b64'/>
<id>urn:sha1:ee6a7924124b2a9030da55c94a27829e410b3b64</id>
<content type='text'>
If the generation entry in a commit-graph doesn't fit, we instead insert
an offset into a generation overflow chunk. But since we don't record
the size of the chunk, we may read outside the chunk if the offset we
find on disk is malicious or corrupted.

We can't check the size of the chunk up-front; it will vary based on how
many entries need overflow. So instead, we'll do a bounds-check before
accessing the chunk memory. Unfortunately there is no error-return from
this function, so we'll just have to die(), which is what it does for
other forms of corruption.

As with other cases, we can drop the st_mult() call, since we know our
bounds-checked value will fit within a size_t.

Before this patch, the test here actually "works" because we read
garbage data from the next chunk. And since that garbage data happens
not to provide a generation number which changes the output, it appears
to work. We could construct a case that actually segfaults or produces
wrong output, but it would be a bit tricky. For our purposes its
sufficient to check that we've detected the bounds error.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>commit-graph: bounds-check base graphs chunk</title>
<updated>2023-10-09T22:55:01Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2023-10-09T21:05:41Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=6cf61d0db55291c3b8406a6ba8f20fdfb9a4a344'/>
<id>urn:sha1:6cf61d0db55291c3b8406a6ba8f20fdfb9a4a344</id>
<content type='text'>
When we are loading a commit-graph chain, we check that each slice of the
chain points to the appropriate set of base graphs via its BASE chunk.
But since we don't record the size of the chunk, we may access
out-of-bounds memory if the file is corrupted.

Since we know the number of entries we expect to find (based on the
position within the commit-graph-chain file), we can just check the size
up front.

In theory this would also let us drop the st_mult() call a few lines
later when we actually access the memory, since we know that the
computed offset will fit in a size_t. But because the operands
"g-&gt;hash_len" and "n" have types "unsigned char" and "int", we'd have to
cast to size_t first. Leaving the st_mult() does that cast, and makes it
more obvious that we don't have an overflow problem.

Note that the test does not actually segfault before this patch, since
it just reads garbage from the chunk after BASE (and indeed, it even
rejects the file because that garbage does not have the expected hash
value). You could construct a file with BASE at the end that did
segfault, but corrupting the existing one is easy, and we can check
stderr for the expected message.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>commit-graph: detect out-of-bounds extra-edges pointers</title>
<updated>2023-10-09T22:55:01Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2023-10-09T21:05:38Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=9622610e55c7d4f81a924387947884b2ac268934'/>
<id>urn:sha1:9622610e55c7d4f81a924387947884b2ac268934</id>
<content type='text'>
If an entry in a commit-graph file has more than 2 parents, the
fixed-size parent fields instead point to an offset within an "extra
edges" chunk. We blindly follow these, assuming that the chunk is
present and sufficiently large; this can lead to an out-of-bounds read
for a corrupt or malicious file.

We can fix this by recording the size of the chunk and adding a
bounds-check in fill_commit_in_graph(). There are a few tricky bits:

  1. We'll switch from working with a pointer to an offset. This makes
     some corner cases just fall out naturally:

      a. If we did not find an EDGE chunk at all, our size will
         correctly be zero (so everything is "out of bounds").

      b. Comparing "size / 4" lets us make sure we have at least 4 bytes
         to read, and we never compute a pointer more than one element
         past the end of the array (computing a larger pointer is
         probably OK in practice, but is technically undefined
         behavior).

      c. The current code casts to "uint32_t *". Replacing it with an
         offset avoids any comparison between different types of pointer
         (since the chunk is stored as "unsigned char *").

  2. This is the first case in which fill_commit_in_graph() may return
     anything but success. We need to make sure to roll back the
     "parsed" flag (and any parents we might have added before running
     out of buffer) so that the caller can cleanly fall back to
     loading the commit object itself.

     It's a little non-trivial to do this, and we might benefit from
     factoring it out. But we can wait on that until we actually see a
     second case where we return an error.

As a bonus, this lets us drop the st_mult() call. Since we've already
done a bounds check, we know there won't be any integer overflow (it
would imply our buffer is larger than a size_t can hold).

The included test does not actually segfault before this patch (though
you could construct a case where it does). Instead, it reads garbage
from the next chunk which results in it complaining about a bogus parent
id. This is sufficient for our needs, though (we care that the fallback
succeeds, and that stderr mentions the out-of-bounds read).

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>commit-graph: report incomplete chains during verification</title>
<updated>2023-09-28T14:00:43Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2023-09-28T04:39:51Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=5f259197eea0a3acc48f46748778f33c935476cb'/>
<id>urn:sha1:5f259197eea0a3acc48f46748778f33c935476cb</id>
<content type='text'>
The load_commit_graph_chain_fd_st() function will stop loading chains
when it sees an error. But if it has loaded any graph slice at all, it
will return it. This is a good thing for normal use (we use what data we
can, and this is just an optimization). But it's a bad thing for
"commit-graph verify", which should be careful about finding any
irregularities. We do complain to stderr with a warning(), but the
verify command still exits with a successful return code.

The new tests here cover corruption of both the base and tip slices of
the chain. The corruption of the base file already works (it is the
first file we look at, so when we see the error we return NULL). The
"tip" case is what is fixed by this patch (it complains to stderr but
still returns the base slice).

Likewise the existing tests for corruption of the commit-graph-chain
file itself need to be updated. We already exited non-zero correctly for
the "base" case, but the "tip" case can now do so, too.

Note that this also causes us to adjust a test later in the file that
similarly corrupts a tip (though confusingly the test script calls this
"base"). It checks stderr but erroneously expects the whole "verify"
command to exit with a successful code.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>commit-graph: factor out chain opening function</title>
<updated>2023-09-28T14:00:43Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2023-09-28T04:38:07Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=7ed76b4eb25f03f3bec6abad219d4dda651fc30f'/>
<id>urn:sha1:7ed76b4eb25f03f3bec6abad219d4dda651fc30f</id>
<content type='text'>
The load_commit_graph_chain() function opens the chain file and all of
the slices of graph that it points to. If there is no chain file (which
is a totally normal condition), we return NULL. But if we run into
errors with the chain file or loading the actual graph data, we also
return NULL, and the caller cannot tell the difference.

The caller can check for ENOENT for the unremarkable "no such file"
case. But I'm hesitant to assume that the rest of the function would
never accidentally set errno to ENOENT itself, since it is opening the
slice files (and that would mean the caller fails to notice a real
error).

So let's break this into two functions: one to open the file, and one to
actually load it. This matches the interface we provide for the
non-chain graph file, which will also come in handy in a moment when we
fix some bugs in the "git commit-graph verify" code.

Some notes:

  - I've kept the "1 is good, 0 is bad" return convention (and the weird
    "fd" out-parameter) used by the matching open_commit_graph()
    function and other parts of the commit-graph code. This is unlike
    most of the rest of Git (which would just return the fd, with -1 for
    error), but it makes sense to stay consistent with the adjacent bits
    of the API here.

  - The existing chain loading function will quietly return if the file
    is too small to hold a single entry. I've retained that behavior
    (and explicitly set ENOENT in the opener function) for now, under
    the notion that it's probably valid (though I'd imagine unusual) to
    have an empty chain file.

There are two small behavior changes here, but I think both are strictly
positive:

  1. The original blindly did a stat() before checking if fopen()
     succeeded, meaning we were making a pointless extra stat call.

  2. We now use fstat() to check the file size. The previous code using
     a regular stat() on the pathname meant we could technically race
     with somebody updating the chain file, and end up with a size that
     does not match what we just opened with fopen(). I doubt anybody
     ever hit this in practice, but it may have caused an out-of-bounds
     read.

We'll retain the load_commit_graph_chain() function which does both the
open and reading steps (most existing callers do not care about seeing
errors anyway, since loading commit-graphs is optimistic).

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>object-store-ll.h: split this header out of object-store.h</title>
<updated>2023-06-21T20:39:54Z</updated>
<author>
<name>Elijah Newren</name>
<email>newren@gmail.com</email>
</author>
<published>2023-05-16T06:34:06Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=a034e9106ff1a4cb6fcb6f2ea3a1a47b4d2ba173'/>
<id>urn:sha1:a034e9106ff1a4cb6fcb6f2ea3a1a47b4d2ba173</id>
<content type='text'>
The vast majority of files including object-store.h did not need dir.h
nor khash.h.  Split the header into two files, and let most just depend
upon object-store-ll.h, while letting the two callers that need it
depend on the full object-store.h.

After this patch:
    $ git grep -h include..object-store | sort | uniq -c
          2 #include "object-store.h"
        129 #include "object-store-ll.h"

Diff best viewed with `--color-moved`.

Signed-off-by: Elijah Newren &lt;newren@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
</feed>
