user/sven/git.git/commit-graph.h, branch v2.49.0-rc2

hash-ll: merge with "hash.h"

2024-06-14T17:26:33Z

The "hash-ll.h" header was introduced via d1cbe1e6d8 (hash-ll.h: split out of hash.h to remove dependency on repository.h, 2023-04-22) to make explicit the split between hash-related functions that rely on the global `the_repository`, and those that don't. This split is no longer necessary now that we we have removed the reliance on `the_repository`. Merge "hash-ll.h" back into "hash.h". This causes some code units to not include "repository.h" anymore, which requires us to add some forward declarations. Signed-off-by: Patrick Steinhardt Signed-off-by: Junio C Hamano

Merge branch 'ps/do-not-trust-commit-graph-blindly-for-existence'

2023-11-08T02:03:59Z

The codepath to traverse the commit-graph learned to notice that a commit is missing (e.g., corrupt repository lost an object), even though it knows something about the commit (like its parents) from what is in commit-graph. * ps/do-not-trust-commit-graph-blindly-for-existence: commit: detect commits that exist in commit-graph but not in the ODB commit-graph: introduce envvar to disable commit existence checks

commit-graph: introduce envvar to disable commit existence checks

2023-11-01T03:04:06Z

Our `lookup_commit_in_graph()` helper tries to look up commits from the commit graph and, if it doesn't exist there, falls back to parsing it from the object database instead. This is intended to speed up the lookup of any such commit that exists in the database. There is an edge case though where the commit exists in the graph, but not in the object database. To avoid returning such stale commits the helper function thus double checks that any such commit parsed from the graph also exists in the object database. This makes the function safe to use even when commit graphs aren't updated regularly. We're about to introduce the same pattern into other parts of our code base though, namely `repo_parse_commit_internal()`. Here the extra sanity check is a bit of a tougher sell: `lookup_commit_in_graph()` was a newly introduced helper, and as such there was no performance hit by adding this sanity check. If we added `repo_parse_commit_internal()` with that sanity check right from the beginning as well, this would probably never have been an issue to begin with. But by retrofitting it with this sanity check now we do add a performance regression to preexisting code, and thus there is a desire to avoid this or at least give an escape hatch. In practice, there is no inherent reason why either of those functions should have the sanity check whereas the other one does not: either both of them are able to detect this issue or none of them should be. This also means that the default of whether we do the check should likely be the same for both. To err on the side of caution, we thus rather want to make `repo_parse_commit_internal()` stricter than to loosen the checks that we already have in `lookup_commit_in_graph()`. The escape hatch is added in the form of a new GIT_COMMIT_GRAPH_PARANOIA environment variable that mirrors GIT_REF_PARANOIA. If enabled, which is the default, we will double check that commits looked up in the commit graph via `lookup_commit_in_graph()` also exist in the object database. This same check will also be added in `repo_parse_commit_internal()`. Signed-off-by: Patrick Steinhardt Signed-off-by: Junio C Hamano

commit-graph: check bounds when accessing BDAT chunk

2023-10-09T22:55:01Z

When loading Bloom filters from a commit-graph file, we use the offset values in the BIDX chunk to index into the memory mapped for the BDAT chunk. But since we don't record how big the BDAT chunk is, we just trust that the BIDX offsets won't cause us to read outside of the chunk memory. A corrupted or malicious commit-graph file will cause us to segfault (in practice this isn't a very interesting attack, since commit-graph files are local-only, and the worst case is an out-of-bounds read). We can't fix this by checking the chunk size during parsing, since the data in the BDAT chunk doesn't have a fixed size (that's why we need the BIDX in the first place). So we'll fix it in two parts: 1. Record the BDAT chunk size during parsing, and then later check that the BIDX offsets we look up are within bounds. 2. Because the offsets are relative to the end of the BDAT header, we must also make sure that the BDAT chunk is at least as large as the expected header size. Otherwise, we overflow when trying to move past the header, even for an offset of "0". We can check this early, during the parsing stage. The error messages are rather verbose, but since this is not something you'd expect to see outside of severe bugs or corruption, it makes sense to err on the side of too many details. Sadly we can't mention the filename during the chunk-parsing stage, as we haven't set g->filename at this point, nor passed it down through the stack. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano

commit-graph: bounds-check generation overflow chunk

2023-10-09T22:55:01Z

If the generation entry in a commit-graph doesn't fit, we instead insert an offset into a generation overflow chunk. But since we don't record the size of the chunk, we may read outside the chunk if the offset we find on disk is malicious or corrupted. We can't check the size of the chunk up-front; it will vary based on how many entries need overflow. So instead, we'll do a bounds-check before accessing the chunk memory. Unfortunately there is no error-return from this function, so we'll just have to die(), which is what it does for other forms of corruption. As with other cases, we can drop the st_mult() call, since we know our bounds-checked value will fit within a size_t. Before this patch, the test here actually "works" because we read garbage data from the next chunk. And since that garbage data happens not to provide a generation number which changes the output, it appears to work. We could construct a case that actually segfaults or produces wrong output, but it would be a bit tricky. For our purposes its sufficient to check that we've detected the bounds error. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano

commit-graph: bounds-check base graphs chunk

2023-10-09T22:55:01Z

When we are loading a commit-graph chain, we check that each slice of the chain points to the appropriate set of base graphs via its BASE chunk. But since we don't record the size of the chunk, we may access out-of-bounds memory if the file is corrupted. Since we know the number of entries we expect to find (based on the position within the commit-graph-chain file), we can just check the size up front. In theory this would also let us drop the st_mult() call a few lines later when we actually access the memory, since we know that the computed offset will fit in a size_t. But because the operands "g->hash_len" and "n" have types "unsigned char" and "int", we'd have to cast to size_t first. Leaving the st_mult() does that cast, and makes it more obvious that we don't have an overflow problem. Note that the test does not actually segfault before this patch, since it just reads garbage from the chunk after BASE (and indeed, it even rejects the file because that garbage does not have the expected hash value). You could construct a file with BASE at the end that did segfault, but corrupting the existing one is easy, and we can check stderr for the expected message. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano

commit-graph: detect out-of-bounds extra-edges pointers

2023-10-09T22:55:01Z

If an entry in a commit-graph file has more than 2 parents, the fixed-size parent fields instead point to an offset within an "extra edges" chunk. We blindly follow these, assuming that the chunk is present and sufficiently large; this can lead to an out-of-bounds read for a corrupt or malicious file. We can fix this by recording the size of the chunk and adding a bounds-check in fill_commit_in_graph(). There are a few tricky bits: 1. We'll switch from working with a pointer to an offset. This makes some corner cases just fall out naturally: a. If we did not find an EDGE chunk at all, our size will correctly be zero (so everything is "out of bounds"). b. Comparing "size / 4" lets us make sure we have at least 4 bytes to read, and we never compute a pointer more than one element past the end of the array (computing a larger pointer is probably OK in practice, but is technically undefined behavior). c. The current code casts to "uint32_t *". Replacing it with an offset avoids any comparison between different types of pointer (since the chunk is stored as "unsigned char *"). 2. This is the first case in which fill_commit_in_graph() may return anything but success. We need to make sure to roll back the "parsed" flag (and any parents we might have added before running out of buffer) so that the caller can cleanly fall back to loading the commit object itself. It's a little non-trivial to do this, and we might benefit from factoring it out. But we can wait on that until we actually see a second case where we return an error. As a bonus, this lets us drop the st_mult() call. Since we've already done a bounds check, we know there won't be any integer overflow (it would imply our buffer is larger than a size_t can hold). The included test does not actually segfault before this patch (though you could construct a case where it does). Instead, it reads garbage from the next chunk which results in it complaining about a bogus parent id. This is sufficient for our needs, though (we care that the fallback succeeds, and that stderr mentions the out-of-bounds read). Signed-off-by: Jeff King Signed-off-by: Junio C Hamano

commit-graph: report incomplete chains during verification

2023-09-28T14:00:43Z

The load_commit_graph_chain_fd_st() function will stop loading chains when it sees an error. But if it has loaded any graph slice at all, it will return it. This is a good thing for normal use (we use what data we can, and this is just an optimization). But it's a bad thing for "commit-graph verify", which should be careful about finding any irregularities. We do complain to stderr with a warning(), but the verify command still exits with a successful return code. The new tests here cover corruption of both the base and tip slices of the chain. The corruption of the base file already works (it is the first file we look at, so when we see the error we return NULL). The "tip" case is what is fixed by this patch (it complains to stderr but still returns the base slice). Likewise the existing tests for corruption of the commit-graph-chain file itself need to be updated. We already exited non-zero correctly for the "base" case, but the "tip" case can now do so, too. Note that this also causes us to adjust a test later in the file that similarly corrupts a tip (though confusingly the test script calls this "base"). It checks stderr but erroneously expects the whole "verify" command to exit with a successful code. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano

commit-graph: factor out chain opening function

2023-09-28T14:00:43Z

The load_commit_graph_chain() function opens the chain file and all of the slices of graph that it points to. If there is no chain file (which is a totally normal condition), we return NULL. But if we run into errors with the chain file or loading the actual graph data, we also return NULL, and the caller cannot tell the difference. The caller can check for ENOENT for the unremarkable "no such file" case. But I'm hesitant to assume that the rest of the function would never accidentally set errno to ENOENT itself, since it is opening the slice files (and that would mean the caller fails to notice a real error). So let's break this into two functions: one to open the file, and one to actually load it. This matches the interface we provide for the non-chain graph file, which will also come in handy in a moment when we fix some bugs in the "git commit-graph verify" code. Some notes: - I've kept the "1 is good, 0 is bad" return convention (and the weird "fd" out-parameter) used by the matching open_commit_graph() function and other parts of the commit-graph code. This is unlike most of the rest of Git (which would just return the fd, with -1 for error), but it makes sense to stay consistent with the adjacent bits of the API here. - The existing chain loading function will quietly return if the file is too small to hold a single entry. I've retained that behavior (and explicitly set ENOENT in the opener function) for now, under the notion that it's probably valid (though I'd imagine unusual) to have an empty chain file. There are two small behavior changes here, but I think both are strictly positive: 1. The original blindly did a stat() before checking if fopen() succeeded, meaning we were making a pointless extra stat call. 2. We now use fstat() to check the file size. The previous code using a regular stat() on the pathname meant we could technically race with somebody updating the chain file, and end up with a size that does not match what we just opened with fopen(). I doubt anybody ever hit this in practice, but it may have caused an out-of-bounds read. We'll retain the load_commit_graph_chain() function which does both the open and reading steps (most existing callers do not care about seeing errors anyway, since loading commit-graphs is optimistic). Signed-off-by: Jeff King Signed-off-by: Junio C Hamano

object-store-ll.h: split this header out of object-store.h

2023-06-21T20:39:54Z

The vast majority of files including object-store.h did not need dir.h nor khash.h. Split the header into two files, and let most just depend upon object-store-ll.h, while letting the two callers that need it depend on the full object-store.h. After this patch: $ git grep -h include..object-store | sort | uniq -c 2 #include "object-store.h" 129 #include "object-store-ll.h" Diff best viewed with `--color-moved`. Signed-off-by: Elijah Newren Signed-off-by: Junio C Hamano