summaryrefslogtreecommitdiff
path: root/commit-graph.c
AgeCommit message (Collapse)Author
13 daysconfig: move Git config parsing into "environment.c"Patrick Steinhardt
In "config.c" we host both the business logic to read and write config files as well as the logic to parse specific Git-related variables. On the one hand this is mixing concerns, but even more importantly it means that we cannot easily remove the dependency on `the_repository` in our config parsing logic. Move the logic into "environment.c". This file is a grab bag of all kinds of global state already, so it is quite a good fit. Furthermore, it also hosts most of the global variables that we're parsing the config values into, making this an even better fit. Note that there is one hidden change: in `parse_fsync_components()` we use an `int` to iterate through `ARRAY_SIZE(fsync_component_names)`. But as -Wsign-compare warnings are enabled in this file this causes a compiler warning. The issue is fixed by using a `size_t` instead. This change allows us to drop the `USE_THE_REPOSITORY_VARIABLE` declaration. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-07-01odb: rename `has_object()`Patrick Steinhardt
Rename `has_object()` to `odb_has_object()` to match other functions related to the object database and our modern coding guidelines. Introduce a compatibility wrapper so that any in-flight topics will continue to compile. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-07-01odb: rename `oid_object_info()`Patrick Steinhardt
Rename `oid_object_info()` to `odb_read_object_info()` as well as their `_extended()` variant to match other functions related to the object database and our modern coding guidelines. Introduce compatibility wrappers so that any in-flight topics will continue to compile. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-07-01odb: get rid of `the_repository` when handling alternatesPatrick Steinhardt
The functions to manage alternates all depend on `the_repository`. Refactor them to accept an object database as a parameter and adjust all callers. The functions are renamed accordingly. Note that right now the situation is still somewhat weird because we end up using the object store path provided by the object store's repository anyway. Consequently, we could have instead passed in a pointer to the repository instead of passing in the pointer to the object store. This will be addressed in subsequent commits though, where we will start to use the path owned by the object store itself. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-07-01object-store: rename files to "odb.{c,h}"Patrick Steinhardt
In the preceding commits we have renamed the structures contained in "object-store.h" to `struct object_database` and `struct odb_backend`. As such, the code files "object-store.{c,h}" are confusingly named now. Rename them to "odb.{c,h}" accordingly. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-07-01object-store: rename `object_directory` to `odb_source`Patrick Steinhardt
The `object_directory` structure is used as an access point for a single object directory like ".git/objects". While the structure isn't yet fully self-contained, the intent is for it to eventually contain all information required to access objects in one specific location. While the name "object directory" is a good fit for now, this will change over time as we continue with the agenda to make pluggable object databases a thing. Eventually, objects may not be accessed via any kind of directory at all anymore, but they could instead be backed by any kind of durable storage mechanism. While it seems quite far-fetched for now, it is thinkable that eventually this might even be some form of a database, for example. As such, the current name of this structure will become worse over time as we evolve into the direction of pluggable ODBs. Immediate next steps will start to carve out proper self-contained object directories, which requires us to pass in these object directories as parameters. Based on our modern naming schema this means that those functions should then be named after their subsystem, which means that we would start to bake the current name into the codebase more and more. Let's preempt this by renaming the structure. There have been a couple alternatives that were discussed: - `odb_backend` was discarded because it led to the association that one object database has a single backend, but the model is that one alternate has one backend. Furthermore, "backend" is more about the actual backing implementation and less about the high-level concept. - `odb_alternate` was discarded because it is a bit of a stretch to also call the main object directory an "alternate". Instead, pick `odb_source` as the new name. It makes it sufficiently clear that there can be multiple sources and does not cause confusion when mixed with the already-existing "alternate" terminology. In the future, this change allows us to easily introduce for example a `odb_files_source` and other format-specific implementations. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-07-01object-store: rename `raw_object_store` to `object_database`Patrick Steinhardt
The `raw_object_store` structure is the central entry point for reading and writing objects in a repository. The main purpose of this structure is to manage object directories and provide an interface to access and write objects in those object directories. Right now, many of the functions associated with the raw object store implicitly rely on `the_repository` to get access to its `objects` pointer, which is the `raw_object_store`. As we want to generally get rid of using `the_repository` across our codebase we will have to convert this implicit dependency on this global variable into an explicit parameter. This conversion can be done by simply passing in an explicit pointer to a repository and then using its `->objects` pointer. But there is a second effort underway, which is to make the object subsystem more selfcontained so that we can eventually have pluggable object backends. As such, passing in a repository wouldn't make a ton of sense, and the goal is to convert the object store interfaces such that we always pass in a reference to the `raw_object_store` instead. This will expose the `raw_object_store` type to a lot more callers though, which surfaces that this type is named somewhat awkwardly. The "raw_" prefix makes readers wonder whether there is a non-raw variant of the object store, but there isn't. Furthermore, we nowadays want to name functions in a way that they can be clearly attributed to a specific subsystem, but calling them e.g. `raw_object_store_has_object()` is just too unwieldy, even when dropping the "raw_" prefix. Instead, rename the structure to `object_database`. This term is already used a lot throughout our codebase, and it cannot easily be mistaken for "object directories", either. Furthermore, its acronym ODB is already well-known and works well as part of a function's name, like for example `odb_has_object()`. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-05-27Merge branch 'js/misc-fixes'Junio C Hamano
Assorted fixes for issues found with CodeQL. * js/misc-fixes: sequencer: stop pretending that an assignment is a condition bundle-uri: avoid using undefined output of `sscanf()` commit-graph: avoid using stale stack addresses trace2: avoid "futile conditional" Avoid redundant conditions fetch: avoid unnecessary work when there is no current branch has_dir_name(): make code more obvious upload-pack: rename `enum` to reflect the operation commit-graph: avoid malloc'ing a local variable fetch: carefully clear local variable's address after use commit: simplify code
2025-05-27Merge branch 'ly/commit-graph-fill-oids-leakfix'Junio C Hamano
Leakfix. * ly/commit-graph-fill-oids-leakfix: commit-graph: fix memory leak when `fill_oids_from_packs()` fails
2025-05-15commit-graph: fix memory leak when `fill_oids_from_packs()` failsLidong Yan
In commit-graph.c:fill_oids_from_packs, if open_pack_index failed, memory allocated and returned by add_packed_git will leak. Simply add close_pack and free(p) will solve this problem. Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-05-15commit-graph: avoid using stale stack addressesJohannes Schindelin
The code is a bit too hard to reason about to fully assess whether the `fill_commit_graph_info()` function is called at all after `write_commit_graph()` returns (and hence the stack variable `topo_levels` goes out of context). Let's simply make sure that the stack address is no longer used at that stage, thereby making the code quite a bit easier to reason about. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-05-15commit-graph: avoid malloc'ing a local variableJohannes Schindelin
We do need a context to write the commit graph, but that context is only needed during the life time of `commit_graph_write()`, therefore it can easily be a stack variable. This also helps CodeQL recognize that it is safe to assign the address of other local variables to the context's fields. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-05-12raw_object_store: drop extra pointer to replace_mapJeff King
We store the replacement data in an oidmap, which is itself a pointer in the raw_object_store struct. But there's no need for an extra pointer indirection here. It is always allocated and initialized along with the containing struct, and we never check it for NULL-ness. Let's embed the map directly in the struct, which is simpler and avoids extra pointer chasing. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-05-12oidmap: add size functionJeff King
Callers which want to know how many items are in an oidmap have to look at the underlying hashmap struct, leaking an implementation detail. Let's provide a type-appropriate wrapper and use it. Note in the call from lookup_replace_object(), the caller was actually looking at the hashmap's tablesize parameter (the allocated size of the table) rather than hashmap_get_size(), the number of items in the table. This probably should have been checking the number of items all along, but the two are functionally equivalent here since we only add to the map and never remove anything. Thus if there was any allocation, it was because there is at least one item. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-04-15object-store: merge "object-store-ll.h" and "object-store.h"Patrick Steinhardt
The "object-store-ll.h" header has been introduced to keep transitive header dependendcies and compile times at bay. Now that we have created a new "object-store.c" file though we can easily move the last remaining additional bit of "object-store.h", the `odb_path_map`, out of the header. Do so. As the "object-store.h" header is now equivalent to its low-level alternative we drop the latter and inline it into the former. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-04-15object-file: move `git_open_cloexec()` to "compat/open.c"Patrick Steinhardt
The `git_open_cloexec()` wrapper function provides the ability to open a file with `O_CLOEXEC` in a platform-agnostic way. This function is provided by "object-file.c" even though it is not specific to the object subsystem at all. Move the file into "compat/open.c". This file already exists before this commit, but has only been compiled conditionally depending on whether or not open(3p) may return EINTR. With this change we now unconditionally compile the object, but wrap `git_open_with_retry()` in an ifdef. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-04-15object-file: move `safe_create_leading_directories()` into "path.c"Patrick Steinhardt
The `safe_create_leading_directories()` function and its relatives are located in "object-file.c", which is not a good fit as they provide generic functionality not related to objects at all. Move them into "path.c", which already hosts `safe_create_dir()` and its relative `safe_create_dir_in_gitdir()`. "path.c" is free of `the_repository`, but the moved functions depend on `the_repository` to read the "core.sharedRepository" config. Adapt the function signature to accept a repository as argument to fix the issue and adjust callers accordingly. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-10csum-file: stop depending on `the_repository`Patrick Steinhardt
There are multiple sites in "csum-file.c" where we use the global `the_repository` variable, either explicitly or implicitly by using `the_hash_algo`. Refactor the code to stop using `the_repository` by adapting functions to receive required data as parameters. Adapt callsites accordingly by either using `the_repository->hash_algo`, or by using a context-provided hash algorithm in case the subsystem already got rid of its dependency on `the_repository`. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-02-28path: adjust last remaining users of `the_repository`Patrick Steinhardt
With the preceding refactorings we now only have a couple of implicit users of `the_repository` left in the "path" subsystem, all of which depend on global state via `calc_shared_perm()`. Make the dependency on `the_repository` explicit by passing the repo as a parameter instead and adjust callers accordingly. Note that this change bubbles up into a couple of subsystems that were previously declared as free from `the_repository`. Instead of marking all of them as `the_repository`-dependent again, we instead use the repository that is available in the calling context. There are three exceptions though with "copy.c", "pack-write.c" and "tempfile.c". Adjusting these would require us to adapt callsites all over the place, so this is left for a future iteration. Mark "path.c" as free from `the_repository`. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-18progress: stop using `the_repository`Patrick Steinhardt
Stop using `the_repository` in the "progress" subsystem by passing in a repository when initializing `struct progress`. Furthermore, store a pointer to the repository in that struct so that we can pass it to the trace2 API when logging information. Adjust callers accordingly by using `the_repository`. While there may be some callers that have a repository available in their context, this trivial conversion allows for easier verification and bubbles up the use of `the_repository` by one level. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-18Merge branch 'ps/build-sign-compare' into ps/the-repositoryJunio C Hamano
* ps/build-sign-compare: t/helper: don't depend on implicit wraparound scalar: address -Wsign-compare warnings builtin/patch-id: fix type of `get_one_patchid()` builtin/blame: fix type of `length` variable when emitting object ID gpg-interface: address -Wsign-comparison warnings daemon: fix type of `max_connections` daemon: fix loops that have mismatching integer types global: trivial conversions to fix `-Wsign-compare` warnings pkt-line: fix -Wsign-compare warning on 32 bit platform csum-file: fix -Wsign-compare warning on 32-bit platform diff.h: fix index used to loop through unsigned integer config.mak.dev: drop `-Wno-sign-compare` global: mark code units that generate warnings with `-Wsign-compare` compat/win32: fix -Wsign-compare warning in "wWinMain()" compat/regex: explicitly ignore "-Wsign-compare" warnings git-compat-util: introduce macros to disable "-Wsign-compare" warnings
2024-12-06global: mark code units that generate warnings with `-Wsign-compare`Patrick Steinhardt
Mark code units that generate warnings with `-Wsign-compare`. This allows for a structured approach to get rid of all such warnings over time in a way that can be easily measured. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-04packfile: pass down repository to `for_each_packed_object`Karthik Nayak
The function `for_each_packed_object` currently relies on the global variable `the_repository`. To eliminate global variable usage in `packfile.c`, we should progressively shift the dependency on the_repository to higher layers. Let's remove its usage from this function and closely related function `is_promisor_object`. Signed-off-by: Karthik Nayak <karthik.188@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-04packfile: add repository to struct `packed_git`Karthik Nayak
The struct `packed_git` holds information regarding a packed object file. Let's add the repository variable to this object, to represent the repository that this packfile belongs to. This helps remove dependency on the global `the_repository` object in `packfile.c` by simply using repository information now readily available in the struct. We do need to consider that a packfile could be part of the alternates of a repository, but considering that we only have one repository struct and also that we currently anyways use 'the_repository', we should be OK with this change. We also modify `alloc_packed_git` to ensure that the repository is added to newly created `packed_git` structs. This requires modifying the function and all its callee to pass the repository object down the levels. Helped-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Karthik Nayak <karthik.188@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-09-23commit-graph: remove unnecessary UNLEAKRené Scharfe
When f4dbdfc4d5 (commit-graph: clean up leaked memory during write, 2018-10-03) added the UNLEAK, it was right before a call to die_errno(). e103f7276f (commit-graph: return with errors during write, 2019-06-12) made it unnecessary, as it was then followed by a free() call for the allocated string. The code moved to write_commit_graph_file() in the meantime and the string pointer is now part of a struct, but the function's only caller still cleans up the allocation. Drop the superfluous UNLEAK. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-08-09refs: add referent to each_ref_fnJohn Cai
Add a parameter to each_ref_fn so that callers to the ref APIs that use this function as a callback can have acess to the unresolved value of a symbolic ref. Signed-off-by: John Cai <johncai86@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-07-08Merge branch 'tb/path-filter-fix'Junio C Hamano
The Bloom filter used for path limited history traversal was broken on systems whose "char" is unsigned; update the implementation and bump the format version to 2. * tb/path-filter-fix: bloom: introduce `deinit_bloom_filters()` commit-graph: reuse existing Bloom filters where possible object.h: fix mis-aligned flag bits table commit-graph: new Bloom filter version that fixes murmur3 commit-graph: unconditionally load Bloom filters bloom: prepare to discard incompatible Bloom filters bloom: annotate filters with hash version repo-settings: introduce commitgraph.changedPathsVersion t4216: test changed path filters with high bit paths t/helper/test-read-graph: implement `bloom-filters` mode bloom.h: make `load_bloom_filter_from_graph()` public t/helper/test-read-graph.c: extract `dump_graph_info()` gitformat-commit-graph: describe version 2 of BDAT commit-graph: ensure Bloom filters are read with consistent settings revision.c: consult Bloom filters for root commits t/t4216-log-bloom.sh: harden `test_bloom_filters_not_used()`
2024-07-02Merge branch 'ps/use-the-repository'Junio C Hamano
A CPP macro USE_THE_REPOSITORY_VARIABLE is introduced to help transition the codebase to rely less on the availability of the singleton the_repository instance. * ps/use-the-repository: hex: guard declarations with `USE_THE_REPOSITORY_VARIABLE` t/helper: remove dependency on `the_repository` in "proc-receive" t/helper: fix segfault in "oid-array" command without repository t/helper: use correct object hash in partial-clone helper compat/fsmonitor: fix socket path in networked SHA256 repos replace-object: use hash algorithm from passed-in repository protocol-caps: use hash algorithm from passed-in repository oidset: pass hash algorithm when parsing file http-fetch: don't crash when parsing packfile without a repo hash-ll: merge with "hash.h" refs: avoid include cycle with "repository.h" global: introduce `USE_THE_REPOSITORY_VARIABLE` macro hash: require hash algorithm in `empty_tree_oid_hex()` hash: require hash algorithm in `is_empty_{blob,tree}_oid()` hash: make `is_null_oid()` independent of `the_repository` hash: convert `oidcmp()` and `oideq()` to compare whole hash global: ensure that object IDs are always padded hash: require hash algorithm in `oidread()` and `oidclr()` hash: require hash algorithm in `hasheq()`, `hashcmp()` and `hashclr()` hash: drop (mostly) unused `is_empty_{blob,tree}_sha1()` functions
2024-06-25bloom: introduce `deinit_bloom_filters()`Taylor Blau
After we are done using Bloom filters, we do not currently clean up any memory allocated by the commit slab used to store those filters in the first place. Besides the bloom_filter structures themselves, there is mostly nothing to free() in the first place, since in the read-only path all Bloom filter's `data` members point to a memory mapped region in the commit-graph file itself. But when generating Bloom filters from scratch (or initializing truncated filters) we allocate additional memory to store the filter's data. Keep track of when we need to free() this additional chunk of memory by using an extra pointer `to_free`. Most of the time this will be NULL (indicating that we are representing an existing Bloom filter stored in a memory mapped region). When it is non-NULL, free it before discarding the Bloom filters slab. Suggested-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-25commit-graph: reuse existing Bloom filters where possibleTaylor Blau
In an earlier commit, a bug was described where it's possible for Git to produce non-murmur3 hashes when the platform's "char" type is signed, and there are paths with characters whose highest bit is set (i.e. all characters >= 0x80). That patch allows the caller to control which version of Bloom filters are read and written. However, even on platforms with a signed "char" type, it is possible to reuse existing Bloom filters if and only if there are no changed paths in any commit's first parent tree-diff whose characters have their highest bit set. When this is the case, we can reuse the existing filter without having to compute a new one. This is done by marking trees which are known to have (or not have) any such paths. When a commit's root tree is verified to not have any such paths, we mark it as such and declare that the commit's Bloom filter is reusable. Note that this heuristic only goes in one direction. If neither a commit nor its first parent have any paths in their trees with non-ASCII characters, then we know for certain that a path with non-ASCII characters will not appear in a tree-diff against that commit's first parent. The reverse isn't necessarily true: just because the tree-diff doesn't contain any such paths does not imply that no such paths exist in either tree. So we end up recomputing some Bloom filters that we don't strictly have to (i.e. their bits are the same no matter which version of murmur3 we use). But culling these out is impossible, since we'd have to perform the full tree-diff, which is the same effort as computing the Bloom filter from scratch. But because we can cache our results in each tree's flag bits, we can often avoid recomputing many filters, thereby reducing the time it takes to run $ git commit-graph write --changed-paths --reachable when upgrading from v1 to v2 Bloom filters. To benchmark this, let's generate a commit-graph in linux.git with v1 changed-paths in generation order[^1]: $ git clone git@github.com:torvalds/linux.git $ cd linux $ git commit-graph write --reachable --changed-paths $ graph=".git/objects/info/commit-graph" $ mv $graph{,.bak} Then let's time how long it takes to go from v1 to v2 filters (with and without the upgrade path enabled), resetting the state of the commit-graph each time: $ git config commitGraph.changedPathsVersion 2 $ hyperfine -p 'cp -f $graph.bak $graph' -L v 0,1 \ 'GIT_TEST_UPGRADE_BLOOM_FILTERS={v} git.compile commit-graph write --reachable --changed-paths' On linux.git (where there aren't any non-ASCII paths), the timings indicate that this patch represents a speed-up over recomputing all Bloom filters from scratch: Benchmark 1: GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths Time (mean ± σ): 124.873 s ± 0.316 s [User: 124.081 s, System: 0.643 s] Range (min … max): 124.621 s … 125.227 s 3 runs Benchmark 2: GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths Time (mean ± σ): 79.271 s ± 0.163 s [User: 74.611 s, System: 4.521 s] Range (min … max): 79.112 s … 79.437 s 3 runs Summary 'GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths' ran 1.58 ± 0.01 times faster than 'GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths' On git.git, we do have some non-ASCII paths, giving us a more modest improvement from 4.163 seconds to 3.348 seconds, for a 1.24x speed-up. On my machine, the stats for git.git are: - 8,285 Bloom filters computed from scratch - 10 Bloom filters generated as empty - 4 Bloom filters generated as truncated due to too many changed paths - 65,114 Bloom filters were reused when transitioning from v1 to v2. [^1]: Note that this is is important, since `--stdin-packs` or `--stdin-commits` orders commits in the commit-graph by their pack position (with `--stdin-packs`) or in the raw input (with `--stdin-commits`). Since we compute Bloom filters in the same order that commits appear in the graph, we must see a commit's (first) parent before we process the commit itself. This is only guaranteed to happen when sorting commits by their generation number. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-25commit-graph: new Bloom filter version that fixes murmur3Taylor Blau
The murmur3 implementation in bloom.c has a bug when converting series of 4 bytes into network-order integers when char is signed (which is controllable by a compiler option, and the default signedness of char is platform-specific). When a string contains characters with the high bit set, this bug causes results that, although internally consistent within Git, does not accord with other implementations of murmur3 (thus, the changed path filters wouldn't be readable by other off-the-shelf implementatios of murmur3) and even with Git binaries that were compiled with different signedness of char. This bug affects both how Git writes changed path filters to disk and how Git interprets changed path filters on disk. Therefore, introduce a new version (2) of changed path filters that corrects this problem. The existing version (1) is still supported and is still the default, but users should migrate away from it as soon as possible. Because this bug only manifests with characters that have the high bit set, it may be possible that some (or all) commits in a given repo would have the same changed path filter both before and after this fix is applied. However, in order to determine whether this is the case, the changed paths would first have to be computed, at which point it is not much more expensive to just compute a new changed path filter. So this patch does not include any mechanism to "salvage" changed path filters from repositories. There is also no "mixed" mode - for each invocation of Git, reading and writing changed path filters are done with the same version number; this version number may be explicitly stated (typically if the user knows which version they need) or automatically determined from the version of the existing changed path filters in the repository. There is a change in write_commit_graph(). graph_read_bloom_data() makes it possible for chunk_bloom_data to be non-NULL but bloom_filter_settings to be NULL, which causes a segfault later on. I produced such a segfault while developing this patch, but couldn't find a way to reproduce it neither after this complete patch (or before), but in any case it seemed like a good thing to include that might help future patch authors. The value in t0095 was obtained from another murmur3 implementation using the following Go source code: package main import "fmt" import "github.com/spaolacci/murmur3" func main() { fmt.Printf("%x\n", murmur3.Sum32([]byte("Hello world!"))) fmt.Printf("%x\n", murmur3.Sum32([]byte{0x99, 0xaa, 0xbb, 0xcc, 0xdd, 0xee, 0xff})) } Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-25commit-graph: unconditionally load Bloom filtersTaylor Blau
In an earlier commit, we began ignoring the Bloom data ("BDAT") chunk for commit-graphs whose Bloom filters were computed using a hash version incompatible with the value of `commitGraph.changedPathVersion`. Now that the Bloom API has been hardened to discard these incompatible filters (with the exception of low-level APIs), we can safely load these Bloom filters unconditionally. We no longer want to return early from `graph_read_bloom_data()`, and similarly do not want to set the bloom_settings' `hash_version` field as a side-effect. The latter is because we want to wait until we know which Bloom settings we're using (either the defaults, from the GIT_TEST variables, or from the previous commit-graph layer) before deciding what hash_version to use. If we detect an existing BDAT chunk, we'll infer the rest of the settings (e.g., number of hashes, bits per entry, and maximum number of changed paths) from the earlier graph layer. The hash_version will be inferred from the previous layer as well, unless one has already been specified via configuration. Once all of that is done, we normalize the value of the hash_version to either "1" or "2". Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-25repo-settings: introduce commitgraph.changedPathsVersionTaylor Blau
A subsequent commit will introduce another version of the changed-path filter in the commit graph file. In order to control which version to write (and read), a config variable is needed. Therefore, introduce this config variable. For forwards compatibility, teach Git to not read commit graphs when the config variable is set to an unsupported version. Because we teach Git this, commitgraph.readChangedPaths is now redundant, so deprecate it and define its behavior in terms of the config variable we introduce. This commit does not change the behavior of writing (Git writes changed path filters when explicitly instructed regardless of any config variable), but a subsequent commit will restrict Git such that it will only write when commitgraph.changedPathsVersion is a recognized value. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-25commit-graph: ensure Bloom filters are read with consistent settingsTaylor Blau
The changed-path Bloom filter mechanism is parameterized by a couple of variables, notably the number of bits per hash (typically "m" in Bloom filter literature) and the number of hashes themselves (typically "k"). It is critically important that filters are read with the Bloom filter settings that they were written with. Failing to do so would mean that each query is liable to compute different fingerprints, meaning that the filter itself could return a false negative. This goes against a basic assumption of using Bloom filters (that they may return false positives, but never false negatives) and can lead to incorrect results. We have some existing logic to carry forward existing Bloom filter settings from one layer to the next. In `write_commit_graph()`, we have something like: if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) { struct commit_graph *g = ctx->r->objects->commit_graph; /* We have changed-paths already. Keep them in the next graph */ if (g && g->chunk_bloom_data) { ctx->changed_paths = 1; ctx->bloom_settings = g->bloom_filter_settings; } } , which drags forward Bloom filter settings across adjacent layers. This doesn't quite address all cases, however, since it is possible for intermediate layers to contain no Bloom filters at all. For example, suppose we have two layers in a commit-graph chain, say, {G1, G2}. If G1 contains Bloom filters, but G2 doesn't, a new G3 (whose base graph is G2) may be written with arbitrary Bloom filter settings, because we only check the immediately adjacent layer's settings for compatibility. This behavior has existed since the introduction of changed-path Bloom filters. But in practice, this is not such a big deal, since the only way up until this point to modify the Bloom filter settings at write time is with the undocumented environment variables: - GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY - GIT_TEST_BLOOM_SETTINGS_NUM_HASHES - GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS (it is still possible to tweak MAX_CHANGED_PATHS between layers, but this does not affect reads, so is allowed to differ across multiple graph layers). But in future commits, we will introduce another parameter to change the hash algorithm used to compute Bloom fingerprints itself. This will be exposed via a configuration setting, making this foot-gun easier to use. To prevent this potential issue, validate that all layers of a split commit-graph have compatible settings with the newest layer which contains Bloom filters. Reported-by: SZEDER Gábor <szeder.dev@gmail.com> Original-test-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-24Merge branch 'tb/commit-graph-use-tempfile'Junio C Hamano
"git update-server-info" and "git commit-graph --write" have been updated to use the tempfile API to avoid leaving cruft after failing. * tb/commit-graph-use-tempfile: server-info.c: remove temporary info files on exit commit-graph.c: remove temporary graph layers on exit
2024-06-20Merge branch 'ds/ahead-behind-fix'Junio C Hamano
Fix for a progress bar. * ds/ahead-behind-fix: commit-graph: increment progress indicator
2024-06-14global: introduce `USE_THE_REPOSITORY_VARIABLE` macroPatrick Steinhardt
Use of the `the_repository` variable is deprecated nowadays, and we slowly but steadily convert the codebase to not use it anymore. Instead, callers should be passing down the repository to work on via parameters. It is hard though to prove that a given code unit does not use this variable anymore. The most trivial case, merely demonstrating that there is no direct use of `the_repository`, is already a bit of a pain during code reviews as the reviewer needs to manually verify claims made by the patch author. The bigger problem though is that we have many interfaces that implicitly rely on `the_repository`. Introduce a new `USE_THE_REPOSITORY_VARIABLE` macro that allows code units to opt into usage of `the_repository`. The intent of this macro is to demonstrate that a certain code unit does not use this variable anymore, and to keep it from new dependencies on it in future changes, be it explicit or implicit For now, the macro only guards `the_repository` itself as well as `the_hash_algo`. There are many more known interfaces where we have an implicit dependency on `the_repository`, but those are not guarded at the current point in time. Over time though, we should start to add guards as required (or even better, just remove them). Define the macro as required in our code units. As expected, most of our code still relies on the global variable. Nearly all of our builtins rely on the variable as there is no way yet to pass `the_repository` to their entry point. For now, declare the macro in "biultin.h" to keep the required changes at least a little bit more contained. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-14hash: require hash algorithm in `oidread()` and `oidclr()`Patrick Steinhardt
Both `oidread()` and `oidclr()` use `the_repository` to derive the hash function that shall be used. Require callers to pass in the hash algorithm to get rid of this implicit dependency. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-14hash: require hash algorithm in `hasheq()`, `hashcmp()` and `hashclr()`Patrick Steinhardt
Many of our hash functions have two variants, one receiving a `struct git_hash_algo` and one that derives it via `the_repository`. Adapt all of those functions to always require the hash algorithm as input and drop the variants that do not accept one. As those functions are now independent of `the_repository`, we can move them from "hash.h" to "hash-ll.h". Note that both in this and subsequent commits in this series we always just pass `the_repository->hash_algo` as input even if it is obvious that there is a repository in the context that we should be using the hash from instead. This is done to be on the safe side and not introduce any regressions. All callsites should eventually be amended to use a repo passed via parameters, but this is outside the scope of this patch series. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-12commit-graph: increment progress indicatorDerrick Stolee
This fixes a bug that was introduced by 368d19b0b7 (commit-graph: refactor compute_topological_levels(), 2023-03-20): Previously, the progress indicator was updated from `i + 1` where `i` is the loop variable of the enclosing `for` loop. After this patch, the update used `info->progress_cnt + 1` instead, however, unlike `i`, the `progress_cnt` attribute was not incremented. Let's increment it. Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> [jc: squashed in a test update from Patrick Steinhardt] Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-07commit-graph.c: remove temporary graph layers on exitTaylor Blau
Since the introduction of split commit graph layers in 92b1ea66b9a (Merge branch 'ds/commit-graph-incremental', 2019-07-19), the function write_commit_graph_file() has done the following when writing an incremental commit-graph layer: - used a lock_file to control access to the commit-graph-chain file - used an auxiliary file (whose descriptor was stored in 'fd') to write the new commit-graph layer itself Using a lock_file to control access to the commit-graph-chain is sensible, since only one writer may modify it at a time. Likewise, when the commit-graph machinery is writing out a single layer, the lock_file structure is used to modify the commit-graph itself. This is also sensible, since the non-incremental commit-graph may also have at most one writer. However, using an auxiliary temporary file without using the tempfile.h API means that writes that fail after the temporary graph layer has been created will leave around a file in $GIT_DIR/objects/info/commit-graphs/tmp_graph_XXXXXX The commit-graph-chain file and non-incremental commit-graph do not suffer from this problem as the lockfile.h API uses the tempfile.h API transparently, so processes that died before moving those finals into their final location cleaned up after themselves. Ensure that the temporary file used to write incremental commit-graphs is also managed with the tempfile.h API, to ensure that we do not ever leave tmp_graph_XXXXXX files laying around. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-17refs: pass repo when peeling objectsPatrick Steinhardt
Both `peel_object()` and `peel_iterated_oid()` implicitly rely on `the_repository` to look up objects. Despite the fact that we want to get rid of `the_repository`, it also leads to some restrictions in our ref iterators when trying to retrieve the peeled value for a repository other than `the_repository`. Refactor these functions such that both take a repository as argument and remove the now-unnecessary restrictions. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-07cocci: apply rules to rewrite callers of "refs" interfacesPatrick Steinhardt
Apply the rules that rewrite callers of "refs" interfaces to explicitly pass `struct ref_store`. The resulting patch has been applied with the `--whitespace=fix` option. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-01-26Merge branch 'ps/commit-graph-write-leakfix'Junio C Hamano
Leakfix. * ps/commit-graph-write-leakfix: commit-graph: fix memory leak when not writing graph
2024-01-16Merge branch 'jk/commit-graph-slab-clear-fix'Junio C Hamano
Clearing in-core repository (happens during e.g., "git fetch --recurse-submodules" with commit graph enabled) made in-core commit object in an inconsistent state by discarding the necessary data from commit-graph too early, which has been corrected. * jk/commit-graph-slab-clear-fix: commit-graph: retain commit slab when closing NULL commit_graph
2024-01-15commit-graph: fix memory leak when not writing graphPatrick Steinhardt
When `write_commit_graph()` bails out writing a split commit-graph early then it may happen that we have already gathered the set of existing commit-graph file names without yet determining the new merged set of files. This can result in a memory leak though because we only clear the preimage of files when we have collected the postimage. Fix this issue by dropping the condition altogether so that we always try to free both preimage and postimage filenames. As the context structure is zero-initialized this simplification is safe to do. Signed-off-by: Patrick Steinhardt <ps@pks.im> Acked-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-01-08Merge branch 'en/header-cleanup'Junio C Hamano
Remove unused header "#include". * en/header-cleanup: treewide: remove unnecessary includes in source files treewide: add direct includes currently only pulled in transitively trace2/tr2_tls.h: remove unnecessary include submodule-config.h: remove unnecessary include pkt-line.h: remove unnecessary include line-log.h: remove unnecessary include http.h: remove unnecessary include fsmonitor--daemon.h: remove unnecessary includes blame.h: remove unnecessary includes archive.h: remove unnecessary include treewide: remove unnecessary includes in source files treewide: remove unnecessary includes from header files
2024-01-05commit-graph: retain commit slab when closing NULL commit_graphJeff King
This fixes a regression introduced in ac6d45d11f (commit-graph: move slab-clearing to close_commit_graph(), 2023-10-03), in which running: git -c fetch.writeCommitGraph=true fetch --recurse-submodules multiple times in a freshly cloned repository causes a segfault. What happens in the second (and subsequent) runs is this: 1. We make a "struct commit" for any ref tips which we're storing (even if we already have them, they still go into FETCH_HEAD). Because the first run will have created a commit graph, we'll find those commits in the graph. The commit struct is therefore created with a NULL "maybe_tree" entry, because we can load its oid from the graph later. But to do that we need to remember that we got the commit from the graph, which is recorded in a global commit_graph_data_slab object. 2. Because we're using --recurse-submodules, we'll try to fetch each of the possible submodules. That implies creating a separate "struct repository" in-process for each submodule, which will require a later call to repo_clear(). The call to repo_clear() calls raw_object_store_clear(), which in turn calls close_object_store(), which in turn calls close_commit_graph(). And the latter frees the commit graph data slab. 3. Later, when trying to write out a new commit graph, we'll ask for their tree oid via get_commit_tree_oid(), which will see that the object is parsed but with a NULL maybe_tree field. We'd then usually pull it from the graph file, but because the slab was cleared, we don't realize that we can do so! We end up returning NULL and segfaulting. (It seems questionable that we'd write a graph entry for such a commit anyway, since we know we already have one. I didn't double-check, but that may simply be another side effect of having cleared the slab). The bug is in step (2) above. We should not be clearing the slab when cleaning up the submodule repository structs. Prior to ac6d45d11f, we did not do so because it was done inside a helper function that returned early when it saw NULL. So the behavior change from that commit is that we'll now _always_ clear the slab via repo_clear(), even if the repository being closed did not have a commit graph (and thus would have a NULL commit_graph struct). The most immediate fix is to add in a NULL check in close_commit_graph(), making it a true noop when passed in an object_store with a NULL commit_graph (it's OK to just return early, since the rest of its code is already a noop when passed NULL). That restores the pre-ac6d45d11f behavior. And that's what this patch does, along with a test that exercises it (we already have a test that uses submodules along with fetch.writeCommitGraph, but the bug only triggers when there is a subsequent fetch and when that fetch uses --recurse-submodules). So that fixes the regression in the least-risky way possible. I do think there's some fragility here that we might want to follow up on. We have a global commit_graph_data_slab that contains graph positions, and our global commit structs depend on the that slab remaining valid. But close_commit_graph() is just about closing _one_ object store's graph. So it's dangerous to call that function and clear the slab without also throwing away any "struct commit" we might have parsed that depends on it. Which at first glance seems like a bug we could already trigger. In the situation described here, there is no commit graph in the submodule repository, so our commit graph is NULL (in fact, in our test script there is no submodule repo at all, so we immediately return from repo_init() and call repo_clear() only to free up memory). But what would happen if there was one? Wouldn't we see a non-NULL commit_graph entry, and then clear the global slab anyway? The answer is "no", but for very bizarre reasons. Remember that repo_clear() calls raw_object_store_clear(), which then calls close_object_store() and thus close_commit_graph(). But before it does so, raw_object_store_clear() does something else: it frees the commit graph and sets it to NULL! So by this code path we'll _never_ see a non-NULL commit_graph struct, and thus never clear the slab. So it happens to work out. But it still seems questionable to me that we would clear a global slab (which might still be in use) when closing the commit graph. This clearing comes from 957ba814bf (commit-graph: when closing the graph, also release the slab, 2021-09-08), and was fixing a case where we really did need it to be closed (and in that case we presumably call close_object_store() more directly). So I suspect there may still be a bug waiting to happen there, as any object loaded before the call to close_object_store() may be stranded with a bogus maybe_tree entry (and thus looking at it after the call might cause an error). But I'm not sure how to trigger it, nor what the fix should look like (you probably would need to "unparse" any objects pulled from the graph). And so this patch punts on that for now in favor of fixing the recent regression in the most direct way, which should not have any other fallouts. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-26treewide: remove unnecessary includes in source filesElijah Newren
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-26treewide: add direct includes currently only pulled in transitivelyElijah Newren
The next commit will remove a bunch of unnecessary includes, but to do so, we need some of the lower level direct includes that files rely on to be explicitly specified. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>