user/sven/git.git/list-objects.c, branch v2.26.0-rc2

Merge branch 'jk/list-objects-optim-wo-trees'

2019-10-07T02:32:56Z

The object traversal machinery has been optimized not to load tree objects when we are only interested in commit history. * jk/list-objects-optim-wo-trees: list-objects: don't queue root trees unless revs->tree_objects is set

list-objects: don't queue root trees unless revs->tree_objects is set

2019-09-12T18:47:30Z

When traverse_commit_list() processes each commit, it queues the commit's root tree in the pending array. Then, after all commits are processed, it calls traverse_trees_and_blobs() to walk over the pending list, calling process_tree() on each. But if revs->tree_objects is not set, process_tree() just exists immediately! We can save ourselves some work by not even bothering to queue these trees in the first place. There are a few subtle points to make: - we also detect commits with a NULL tree pointer here. But this isn't an interesting check for broken commits, since the lookup_tree() we'd have done during commit parsing doesn't actually check that we have the tree on disk. So we're not losing any robustness. - besides queueing, we also set the NOT_USER_GIVEN flag on the tree object. This is used by the traverse_commit_list_filtered() variant. But if we're not exploring trees, then we won't actually care about this flag, which is used only inside process_tree() code-paths. - queueing trees eventually leads to us queueing blobs, too. But we don't need to check revs->blob_objects here. Even in the current code, we still wouldn't find those blobs, because we'd never open up the tree objects to list their contents. - the user-visible impact to the caller is minimal. The pending trees are all cleared by the time the function returns anyway, by traverse_trees_and_blobs(). We do call a show_commit() callback, which technically could be looking at revs->pending during the callback. But it seems like a rather unlikely thing to do (if you want the tree of the current commit, then accessing the tree struct member is a lot simpler). So this should be safe to do. Let's look at the benefits: [before] Benchmark #1: git -C linux rev-list HEAD >/dev/null Time (mean ± σ): 7.651 s ± 0.021 s [User: 7.399 s, System: 0.252 s] Range (min … max): 7.607 s … 7.683 s 10 runs [after] Benchmark #1: git -C linux rev-list HEAD >/dev/null Time (mean ± σ): 7.593 s ± 0.023 s [User: 7.329 s, System: 0.264 s] Range (min … max): 7.565 s … 7.634 s 10 runs Not too impressive, but then we're really just avoiding sticking a pointer into a growable array. But still, I'll take a free 0.75% speedup. Let's try it after running "git commit-graph write": [before] Benchmark #1: git -C linux rev-list HEAD >/dev/null Time (mean ± σ): 1.458 s ± 0.011 s [User: 1.199 s, System: 0.259 s] Range (min … max): 1.447 s … 1.481 s 10 runs [after] Benchmark #1: git -C linux rev-list HEAD >/dev/null Time (mean ± σ): 1.126 s ± 0.023 s [User: 896.5 ms, System: 229.0 ms] Range (min … max): 1.106 s … 1.181 s 10 runs Now that's more like it. We saved over 22% of the total time. Part of that is because the runtime is shorter overall, but the absolute improvement is also much larger. What's going on? When we fill in a commit struct using the commit graph, we don't bother to set the tree pointer, and instead lazy-load it when somebody calls get_commit_tree(). So we're not only skipping the pointer write to the pending queue, but we're skipping the lazy-load of the tree entirely. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano

list-objects-filter: encapsulate filter components

2019-06-28T15:41:53Z

Encapsulate filter_fn, filter_free_fn, and filter_data into their own opaque struct. Due to opaqueness, filter_fn and filter_free_fn can no longer be accessed directly by users. Currently, all usages of filter_fn are guarded by a necessary check: (obj->flags & NOT_USER_GIVEN) && filter_fn Take the opportunity to include this check into the new function list_objects_filter__filter_object(), so that we no longer need to write this check at every caller of the filter function. Also, the init functions in list-objects-filter.c no longer need to confusingly return the filter constituents in various places (filter_fn and filter_free_fn as out parameters, and filter_data as the function's return value); they can just initialize the "struct filter" passed in. Helped-by: Jeff Hostetler Helped-by: Jonathan Tan Helped-by: Junio C Hamano Signed-off-by: Matthew DeVore Signed-off-by: Junio C Hamano

rev-list: detect broken root trees

2019-04-10T03:59:39Z

When the traversal machinery sees a commit without a root tree, it assumes that the tree was part of a BOUNDARY commit, and quietly ignores the tree. But it could also be caused by a commit whose root tree is broken or missing. Instead, let's die() when we see a NULL root tree. We can differentiate it from the BOUNDARY case by seeing if the commit was actually parsed. This covers that case, plus future-proofs us against any others where we might try to show an unparsed commit. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano

list-objects.c: handle unexpected non-tree entries

2019-04-10T03:59:39Z

Apply similar treatment as the previous commit for non-tree entries, too. Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano

list-objects.c: handle unexpected non-blob entries

2019-04-10T03:59:39Z

Fix one of the cases described in the previous commit where a tree-entry that is promised to a blob is in fact a non-blob. When 'lookup_blob()' returns NULL, it is because Git has cached the requested object as a non-blob. In this case, prevent a SIGSEGV by 'die()'-ing immediately before attempting to dereference the result. Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano

Merge branch 'ds/push-sparse-tree-walk'

2019-02-07T06:05:25Z

"git pack-objects" learned another algorithm to compute the set of objects to send, that trades the resulting packfile off to save traversal cost to favor small pushes. * ds/push-sparse-tree-walk: pack-objects: create GIT_TEST_PACK_SPARSE pack-objects: create pack.useSparse setting revision: implement sparse algorithm list-objects: consume sparse tree walk revision: add mark_tree_uninteresting_sparse

Merge branch 'bc/tree-walk-oid'

2019-01-29T20:47:56Z

The code to walk tree objects has been taught that we may be working with object names that are not computed with SHA-1. * bc/tree-walk-oid: cache: make oidcpy always copy GIT_MAX_RAWSZ bytes tree-walk: store object_id in a separate member match-trees: use hashcpy to splice trees match-trees: compute buffer offset correctly when splicing tree-walk: copy object ID before use

list-objects: consume sparse tree walk

2019-01-17T21:44:39Z

When creating a pack-file using 'git pack-objects --revs' we provide a list of interesting and uninteresting commits. For example, a push operation would make the local topic branch be interesting and the known remote refs as uninteresting. We want to discover the set of new objects to send to the server as a thin pack. We walk these commits until we discover a frontier of commits such that every commit walk starting at interesting commits ends in a root commit or unintersting commit. We then need to discover which non-commit objects are reachable from uninteresting commits. This commit walk is not changing during this series. The mark_edges_uninteresting() method in list-objects.c iterates on the commit list and does the following: * If the commit is UNINTERSTING, then mark its root tree and every object it can reach as UNINTERESTING. * If the commit is interesting, then mark the root tree of every UNINTERSTING parent (and all objects that tree can reach) as UNINTERSTING. At the very end, we repeat the process on every commit directly given to the revision walk from stdin. This helps ensure we properly cover shallow commits that otherwise were not included in the frontier. The logic to recursively follow trees is in the mark_tree_uninteresting() method in revision.c. The algorithm avoids duplicate work by not recursing into trees that are already marked UNINTERSTING. Add a new 'sparse' option to the mark_edges_uninteresting() method that performs this logic in a slightly different way. As we iterate over the commits, we add all of the root trees to an oidset. Then, call mark_trees_uninteresting_sparse() on that oidset. Note that we include interesting trees in this process. The current implementation of mark_trees_unintersting_sparse() will walk the same trees as the old logic, but this will be replaced in a later change. Add a '--sparse' flag in 'git pack-objects' to call this new logic. Add a new test script t/t5322-pack-objects-sparse.sh that tests this option. The tests currently demonstrate that the resulting object list is the same as the old algorithm. This includes a case where both algorithms pack an object that is not needed by a remote due to limits on the explored set of trees. When the sparse algorithm is changed in a later commit, we will add a test that demonstrates a change of behavior in some cases. Signed-off-by: Derrick Stolee Signed-off-by: Junio C Hamano

tree-walk: store object_id in a separate member

2019-01-15T17:57:41Z

When parsing a tree, we read the object ID directly out of the tree buffer. This is normally fine, but such an object ID cannot be used with oidcpy, which copies GIT_MAX_RAWSZ bytes, because if we are using SHA-1, there may not be that many bytes to copy. Instead, store the object ID in a separate struct member. Since we can no longer efficiently compute the path length, store that information as well in struct name_entry. Ensure we only copy the object ID into the new buffer if the path length is nonzero, as some callers will pass us an empty path with no object ID following it, and we will not want to read past the end of the buffer. Signed-off-by: brian m. carlson Signed-off-by: Junio C Hamano