<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/git.git/pack-objects.c, branch v2.26.0-rc2</title>
<subtitle>Git
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/git.git/atom?h=v2.26.0-rc2</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/atom?h=v2.26.0-rc2'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/'/>
<updated>2020-02-24T20:55:52Z</updated>
<entry>
<title>pack-objects: convert oe_set_delta_ext() to use object_id</title>
<updated>2020-02-24T20:55:52Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2020-02-24T04:30:07Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=a93c141ddef25dc999fff73c590b42d3af606ff3'/>
<id>urn:sha1:a93c141ddef25dc999fff73c590b42d3af606ff3</id>
<content type='text'>
We already store an object_id internally, and now our sole caller also
has one. Let's stop passing around the internal hash array, which adds a
bit of type safety.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Merge branch 'jk/optim-in-pack-idx-conversion'</title>
<updated>2019-12-01T17:04:38Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2019-12-01T17:04:38Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=8faff3899e1fdefbdd143eaf5ce3b96532205bc7'/>
<id>urn:sha1:8faff3899e1fdefbdd143eaf5ce3b96532205bc7</id>
<content type='text'>
Code clean-up.

* jk/optim-in-pack-idx-conversion:
  pack-objects: avoid pointless oe_map_new_pack() calls
</content>
</entry>
<entry>
<title>pack-objects: avoid pointless oe_map_new_pack() calls</title>
<updated>2019-11-12T04:36:36Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2019-11-11T11:12:49Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=f66e0401abaa884aa91100b4c155c4d042c47e0d'/>
<id>urn:sha1:f66e0401abaa884aa91100b4c155c4d042c47e0d</id>
<content type='text'>
This patch fixes an extreme slowdown in pack-objects when you have more
than 1023 packs. See below for numbers.

Since 43fa44fa3b (pack-objects: move in_pack out of struct object_entry,
2018-04-14), we use a complicated system to save some per-object memory.

Each object_entry structs gets a 10-bit field to store the index of the
pack it's in. We map those indices into pointers using
packing_data-&gt;in_pack_by_idx, which we initialize at the start of the
program. If we have 2^10 or more packs, then we instead create an array
of pack pointers, one per object. This is packing_data-&gt;in_pack.

So far so good. But there's one other tricky case: if a new pack arrives
after we've initialized in_pack_by_idx, it won't have an index yet. We
solve that by calling oe_map_new_pack(), which just switches on the fly
to the less-optimal in_pack mechanism, allocating the array and
back-filling it for already-seen objects.

But that logic kicks in even when we've switched to it already (whether
because we really did see a new pack, or because we had too many packs
in the first place). The result doesn't produce a wrong outcome, but
it's very slow. What happens is this:

  - imagine you have a repo with 500k objects and 2000 packs that you
    want to repack.

  - before looking at any objects, we call prepare_in_pack_by_idx(). It
    starts allocating an index for each pack. On the 1024th pack, it
    sees there are too many, so it bails, leaving in_pack_by_idx as
    NULL.

  - while actually adding objects to the packing list, we call
    oe_set_in_pack(), which checks whether the pack already has an
    index. If it's one of the packs after the first 1023, then it
    doesn't have one, and we'll call oe_map_new_pack().

    But there's no useful work for that function to do. We're already
    using in_pack, so it just uselessly walks over the complete list of
    objects, trying to backfill in_pack.

    And we end up doing this for almost 1000 packs (each of which may be
    triggered by more than one object). And each time it triggers, we
    may iterate over up to 500k objects. So in the absolute worst case,
    this is quadratic in the number of objects.

The solution is simple: we don't need to bother checking whether the
pack has an index if we've already converted to using in_pack, since by
definition we're not going to use it. So we can just push the "does the
pack have a valid index" check down into that half of the conditional,
where we know we're going to use it.

The current test in p5303 sadly doesn't notice this problem, since it
maxes out at 1000 packs. If we add a new test to it at 2000 packs, it
does show the improvement:

  Test                      HEAD^               HEAD
  ----------------------------------------------------------------------
  5303.12: repack (2000)    26.72(39.68+0.67)   15.70(28.70+0.66) -41.2%

However, these many-pack test cases are rather expensive to run, so
adding larger and larger numbers isn't appealing. Instead, we can show
it off more easily by using GIT_TEST_FULL_IN_PACK_ARRAY, which forces us
into the absolute worst case: no pack has an index, so we'll trigger
oe_map_new_pack() pointlessly for every single object, making it truly
quadratic.

Here are the numbers (on git.git) with the included change to p5303:

  Test                      HEAD^               HEAD
  ----------------------------------------------------------------------
  5303.3: rev-list (1)      2.05(1.98+0.06)     2.06(1.99+0.06) +0.5%
  5303.4: repack (1)        33.45(33.46+0.19)   2.75(2.73+0.22) -91.8%
  5303.6: rev-list (50)     2.07(2.01+0.06)     2.06(2.01+0.05) -0.5%
  5303.7: repack (50)       34.21(35.18+0.16)   3.49(4.50+0.12) -89.8%
  5303.9: rev-list (1000)   2.87(2.78+0.08)     2.88(2.80+0.07) +0.3%
  5303.10: repack (1000)    41.26(51.30+0.47)   10.75(20.75+0.44) -73.9%

Again, those improvements aren't realistic for the 1-pack case (because
in the real world, the full-array solution doesn't kick in), but it's
more useful to be testing the more-complicated code path.

While we're looking at this issue, we'll tweak one more thing: in
oe_map_new_pack(), we call REALLOC_ARRAY(pack-&gt;in_pack). But we'd never
expect to get here unless we're back-filling it for the first time, in
which case it would be NULL. So let's switch that to ALLOC_ARRAY() for
clarity, and add a BUG() to document the expectation. Unfortunately this
code isn't well-covered in the test suite because it's inherently racy
(it only kicks in if somebody else adds a new pack while we're in the
middle of repacking).

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Reviewed-by: Derrick Stolee &lt;dstolee@microsoft.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>pack-objects: drop packlist index_pos optimization</title>
<updated>2019-09-06T18:03:42Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2019-09-06T01:36:05Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=3a37876b5dca4c18bda67bcdead9c1d79a59933d'/>
<id>urn:sha1:3a37876b5dca4c18bda67bcdead9c1d79a59933d</id>
<content type='text'>
Once upon a time, the code to add an object to our packing list in
pack-objects all lived in a single function. It computed the position
within the hash table once, then used it to check if the object was
already present, and if not, to add it.

Later, in 2834bc27c1 (pack-objects: refactor the packing list,
2013-10-24), this was split into two functions: packlist_find() and
packlist_alloc(). We ended up with an "index_pos" variable that gets
passed through several functions to make it from one to the other.

The resulting code is rather confusing to follow. The "index_pos"
variable is sometimes undefined, if we don't yet have a hash table. This
works out in practice because in that case packlist_alloc() won't use it
at all, since it will have to create/grow the hash table. But it's hard
to verify that, and it does cause gcc 9.2.1's -Wmaybe-uninitialized to
complain when compiled with "-flto -O3" (rightfully, since we do pass
the uninitialized value as a function parameter, even if nobody ends up
using it).

All of this is to save computing the hash index again when we're
inserting into the hash table, which I found doesn't make a measurable
difference in the program runtime (which is not surprising, since we're
doing all kinds of other heavyweight things for each object).

Let's just drop this index_pos variable entirely, simplifying the code
(and pleasing the compiler).

We might be better still refactoring this custom hash table to use one
of our existing implementations (an oidmap, or a kh_oid_map). I stopped
short of that here, but this would be the likely first step towards that
anyway.

Reported-by: Stephan Beyer &lt;s-beyer@gmx.net&gt;
Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>pack-objects: use object_id in packlist_alloc()</title>
<updated>2019-09-06T18:03:39Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2019-09-05T22:52:25Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=f1cbd033e201a18c7175bc6509b48d6243e79739'/>
<id>urn:sha1:f1cbd033e201a18c7175bc6509b48d6243e79739</id>
<content type='text'>
The only caller of packlist_alloc() already has a "struct object_id",
and we immediately copy the hash they pass us into our own object_id.
Let's avoid the unnecessary round-trip to a raw sha1 pointer.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>hashmap: convert sha1hash() to oidhash()</title>
<updated>2019-06-20T17:44:22Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2019-06-20T07:41:49Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=d40abc8e95f75b529feb140178b69a3783c2d108'/>
<id>urn:sha1:d40abc8e95f75b529feb140178b69a3783c2d108</id>
<content type='text'>
There are no callers left of sha1hash() that do not simply pass the
"hash" member of a "struct object_id". Let's get rid of the outdated
sha1-specific function and provide one that operates on the whole struct
(even though the technique, taking the first few bytes of the hash, will
remain the same).

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>pack-objects: convert locate_object_entry_hash() to object_id</title>
<updated>2019-06-20T17:03:32Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2019-06-20T07:41:07Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=5e7ac6802823ec23759b1b986315bd65b0881ff9'/>
<id>urn:sha1:5e7ac6802823ec23759b1b986315bd65b0881ff9</id>
<content type='text'>
There are no callers of locate_object_entry_hash() that aren't just
passing us the "hash" member of a "struct object_id". Let's take the
whole struct, which gets us closer to removing all raw sha1 variables.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>pack-objects: convert packlist_find() to use object_id</title>
<updated>2019-06-20T16:54:58Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2019-06-20T07:41:03Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=3df28caefb2193fb7bbc87a427a620d96d508c8d'/>
<id>urn:sha1:3df28caefb2193fb7bbc87a427a620d96d508c8d</id>
<content type='text'>
We take a raw hash pointer, but most of our callers have a "struct
object_id" already. Let's switch to taking the full struct, which will
let us continue removing uses of raw sha1 buffers.

There are two callers that do need special attention:

  - in rebuild_existing_bitmaps(), we need to switch to
    nth_packed_object_oid(). This incurs an extra hash copy over
    pointing straight to the mmap'd sha1, but it shouldn't be measurable
    compared to the rest of the operation.

  - in can_reuse_delta() we already spent the effort to copy the sha1
    into a "struct object_id", but now we just have to do so a little
    earlier in the function (we can't easily convert that function's
    callers because they may be pointing at mmap'd REF_DELTA blocks).

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>pack-objects: drop unused parameter from oe_map_new_pack()</title>
<updated>2019-02-14T23:26:15Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2019-02-14T05:50:32Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=c409d108b857799ae699654d2fc33b063c9aef9d'/>
<id>urn:sha1:c409d108b857799ae699654d2fc33b063c9aef9d</id>
<content type='text'>
Since 43fa44fa3b (pack-objects: move in_pack out of struct object_entry,
2018-04-14), we store the source pack for each object as a small index
rather than as a pointer. When we see a new pack that has no allocated
index, we fall back to generating an array of pointers by calling
oe_map_new_pack().

Perhaps counter-intuitively, that function does not need to actually see
our new index-less pack. It only allocates and populates the array with
the existing packs, after which oe_set_in_pack() actually adds the new
pack to the array.

Let's drop the unused "struct packed_git" argument to oe_map_new_pack()
to avoid confusion.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Merge branch 'ph/pack-objects-mutex-fix'</title>
<updated>2019-02-05T22:26:16Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2019-02-05T22:26:16Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=d243a323a545da68b87149e885f2e440f0b13725'/>
<id>urn:sha1:d243a323a545da68b87149e885f2e440f0b13725</id>
<content type='text'>
"git pack-objects" incorrectly used uninitialized mutex, which has
been corrected.

* ph/pack-objects-mutex-fix:
  pack-objects: merge read_lock and lock in packing_data struct
  pack-objects: move read mutex to packing_data struct
</content>
</entry>
</feed>
