<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/git.git/xdiff, branch origin/maint</title>
<subtitle>Git
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/git.git/atom?h=origin%2Fmaint</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/atom?h=origin%2Fmaint'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/'/>
<updated>2025-12-08T22:54:55Z</updated>
<entry>
<title>Merge branch 'yc/xdiff-patience-optim'</title>
<updated>2025-12-08T22:54:55Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2025-12-08T22:54:55Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=7fc0b33b5d9ced74819132a094528154f83f4a6a'/>
<id>urn:sha1:7fc0b33b5d9ced74819132a094528154f83f4a6a</id>
<content type='text'>
The way patience diff finds LCS has been optimized.

* yc/xdiff-patience-optim:
  xdiff: optimize patience diff's LCS search
</content>
</entry>
<entry>
<title>Merge branch 'en/xdiff-cleanup-2'</title>
<updated>2025-12-05T05:49:56Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2025-12-05T05:49:56Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=0c6707687f818fa43dca7c9381c55e611ba6c51e'/>
<id>urn:sha1:0c6707687f818fa43dca7c9381c55e611ba6c51e</id>
<content type='text'>
Code clean-up.

* en/xdiff-cleanup-2:
  xdiff: rename rindex -&gt; reference_index
  xdiff: change rindex from long to size_t in xdfile_t
  xdiff: make xdfile_t.nreff a size_t instead of long
  xdiff: make xdfile_t.nrec a size_t instead of long
  xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash
  xdiff: use unambiguous types in xdl_hash_record()
  xdiff: use size_t for xrecord_t.size
  xdiff: make xrecord_t.ptr a uint8_t instead of char
  xdiff: use ptrdiff_t for dstart/dend
  doc: define unambiguous type mappings across C and Rust
</content>
</entry>
<entry>
<title>xdiff: optimize patience diff's LCS search</title>
<updated>2025-11-28T03:11:41Z</updated>
<author>
<name>Yee Cheng Chin</name>
<email>ychin.git@gmail.com</email>
</author>
<published>2025-11-27T02:16:06Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=c7e3b8085bb2f74371f5017f42c58b0acf01b915'/>
<id>urn:sha1:c7e3b8085bb2f74371f5017f42c58b0acf01b915</id>
<content type='text'>
The find_longest_common_sequence() function in patience diff is
inefficient as it calls binary_search() for every unique line it
encounters when deciding where to put it in the sequence. From
instrumentation (using xctrace) on popular repositories, binary_search()
takes up 50-60% of the run time within patience_diff() when performing a
diff.

To optimize this, add a boundary condition check before binary_search()
is called to see if the encountered unique line is located after the
entire currently tracked longest subsequence. If so, skip the
unnecessary binary search and simply append the entry to the end of
sequence. Given that most files compared in a diff are usually quite
similar to each other, this condition is very common, and should be hit
much more frequently than the binary search.

Below are some end-to-end performance results by timing `git log
--shortstat --oneline -500 --patience` on different repositories with
the old and new code. Generally speaking this seems to give at least
8-10% speed up. The "binary search hit %" column describes how often the
algorithm enters the binary search path instead of the new faster path.
Even in the WebKit case we can see that it's quite rare (1.46%).

| Repo     | Speed difference | binary search hit % |
|----------|------------------|---------------------|
| vim      | 1.27x            | 0.01%               |
| pytorch  | 1.16x            | 0.02%               |
| cpython  | 1.14x            | 0.06%               |
| ripgrep  | 1.14x            | 0.03%               |
| git      | 1.13x            | 0.12%               |
| vscode   | 1.09x            | 0.10%               |
| WebKit   | 1.08x            | 1.46%               |

The benchmarks were done using hyperfine, on an Apple M1 Max laptop,
with git compiled with `-O3 -flto`.

Signed-off-by: Yee Cheng Chin &lt;ychin.git@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>xdiff: rename rindex -&gt; reference_index</title>
<updated>2025-11-18T22:53:11Z</updated>
<author>
<name>Ezekiel Newren</name>
<email>ezekielnewren@gmail.com</email>
</author>
<published>2025-11-18T22:34:22Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=22ce0cb6397d3d15c21c217696f262c4b8eb44b3'/>
<id>urn:sha1:22ce0cb6397d3d15c21c217696f262c4b8eb44b3</id>
<content type='text'>
The classic diff adds only the lines that it's going to consider,
during the diff, to an array. A mapping between the compacted
array, and the lines of the file that they reference, is
facilitated by this array.

Signed-off-by: Ezekiel Newren &lt;ezekielnewren@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>xdiff: change rindex from long to size_t in xdfile_t</title>
<updated>2025-11-18T22:53:11Z</updated>
<author>
<name>Ezekiel Newren</name>
<email>ezekielnewren@gmail.com</email>
</author>
<published>2025-11-18T22:34:21Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=5004a8da14e2aa80b5697b0a3a60e594af1c8292'/>
<id>urn:sha1:5004a8da14e2aa80b5697b0a3a60e594af1c8292</id>
<content type='text'>
The field rindex describes an index offset for other arrays. Change it
to size_t.

Changing the type of rindex from long to size_t has no cascading
refactor impact because it is only ever used to directly index other
arrays.

Signed-off-by: Ezekiel Newren &lt;ezekielnewren@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>xdiff: make xdfile_t.nreff a size_t instead of long</title>
<updated>2025-11-18T22:53:11Z</updated>
<author>
<name>Ezekiel Newren</name>
<email>ezekielnewren@gmail.com</email>
</author>
<published>2025-11-18T22:34:20Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=e35877eadbd9bee473936577c82abca9c8333abd'/>
<id>urn:sha1:e35877eadbd9bee473936577c82abca9c8333abd</id>
<content type='text'>
size_t is used because nreff describes the number of elements in memory
for rindex.

Signed-off-by: Ezekiel Newren &lt;ezekielnewren@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>xdiff: make xdfile_t.nrec a size_t instead of long</title>
<updated>2025-11-18T22:53:10Z</updated>
<author>
<name>Ezekiel Newren</name>
<email>ezekielnewren@gmail.com</email>
</author>
<published>2025-11-18T22:34:19Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=016538780e9f6e83a1d9c7b0ec771fb6c5583c0f'/>
<id>urn:sha1:016538780e9f6e83a1d9c7b0ec771fb6c5583c0f</id>
<content type='text'>
size_t is used because nrec describes the number of elements for both
recs, and for 'changed' + 2.

Signed-off-by: Ezekiel Newren &lt;ezekielnewren@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash</title>
<updated>2025-11-18T22:53:10Z</updated>
<author>
<name>Ezekiel Newren</name>
<email>ezekielnewren@gmail.com</email>
</author>
<published>2025-11-18T22:34:18Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=6a26019c81faa07ba811541b4cf35be9e8ee1ead'/>
<id>urn:sha1:6a26019c81faa07ba811541b4cf35be9e8ee1ead</id>
<content type='text'>
The ha field is serving two different purposes, which makes the code
harder to read. At first glance, it looks like many places assume
there could never be hash collisions between lines of the two input
files. In reality, line_hash is used together with xdl_recmatch() to
ensure correct comparisons of lines, even when collisions occur.

To make this clearer, the old ha field has been split:
  * line_hash: a straightforward hash of a line, independent of any
    external context. Its type is uint64_t, as it comes from a fixed
    width hash function.
  * minimal_perfect_hash: Not a new concept, but now a separate
    field. It comes from the classifier's general-purpose hash table,
    which assigns each line a unique and minimal hash across the two
    files. A size_t is used here because it's meant to be used to
    index an array. This also avoids ` as usize` casts on the Rust
    side when using it to index a slice.

Signed-off-by: Ezekiel Newren &lt;ezekielnewren@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>xdiff: use unambiguous types in xdl_hash_record()</title>
<updated>2025-11-18T22:53:10Z</updated>
<author>
<name>Ezekiel Newren</name>
<email>ezekielnewren@gmail.com</email>
</author>
<published>2025-11-18T22:34:17Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=b0d4ae30f5a23fa9da87e9396b78e6442b351ddc'/>
<id>urn:sha1:b0d4ae30f5a23fa9da87e9396b78e6442b351ddc</id>
<content type='text'>
Convert the function signature and body to use unambiguous types. char
is changed to uint8_t because this function processes bytes in memory.
unsigned long to uint64_t so that the hash output is consistent across
platforms. `flags` was changed from long to uint64_t to ensure the
high order bits are not dropped on platforms that treat long as 32
bits.

Signed-off-by: Ezekiel Newren &lt;ezekielnewren@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>xdiff: use size_t for xrecord_t.size</title>
<updated>2025-11-18T22:53:10Z</updated>
<author>
<name>Ezekiel Newren</name>
<email>ezekielnewren@gmail.com</email>
</author>
<published>2025-11-18T22:34:16Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=9bd193253c5e590203fc566ad7cff8f891ec0493'/>
<id>urn:sha1:9bd193253c5e590203fc566ad7cff8f891ec0493</id>
<content type='text'>
size_t is the appropriate type because size is describing the number of
elements, bytes in this case, in memory.

Signed-off-by: Ezekiel Newren &lt;ezekielnewren@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
</feed>
