user/sven/git.git/diffcore-delta.c, branch v1.8.0.2

Fix diff -B/--dirstat miscounting of newly added contents

2009-12-05T18:54:17Z

What used to happen is that diffcore_count_changes() simply ignored any hashes in the destination that didn't match hashes in the source. EXCEPT if the source hash didn't exist at all, in which case it would count _one_ destination hash that happened to have the "next" hash value. As a consequence, newly added material was often undercounted, making output from --dirstat and "complete rewrite" detection used by -B unrelialble. This changes it so that: - whenever it bypasses a destination hash (because it doesn't match a source), it counts the bytes associated with that as "literal added" - at the end (once we have used up all the source hashes), we do the same thing with the remaining destination hashes. - when hashes do match, and we use the difference in counts as a value, we also use up that destination hash entry (the 'd++'). Signed-off-by: Linus Torvalds Signed-off-by: Junio C Hamano

optimize diffcore-delta by sorting hash entries.

2007-10-04T07:05:36Z

Here's a test-patch. I don't guarantee anything, except that when I did the timings I also did a "wc" on the result, and they matched.. Before: [torvalds@woody linux]$ time git diff -l0 --stat -C v2.6.22.. | wc 7104 28574 438020 real 0m10.526s user 0m10.401s sys 0m0.136s After: [torvalds@woody linux]$ time ~/git/git diff -l0 --stat -C v2.6.22.. | wc 7104 28574 438020 real 0m8.876s user 0m8.761s sys 0m0.128s but the diff is fairly simple, so if somebody will go over it and say whether it's likely to be *correct* too, that 15% may well be worth it. [ Side note, without rename detection, that diff takes just under three seconds for me, so in that sense the improvement to the rename detection itself is larger than the overall 15% - it brings the cost of just rename detection from 7.5s to 5.9s, which would be on the order of just over a 20% performance improvement. ] Hmm. The patch depends on half-way subtle issues like the fact that the hashtables are guaranteed to not be full => we're guaranteed to have zero counts at the end => we don't need to do any steenking iterator count in the loop. A few comments might in order. Linus

Introduce diff_filespec_is_binary()

2007-07-06T07:21:41Z

This replaces an explicit initialization of filespec->is_binary field used for rename/break followed by direct access to that field with a wrapper function that lazily iniaitlizes and accesses the field. We would add more attribute accesses for the use of diff routines, and it would be better to make this abstraction earlier. Signed-off-by: Junio C Hamano

diffcore-delta.c: Ignore CR in CRLF for text files

2007-07-01T03:51:31Z

This ignores CR byte in CRLF sequence in text file when computing similarity of two blobs. Usually this should not matter as nobody sane would be checking in a file with CRLF line endings to the repository (they would use autocrlf so that the repository copy would have LF line endings). Signed-off-by: Junio C Hamano

diffcore-delta.c: update the comment on the algorithm.

2007-07-01T03:51:31Z

The comment at the top of the file described an old algorithm that was neutral to text/binary differences (it hashed sliding window of N-byte sequences and counted overlaps), but long time ago we switched to a new heuristics that are more suitable for line oriented (read: text) files that are much faster. Signed-off-by: Junio C Hamano

diffcore_count_changes: pass diffcore_filespec

2007-07-01T03:51:31Z

We may want to use richer information on the data we are dealing with in this function, so instead of passing a buffer address and length, just pass the diffcore_filespec structure. Existing callers always call this function with parameters taken from a filespec anyway, so there is no functionality changes. Signed-off-by: Junio C Hamano

diffcore-delta: 64-byte-or-EOL ultrafast replacement (hash fix).

2006-03-15T21:19:27Z

The rotating 64-bit number was not really rotating, and worse yet ulong was longer than 64-bit on 64-bit architectures X-<. Signed-off-by: Junio C Hamano

diffcore-delta: 64-byte-or-EOL ultrafast replacement.

2006-03-15T08:37:57Z

Signed-off-by: Junio C Hamano

diffcore-delta: tweak hashbase value.

2006-03-13T04:42:12Z

This tweaks the maximum hashvalue we use to hash the string into without making the maximum size of the hashtable can grow from the current limit. With this, the renames detected becomes a bit more precise without incurring additional paging cost. Signed-off-by: Junio C Hamano

diffcore-delta: make the hash a bit denser.

2006-03-13T01:26:32Z

To reduce wasted memory, wait until the hash fills up more densely before we rehash. This reduces the working set size a bit further. Signed-off-by: Junio C Hamano