<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/git.git/diffcore-rename.c, branch v1.5.4.4</title>
<subtitle>Git
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/git.git/atom?h=v1.5.4.4</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/atom?h=v1.5.4.4'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/'/>
<updated>2007-11-30T23:49:17Z</updated>
<entry>
<title>Fix a pathological case in git detecting proper renames</title>
<updated>2007-11-30T23:49:17Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2007-11-30T00:41:09Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=9ae8fcb36ac9fde8e048a304cc3717f2c7914e78'/>
<id>urn:sha1:9ae8fcb36ac9fde8e048a304cc3717f2c7914e78</id>
<content type='text'>
On Thu, 29 Nov 2007, Jeff King wrote:
&gt;
&gt; I think it will get worse, because you are simultaneously calculating
&gt; all of the similarity scores bit by bit rather than doing a loop. Though
&gt; perhaps you mean at the end you will end up with a list of src/dst pairs
&gt; sorted by score, and you can loop over that.

Well, after thinking about this a bit, I think there's a solution that may
work well with the current thing too: instead of looping just *once* over
the list of rename pairs, loop twice - and simply refuse to do copies on
the first loop.

This trivial patch does that, and turns Kumar's test-case into a perfect
rename list.

It's not pretty, it's not smart, but it seems to work. There's something
to be said for keeping it simple and stupid.

And it should not be nearly as expensive as it may _look_. Yes, the loop
is "(i = 0; i &lt; num_create * num_src; i++)", but the important part is
that the whole array is sorted by rename score, and we have a

	if (mx[i].score &lt; minimum_score)
		break;

in it, so uthe loop actually would tend to terminate rather quickly.

Anyway, Kumar, the thing to take away from this is:

 - git really doesn't even *care* about the whole "rename detection"
   internally, and any commits you have done with renames are totally
   independent of the heuristics we then use to *show* the renames.

 - the rename detection really is for just two reasons: (a) keep humans
   happy, and keep the diffs small and (b) help automatic merging across
   renames. So getting renames right is certainly good, but it's more of a
   "politeness" issue than a "correctness" issue, although the merge
   portion of it does matter a lot sometimes.

 - the important thing here is that you can commit your changes and not
   worry about them being somehow "corrupted" by lack of rename detection,
   even if you commit them with a version of git that doesn't do rename
   detection the way you expected it. The rename detection is an
   "after-the-fact" thing, not something that actually gets saved in the
   repository, which is why we can change the heuristics _after_ seeing
   examples, and the examples magically correct themselves!

 - try out the two patches I've posted, and see if they work for you. They
   pass the test-suite, and the output for your example commit looks sane,
   but hey, if you have other test-cases, try them out.

Here's Kumar's pretty diffstat with both my patches:

	 Makefile                                         |    6 +++---
	 board/{cds =&gt; freescale}/common/cadmus.c         |    0
	 board/{cds =&gt; freescale}/common/cadmus.h         |    0
	 board/{cds =&gt; freescale}/common/eeprom.c         |    0
	 board/{cds =&gt; freescale}/common/eeprom.h         |    0
	 board/{cds =&gt; freescale}/common/ft_board.c       |    0
	 board/{cds =&gt; freescale}/common/via.c            |    0
	 board/{cds =&gt; freescale}/common/via.h            |    0
	 board/{cds =&gt; freescale}/mpc8541cds/Makefile     |    0
	 board/{cds =&gt; freescale}/mpc8541cds/config.mk    |    0
	 board/{cds =&gt; freescale}/mpc8541cds/init.S       |    0
	 board/{cds =&gt; freescale}/mpc8541cds/mpc8541cds.c |    0
	 board/{cds =&gt; freescale}/mpc8541cds/u-boot.lds   |    4 ++--
	 board/{cds =&gt; freescale}/mpc8548cds/Makefile     |    0
	 board/{cds =&gt; freescale}/mpc8548cds/config.mk    |    0
	 board/{cds =&gt; freescale}/mpc8548cds/init.S       |    0
	 board/{cds =&gt; freescale}/mpc8548cds/mpc8548cds.c |    0
	 board/{cds =&gt; freescale}/mpc8548cds/u-boot.lds   |    4 ++--
	 board/{cds =&gt; freescale}/mpc8555cds/Makefile     |    0
	 board/{cds =&gt; freescale}/mpc8555cds/config.mk    |    0
	 board/{cds =&gt; freescale}/mpc8555cds/init.S       |    0
	 board/{cds =&gt; freescale}/mpc8555cds/mpc8555cds.c |    0
	 board/{cds =&gt; freescale}/mpc8555cds/u-boot.lds   |    4 ++--
	 23 files changed, 9 insertions(+), 9 deletions(-)

and here it is before:

	 Makefile                                           |    6 +-
	 board/cds/mpc8548cds/Makefile                      |   60 -----
	 board/cds/mpc8555cds/Makefile                      |   60 -----
	 board/cds/mpc8555cds/init.S                        |  255 --------------------
	 board/cds/mpc8555cds/u-boot.lds                    |  150 ------------
	 board/{cds =&gt; freescale}/common/cadmus.c           |    0
	 board/{cds =&gt; freescale}/common/cadmus.h           |    0
	 board/{cds =&gt; freescale}/common/eeprom.c           |    0
	 board/{cds =&gt; freescale}/common/eeprom.h           |    0
	 board/{cds =&gt; freescale}/common/ft_board.c         |    0
	 board/{cds =&gt; freescale}/common/via.c              |    0
	 board/{cds =&gt; freescale}/common/via.h              |    0
	 board/{cds =&gt; freescale}/mpc8541cds/Makefile       |    0
	 board/{cds =&gt; freescale}/mpc8541cds/config.mk      |    0
	 board/{cds =&gt; freescale}/mpc8541cds/init.S         |    0
	 board/{cds =&gt; freescale}/mpc8541cds/mpc8541cds.c   |    0
	 board/{cds =&gt; freescale}/mpc8541cds/u-boot.lds     |    4 +-
	 .../mpc8541cds =&gt; freescale/mpc8548cds}/Makefile   |    0
	 board/{cds =&gt; freescale}/mpc8548cds/config.mk      |    0
	 board/{cds =&gt; freescale}/mpc8548cds/init.S         |    0
	 board/{cds =&gt; freescale}/mpc8548cds/mpc8548cds.c   |    0
	 board/{cds =&gt; freescale}/mpc8548cds/u-boot.lds     |    4 +-
	 .../mpc8541cds =&gt; freescale/mpc8555cds}/Makefile   |    0
	 board/{cds =&gt; freescale}/mpc8555cds/config.mk      |    0
	 .../mpc8541cds =&gt; freescale/mpc8555cds}/init.S     |    0
	 board/{cds =&gt; freescale}/mpc8555cds/mpc8555cds.c   |    0
	 .../mpc8541cds =&gt; freescale/mpc8555cds}/u-boot.lds |    4 +-
	 27 files changed, 9 insertions(+), 534 deletions(-)

so it certainly makes the diffs prettier.

		Linus

Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Fix a pathological case in git detecting proper renames</title>
<updated>2007-11-30T23:49:17Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2007-11-29T21:30:13Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=32d75d29f99cca8e0874b1bdf94ded48b576c906'/>
<id>urn:sha1:32d75d29f99cca8e0874b1bdf94ded48b576c906</id>
<content type='text'>
Kumar Gala had a case in the u-boot archive with multiple renames of files
with identical contents, and git would turn those into multiple "copy"
operations of one of the sources, and just deleting the other sources.

This patch makes the git exact rename detection prefer to spread out the
renames over the multiple sources, rather than do multiple copies of one
source.

NOTE! The changes are a bit larger than required, because I also renamed
the variables named "one" and "two" to "target" and "source" respectively.
That makes the logic easier to follow, especially as the "one" was
illogically the target and not the soruce, for purely historical reasons
(this piece of code used to traverse over sources and targets in the wrong
order, and when we fixed that, we didn't fix the names back then. So I
fixed them now).

The important part of this change is just the trivial score calculations
for when files have identical contents:

	/* Give higher scores to sources that haven't been used already */
	score = !source-&gt;rename_used;
	score += basename_same(source, target);

and when we have multiple choices we'll now pick the choice that gets the
best rename score, rather than only looking at whether the basename
matched.

It's worth noting a few gotchas:

 - this scoring is currently only done for the "exact match" case.

   In particular, in Kumar's example, even after this patch, the inexact
   match case is still done as a copy+delete rather than as two renames:

	 delete mode 100644 board/cds/mpc8555cds/u-boot.lds
	 copy board/{cds =&gt; freescale}/mpc8541cds/u-boot.lds (97%)
	 rename board/{cds/mpc8541cds =&gt; freescale/mpc8555cds}/u-boot.lds (97%)

   because apparently the "cds/mpc8541cds/u-boot.lds" copy looked
   a bit more similar to both end results. That said, I *suspect* we just
   have the exact same issue there - the similarity analysis just gave
   identical (or at least very _close_ to identical) similarity points,
   and we do not have any logic to prefer multiple renames over a
   copy/delete there.

   That is a separate patch.

 - When you have identical contents and identical basenames, the actual
   entry that is chosen is still picked fairly "at random" for the first
   one (but the subsequent ones will prefer entries that haven't already
   been used).

   It's not actually really random, in that it actually depends on the
   relative alphabetical order of the files (which in turn will have
   impacted the order that the entries got hashed!), so it gives
   consistent results that can be explained. But I wanted to point it out
   as an issue for when anybody actually does cross-renames.

   In Kumar's case the choice is the right one (and for a single normal
   directory rename it should always be, since the relative alphabetical
   sorting of the files will be identical), and we now get:

	 rename board/{cds =&gt; freescale}/mpc8541cds/init.S (100%)
	 rename board/{cds =&gt; freescale}/mpc8548cds/init.S (100%)

   which is the "expected" answer. However, it might still be better to
   change the pedantic "exact same basename" on/off choice into a more
   graduated "how similar are the pathnames" scoring situation, in order
   to be more likely to get the exact rename choice that people *expect*
   to see, rather than other alternatives that may *technically* be
   equally good, but are surprising to a human.

It's also unclear whether we should consider "basenames are equal" or
"have already used this as a source" to be more important. This gives them
equal weight, but I suspect we might want to just multiple the "basenames
are equal" weight by two, or something, to prefer equal basenames even if
that causes a copy/delete pair. I dunno.

Anyway, what I'm just saying in a really long-winded manner is that I
think this is right as-is, but it's not the complete solution, and it may
want some further tweaking in the future.

Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Do the fuzzy rename detection limits with the exact renames removed</title>
<updated>2007-10-27T06:18:06Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2007-10-26T23:56:34Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=42899ac898f2b539fc762c8452da2c981ccbd815'/>
<id>urn:sha1:42899ac898f2b539fc762c8452da2c981ccbd815</id>
<content type='text'>
When we do the fuzzy rename detection, we don't care about the
destinations that we already handled with the exact rename detector.
And, in fact, the code already knew that - but the rename limiter, which
used to run *before* exact renames were detected, did not.

This fixes it so that the rename detection limiter now bases its
decisions on the *remaining* rename counts, rather than the original
ones.

Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Fix ugly magic special case in exact rename detection</title>
<updated>2007-10-27T06:18:06Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2007-10-26T23:51:28Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=81ac051d6ac9deef9a34de496fb981469aae77f0'/>
<id>urn:sha1:81ac051d6ac9deef9a34de496fb981469aae77f0</id>
<content type='text'>
For historical reasons, the exact rename detection had populated the
filespecs for the entries it compared, and the rest of the similarity
analysis depended on that.  I hadn't even bothered to debug why that was
the case when I re-did the rename detection, I just made the new one
have the same broken behaviour, with a note about this special case.

This fixes that fixme.  The reason the exact rename detector needed to
fill in the file sizes of the files it checked was that the _inexact_
rename detector was broken, and started comparing file sizes before it
filled them in.

Fixing that allows the exact phase to do the sane thing of never even
caring (since all *it* cares about is really just the SHA1 itself, not
the size nor the contents).

It turns out that this also indirectly fixes a bug: trying to populate
all the filespecs will run out of virtual memory if there is tons and
tons of possible rename options.  The fuzzy similarity analysis does the
right thing in this regard, and free's the blob info after it has
generated the hash tables, so the special case code caused more trouble
than just some extra illogical code.

Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Do exact rename detection regardless of rename limits</title>
<updated>2007-10-27T06:18:06Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2007-10-25T18:24:47Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=17559a643ecef94834d930790498c6babe3e89a8'/>
<id>urn:sha1:17559a643ecef94834d930790498c6babe3e89a8</id>
<content type='text'>
Now that the exact rename detection is linear-time (with a very small
constant factor to boot), there is no longer any reason to limit it by
the number of files involved.

In some trivial testing, I created a repository with a directory that
had a hundred thousand files in it (all with different contents), and
then moved that directory to show the effects of renaming 100,000 files.

With the new code, that resulted in

	[torvalds@woody big-rename]$ time ~/git/git show -C | wc -l
	400006

	real    0m2.071s
	user    0m1.520s
	sys     0m0.576s

ie the code can correctly detect the hundred thousand renames in about 2
seconds (the number "400006" comes from four lines for each rename:

	diff --git a/really-big-dir/file-1-1-1-1-1 b/moved-big-dir/file-1-1-1-1-1
	similarity index 100%
	rename from really-big-dir/file-1-1-1-1-1
	rename to moved-big-dir/file-1-1-1-1-1

and the extra six lines is from a one-liner commit message and all the
commit information and spacing).

Most of those two seconds weren't even really the rename detection, it's
really all the other stuff needed to get there.

With the old code, this wouldn't have been practically possible.  Doing
a pairwise check of the ten billion possible pairs would have been
prohibitively expensive.  In fact, even with the rename limiter in
place, the old code would waste a lot of time just on the diff_filespec
checks, and despite not even trying to find renames, it used to look
like:

	[torvalds@woody big-rename]$ time git show -C | wc -l
	1400006

	real    0m12.337s
	user    0m12.285s
	sys     0m0.192s

ie we used to take 12 seconds for this load and not even do any rename
detection! (The number 1400006 comes from fourteen lines per file moved:
seven lines each for the delete and the create of a one-liner file, and
the same extra six lines of commit information).

Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Do linear-time/space rename logic for exact renames</title>
<updated>2007-10-27T06:18:06Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2007-10-25T18:23:26Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=9027f53cb5051bf83a0254e7f8aeb5d1a206de0b'/>
<id>urn:sha1:9027f53cb5051bf83a0254e7f8aeb5d1a206de0b</id>
<content type='text'>
This implements a smarter rename detector for exact renames, which
rather than doing a pairwise comparison (time O(m*n)) will just hash the
files into a hash-table (size O(n+m)), and only do pairwise comparisons
to renames that have the same hash (time O(n+m) except for unrealistic
hash collissions, which we just cull aggressively).

Admittedly the exact rename case is not nearly as interesting as the
generic case, but it's an important case none-the-less. A similar general
approach should work for the generic case too, but even then you do need
to handle the exact renames/copies separately (to avoid the inevitable
added cost factor that comes from the _size_ of the file), so this is
worth doing.

In the expectation that we will indeed do the same hashing trick for the
general rename case, this code uses a generic hash-table implementation
that can be used for other things too.  In fact, we might be able to
consolidate some of our existing hash tables with the new generic code
in hash.[ch].

Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>copy vs rename detection: avoid unnecessary O(n*m) loops</title>
<updated>2007-10-27T06:18:06Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2007-10-25T18:20:56Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=644797119d7a3b7a043a51a9cccd8758f8451f91'/>
<id>urn:sha1:644797119d7a3b7a043a51a9cccd8758f8451f91</id>
<content type='text'>
The core rename detection had some rather stupid code to check if a
pathname was used by a later modification or rename, which basically
walked the whole pathname space for all renames for each rename, in
order to tell whether it was a pure rename (no remaining users) or
should be considered a copy (other users of the source file remaining).

That's really silly, since we can just keep a count of users around, and
replace all those complex and expensive loops with just testing that
simple counter (but this all depends on the previous commit that shared
the diff_filespec data structure by using a separate reference count).

Note that the reference count is not the same as the rename count: they
behave otherwise rather similarly, but the reference count is tied to
the allocation (and decremented at de-allocation, so that when it turns
zero we can get rid of the memory), while the rename count is tied to
the renames and is decremented when we find a rename (so that when it
turns zero we know that it was a rename, not a copy).

Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Ref-count the filespecs used by diffcore</title>
<updated>2007-10-27T06:18:05Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2007-10-25T18:19:10Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=9fb88419ba85e641006c80db53620423f37f1c93'/>
<id>urn:sha1:9fb88419ba85e641006c80db53620423f37f1c93</id>
<content type='text'>
Rather than copy the filespecs when introducing new versions of them
(for rename or copy detection), use a refcount and increment the count
when reusing the diff_filespec.

This avoids unnecessary allocations, but the real reason behind this is
a future enhancement: we will want to track shared data across the
copy/rename detection.  In order to efficiently notice when a filespec
is used by a rename, the rename machinery wants to keep track of a
rename usage count which is shared across all different users of the
filespec.

Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Split out "exact content match" phase of rename detection</title>
<updated>2007-10-27T06:18:05Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2007-10-25T18:17:55Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=cb1491b6bff20748532c9e50afc7f9d6896167a8'/>
<id>urn:sha1:cb1491b6bff20748532c9e50afc7f9d6896167a8</id>
<content type='text'>
This makes the exact content match a separate function of its own.
Partly to cut down a bit on the size of the diffcore_rename() function
(which is too complex as it is), and partly because there are smarter
ways to do this than an O(m*n) loop over it all, and that function
should be rewritten to take that into account.

Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>rename diff_free_filespec_data_large() to diff_free_filespec_blob()</title>
<updated>2007-10-03T04:02:09Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2007-10-03T04:01:03Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=8ae92e63895000ff9b12046325ae381f3c17d414'/>
<id>urn:sha1:8ae92e63895000ff9b12046325ae381f3c17d414</id>
<content type='text'>
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
</feed>
