user/sven/git.git/diffcore-rename.c, branch v1.5.4.4

Fix a pathological case in git detecting proper renames

2007-11-30T23:49:17Z

On Thu, 29 Nov 2007, Jeff King wrote: > > I think it will get worse, because you are simultaneously calculating > all of the similarity scores bit by bit rather than doing a loop. Though > perhaps you mean at the end you will end up with a list of src/dst pairs > sorted by score, and you can loop over that. Well, after thinking about this a bit, I think there's a solution that may work well with the current thing too: instead of looping just *once* over the list of rename pairs, loop twice - and simply refuse to do copies on the first loop. This trivial patch does that, and turns Kumar's test-case into a perfect rename list. It's not pretty, it's not smart, but it seems to work. There's something to be said for keeping it simple and stupid. And it should not be nearly as expensive as it may _look_. Yes, the loop is "(i = 0; i < num_create * num_src; i++)", but the important part is that the whole array is sorted by rename score, and we have a if (mx[i].score < minimum_score) break; in it, so uthe loop actually would tend to terminate rather quickly. Anyway, Kumar, the thing to take away from this is: - git really doesn't even *care* about the whole "rename detection" internally, and any commits you have done with renames are totally independent of the heuristics we then use to *show* the renames. - the rename detection really is for just two reasons: (a) keep humans happy, and keep the diffs small and (b) help automatic merging across renames. So getting renames right is certainly good, but it's more of a "politeness" issue than a "correctness" issue, although the merge portion of it does matter a lot sometimes. - the important thing here is that you can commit your changes and not worry about them being somehow "corrupted" by lack of rename detection, even if you commit them with a version of git that doesn't do rename detection the way you expected it. The rename detection is an "after-the-fact" thing, not something that actually gets saved in the repository, which is why we can change the heuristics _after_ seeing examples, and the examples magically correct themselves! - try out the two patches I've posted, and see if they work for you. They pass the test-suite, and the output for your example commit looks sane, but hey, if you have other test-cases, try them out. Here's Kumar's pretty diffstat with both my patches: Makefile | 6 +++--- board/{cds => freescale}/common/cadmus.c | 0 board/{cds => freescale}/common/cadmus.h | 0 board/{cds => freescale}/common/eeprom.c | 0 board/{cds => freescale}/common/eeprom.h | 0 board/{cds => freescale}/common/ft_board.c | 0 board/{cds => freescale}/common/via.c | 0 board/{cds => freescale}/common/via.h | 0 board/{cds => freescale}/mpc8541cds/Makefile | 0 board/{cds => freescale}/mpc8541cds/config.mk | 0 board/{cds => freescale}/mpc8541cds/init.S | 0 board/{cds => freescale}/mpc8541cds/mpc8541cds.c | 0 board/{cds => freescale}/mpc8541cds/u-boot.lds | 4 ++-- board/{cds => freescale}/mpc8548cds/Makefile | 0 board/{cds => freescale}/mpc8548cds/config.mk | 0 board/{cds => freescale}/mpc8548cds/init.S | 0 board/{cds => freescale}/mpc8548cds/mpc8548cds.c | 0 board/{cds => freescale}/mpc8548cds/u-boot.lds | 4 ++-- board/{cds => freescale}/mpc8555cds/Makefile | 0 board/{cds => freescale}/mpc8555cds/config.mk | 0 board/{cds => freescale}/mpc8555cds/init.S | 0 board/{cds => freescale}/mpc8555cds/mpc8555cds.c | 0 board/{cds => freescale}/mpc8555cds/u-boot.lds | 4 ++-- 23 files changed, 9 insertions(+), 9 deletions(-) and here it is before: Makefile | 6 +- board/cds/mpc8548cds/Makefile | 60 ----- board/cds/mpc8555cds/Makefile | 60 ----- board/cds/mpc8555cds/init.S | 255 -------------------- board/cds/mpc8555cds/u-boot.lds | 150 ------------ board/{cds => freescale}/common/cadmus.c | 0 board/{cds => freescale}/common/cadmus.h | 0 board/{cds => freescale}/common/eeprom.c | 0 board/{cds => freescale}/common/eeprom.h | 0 board/{cds => freescale}/common/ft_board.c | 0 board/{cds => freescale}/common/via.c | 0 board/{cds => freescale}/common/via.h | 0 board/{cds => freescale}/mpc8541cds/Makefile | 0 board/{cds => freescale}/mpc8541cds/config.mk | 0 board/{cds => freescale}/mpc8541cds/init.S | 0 board/{cds => freescale}/mpc8541cds/mpc8541cds.c | 0 board/{cds => freescale}/mpc8541cds/u-boot.lds | 4 +- .../mpc8541cds => freescale/mpc8548cds}/Makefile | 0 board/{cds => freescale}/mpc8548cds/config.mk | 0 board/{cds => freescale}/mpc8548cds/init.S | 0 board/{cds => freescale}/mpc8548cds/mpc8548cds.c | 0 board/{cds => freescale}/mpc8548cds/u-boot.lds | 4 +- .../mpc8541cds => freescale/mpc8555cds}/Makefile | 0 board/{cds => freescale}/mpc8555cds/config.mk | 0 .../mpc8541cds => freescale/mpc8555cds}/init.S | 0 board/{cds => freescale}/mpc8555cds/mpc8555cds.c | 0 .../mpc8541cds => freescale/mpc8555cds}/u-boot.lds | 4 +- 27 files changed, 9 insertions(+), 534 deletions(-) so it certainly makes the diffs prettier. Linus Signed-off-by: Junio C Hamano

Fix a pathological case in git detecting proper renames

2007-11-30T23:49:17Z

Kumar Gala had a case in the u-boot archive with multiple renames of files with identical contents, and git would turn those into multiple "copy" operations of one of the sources, and just deleting the other sources. This patch makes the git exact rename detection prefer to spread out the renames over the multiple sources, rather than do multiple copies of one source. NOTE! The changes are a bit larger than required, because I also renamed the variables named "one" and "two" to "target" and "source" respectively. That makes the logic easier to follow, especially as the "one" was illogically the target and not the soruce, for purely historical reasons (this piece of code used to traverse over sources and targets in the wrong order, and when we fixed that, we didn't fix the names back then. So I fixed them now). The important part of this change is just the trivial score calculations for when files have identical contents: /* Give higher scores to sources that haven't been used already */ score = !source->rename_used; score += basename_same(source, target); and when we have multiple choices we'll now pick the choice that gets the best rename score, rather than only looking at whether the basename matched. It's worth noting a few gotchas: - this scoring is currently only done for the "exact match" case. In particular, in Kumar's example, even after this patch, the inexact match case is still done as a copy+delete rather than as two renames: delete mode 100644 board/cds/mpc8555cds/u-boot.lds copy board/{cds => freescale}/mpc8541cds/u-boot.lds (97%) rename board/{cds/mpc8541cds => freescale/mpc8555cds}/u-boot.lds (97%) because apparently the "cds/mpc8541cds/u-boot.lds" copy looked a bit more similar to both end results. That said, I *suspect* we just have the exact same issue there - the similarity analysis just gave identical (or at least very _close_ to identical) similarity points, and we do not have any logic to prefer multiple renames over a copy/delete there. That is a separate patch. - When you have identical contents and identical basenames, the actual entry that is chosen is still picked fairly "at random" for the first one (but the subsequent ones will prefer entries that haven't already been used). It's not actually really random, in that it actually depends on the relative alphabetical order of the files (which in turn will have impacted the order that the entries got hashed!), so it gives consistent results that can be explained. But I wanted to point it out as an issue for when anybody actually does cross-renames. In Kumar's case the choice is the right one (and for a single normal directory rename it should always be, since the relative alphabetical sorting of the files will be identical), and we now get: rename board/{cds => freescale}/mpc8541cds/init.S (100%) rename board/{cds => freescale}/mpc8548cds/init.S (100%) which is the "expected" answer. However, it might still be better to change the pedantic "exact same basename" on/off choice into a more graduated "how similar are the pathnames" scoring situation, in order to be more likely to get the exact rename choice that people *expect* to see, rather than other alternatives that may *technically* be equally good, but are surprising to a human. It's also unclear whether we should consider "basenames are equal" or "have already used this as a source" to be more important. This gives them equal weight, but I suspect we might want to just multiple the "basenames are equal" weight by two, or something, to prefer equal basenames even if that causes a copy/delete pair. I dunno. Anyway, what I'm just saying in a really long-winded manner is that I think this is right as-is, but it's not the complete solution, and it may want some further tweaking in the future. Signed-off-by: Linus Torvalds Signed-off-by: Junio C Hamano

Do the fuzzy rename detection limits with the exact renames removed

2007-10-27T06:18:06Z

When we do the fuzzy rename detection, we don't care about the destinations that we already handled with the exact rename detector. And, in fact, the code already knew that - but the rename limiter, which used to run *before* exact renames were detected, did not. This fixes it so that the rename detection limiter now bases its decisions on the *remaining* rename counts, rather than the original ones. Signed-off-by: Linus Torvalds Signed-off-by: Junio C Hamano

Fix ugly magic special case in exact rename detection

2007-10-27T06:18:06Z

For historical reasons, the exact rename detection had populated the filespecs for the entries it compared, and the rest of the similarity analysis depended on that. I hadn't even bothered to debug why that was the case when I re-did the rename detection, I just made the new one have the same broken behaviour, with a note about this special case. This fixes that fixme. The reason the exact rename detector needed to fill in the file sizes of the files it checked was that the _inexact_ rename detector was broken, and started comparing file sizes before it filled them in. Fixing that allows the exact phase to do the sane thing of never even caring (since all *it* cares about is really just the SHA1 itself, not the size nor the contents). It turns out that this also indirectly fixes a bug: trying to populate all the filespecs will run out of virtual memory if there is tons and tons of possible rename options. The fuzzy similarity analysis does the right thing in this regard, and free's the blob info after it has generated the hash tables, so the special case code caused more trouble than just some extra illogical code. Signed-off-by: Linus Torvalds Signed-off-by: Junio C Hamano

Do exact rename detection regardless of rename limits

2007-10-27T06:18:06Z

Now that the exact rename detection is linear-time (with a very small constant factor to boot), there is no longer any reason to limit it by the number of files involved. In some trivial testing, I created a repository with a directory that had a hundred thousand files in it (all with different contents), and then moved that directory to show the effects of renaming 100,000 files. With the new code, that resulted in [torvalds@woody big-rename]$ time ~/git/git show -C | wc -l 400006 real 0m2.071s user 0m1.520s sys 0m0.576s ie the code can correctly detect the hundred thousand renames in about 2 seconds (the number "400006" comes from four lines for each rename: diff --git a/really-big-dir/file-1-1-1-1-1 b/moved-big-dir/file-1-1-1-1-1 similarity index 100% rename from really-big-dir/file-1-1-1-1-1 rename to moved-big-dir/file-1-1-1-1-1 and the extra six lines is from a one-liner commit message and all the commit information and spacing). Most of those two seconds weren't even really the rename detection, it's really all the other stuff needed to get there. With the old code, this wouldn't have been practically possible. Doing a pairwise check of the ten billion possible pairs would have been prohibitively expensive. In fact, even with the rename limiter in place, the old code would waste a lot of time just on the diff_filespec checks, and despite not even trying to find renames, it used to look like: [torvalds@woody big-rename]$ time git show -C | wc -l 1400006 real 0m12.337s user 0m12.285s sys 0m0.192s ie we used to take 12 seconds for this load and not even do any rename detection! (The number 1400006 comes from fourteen lines per file moved: seven lines each for the delete and the create of a one-liner file, and the same extra six lines of commit information). Signed-off-by: Linus Torvalds Signed-off-by: Junio C Hamano

Do linear-time/space rename logic for exact renames

2007-10-27T06:18:06Z

This implements a smarter rename detector for exact renames, which rather than doing a pairwise comparison (time O(m*n)) will just hash the files into a hash-table (size O(n+m)), and only do pairwise comparisons to renames that have the same hash (time O(n+m) except for unrealistic hash collissions, which we just cull aggressively). Admittedly the exact rename case is not nearly as interesting as the generic case, but it's an important case none-the-less. A similar general approach should work for the generic case too, but even then you do need to handle the exact renames/copies separately (to avoid the inevitable added cost factor that comes from the _size_ of the file), so this is worth doing. In the expectation that we will indeed do the same hashing trick for the general rename case, this code uses a generic hash-table implementation that can be used for other things too. In fact, we might be able to consolidate some of our existing hash tables with the new generic code in hash.[ch]. Signed-off-by: Linus Torvalds Signed-off-by: Junio C Hamano

copy vs rename detection: avoid unnecessary O(n*m) loops

2007-10-27T06:18:06Z

The core rename detection had some rather stupid code to check if a pathname was used by a later modification or rename, which basically walked the whole pathname space for all renames for each rename, in order to tell whether it was a pure rename (no remaining users) or should be considered a copy (other users of the source file remaining). That's really silly, since we can just keep a count of users around, and replace all those complex and expensive loops with just testing that simple counter (but this all depends on the previous commit that shared the diff_filespec data structure by using a separate reference count). Note that the reference count is not the same as the rename count: they behave otherwise rather similarly, but the reference count is tied to the allocation (and decremented at de-allocation, so that when it turns zero we can get rid of the memory), while the rename count is tied to the renames and is decremented when we find a rename (so that when it turns zero we know that it was a rename, not a copy). Signed-off-by: Linus Torvalds Signed-off-by: Junio C Hamano

Ref-count the filespecs used by diffcore

2007-10-27T06:18:05Z

Rather than copy the filespecs when introducing new versions of them (for rename or copy detection), use a refcount and increment the count when reusing the diff_filespec. This avoids unnecessary allocations, but the real reason behind this is a future enhancement: we will want to track shared data across the copy/rename detection. In order to efficiently notice when a filespec is used by a rename, the rename machinery wants to keep track of a rename usage count which is shared across all different users of the filespec. Signed-off-by: Linus Torvalds Signed-off-by: Junio C Hamano

Split out "exact content match" phase of rename detection

2007-10-27T06:18:05Z

This makes the exact content match a separate function of its own. Partly to cut down a bit on the size of the diffcore_rename() function (which is too complex as it is), and partly because there are smarter ways to do this than an O(m*n) loop over it all, and that function should be rewritten to take that into account. Signed-off-by: Linus Torvalds Signed-off-by: Junio C Hamano

rename diff_free_filespec_data_large() to diff_free_filespec_blob()

2007-10-03T04:02:09Z

Signed-off-by: Junio C Hamano