user/sven/git.git/diffcore-rename.c, branch v1.6.4.5

Fix typos / spelling in comments

2009-04-23T02:02:12Z

Signed-off-by: Mike Ralphson Signed-off-by: Junio C Hamano

Rename detection: Avoid repeated filespec population

2009-01-21T08:14:12Z

In diffcore_rename, we assume that the blob contents in the filespec aren't required anymore after estimate_similarity has been called and thus we free it. But estimate_similarity might return early when the file sizes differ too much. In that case, cnt_data is never set and the next call to estimate_similarity will populate the filespec again, eventually rereading the same blob over and over again. To fix that, we first get the blob sizes and only when the blob contents are actually required, and when cnt_data will be set, the full filespec is populated, once. Signed-off-by: Björn Steinbrink Signed-off-by: Junio C Hamano

Add file delete/create info when we overflow rename_limit

2008-10-28T15:58:42Z

When we refuse to do rename detection due to having too many files created or deleted, let the user know the numbers. That way there is a reasonable starting point for setting the diff.renamelimit option. Signed-off-by: Linus Torvalds Signed-off-by: Junio C Hamano

diff: make "too many files" rename warning optional

2008-05-03T20:40:43Z

In many cases, the warning ends up as clutter, because the diff is being done "behind the scenes" from the user (e.g., when generating a commit diffstat), and whether we show renames or not is not particularly interesting to the user. However, in the case of a merge (which is what motivated the warning in the first place), it is a useful hint as to why a merge with renames might have failed. This patch makes the warning optional based on the code calling into diffcore. We default to not showing the warning, but turn it on for merges. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano

Merge branch 'jc/rename'

2008-04-09T08:09:12Z

* 'jc/rename' (early part): Optimize rename detection for a huge diff

rename: warn user when we have turned off rename detection

2008-03-01T09:30:15Z

Signed-off-by: Jeff King Signed-off-by: Junio C Hamano

Optimize rename detection for a huge diff

2008-02-13T23:44:20Z

When there are N deleted paths and M created paths, we used to allocate (N x M) "struct diff_score" that record how similar each of the pair is, and picked the pair that gives the best match first, and then went on to process worse matches. This sorting is done so that when two new files in the postimage that are similar to the same file deleted from the preimage, we can process the more similar one first, and when processing the second one, it can notice "Ah, the source I was planning to say I am a copy of is already taken by somebody else" and continue on to match itself with another file in the preimage with a lessor match. This matters to a change introduced between 1.5.3.X series and 1.5.4-rc, that lets the code to favor unused matches first and then falls back to using already used matches. This instead allocates and keeps only a handful rename source candidates per new files in the postimage. I.e. it makes the memory requirement from O(N x M) to O(M). For each dst, we compute similarlity with all sources (i.e. the number of similarity estimate computations is still O(N x M)), but we keep handful best src candidates for each dst. Signed-off-by: Junio C Hamano

Fix a pathological case in git detecting proper renames

2007-11-30T23:49:17Z

On Thu, 29 Nov 2007, Jeff King wrote: > > I think it will get worse, because you are simultaneously calculating > all of the similarity scores bit by bit rather than doing a loop. Though > perhaps you mean at the end you will end up with a list of src/dst pairs > sorted by score, and you can loop over that. Well, after thinking about this a bit, I think there's a solution that may work well with the current thing too: instead of looping just *once* over the list of rename pairs, loop twice - and simply refuse to do copies on the first loop. This trivial patch does that, and turns Kumar's test-case into a perfect rename list. It's not pretty, it's not smart, but it seems to work. There's something to be said for keeping it simple and stupid. And it should not be nearly as expensive as it may _look_. Yes, the loop is "(i = 0; i < num_create * num_src; i++)", but the important part is that the whole array is sorted by rename score, and we have a if (mx[i].score < minimum_score) break; in it, so uthe loop actually would tend to terminate rather quickly. Anyway, Kumar, the thing to take away from this is: - git really doesn't even *care* about the whole "rename detection" internally, and any commits you have done with renames are totally independent of the heuristics we then use to *show* the renames. - the rename detection really is for just two reasons: (a) keep humans happy, and keep the diffs small and (b) help automatic merging across renames. So getting renames right is certainly good, but it's more of a "politeness" issue than a "correctness" issue, although the merge portion of it does matter a lot sometimes. - the important thing here is that you can commit your changes and not worry about them being somehow "corrupted" by lack of rename detection, even if you commit them with a version of git that doesn't do rename detection the way you expected it. The rename detection is an "after-the-fact" thing, not something that actually gets saved in the repository, which is why we can change the heuristics _after_ seeing examples, and the examples magically correct themselves! - try out the two patches I've posted, and see if they work for you. They pass the test-suite, and the output for your example commit looks sane, but hey, if you have other test-cases, try them out. Here's Kumar's pretty diffstat with both my patches: Makefile | 6 +++--- board/{cds => freescale}/common/cadmus.c | 0 board/{cds => freescale}/common/cadmus.h | 0 board/{cds => freescale}/common/eeprom.c | 0 board/{cds => freescale}/common/eeprom.h | 0 board/{cds => freescale}/common/ft_board.c | 0 board/{cds => freescale}/common/via.c | 0 board/{cds => freescale}/common/via.h | 0 board/{cds => freescale}/mpc8541cds/Makefile | 0 board/{cds => freescale}/mpc8541cds/config.mk | 0 board/{cds => freescale}/mpc8541cds/init.S | 0 board/{cds => freescale}/mpc8541cds/mpc8541cds.c | 0 board/{cds => freescale}/mpc8541cds/u-boot.lds | 4 ++-- board/{cds => freescale}/mpc8548cds/Makefile | 0 board/{cds => freescale}/mpc8548cds/config.mk | 0 board/{cds => freescale}/mpc8548cds/init.S | 0 board/{cds => freescale}/mpc8548cds/mpc8548cds.c | 0 board/{cds => freescale}/mpc8548cds/u-boot.lds | 4 ++-- board/{cds => freescale}/mpc8555cds/Makefile | 0 board/{cds => freescale}/mpc8555cds/config.mk | 0 board/{cds => freescale}/mpc8555cds/init.S | 0 board/{cds => freescale}/mpc8555cds/mpc8555cds.c | 0 board/{cds => freescale}/mpc8555cds/u-boot.lds | 4 ++-- 23 files changed, 9 insertions(+), 9 deletions(-) and here it is before: Makefile | 6 +- board/cds/mpc8548cds/Makefile | 60 ----- board/cds/mpc8555cds/Makefile | 60 ----- board/cds/mpc8555cds/init.S | 255 -------------------- board/cds/mpc8555cds/u-boot.lds | 150 ------------ board/{cds => freescale}/common/cadmus.c | 0 board/{cds => freescale}/common/cadmus.h | 0 board/{cds => freescale}/common/eeprom.c | 0 board/{cds => freescale}/common/eeprom.h | 0 board/{cds => freescale}/common/ft_board.c | 0 board/{cds => freescale}/common/via.c | 0 board/{cds => freescale}/common/via.h | 0 board/{cds => freescale}/mpc8541cds/Makefile | 0 board/{cds => freescale}/mpc8541cds/config.mk | 0 board/{cds => freescale}/mpc8541cds/init.S | 0 board/{cds => freescale}/mpc8541cds/mpc8541cds.c | 0 board/{cds => freescale}/mpc8541cds/u-boot.lds | 4 +- .../mpc8541cds => freescale/mpc8548cds}/Makefile | 0 board/{cds => freescale}/mpc8548cds/config.mk | 0 board/{cds => freescale}/mpc8548cds/init.S | 0 board/{cds => freescale}/mpc8548cds/mpc8548cds.c | 0 board/{cds => freescale}/mpc8548cds/u-boot.lds | 4 +- .../mpc8541cds => freescale/mpc8555cds}/Makefile | 0 board/{cds => freescale}/mpc8555cds/config.mk | 0 .../mpc8541cds => freescale/mpc8555cds}/init.S | 0 board/{cds => freescale}/mpc8555cds/mpc8555cds.c | 0 .../mpc8541cds => freescale/mpc8555cds}/u-boot.lds | 4 +- 27 files changed, 9 insertions(+), 534 deletions(-) so it certainly makes the diffs prettier. Linus Signed-off-by: Junio C Hamano

Fix a pathological case in git detecting proper renames

2007-11-30T23:49:17Z

Kumar Gala had a case in the u-boot archive with multiple renames of files with identical contents, and git would turn those into multiple "copy" operations of one of the sources, and just deleting the other sources. This patch makes the git exact rename detection prefer to spread out the renames over the multiple sources, rather than do multiple copies of one source. NOTE! The changes are a bit larger than required, because I also renamed the variables named "one" and "two" to "target" and "source" respectively. That makes the logic easier to follow, especially as the "one" was illogically the target and not the soruce, for purely historical reasons (this piece of code used to traverse over sources and targets in the wrong order, and when we fixed that, we didn't fix the names back then. So I fixed them now). The important part of this change is just the trivial score calculations for when files have identical contents: /* Give higher scores to sources that haven't been used already */ score = !source->rename_used; score += basename_same(source, target); and when we have multiple choices we'll now pick the choice that gets the best rename score, rather than only looking at whether the basename matched. It's worth noting a few gotchas: - this scoring is currently only done for the "exact match" case. In particular, in Kumar's example, even after this patch, the inexact match case is still done as a copy+delete rather than as two renames: delete mode 100644 board/cds/mpc8555cds/u-boot.lds copy board/{cds => freescale}/mpc8541cds/u-boot.lds (97%) rename board/{cds/mpc8541cds => freescale/mpc8555cds}/u-boot.lds (97%) because apparently the "cds/mpc8541cds/u-boot.lds" copy looked a bit more similar to both end results. That said, I *suspect* we just have the exact same issue there - the similarity analysis just gave identical (or at least very _close_ to identical) similarity points, and we do not have any logic to prefer multiple renames over a copy/delete there. That is a separate patch. - When you have identical contents and identical basenames, the actual entry that is chosen is still picked fairly "at random" for the first one (but the subsequent ones will prefer entries that haven't already been used). It's not actually really random, in that it actually depends on the relative alphabetical order of the files (which in turn will have impacted the order that the entries got hashed!), so it gives consistent results that can be explained. But I wanted to point it out as an issue for when anybody actually does cross-renames. In Kumar's case the choice is the right one (and for a single normal directory rename it should always be, since the relative alphabetical sorting of the files will be identical), and we now get: rename board/{cds => freescale}/mpc8541cds/init.S (100%) rename board/{cds => freescale}/mpc8548cds/init.S (100%) which is the "expected" answer. However, it might still be better to change the pedantic "exact same basename" on/off choice into a more graduated "how similar are the pathnames" scoring situation, in order to be more likely to get the exact rename choice that people *expect* to see, rather than other alternatives that may *technically* be equally good, but are surprising to a human. It's also unclear whether we should consider "basenames are equal" or "have already used this as a source" to be more important. This gives them equal weight, but I suspect we might want to just multiple the "basenames are equal" weight by two, or something, to prefer equal basenames even if that causes a copy/delete pair. I dunno. Anyway, what I'm just saying in a really long-winded manner is that I think this is right as-is, but it's not the complete solution, and it may want some further tweaking in the future. Signed-off-by: Linus Torvalds Signed-off-by: Junio C Hamano

Do the fuzzy rename detection limits with the exact renames removed

2007-10-27T06:18:06Z

When we do the fuzzy rename detection, we don't care about the destinations that we already handled with the exact rename detector. And, in fact, the code already knew that - but the rename limiter, which used to run *before* exact renames were detected, did not. This fixes it so that the rename detection limiter now bases its decisions on the *remaining* rename counts, rather than the original ones. Signed-off-by: Linus Torvalds Signed-off-by: Junio C Hamano