user/sven/git.git/diffcore-break.c, branch v1.7.7.4

Add a macro DIFF_QUEUE_CLEAR.

2010-05-07T16:34:27Z

Refactor the diff_queue_struct code, this macro help to reset the structure. Signed-off-by: Bo Yang Signed-off-by: Junio C Hamano

diffcore-break: save cnt_data for other phases

2009-11-16T21:21:12Z

The "break" phase works by counting changes between two blobs with the same path. We do this by splitting the file into chunks (or lines for text oriented files) and then keeping a count of chunk hashes. The "rename" phase counts changes between blobs at two different paths. However, it uses the exact same set of chunk hashes (which are immutable for a given sha1). The rename phase can therefore use the same hash data as break. Unfortunately, we were throwing this data away after computing it in the break phase. This patch instead attaches it to the filespec and lets it live through the rename phase, working under the assumption that most of the time that breaks are being computed, renames will be too. We only do this optimization for files which have actually been broken, as those ones will be candidates for rename detection (and it is a time-space tradeoff, so we don't want to waste space keeping useless data). Signed-off-by: Jeff King Signed-off-by: Junio C Hamano

diffcore-break: free filespec data as we go

2009-11-16T21:21:11Z

As we look at each changed file and consider breaking it, we load the blob data and make a decision about whether to break, which is independent of any other blobs that might have changed. However, we keep the data in memory while we consider breaking all of the other files. Which means that both versions of every file you are diffing are in memory at the same time. This patch instead frees the blob data as we finish with each file pair, leading to much lower memory usage. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano

Remove unused function scope local variables

2009-03-08T04:52:17Z

These variables were unused and can be removed safely: builtin-clone.c::cmd_clone(): use_local_hardlinks, use_separate_remote builtin-fetch-pack.c::find_common(): len builtin-remote.c::mv(): symref diff.c::show_stats():show_stats(): total diffcore-break.c::should_break(): base_size fast-import.c::validate_raw_date(): date, sign fsck.c::fsck_tree(): o_sha1, sha1 xdiff-interface.c::parse_num(): read_some Signed-off-by: Benjamin Kramer Signed-off-by: Junio C Hamano

rename: Break filepairs with different types.

2007-12-02T10:24:46Z

When we consider if a path has been totally rewritten, we did not touch changes from symlinks to files or vice versa. But a change that modifies even the type of a blob surely should count as a complete rewrite. While we are at it, modernise diffcore-break to be aware of gitlinks (we do not want to touch them). Signed-off-by: Junio C Hamano

Fix diffcore-break total breakage

2007-10-21T05:59:42Z

Ok, so on the kernel list, some people noticed that "git log --follow" doesn't work too well with some files in the x86 merge, because a lot of files got renamed in very special ways. In particular, there was a pattern of doing single commits with renames that looked basically like - rename "filename.h" -> "filename_64.h" - create new "filename.c" that includes "filename_32.h" or "filename_64.h" depending on whether we're 32-bit or 64-bit. which was preparatory for smushing the two trees together. Now, there's two issues here: - "filename.c" *remained*. Yes, it was a rename, but there was a new file created with the old name in the same commit. This was important, because we wanted each commit to compile properly, so that it was bisectable, so splitting the rename into one commit and the "create helper file" into another was *not* an option. So we need to break associations where the contents change too much. Fine. We have the -B flag for that. When we break things up, then the rename detection will be able to figure out whether there are better alternatives. - "git log --follow" didn't with with -B. Now, the second case was really simple: we use a different "diffopt" structure for the rename detection than the basic one (which we use for showing the diffs). So that second case is trivially fixed by a trivial one-liner that just copies the break_opt values from the "real" diffopts to the one used for rename following. So now "git log -B --follow" works fine: diff --git a/tree-diff.c b/tree-diff.c index 26bdbdd..7c261fd 100644 --- a/tree-diff.c +++ b/tree-diff.c @@ -319,6 +319,7 @@ static void try_to_follow_renames(struct tree_desc *t1, struct tree_desc *t2, co diff_opts.detect_rename = DIFF_DETECT_RENAME; diff_opts.output_format = DIFF_FORMAT_NO_OUTPUT; diff_opts.single_follow = opt->paths[0]; + diff_opts.break_opt = opt->break_opt; paths[0] = NULL; diff_tree_setup_paths(paths, &diff_opts); if (diff_setup_done(&diff_opts) < 0) however, the end result does *not* work. Because our diffcore-break.c logic is totally bogus! In particular: - it used to do if (base_size < MINIMUM_BREAK_SIZE) return 0; /* we do not break too small filepair */ which basically says "don't bother to break small files". But that "base_size" is the *smaller* of the two sizes, which means that if some large file was rewritten into one that just includes another file, we would look at the (small) result, and decide that it's smaller than the break size, so it cannot be worth it to break it up! Even if the other side was ten times bigger and looked *nothing* like the samell file! That's clearly bogus. I replaced "base_size" with "max_size", so that we compare the *bigger* of the filepair with the break size. - It calculated a "merge_score", which was the score needed to merge it back together if nothing else wanted it. But even if it was *so* different that we would never want to merge it back, we wouldn't consider it a break! That makes no sense. So I added if (*merge_score_p > break_score) return 1; to make it clear that if we wouldn't want to merge it at the end, it was *definitely* a break. - It compared the whole "extent of damage", counting all inserts and deletes, but it based this score on the "base_size", and generated the damage score with delta_size = src_removed + literal_added; damage_score = delta_size * MAX_SCORE / base_size; but that makes no sense either, since quite often, this will result in a number that is *bigger* than MAX_SCORE! Why? Because base_size is (again) the smaller of the two files we compare, and when you start out from a small file and add a lot (or start out from a large file and remove a lot), the base_size is going to be much smaller than the damage! Again, the fix was to replace "base_size" with "max_size", at which point the damage actually becomes a sane percentage of the whole. With these changes in place, not only does "git log -B --follow" work for the case that triggered this in the first place, ie now git log -B --follow arch/x86/kernel/vmlinux_64.lds.S actually gives reasonable results. But I also wanted to verify it in general, by doing a full-history git log --stat -B -C on my kernel tree with the old code and the new code. There's some tweaking to be done, but generally, the new code generates much better results wrt breaking up files (and then finding better rename candidates). Here's a few examples of the "--stat" output: - This: include/asm-x86/Kbuild | 2 - include/asm-x86/debugreg.h | 79 +++++++++++++++++++++++++++++++++++------ include/asm-x86/debugreg_32.h | 64 --------------------------------- include/asm-x86/debugreg_64.h | 65 --------------------------------- 4 files changed, 68 insertions(+), 142 deletions(-) Becomes: include/asm-x86/Kbuild | 2 - include/asm-x86/{debugreg_64.h => debugreg.h} | 9 +++- include/asm-x86/debugreg_32.h | 64 ------------------------- 3 files changed, 7 insertions(+), 68 deletions(-) - This: include/asm-x86/bug.h | 41 +++++++++++++++++++++++++++++++++++++++-- include/asm-x86/bug_32.h | 37 ------------------------------------- include/asm-x86/bug_64.h | 34 ---------------------------------- 3 files changed, 39 insertions(+), 73 deletions(-) Becomes include/asm-x86/{bug_64.h => bug.h} | 20 +++++++++++++----- include/asm-x86/bug_32.h | 37 ----------------------------------- 2 files changed, 14 insertions(+), 43 deletions(-) Now, in some other cases, it does actually turn a rename into a real "delete+create" pair, and then the diff is usually bigger, so truth in advertizing: it doesn't always generate a nicer diff. But for what -B was meant for, I think this is a big improvement, and I suspect those cases where it generates a bigger diff are tweakable. So I think this diff fixes a real bug, but we might still want to tweak the default values and perhaps the exact rules for when a break happens. Signed-off-by: Linus Torvalds Signed-off-by: Shawn O. Pearce

diffcore_count_changes: pass diffcore_filespec

2007-07-01T03:51:31Z

We may want to use richer information on the data we are dealing with in this function, so instead of passing a buffer address and length, just pass the diffcore_filespec structure. Existing callers always call this function with parameters taken from a filespec anyway, so there is no functionality changes. Signed-off-by: Junio C Hamano

Cast 64 bit off_t to 32 bit size_t

2007-03-07T19:15:26Z

Some systems have sizeof(off_t) == 8 while sizeof(size_t) == 4. This implies that we are able to access and work on files whose maximum length is around 2^63-1 bytes, but we can only malloc or mmap somewhat less than 2^32-1 bytes of memory. On such a system an implicit conversion of off_t to size_t can cause the size_t to wrap, resulting in unexpected and exciting behavior. Right now we are working around all gcc warnings generated by the -Wshorten-64-to-32 option by passing the off_t through xsize_t(). In the future we should make xsize_t on such problematic platforms detect the wrapping and die if such a file is accessed. Signed-off-by: Shawn O. Pearce Signed-off-by: Junio C Hamano

Do not use memcmp(sha1_1, sha1_2, 20) with hardcoded length.

2006-08-17T21:23:53Z

Introduces global inline: hashcmp(const unsigned char *sha1, const unsigned char *sha2) Uses memcmp for comparison and returns the result based on the length of the hash name (a future runtime decision). Acked-by: Alex Riesen Signed-off-by: David Rientjes Signed-off-by: Junio C Hamano

diffcore-rename: somewhat optimized.

2006-03-12T11:22:10Z

This changes diffcore-rename to reuse statistics information gathered during similarity estimation, and updates the hashtable implementation used to keep track of the statistics to be denser. This seems to give better performance. Signed-off-by: Junio C Hamano