<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/git.git/diffcore-break.c, branch v1.7.7.4</title>
<subtitle>Git
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/git.git/atom?h=v1.7.7.4</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/atom?h=v1.7.7.4'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/'/>
<updated>2010-05-07T16:34:27Z</updated>
<entry>
<title>Add a macro DIFF_QUEUE_CLEAR.</title>
<updated>2010-05-07T16:34:27Z</updated>
<author>
<name>Bo Yang</name>
<email>struggleyb.nku@gmail.com</email>
</author>
<published>2010-05-07T04:52:27Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=9ca5df90615aa3c6b60e1bc8f03db6cae98e816c'/>
<id>urn:sha1:9ca5df90615aa3c6b60e1bc8f03db6cae98e816c</id>
<content type='text'>
Refactor the diff_queue_struct code, this macro help
to reset the structure.

Signed-off-by: Bo Yang &lt;struggleyb.nku@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>diffcore-break: save cnt_data for other phases</title>
<updated>2009-11-16T21:21:12Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2009-11-16T16:02:02Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=8282de94bc76360e0bf76da4076755696b049d23'/>
<id>urn:sha1:8282de94bc76360e0bf76da4076755696b049d23</id>
<content type='text'>
The "break" phase works by counting changes between two
blobs with the same path. We do this by splitting the file
into chunks (or lines for text oriented files) and then
keeping a count of chunk hashes.

The "rename" phase counts changes between blobs at two
different paths. However, it uses the exact same set of
chunk hashes (which are immutable for a given sha1).

The rename phase can therefore use the same hash data as
break. Unfortunately, we were throwing this data away after
computing it in the break phase. This patch instead attaches
it to the filespec and lets it live through the rename
phase, working under the assumption that most of the time
that breaks are being computed, renames will be too.

We only do this optimization for files which have actually
been broken, as those ones will be candidates for rename
detection (and it is a time-space tradeoff, so we don't want
to waste space keeping useless data).

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>diffcore-break: free filespec data as we go</title>
<updated>2009-11-16T21:21:11Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2009-11-16T15:56:25Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=f4f19fb63449e1beee02b0ec845319f7115fa9d0'/>
<id>urn:sha1:f4f19fb63449e1beee02b0ec845319f7115fa9d0</id>
<content type='text'>
As we look at each changed file and consider breaking it, we
load the blob data and make a decision about whether to
break, which is independent of any other blobs that might
have changed. However, we keep the data in memory while we
consider breaking all of the other files. Which means that
both versions of every file you are diffing are in memory at
the same time.

This patch instead frees the blob data as we finish with
each file pair, leading to much lower memory usage.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Remove unused function scope local variables</title>
<updated>2009-03-08T04:52:17Z</updated>
<author>
<name>Benjamin Kramer</name>
<email>benny.kra@googlemail.com</email>
</author>
<published>2009-03-07T20:02:10Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=eb3a9dd3279fe4b05f286665986ebf6d43a6ccc0'/>
<id>urn:sha1:eb3a9dd3279fe4b05f286665986ebf6d43a6ccc0</id>
<content type='text'>
These variables were unused and can be removed safely:

  builtin-clone.c::cmd_clone(): use_local_hardlinks, use_separate_remote
  builtin-fetch-pack.c::find_common(): len
  builtin-remote.c::mv(): symref
  diff.c::show_stats():show_stats(): total
  diffcore-break.c::should_break(): base_size
  fast-import.c::validate_raw_date(): date, sign
  fsck.c::fsck_tree(): o_sha1, sha1
  xdiff-interface.c::parse_num(): read_some

Signed-off-by: Benjamin Kramer &lt;benny.kra@googlemail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>rename: Break filepairs with different types.</title>
<updated>2007-12-02T10:24:46Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2007-12-01T06:22:38Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=b45563a229f5150271837cf487a91ddd8224fbd3'/>
<id>urn:sha1:b45563a229f5150271837cf487a91ddd8224fbd3</id>
<content type='text'>
When we consider if a path has been totally rewritten, we did not
touch changes from symlinks to files or vice versa.  But a change
that modifies even the type of a blob surely should count as a
complete rewrite.

While we are at it, modernise diffcore-break to be aware of gitlinks (we
do not want to touch them).

Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Fix diffcore-break total breakage</title>
<updated>2007-10-21T05:59:42Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2007-10-20T19:31:31Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=6dd4b66fdecc2ffdc68758b6c4e059fcaaca512b'/>
<id>urn:sha1:6dd4b66fdecc2ffdc68758b6c4e059fcaaca512b</id>
<content type='text'>
Ok, so on the kernel list, some people noticed that "git log --follow"
doesn't work too well with some files in the x86 merge, because a lot of
files got renamed in very special ways.

In particular, there was a pattern of doing single commits with renames
that looked basically like

 - rename "filename.h" -&gt; "filename_64.h"
 - create new "filename.c" that includes "filename_32.h" or
   "filename_64.h" depending on whether we're 32-bit or 64-bit.

which was preparatory for smushing the two trees together.

Now, there's two issues here:

 - "filename.c" *remained*. Yes, it was a rename, but there was a new file
   created with the old name in the same commit. This was important,
   because we wanted each commit to compile properly, so that it was
   bisectable, so splitting the rename into one commit and the "create
   helper file" into another was *not* an option.

   So we need to break associations where the contents change too much.
   Fine. We have the -B flag for that. When we break things up, then the
   rename detection will be able to figure out whether there are better
   alternatives.

 - "git log --follow" didn't with with -B.

Now, the second case was really simple: we use a different "diffopt"
structure for the rename detection than the basic one (which we use for
showing the diffs). So that second case is trivially fixed by a trivial
one-liner that just copies the break_opt values from the "real" diffopts
to the one used for rename following. So now "git log -B --follow" works
fine:

	diff --git a/tree-diff.c b/tree-diff.c
	index 26bdbdd..7c261fd 100644
	--- a/tree-diff.c
	+++ b/tree-diff.c
	@@ -319,6 +319,7 @@ static void try_to_follow_renames(struct tree_desc *t1, struct tree_desc *t2, co
	 	diff_opts.detect_rename = DIFF_DETECT_RENAME;
	 	diff_opts.output_format = DIFF_FORMAT_NO_OUTPUT;
	 	diff_opts.single_follow = opt-&gt;paths[0];
	+	diff_opts.break_opt = opt-&gt;break_opt;
	 	paths[0] = NULL;
	 	diff_tree_setup_paths(paths, &amp;diff_opts);
	 	if (diff_setup_done(&amp;diff_opts) &lt; 0)

however, the end result does *not* work. Because our diffcore-break.c
logic is totally bogus!

In particular:

 - it used to do

	if (base_size &lt; MINIMUM_BREAK_SIZE)
		return 0; /* we do not break too small filepair */

   which basically says "don't bother to break small files". But that
   "base_size" is the *smaller* of the two sizes, which means that if some
   large file was rewritten into one that just includes another file, we
   would look at the (small) result, and decide that it's smaller than the
   break size, so it cannot be worth it to break it up! Even if the other
   side was ten times bigger and looked *nothing* like the samell file!

   That's clearly bogus. I replaced "base_size" with "max_size", so that
   we compare the *bigger* of the filepair with the break size.

 - It calculated a "merge_score", which was the score needed to merge it
   back together if nothing else wanted it. But even if it was *so*
   different that we would never want to merge it back, we wouldn't
   consider it a break! That makes no sense. So I added

	if (*merge_score_p &gt; break_score)
		return 1;

   to make it clear that if we wouldn't want to merge it at the end, it
   was *definitely* a break.

 - It compared the whole "extent of damage", counting all inserts and
   deletes, but it based this score on the "base_size", and generated the
   damage score with

	delta_size = src_removed + literal_added;
	damage_score = delta_size * MAX_SCORE / base_size;

   but that makes no sense either, since quite often, this will result in
   a number that is *bigger* than MAX_SCORE! Why? Because base_size is
   (again) the smaller of the two files we compare, and when you start out
   from a small file and add a lot (or start out from a large file and
   remove a lot), the base_size is going to be much smaller than the
   damage!

   Again, the fix was to replace "base_size" with "max_size", at which
   point the damage actually becomes a sane percentage of the whole.

With these changes in place, not only does "git log -B --follow" work for
the case that triggered this in the first place, ie now

	git log -B --follow arch/x86/kernel/vmlinux_64.lds.S

actually gives reasonable results. But I also wanted to verify it in
general, by doing a full-history

	git log --stat -B -C

on my kernel tree with the old code and the new code.

There's some tweaking to be done, but generally, the new code generates
much better results wrt breaking up files (and then finding better rename
candidates). Here's a few examples of the "--stat" output:

 - This:
	include/asm-x86/Kbuild        |    2 -
	include/asm-x86/debugreg.h    |   79 +++++++++++++++++++++++++++++++++++------
	include/asm-x86/debugreg_32.h |   64 ---------------------------------
	include/asm-x86/debugreg_64.h |   65 ---------------------------------
	4 files changed, 68 insertions(+), 142 deletions(-)

      Becomes:

	include/asm-x86/Kbuild                        |    2 -
	include/asm-x86/{debugreg_64.h =&gt; debugreg.h} |    9 +++-
	include/asm-x86/debugreg_32.h                 |   64 -------------------------
	3 files changed, 7 insertions(+), 68 deletions(-)

 - This:
	include/asm-x86/bug.h    |   41 +++++++++++++++++++++++++++++++++++++++--
	include/asm-x86/bug_32.h |   37 -------------------------------------
	include/asm-x86/bug_64.h |   34 ----------------------------------
	3 files changed, 39 insertions(+), 73 deletions(-)

      Becomes

	include/asm-x86/{bug_64.h =&gt; bug.h} |   20 +++++++++++++-----
	include/asm-x86/bug_32.h            |   37 -----------------------------------
	2 files changed, 14 insertions(+), 43 deletions(-)

Now, in some other cases, it does actually turn a rename into a real
"delete+create" pair, and then the diff is usually bigger, so truth in
advertizing: it doesn't always generate a nicer diff. But for what -B was
meant for, I think this is a big improvement, and I suspect those cases
where it generates a bigger diff are tweakable.

So I think this diff fixes a real bug, but we might still want to tweak
the default values and perhaps the exact rules for when a break happens.

Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Shawn O. Pearce &lt;spearce@spearce.org&gt;
</content>
</entry>
<entry>
<title>diffcore_count_changes: pass diffcore_filespec</title>
<updated>2007-07-01T03:51:31Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2007-06-29T05:54:37Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=d8c3d03a0b7f10977dd508a5a965a417b7f1b065'/>
<id>urn:sha1:d8c3d03a0b7f10977dd508a5a965a417b7f1b065</id>
<content type='text'>
We may want to use richer information on the data we are dealing
with in this function, so instead of passing a buffer address
and length, just pass the diffcore_filespec structure.  Existing
callers always call this function with parameters taken from a
filespec anyway, so there is no functionality changes.

Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Cast 64 bit off_t to 32 bit size_t</title>
<updated>2007-03-07T19:15:26Z</updated>
<author>
<name>Shawn O. Pearce</name>
<email>spearce@spearce.org</email>
</author>
<published>2007-03-07T01:44:37Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=dc49cd769b5fa6b7e0114b051c34a849828a7603'/>
<id>urn:sha1:dc49cd769b5fa6b7e0114b051c34a849828a7603</id>
<content type='text'>
Some systems have sizeof(off_t) == 8 while sizeof(size_t) == 4.
This implies that we are able to access and work on files whose
maximum length is around 2^63-1 bytes, but we can only malloc or
mmap somewhat less than 2^32-1 bytes of memory.

On such a system an implicit conversion of off_t to size_t can cause
the size_t to wrap, resulting in unexpected and exciting behavior.
Right now we are working around all gcc warnings generated by the
-Wshorten-64-to-32 option by passing the off_t through xsize_t().

In the future we should make xsize_t on such problematic platforms
detect the wrapping and die if such a file is accessed.

Signed-off-by: Shawn O. Pearce &lt;spearce@spearce.org&gt;
Signed-off-by: Junio C Hamano &lt;junkio@cox.net&gt;
</content>
</entry>
<entry>
<title>Do not use memcmp(sha1_1, sha1_2, 20) with hardcoded length.</title>
<updated>2006-08-17T21:23:53Z</updated>
<author>
<name>David Rientjes</name>
<email>rientjes@google.com</email>
</author>
<published>2006-08-17T18:54:57Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=a89fccd28197fa179828c8596791ff16e2268d20'/>
<id>urn:sha1:a89fccd28197fa179828c8596791ff16e2268d20</id>
<content type='text'>
Introduces global inline:

	hashcmp(const unsigned char *sha1, const unsigned char *sha2)

Uses memcmp for comparison and returns the result based on the length of
the hash name (a future runtime decision).

Acked-by: Alex Riesen &lt;raa.lkml@gmail.com&gt;
Signed-off-by: David Rientjes &lt;rientjes@google.com&gt;
Signed-off-by: Junio C Hamano &lt;junkio@cox.net&gt;
</content>
</entry>
<entry>
<title>diffcore-rename: somewhat optimized.</title>
<updated>2006-03-12T11:22:10Z</updated>
<author>
<name>Junio C Hamano</name>
<email>junkio@cox.net</email>
</author>
<published>2006-03-12T11:22:10Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=c06c79667c9514aed00d29bcd80bd0cee7cc5a25'/>
<id>urn:sha1:c06c79667c9514aed00d29bcd80bd0cee7cc5a25</id>
<content type='text'>
This changes diffcore-rename to reuse statistics information
gathered during similarity estimation, and updates the hashtable
implementation used to keep track of the statistics to be
denser.  This seems to give better performance.

Signed-off-by: Junio C Hamano &lt;junkio@cox.net&gt;
</content>
</entry>
</feed>
