<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/git.git/utf8.c, branch v2.26.0-rc2</title>
<subtitle>Git
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/git.git/atom?h=v2.26.0-rc2</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/atom?h=v2.26.0-rc2'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/'/>
<updated>2019-11-10T07:04:36Z</updated>
<entry>
<title>utf8: use skip_iprefix() in same_utf_encoding()</title>
<updated>2019-11-10T07:04:36Z</updated>
<author>
<name>René Scharfe</name>
<email>l.s.r@web.de</email>
</author>
<published>2019-11-08T20:25:21Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=89f8cabaf35f8a5f7e893f190764597ad5c44ef9'/>
<id>urn:sha1:89f8cabaf35f8a5f7e893f190764597ad5c44ef9</id>
<content type='text'>
Get rid of magic numbers by using skip_iprefix() and skip_prefix() for
parsing the leading "[uU][tT][fF]-?" of both strings instead of checking
with istarts_with() and an explicit comparison.

Signed-off-by: René Scharfe &lt;l.s.r@web.de&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>utf8: use ARRAY_SIZE() in git_wcwidth()</title>
<updated>2019-10-12T01:57:39Z</updated>
<author>
<name>Beat Bolli</name>
<email>dev+git@drbeat.li</email>
</author>
<published>2019-10-11T18:41:23Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=fa364ad790ae0a778ba14b5e5fc9fe910ab4b67e'/>
<id>urn:sha1:fa364ad790ae0a778ba14b5e5fc9fe910ab4b67e</id>
<content type='text'>
This macro has been available globally since b4f2a6ac92 ("Use #define
ARRAY_SIZE(x) (sizeof(x)/sizeof(x[0]))", 2006-03-09), so let's use it.

Signed-off-by: Beat Bolli &lt;dev+git@drbeat.li&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>utf8: handle systems that don't write BOM for UTF-16</title>
<updated>2019-02-12T02:20:07Z</updated>
<author>
<name>brian m. carlson</name>
<email>sandals@crustytoothpaste.net</email>
</author>
<published>2019-02-12T00:52:06Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=79444c92943048f9ac62e9311038ebe43f5f0982'/>
<id>urn:sha1:79444c92943048f9ac62e9311038ebe43f5f0982</id>
<content type='text'>
When serializing UTF-16 (and UTF-32), there are three possible ways to
write the stream. One can write the data with a BOM in either big-endian
or little-endian format, or one can write the data without a BOM in
big-endian format.

Most systems' iconv implementations choose to write it with a BOM in
some endianness, since this is the most foolproof, and it is resistant
to misinterpretation on Windows, where UTF-16 and the little-endian
serialization are very common. For compatibility with Windows and to
avoid accidental misuse there, Git always wants to write UTF-16 with a
BOM, and will refuse to read UTF-16 without it.

However, musl's iconv implementation writes UTF-16 without a BOM,
relying on the user to interpret it as big-endian. This causes t0028 and
the related functionality to fail, since Git won't read the file without
a BOM.

Add a Makefile and #define knob, ICONV_OMITS_BOM, that can be set if the
iconv implementation has this behavior. When set, Git will write a BOM
manually for UTF-16 and UTF-32 and then force the data to be written in
UTF-16BE or UTF-32BE. We choose big-endian behavior here because the
tests use the raw "UTF-16" encoding, which will be big-endian when the
implementation requires this knob to be set.

Update the tests to detect this case and write test data with an added
BOM if necessary. Always write the BOM in the tests in big-endian
format, since all iconv implementations that omit a BOM must use
big-endian serialization according to the Unicode standard.

Preserve the existing behavior for systems which do not have this knob
enabled, since they may use optimized implementations, including
defaulting to the native endianness, which may improve performance.

Signed-off-by: brian m. carlson &lt;sandals@crustytoothpaste.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Support working-tree-encoding "UTF-16LE-BOM"</title>
<updated>2019-01-31T18:27:52Z</updated>
<author>
<name>Torsten Bögershausen</name>
<email>tboegi@web.de</email>
</author>
<published>2019-01-30T15:01:52Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=aab2a1ae48ff65781a5379a01a4abb4f75e5641d'/>
<id>urn:sha1:aab2a1ae48ff65781a5379a01a4abb4f75e5641d</id>
<content type='text'>
Users who want UTF-16 files in the working tree set the .gitattributes
like this:
test.txt working-tree-encoding=UTF-16

The unicode standard itself defines 3 allowed ways how to encode UTF-16.
The following 3 versions convert all back to 'g' 'i' 't' in UTF-8:

a) UTF-16, without BOM, big endian:
$ printf "\000g\000i\000t" | iconv -f UTF-16 -t UTF-8 | od -c
0000000    g   i   t

b) UTF-16, with BOM, little endian:
$ printf "\377\376g\000i\000t\000" | iconv -f UTF-16 -t UTF-8 | od -c
0000000    g   i   t

c) UTF-16, with BOM, big endian:
$ printf "\376\377\000g\000i\000t" | iconv -f UTF-16 -t UTF-8 | od -c
0000000    g   i   t

Git uses libiconv to convert from UTF-8 in the index into ITF-16 in the
working tree.
After a checkout, the resulting file has a BOM and is encoded in "UTF-16",
in the version (c) above.
This is what iconv generates, more details follow below.

iconv (and libiconv) can generate UTF-16, UTF-16LE or UTF-16BE:

d) UTF-16
$ printf 'git' | iconv -f UTF-8 -t UTF-16 | od -c
0000000  376 377  \0   g  \0   i  \0   t

e) UTF-16LE
$ printf 'git' | iconv -f UTF-8 -t UTF-16LE | od -c
0000000    g  \0   i  \0   t  \0

f)  UTF-16BE
$ printf 'git' | iconv -f UTF-8 -t UTF-16BE | od -c
0000000   \0   g  \0   i  \0   t

There is no way to generate version (b) from above in a Git working tree,
but that is what some applications need.
(All fully unicode aware applications should be able to read all 3 variants,
but in practise we are not there yet).

When producing UTF-16 as an output, iconv generates the big endian version
with a BOM. (big endian is probably chosen for historical reasons).

iconv can produce UTF-16 files with little endianess by using "UTF-16LE"
as encoding, and that file does not have a BOM.

Not all users (especially under Windows) are happy with this.
Some tools are not fully unicode aware and can only handle version (b).

Today there is no way to produce version (b) with iconv (or libiconv).
Looking into the history of iconv, it seems as if version (c) will
be used in all future iconv versions (for compatibility reasons).

Solve this dilemma and introduce a Git-specific "UTF-16LE-BOM".
libiconv can not handle the encoding, so Git pick it up, handles the BOM
and uses libiconv to convert the rest of the stream.
(UTF-16BE-BOM is added for consistency)

Rported-by: Adrián Gimeno Balaguer &lt;adrigibal@gmail.com&gt;
Signed-off-by: Torsten Bögershausen &lt;tboegi@web.de&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Merge branch 'jk/size-t'</title>
<updated>2018-08-15T22:08:25Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2018-08-15T22:08:25Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=7d020f5a78a1ae7ff47d65a2adc79171c0cf9cb7'/>
<id>urn:sha1:7d020f5a78a1ae7ff47d65a2adc79171c0cf9cb7</id>
<content type='text'>
Code clean-up to use size_t/ssize_t when they are the right type.

* jk/size-t:
  strbuf_humanise: use unsigned variables
  pass st.st_size as hint for strbuf_readlink()
  strbuf_readlink: use ssize_t
  strbuf: use size_t for length in intermediate variables
  reencode_string: use size_t for string lengths
  reencode_string: use st_add/st_mult helpers
</content>
</entry>
<entry>
<title>reencode_string: use size_t for string lengths</title>
<updated>2018-07-24T17:19:29Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2018-07-24T10:50:33Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=c7d017d7e1cca37ca20f73c11fa9f1b319a2c3a5'/>
<id>urn:sha1:c7d017d7e1cca37ca20f73c11fa9f1b319a2c3a5</id>
<content type='text'>
The iconv interface takes a size_t, which is the appropriate
type for an in-memory buffer. But our reencode_string_*
functions use integers, meaning we may get confusing results
when the sizes exceed INT_MAX. Let's use size_t
consistently.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>reencode_string: use st_add/st_mult helpers</title>
<updated>2018-07-24T17:19:29Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2018-07-24T10:50:10Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=77aa03d6c7f07db4a5d34afe8f5b3a55e801057c'/>
<id>urn:sha1:77aa03d6c7f07db4a5d34afe8f5b3a55e801057c</id>
<content type='text'>
When converting a string with iconv, if the output buffer
isn't big enough, we grow it. But our growth is done without
any concern for integer overflow. So when we add:

  outalloc = sofar + insz * 2 + 32;

we may end up wrapping outalloc (which is a size_t), and
allocating a too-small buffer. We then manipulate it
further:

  outsz = outalloc - sofar - 1;

and feed outsz back to iconv. If outalloc is wrapped and
smaller than sofar, we'll end up with a small allocation but
feed a very large outsz to iconv, which could result in it
overflowing the buffer.

Can we use this to construct an attack wherein the victim
clones a repository with a very large commit object with an
encoding header, and running "git log" reencodes it into
utf8, causing an overflow?

An attack of this sort is likely impossible in practice.
"sofar" is how many output bytes we've written total, and
"insz" is the number of input bytes remaining. Imagine our
input doubles in size as we output it (which is easy to do
by converting latin1 to utf8, for example), and that we
start with N input bytes. Our initial output buffer also
starts at N bytes, so after the first call we'd have N/2
input bytes remaining (insz), and have written N bytes
(sofar). That means our next allocation will be
(N + N/2 * 2 + 32) bytes, or (2N + 32).

We can therefore overflow a 32-bit size_t with a commit
message that's just under 2^31 bytes, assuming it consists
mostly of "doubling" sequences (e.g., latin1 0xe1 which
becomes utf8 0xc3 0xa1).

But we'll never make it that far with such a message. We'll
be spending 2^31 bytes on the original string. And our
initial output buffer will also be 2^31 bytes. Which is not
going to succeed on a system with a 32-bit size_t, since
there will be other things using the address space, too. The
initial malloc will fail.

If we imagine instead that we can triple the size when
converting, then our second allocation becomes
(N + 2/3N * 2 + 32), or (7/3N + 32). That still requires two
allocations of 3/7 of our address space (6/7 of the total)
to succeed.

If we imagine we can quadruple, it becomes (5/2N + 32); we
need to be able to allocate 4/5 of the address space to
succeed.

This might start to get plausible. But is it possible to get
a 4-to-1 increase in size? Probably if you're converting to
some obscure encoding. But since git defaults to utf8 for
its output, that's the likely destination encoding for an
attack. And while there are 4-character utf8 sequences, it's
unlikely that you'd be able find a single-byte source
sequence in any encoding.

So this is certainly buggy code which should be fixed, but
it is probably not a useful attack vector.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>utf8.c: avoid char overflow</title>
<updated>2018-07-09T21:38:12Z</updated>
<author>
<name>Beat Bolli</name>
<email>dev+git@drbeat.li</email>
</author>
<published>2018-07-09T19:25:37Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=2b647a05d7223beacba076e45b3920a8621d28e7'/>
<id>urn:sha1:2b647a05d7223beacba076e45b3920a8621d28e7</id>
<content type='text'>
In ISO C, char constants must be in the range -128..127. Change the BOM
constants to char literals to avoid overflow.

Signed-off-by: Beat Bolli &lt;dev+git@drbeat.li&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Sync with Git 2.17.1</title>
<updated>2018-05-29T08:10:05Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2018-05-29T08:09:58Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=7913f53b5628997165e075008d6142da1c04271a'/>
<id>urn:sha1:7913f53b5628997165e075008d6142da1c04271a</id>
<content type='text'>
* maint: (25 commits)
  Git 2.17.1
  Git 2.16.4
  Git 2.15.2
  Git 2.14.4
  Git 2.13.7
  fsck: complain when .gitmodules is a symlink
  index-pack: check .gitmodules files with --strict
  unpack-objects: call fsck_finish() after fscking objects
  fsck: call fsck_finish() after fscking objects
  fsck: check .gitmodules content
  fsck: handle promisor objects in .gitmodules check
  fsck: detect gitmodules files
  fsck: actually fsck blob data
  fsck: simplify ".git" check
  index-pack: make fsck error message more specific
  verify_path: disallow symlinks in .gitmodules
  update-index: stat updated files earlier
  verify_dotfile: mention case-insensitivity in comment
  verify_path: drop clever fallthrough
  skip_prefix: add case-insensitive variant
  ...
</content>
</entry>
<entry>
<title>Sync with Git 2.14.4</title>
<updated>2018-05-22T05:15:14Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2018-05-22T05:15:14Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/git.git/commit/?id=9e0f06d55df5855178ff41342937943604f6e97c'/>
<id>urn:sha1:9e0f06d55df5855178ff41342937943604f6e97c</id>
<content type='text'>
* maint-2.14:
  Git 2.14.4
  Git 2.13.7
  verify_path: disallow symlinks in .gitmodules
  update-index: stat updated files earlier
  verify_dotfile: mention case-insensitivity in comment
  verify_path: drop clever fallthrough
  skip_prefix: add case-insensitive variant
  is_{hfs,ntfs}_dotgitmodules: add tests
  is_ntfs_dotgit: match other .git files
  is_hfs_dotgit: match other .git files
  is_ntfs_dotgit: use a size_t for traversing string
  submodule-config: verify submodule names as paths
</content>
</entry>
</feed>
