Add fast path for validating UTF-8 text

Our previous validator used a traditional algorithm that performed comparison and branching one byte at a time. It's useful in that we always know exactly how many bytes we have validated, but that precision comes at a cost. Input validation can show up prominently in profiles of COPY FROM, and future improvements to COPY FROM such as parallelism or faster line parsing will put more pressure on input validation. Hence, add fast paths for both ASCII and multibyte UTF-8: Use bitwise operations to check 16 bytes at a time for ASCII. If that fails, use a "shift-based" DFA on those bytes to handle the general case, including multibyte. These paths are relatively free of branches and thus robust against all kinds of byte patterns. With these algorithms, UTF-8 validation is several times faster, depending on platform and the input byte distribution. The previous coding in pg_utf8_verifystr() is retained for short strings and for when the fast path returns an error. Review, performance testing, and additional hacking by: Heikki Linakangas, Vladimir Sitnikov, Amit Khandekar, Thomas Munro, and Greg Stark Discussion: https://www.postgresql.org/message-id/CAFBsxsEV_SzH%2BOLyCiyon%3DiwggSyMh_eF6A3LU2tiWf3Cy2ZQg%40mail.gmail.com
author: John Naylor <john.naylor@postgresql.org> 2021-10-19 16:43:14 -0400
committer: John Naylor <john.naylor@postgresql.org> 2021-12-20 10:07:29 -0400
commit: 911588a3f816d875261d8f7d89e2517978831cd5 (patch)
tree: d85f5458e9a2e0b6f2ebf5ebb27c7db22561ce06 /src/include
parent: e2c52beecdea152ca680a22ef35c6a7da55aa30f (diff)
1 files changed, 53 insertions, 0 deletions
diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h
index d93ccac2633..6bd996b3d0c 100644
--- a/src/include/mb/pg_wchar.h
+++ b/src/include/mb/pg_wchar.h
@@ -699,4 +699,57 @@ extern int	mic2latin_with_table(const unsigned char *mic, unsigned char *p,
 extern WCHAR *pgwin32_message_to_UTF16(const char *str, int len, int *utf16len);
 #endif
 
+
+/*
+ * Verify a chunk of bytes for valid ASCII.
+ *
+ * Returns false if the input contains any zero bytes or bytes with the
+ * high-bit set. Input len must be a multiple of 8.
+ */
+static inline bool
+is_valid_ascii(const unsigned char *s, int len)
+{
+	uint64		chunk,
+				highbit_cum = UINT64CONST(0),
+				zero_cum = UINT64CONST(0x8080808080808080);
+
+	Assert(len % sizeof(chunk) == 0);
+
+	while (len > 0)
+	{
+		memcpy(&chunk, s, sizeof(chunk));
+
+		/*
+		 * Capture any zero bytes in this chunk.
+		 *
+		 * First, add 0x7f to each byte. This sets the high bit in each byte,
+		 * unless it was a zero. If any resulting high bits are zero, the
+		 * corresponding high bits in the zero accumulator will be cleared.
+		 *
+		 * If none of the bytes in the chunk had the high bit set, the max
+		 * value each byte can have after the addition is 0x7f + 0x7f = 0xfe,
+		 * and we don't need to worry about carrying over to the next byte. If
+		 * any input bytes did have the high bit set, it doesn't matter
+		 * because we check for those separately.
+		 */
+		zero_cum &= (chunk + UINT64CONST(0x7f7f7f7f7f7f7f7f));
+
+		/* Capture any set bits in this chunk. */
+		highbit_cum |= chunk;
+
+		s += sizeof(chunk);
+		len -= sizeof(chunk);
+	}
+
+	/* Check if any high bits in the high bit accumulator got set. */
+	if (highbit_cum & UINT64CONST(0x8080808080808080))
+		return false;
+
+	/* Check if any high bits in the zero accumulator got cleared. */
+	if (zero_cum != UINT64CONST(0x8080808080808080))
+		return false;
+
+	return true;
+}
+
 #endif							/* PG_WCHAR_H */
author	John Naylor <john.naylor@postgresql.org>	2021-10-19 16:43:14 -0400
committer	John Naylor <john.naylor@postgresql.org>	2021-12-20 10:07:29 -0400
commit	911588a3f816d875261d8f7d89e2517978831cd5 (patch)
tree	d85f5458e9a2e0b6f2ebf5ebb27c7db22561ce06 /src/include
parent	e2c52beecdea152ca680a22ef35c6a7da55aa30f (diff)