Avoid doing encoding conversions by double-conversion via MULE_INTERNAL.

Previously, we did many conversions for Cyrillic and Central European single-byte encodings by converting to a related MULE_INTERNAL coding scheme before converting to the destination. This seems unnecessarily inefficient. Moreover, if the conversion encounters an untranslatable character, the error message will confusingly complain about failure to convert to or from MULE_INTERNAL, rather than the user-visible encodings. Worse still, this approach results in some completely unnecessary conversion failures; there are cases where the chosen MULE subset lacks characters that exist in both of the user-visible encodings, causing a conversion failure that need not occur. This patch fixes the first two of those deficiencies by introducing a new local2local() conversion support subroutine for direct conversion between any two single-byte character sets, and adding new conversion tables where needed. However, I generated the new conversion tables by testing PG 9.5's behavior, so that the actual conversion behavior is bug-compatible with previous releases; the only user-visible behavior change is that the error messages for conversion failures are saner. Changes in the conversion behavior will probably ensue after discussion. Interestingly, although this approach requires more tables, the .so files actually end up smaller (at least on my x86_64 machine); the tables are smaller than the management code needed for double conversion. Per a complaint from Albe Laurenz.
author: Tom Lane <tgl@sss.pgh.pa.us> 2015-11-28 13:42:27 -0500
committer: Tom Lane <tgl@sss.pgh.pa.us> 2015-11-28 13:42:27 -0500
commit: 8d32717b6bfaeda5b88b338dae728b47da19f4bb (patch)
tree: 27adac700e3928b0eee4648b33c86bc00cc34f1c /src/backend/utils/mb/conv.c
parent: 5afdfc9cbb29ffc6f6b557a06495672d3c09f688 (diff)
1 files changed, 50 insertions, 5 deletions
diff --git a/src/backend/utils/mb/conv.c b/src/backend/utils/mb/conv.c
index f957b6efd32..9757dbabec5 100644
--- a/src/backend/utils/mb/conv.c
+++ b/src/backend/utils/mb/conv.c
@@ -15,6 +15,51 @@
 
 
 /*
+ * local2local: a generic single byte charset encoding
+ * conversion between two ASCII-superset encodings.
+ *
+ * l points to the source string of length len
+ * p is the output area (must be large enough!)
+ * src_encoding is the PG identifier for the source encoding
+ * dest_encoding is the PG identifier for the target encoding
+ * tab holds conversion entries for the source charset
+ * starting from 128 (0x80). each entry in the table holds the corresponding
+ * code point for the target charset, or 0 if there is no equivalent code.
+ */
+void
+local2local(const unsigned char *l,
+			unsigned char *p,
+			int len,
+			int src_encoding,
+			int dest_encoding,
+			const unsigned char *tab)
+{
+	unsigned char c1,
+				c2;
+
+	while (len > 0)
+	{
+		c1 = *l;
+		if (c1 == 0)
+			report_invalid_encoding(src_encoding, (const char *) l, len);
+		if (!IS_HIGHBIT_SET(c1))
+			*p++ = c1;
+		else
+		{
+			c2 = tab[c1 - HIGHBIT];
+			if (c2)
+				*p++ = c2;
+			else
+				report_untranslatable_char(src_encoding, dest_encoding,
+										   (const char *) l, len);
+		}
+		l++;
+		len--;
+	}
+	*p = '\0';
+}
+
+/*
  * LATINn ---> MIC when the charset's local codes map directly to MIC
  *
  * l points to the source string of length len
@@ -141,8 +186,8 @@ pg_mic2ascii(const unsigned char *mic, unsigned char *p, int len)
  * lc is the mule character set id for the local encoding
  * encoding is the PG identifier for the local encoding
  * tab holds conversion entries for the local charset
- * starting from 128 (0x80). each entry in the table
- * holds the corresponding code point for the mule internal code.
+ * starting from 128 (0x80). each entry in the table holds the corresponding
+ * code point for the mule encoding, or 0 if there is no equivalent code.
  */
 void
 latin2mic_with_table(const unsigned char *l,
@@ -188,9 +233,9 @@ latin2mic_with_table(const unsigned char *l,
  * p is the output area (must be large enough!)
  * lc is the mule character set id for the local encoding
  * encoding is the PG identifier for the local encoding
- * tab holds conversion entries for the mule internal code's
- * second byte, starting from 128 (0x80). each entry in the table
- * holds the corresponding code point for the local charset.
+ * tab holds conversion entries for the mule internal code's second byte,
+ * starting from 128 (0x80). each entry in the table holds the corresponding
+ * code point for the local charset, or 0 if there is no equivalent code.
  */
 void
 mic2latin_with_table(const unsigned char *mic,
author	Tom Lane <tgl@sss.pgh.pa.us>	2015-11-28 13:42:27 -0500
committer	Tom Lane <tgl@sss.pgh.pa.us>	2015-11-28 13:42:27 -0500
commit	8d32717b6bfaeda5b88b338dae728b47da19f4bb (patch)
tree	27adac700e3928b0eee4648b33c86bc00cc34f1c /src/backend/utils/mb/conv.c
parent	5afdfc9cbb29ffc6f6b557a06495672d3c09f688 (diff)