diff options
author | Tom Lane <tgl@sss.pgh.pa.us> | 2015-11-28 13:42:27 -0500 |
---|---|---|
committer | Tom Lane <tgl@sss.pgh.pa.us> | 2015-11-28 13:42:27 -0500 |
commit | 8d32717b6bfaeda5b88b338dae728b47da19f4bb (patch) | |
tree | 27adac700e3928b0eee4648b33c86bc00cc34f1c /src/backend/utils/mb/conv.c | |
parent | 5afdfc9cbb29ffc6f6b557a06495672d3c09f688 (diff) |
Avoid doing encoding conversions by double-conversion via MULE_INTERNAL.
Previously, we did many conversions for Cyrillic and Central European
single-byte encodings by converting to a related MULE_INTERNAL coding
scheme before converting to the destination. This seems unnecessarily
inefficient. Moreover, if the conversion encounters an untranslatable
character, the error message will confusingly complain about failure
to convert to or from MULE_INTERNAL, rather than the user-visible
encodings. Worse still, this approach results in some completely
unnecessary conversion failures; there are cases where the chosen
MULE subset lacks characters that exist in both of the user-visible
encodings, causing a conversion failure that need not occur.
This patch fixes the first two of those deficiencies by introducing
a new local2local() conversion support subroutine for direct conversion
between any two single-byte character sets, and adding new conversion
tables where needed. However, I generated the new conversion tables by
testing PG 9.5's behavior, so that the actual conversion behavior is
bug-compatible with previous releases; the only user-visible behavior
change is that the error messages for conversion failures are saner.
Changes in the conversion behavior will probably ensue after discussion.
Interestingly, although this approach requires more tables, the .so files
actually end up smaller (at least on my x86_64 machine); the tables are
smaller than the management code needed for double conversion.
Per a complaint from Albe Laurenz.
Diffstat (limited to 'src/backend/utils/mb/conv.c')
-rw-r--r-- | src/backend/utils/mb/conv.c | 55 |
1 files changed, 50 insertions, 5 deletions
diff --git a/src/backend/utils/mb/conv.c b/src/backend/utils/mb/conv.c index f957b6efd32..9757dbabec5 100644 --- a/src/backend/utils/mb/conv.c +++ b/src/backend/utils/mb/conv.c @@ -15,6 +15,51 @@ /* + * local2local: a generic single byte charset encoding + * conversion between two ASCII-superset encodings. + * + * l points to the source string of length len + * p is the output area (must be large enough!) + * src_encoding is the PG identifier for the source encoding + * dest_encoding is the PG identifier for the target encoding + * tab holds conversion entries for the source charset + * starting from 128 (0x80). each entry in the table holds the corresponding + * code point for the target charset, or 0 if there is no equivalent code. + */ +void +local2local(const unsigned char *l, + unsigned char *p, + int len, + int src_encoding, + int dest_encoding, + const unsigned char *tab) +{ + unsigned char c1, + c2; + + while (len > 0) + { + c1 = *l; + if (c1 == 0) + report_invalid_encoding(src_encoding, (const char *) l, len); + if (!IS_HIGHBIT_SET(c1)) + *p++ = c1; + else + { + c2 = tab[c1 - HIGHBIT]; + if (c2) + *p++ = c2; + else + report_untranslatable_char(src_encoding, dest_encoding, + (const char *) l, len); + } + l++; + len--; + } + *p = '\0'; +} + +/* * LATINn ---> MIC when the charset's local codes map directly to MIC * * l points to the source string of length len @@ -141,8 +186,8 @@ pg_mic2ascii(const unsigned char *mic, unsigned char *p, int len) * lc is the mule character set id for the local encoding * encoding is the PG identifier for the local encoding * tab holds conversion entries for the local charset - * starting from 128 (0x80). each entry in the table - * holds the corresponding code point for the mule internal code. + * starting from 128 (0x80). each entry in the table holds the corresponding + * code point for the mule encoding, or 0 if there is no equivalent code. */ void latin2mic_with_table(const unsigned char *l, @@ -188,9 +233,9 @@ latin2mic_with_table(const unsigned char *l, * p is the output area (must be large enough!) * lc is the mule character set id for the local encoding * encoding is the PG identifier for the local encoding - * tab holds conversion entries for the mule internal code's - * second byte, starting from 128 (0x80). each entry in the table - * holds the corresponding code point for the local charset. + * tab holds conversion entries for the mule internal code's second byte, + * starting from 128 (0x80). each entry in the table holds the corresponding + * code point for the local charset, or 0 if there is no equivalent code. */ void mic2latin_with_table(const unsigned char *mic, |