diff options
| author | Jeff Davis <jdavis@postgresql.org> | 2025-01-17 15:56:30 -0800 |
|---|---|---|
| committer | Jeff Davis <jdavis@postgresql.org> | 2025-01-17 15:56:30 -0800 |
| commit | d3d0983169130a9b81e3fe48d5c2ca4931480956 (patch) | |
| tree | 75e680ff03b4af3fd21a36be49515367133e6d02 /doc/src | |
| parent | 286a365b9c25479f8ad82043ed136748733adfa6 (diff) | |
Support PG_UNICODE_FAST locale in the builtin collation provider.
The PG_UNICODE_FAST locale uses code point sort order (fast,
memcmp-based) combined with Unicode character semantics. The character
semantics are based on Unicode full case mapping.
Full case mapping can map a single codepoint to multiple codepoints,
such as "ß" uppercasing to "SS". Additionally, it handles
context-sensitive mappings like the "final sigma", and it uses
titlecase mappings such as "Dž" when titlecasing (rather than plain
uppercase mappings).
Importantly, the uppercasing of "ß" as "SS" is specifically mentioned
by the SQL standard. In Postgres, UCS_BASIC uses plain ASCII semantics
for case mapping and pattern matching, so if we changed it to use the
PG_UNICODE_FAST locale, it would offer better compliance with the
standard. For now, though, do not change the behavior of UCS_BASIC.
Discussion: https://postgr.es/m/ddfd67928818f138f51635712529bc5e1d25e4e7.camel@j-davis.com
Discussion: https://postgr.es/m/27bb0e52-801d-4f73-a0a4-02cfdd4a9ada@eisentraut.org
Reviewed-by: Peter Eisentraut, Daniel Verite
Diffstat (limited to 'doc/src')
| -rw-r--r-- | doc/src/sgml/charset.sgml | 29 | ||||
| -rw-r--r-- | doc/src/sgml/ref/create_collation.sgml | 3 | ||||
| -rw-r--r-- | doc/src/sgml/ref/create_database.sgml | 6 | ||||
| -rw-r--r-- | doc/src/sgml/ref/initdb.sgml | 4 |
4 files changed, 35 insertions, 7 deletions
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml index 6c633678790..99f01990004 100644 --- a/doc/src/sgml/charset.sgml +++ b/doc/src/sgml/charset.sgml @@ -377,8 +377,9 @@ initdb --locale-provider=icu --icu-locale=en <listitem> <para> The <literal>builtin</literal> provider uses built-in operations. Only - the <literal>C</literal> and <literal>C.UTF-8</literal> locales are - supported for this provider. + the <literal>C</literal>, <literal>C.UTF-8</literal>, and + <literal>PG_UNICODE_FAST</literal> locales are supported for this + provider. </para> <para> The <literal>C</literal> locale behavior is identical to the @@ -392,6 +393,13 @@ initdb --locale-provider=icu --icu-locale=en regular expression character classes are based on the "POSIX Compatible" semantics, and the case mapping is the "simple" variant. </para> + <para> + The <literal>PG_UNICODE_FAST</literal> locale is available only when + the database encoding is <literal>UTF-8</literal>, and the behavior is + based on Unicode. The collation uses the code point values only. The + regular expression character classes are based on the "Standard" + semantics, and the case mapping is the "full" variant. + </para> </listitem> </varlistentry> @@ -887,6 +895,23 @@ SELECT * FROM test1 ORDER BY a || b COLLATE "fr_FR"; </varlistentry> <varlistentry> + <term><literal>pg_unicode_fast</literal></term> + <listitem> + <para> + This collation sorts by Unicode code point values rather than natural + language order. For the functions <function>lower</function>, + <function>initcap</function>, and <function>upper</function> it uses + Unicode full case mapping. For pattern matching (including regular + expressions), it uses the Standard variant of Unicode <ulink + url="https://www.unicode.org/reports/tr18/#Compatibility_Properties">Compatibility + Properties</ulink>. Behavior is efficient and stable within a + <productname>Postgres</productname> major version. It is only + available for encoding <literal>UTF8</literal>. + </para> + </listitem> + </varlistentry> + + <varlistentry> <term><literal>pg_c_utf8</literal></term> <listitem> <para> diff --git a/doc/src/sgml/ref/create_collation.sgml b/doc/src/sgml/ref/create_collation.sgml index e34bfc97c3d..4af1836ae30 100644 --- a/doc/src/sgml/ref/create_collation.sgml +++ b/doc/src/sgml/ref/create_collation.sgml @@ -99,7 +99,8 @@ CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> FROM <replace <para> If <replaceable>provider</replaceable> is <literal>builtin</literal>, then <replaceable>locale</replaceable> must be specified and set to - either <literal>C</literal> or <literal>C.UTF-8</literal>. + either <literal>C</literal>, <literal>C.UTF-8</literal> or + <literal>PG_UNICODE_FAST</literal>. </para> </listitem> </varlistentry> diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml index 7653cb902ee..a4b052ba08b 100644 --- a/doc/src/sgml/ref/create_database.sgml +++ b/doc/src/sgml/ref/create_database.sgml @@ -168,7 +168,8 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable> If <xref linkend="create-database-locale-provider"/> is <literal>builtin</literal>, then <replaceable>locale</replaceable> or <replaceable>builtin_locale</replaceable> must be specified and set to - either <literal>C</literal> or <literal>C.UTF-8</literal>. + either <literal>C</literal>, <literal>C.UTF-8</literal>, or + <literal>PG_UNICODE_FAST</literal>. </para> <tip> <para> @@ -233,7 +234,8 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable> </para> <para> The locales available for the <literal>builtin</literal> provider are - <literal>C</literal> and <literal>C.UTF-8</literal>. + <literal>C</literal>, <literal>C.UTF-8</literal> and + <literal>PG_UNICODE_FAST</literal>. </para> </listitem> </varlistentry> diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml index 0c32114cf70..0026318485a 100644 --- a/doc/src/sgml/ref/initdb.sgml +++ b/doc/src/sgml/ref/initdb.sgml @@ -295,8 +295,8 @@ PostgreSQL documentation <para> If <option>--locale-provider</option> is <literal>builtin</literal>, <option>--locale</option> or <option>--builtin-locale</option> must be - specified and set to <literal>C</literal> or - <literal>C.UTF-8</literal>. + specified and set to <literal>C</literal>, <literal>C.UTF-8</literal> + or <literal>PG_UNICODE_FAST</literal>. </para> </listitem> </varlistentry> |
