Change initdb and CREATE DATABASE to actively reject attempts to create

databases with encodings that are incompatible with the server's LC_CTYPE locale, when we can determine that (which we can on most modern platforms, I believe). C/POSIX locale is compatible with all encodings, of course, so there is still some usefulness to CREATE DATABASE's ENCODING option, but this will insulate us against all sorts of recurring complaints caused by mismatched settings. I moved initdb's existing LC_CTYPE-to-encoding mapping knowledge into a new src/port/ file so it could be shared by CREATE DATABASE.
author: Tom Lane <tgl@sss.pgh.pa.us> 2007-09-28 22:25:49 +0000
committer: Tom Lane <tgl@sss.pgh.pa.us> 2007-09-28 22:25:49 +0000
commit: 70b9b9b788ceb8d16479fb3e6c5a4a5784a45766 (patch)
tree: 38f09e2adecd5159ac0a0b36524844b8c9f2abd8 /doc/src
parent: ae0b90f223b5cddce80353793340e77c58e215c1 (diff)
2 files changed, 53 insertions, 24 deletions
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index cca029ea565..f54201fd268 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -1,4 +1,4 @@
-<!-- $PostgreSQL: pgsql/doc/src/sgml/charset.sgml,v 2.83 2007/04/15 10:56:25 ishii Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/charset.sgml,v 2.84 2007/09/28 22:25:49 tgl Exp $ -->
 
 <chapter id="charset">
  <title>Localization</>
@@ -249,7 +249,7 @@ initdb --locale=sv_SE
    <title>Problems</>
 
    <para>
-    If locale support doesn't work in spite of the explanation above,
+    If locale support doesn't work according to the explanation above,
     check that the locale support in your operating system is
     correctly configured.  To check what locales are installed on your
     system, you can use the command <literal>locale -a</literal> if
@@ -301,7 +301,8 @@ initdb --locale=sv_SE
 
   <para>
    The character set support in <productname>PostgreSQL</productname>
-   allows you to store text in a variety of character sets, including
+   allows you to store text in a variety of character sets (also called
+   encodings), including
    single-byte character sets such as the ISO 8859 series and
    multiple-byte character sets such as <acronym>EUC</> (Extended Unix
    Code), UTF-8, and Mule internal code.  All supported character sets
@@ -314,6 +315,20 @@ initdb --locale=sv_SE
    databases each with a different character set.
   </para>
 
+  <para>
+   An important restriction, however, is that each database character set
+   must be compatible with the server's <envar>LC_CTYPE</> setting.
+   When <envar>LC_CTYPE</> is <literal>C</> or <literal>POSIX</>, any
+   character set is allowed, but for other settings of <envar>LC_CTYPE</>
+   there is only one character set that will work correctly.
+   Since the <envar>LC_CTYPE</> setting is frozen by <command>initdb</>, the
+   apparent flexibility to use different encodings in different databases
+   of a cluster is more theoretical than real, except when you select
+   <literal>C</> or <literal>POSIX</> locale (thus disabling any real locale
+   awareness).  It is likely that these mechanisms will be revisited in future
+   versions of <productname>PostgreSQL</productname>.
+  </para>
+
    <sect2 id="multibyte-charset-supported">
     <title>Supported Character Sets</title>
 
@@ -716,7 +731,8 @@ initdb -E EUC_JP
     </para>
 
     <para>
-     You can create a database with a different character set:
+     If you have selected <literal>C</> or <literal>POSIX</> locale,
+     you can create a database with a different character set:
 
 <screen>
 createdb -E EUC_KR korean
@@ -731,7 +747,7 @@ CREATE DATABASE korean WITH ENCODING 'EUC_KR';
 </programlisting>
 
      The encoding for a database is stored in the system catalog
-     <literal>pg_database</literal>.  You can see that by using the
+     <literal>pg_database</literal>.  You can see it by using the
      <option>-l</option> option or the <command>\l</command> command
      of <command>psql</command>.
 
@@ -756,26 +772,23 @@ $ <userinput>psql -l</userinput>
 
     <important>
      <para>
-      Although you can specify any encoding you want for a database, it is
-      unwise to choose an encoding that is not what is expected by the locale
-      you have selected.  The <literal>LC_COLLATE</literal> and
-      <literal>LC_CTYPE</literal> settings imply a particular encoding,
-      and locale-dependent operations (such as sorting) are likely to
-      misinterpret data that is in an incompatible encoding.
-     </para>
-
-     <para>
-      Since these locale settings are frozen by <command>initdb</>, the
-      apparent flexibility to use different encodings in different databases
-      of a cluster is more theoretical than real.  It is likely that these
-      mechanisms will be revisited in future versions of
-      <productname>PostgreSQL</productname>.
+      On most modern operating systems, <productname>PostgreSQL</productname>
+      can determine which character set is implied by an <envar>LC_CTYPE</>
+      setting, and it will enforce that only the correct database encoding is
+      used.  On older systems it is your responsibility to ensure that you use
+      the encoding expected by the locale you have selected.  A mistake in
+      this area is likely to lead to strange misbehavior of locale-dependent
+      operations such as sorting.
      </para>
 
      <para>
-      One way to use multiple encodings safely is to set the locale to
-      <literal>C</> or <literal>POSIX</> during <command>initdb</>, thus
-      disabling any real locale awareness.
+      <productname>PostgreSQL</productname> will allow superusers to create
+      databases with <literal>SQL_ASCII</> encoding even when
+      <envar>LC_CTYPE</> is not <literal>C</> or <literal>POSIX</>.  As noted
+      above, <literal>SQL_ASCII</> does not enforce that the data stored in
+      the database has any particular encoding, and so this choice poses risks
+      of locale-dependent misbehavior.  Using this combination of settings is
+      deprecated and may someday be forbidden altogether.
      </para>
     </important>
    </sect2>
diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml
index d4301a73f6a..b1b13332456 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -1,5 +1,5 @@
 <!--
-$PostgreSQL: pgsql/doc/src/sgml/ref/create_database.sgml,v 1.47 2007/01/31 23:26:03 momjian Exp $
+$PostgreSQL: pgsql/doc/src/sgml/ref/create_database.sgml,v 1.48 2007/09/28 22:25:49 tgl Exp $
 PostgreSQL documentation
 -->
 
@@ -107,7 +107,8 @@ CREATE DATABASE <replaceable class="PARAMETER">name</replaceable>
         to use the default encoding (namely, the encoding of the
         template database). The character sets supported by the
         <productname>PostgreSQL</productname> server are described in
-        <xref linkend="multibyte-charset-supported">.
+        <xref linkend="multibyte-charset-supported">. See below for
+        additional restrictions.
        </para>
       </listitem>
      </varlistentry>
@@ -179,6 +180,21 @@ CREATE DATABASE <replaceable class="PARAMETER">name</replaceable>
   </para>
 
   <para>
+   Any character set encoding specified for the new database must be
+   compatible with the server's <envar>LC_CTYPE</> locale setting.
+   If <envar>LC_CTYPE</> is <literal>C</> (or equivalently
+   <literal>POSIX</>), then all encodings are allowed, but for other
+   locale settings there is only one encoding that will work properly,
+   and so the apparent freedom to specify an encoding is illusory if
+   you didn't initialize the database cluster in <literal>C</> locale.
+   <command>CREATE DATABASE</> will allow superusers to specify
+   <literal>SQL_ASCII</> encoding regardless of the locale setting,
+   but this choice is deprecated and may result in misbehavior of
+   character-string functions if data that is not encoding-compatible
+   with the locale is stored in the database.
+  </para>
+
+  <para>
    The <literal>CONNECTION LIMIT</> option is only enforced approximately;
    if two new sessions start at about the same time when just one
    connection <quote>slot</> remains for the database, it is possible that
author	Tom Lane <tgl@sss.pgh.pa.us>	2007-09-28 22:25:49 +0000
committer	Tom Lane <tgl@sss.pgh.pa.us>	2007-09-28 22:25:49 +0000
commit	70b9b9b788ceb8d16479fb3e6c5a4a5784a45766 (patch)
tree	38f09e2adecd5159ac0a0b36524844b8c9f2abd8 /doc/src
parent	ae0b90f223b5cddce80353793340e77c58e215c1 (diff)