From 7bcc6d98fb5c3bda2787ae085ef3ff3dbb65ae42 Mon Sep 17 00:00:00 2001 From: Tom Lane Date: Wed, 5 Feb 2003 17:41:33 +0000 Subject: Replace regular expression package with Henry Spencer's latest version (extracted from Tcl 8.4.1 release, as Henry still hasn't got round to making it a separate library). This solves a performance problem for multibyte, as well as upgrading our regexp support to match recent Tcl and nearly match recent Perl. --- doc/src/sgml/func.sgml | 1124 ++++++++++++++++++++++++++++++++++++++++----- doc/src/sgml/release.sgml | 3 +- 2 files changed, 1001 insertions(+), 126 deletions(-) (limited to 'doc/src') diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index b3de02ef067..baeef816181 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -1,5 +1,5 @@ @@ -424,7 +424,7 @@ PostgreSQL documentation & binary AND - 91 & 15 + 91 & 15 11 @@ -471,7 +471,7 @@ PostgreSQL documentation The binary operators are also available for the bit string types BIT and BIT VARYING, as shown in . - Bit string arguments to &, |, + Bit string arguments to &, |, and # must be of equal length. When bit shifting, the original length of the string is preserved, as shown in the table. @@ -490,7 +490,7 @@ PostgreSQL documentation - B'10001' & B'01101' + B'10001' & B'01101' 00001 @@ -2629,7 +2629,7 @@ SUBSTRING('foobar' FROM '#"o_b#"%' FOR '#') NULL @@ -2640,110 +2640,319 @@ SUBSTRING('foobar' FROM 'o(.)b') o - + + PostgreSQL's regular expressions are implemented + using a package written by Henry Spencer. Much of + the description of regular expressions below is copied verbatim from his + manual entry. + + + + + + Regular Expression Details + Regular expressions (REs), as defined in - POSIX - 1003.2, come in two forms: modern REs (roughly those of - egrep; 1003.2 calls these - extended REs) and obsolete REs (roughly those of - ed; 1003.2 basic REs). - PostgreSQL implements the modern form. + POSIX 1003.2, come in two forms: + extended REs or EREs + (roughly those of egrep), and + basic REs or BREs + (roughly those of ed). + PostgreSQL supports both forms, and + also implements some extensions + that are not in the POSIX standard, but have become widely used anyway + due to their availability in programming languages such as Perl and Tcl. + REs using these non-POSIX extensions are called + advanced REs or AREs + in this documentation. We first describe the ERE/ARE flavor and then + mention the restrictions of the BRE form. - A (modern) RE is one or more non-empty + A regular expression is defined as one or more branches, separated by |. It matches anything that matches one of the branches. - A branch is one or more pieces, - concatenated. It matches a match for the first, followed by a - match for the second, etc. + A branch is zero or more quantified atoms or + constraints, concatenated. + It matches a match for the first, followed by a match for the second, etc; + an empty branch matches the empty string. - A piece is an atom possibly followed by a - single *, +, - ?, or bound. An atom - followed by * matches a sequence of 0 or more - matches of the atom. An atom followed by + - matches a sequence of 1 or more matches of the atom. An atom - followed by ? matches a sequence of 0 or 1 - matches of the atom. + A quantified atom is an atom possibly followed + by a single quantifier. + Without a quantifier, it matches a match for the atom. + With a quantifier, it can match some number of matches of the atom. + An atom can be any of the possibilities + shown in . + The possible quantifiers and their meanings are shown in + . - A bound is { followed by - an unsigned decimal integer, possibly followed by - , possibly followed by another unsigned decimal - integer, always followed by }. The integers - must lie between 0 and RE_DUP_MAX (255) - inclusive, and if there are two of them, the first may not exceed - the second. An atom followed by a bound containing one integer - i and no comma matches a sequence of - exactly i matches of the atom. An atom - followed by a bound containing one integer - i and a comma matches a sequence of - i or more matches of the atom. An atom - followed by a bound containing two integers - i and j - matches a sequence of i through - j (inclusive) matches of the atom. + A constraint matches an empty string, but matches only when + specific conditions are met. A constraint can be used where an atom + could be used, except it may not be followed by a quantifier. + The simple constraints are shown in + ; + some more constraints are described later. + + + + + Regular Expression Atoms + + + + + Atom + Description + + + + + + (re) + (where re is any regular expression) + matches a match for + re, with the match noted for possible reporting + + + + (?:re) + as above, but the match is not noted for reporting + (a non-capturing set of parentheses) + (AREs only) + + + + . + matches any single character + + + + [chars] + a bracket expression, + matching any one of the chars (see + for more detail) + + + + \k + (where k is a non-alphanumeric character) + matches that character taken as an ordinary character, + e.g. \\ matches a backslash character + + + + \c + where c is alphanumeric + (possibly followed by other characters) + is an escape, see + (AREs only; in EREs and BREs, this matches c) + + + + { + when followed by a character other than a digit, + matches the left-brace character {; + when followed by a digit, it is the beginning of a + bound (see below) + + + + x + where x is a single character with no other + significance, matches that character + + + +
+ + + An RE may not end with \. - A repetition operator (?, - *, +, or bounds) cannot - follow another repetition operator. A repetition operator cannot + Remember that the backslash (\) already has a special + meaning in PostgreSQL string literals. + To write a pattern constant that contains a backslash, + you must write two backslashes in the query. + + + + + Regular Expression Quantifiers + + + + + Quantifier + Matches + + + + + + * + a sequence of 0 or more matches of the atom + + + + + + a sequence of 1 or more matches of the atom + + + + ? + a sequence of 0 or 1 matches of the atom + + + + {m} + a sequence of exactly m matches of the atom + + + + {m,} + a sequence of m or more matches of the atom + + + + + {m,n} + a sequence of m through n + (inclusive) matches of the atom; m may not exceed + n + + + + *? + non-greedy version of * + + + + +? + non-greedy version of + + + + + ?? + non-greedy version of ? + + + + {m}? + non-greedy version of {m} + + + + {m,}? + non-greedy version of {m,} + + + + + {m,n}? + non-greedy version of {m,n} + + + +
+ + + The forms using {...} + are known as bounds. + The numbers m and n within a bound are + unsigned decimal integers with permissible values from 0 to 255 inclusive. + + + + Non-greedy quantifiers (available in AREs only) match the + same possibilities as their corresponding normal (greedy) + counterparts, but prefer the smallest number rather than the largest + number of matches. + See for more detail. + + + + + A quantifier cannot immediately follow another quantifier. + A quantifier cannot begin an expression or subexpression or follow ^ or |. - - An atom is a regular expression enclosed in - () (matching a match for the regular - expression), an empty set of () (matching the - null string), a bracket expression (see - below), . (matching any single character), - ^ (matching the null string at the beginning of the - input string), $ (matching the null string at the end - of the input string), a \ followed by one of the - characters ^.[$()|*+?{\ (matching that - character taken as an ordinary character), a \ - followed by any other character (matching that character taken as - an ordinary character, as if the \ had not been - present), or a single character with no other significance - (matching that character). A { followed by a - character other than a digit is an ordinary character, not the - beginning of a bound. It is illegal to end an RE with - \. - + + Regular Expression Constraints + + + + + Constraint + Description + + + + + + ^ + matches at the beginning of the string + + + + $ + matches at the end of the string + + + + (?=re) + positive lookahead matches at any point + where a substring matching re begins + (AREs only) + + + + (?!re) + negative lookahead matches at any point + where no substring matching re begins + (AREs only) + + + +
- Note that the backslash (\) already has a special - meaning in string - literals, so to write a pattern constant that contains a backslash - you must write two backslashes in the query. + Lookahead constraints may not contain back references + (see ), + and all parentheses within them are considered non-capturing. +
+ + + Bracket Expressions A bracket expression is a list of characters enclosed in []. It normally matches any single character from the list (but see below). If the list begins with ^, it matches any single character - (but see below) not from the rest of the list. If two characters + not from the rest of the list. + If two characters in the list are separated by -, this is shorthand for the full range of characters between those two (inclusive) in the collating sequence, e.g. [0-9] in ASCII matches any decimal digit. It is illegal for two ranges to share an endpoint, e.g. a-c-e. Ranges are very - collating-sequence-dependent, and portable programs should avoid + collating-sequence-dependent, so portable programs should avoid relying on them. @@ -2754,11 +2963,13 @@ SUBSTRING('foobar' FROM 'o(.)b') o character, or the second endpoint of a range. To use a literal - as the first endpoint of a range, enclose it in [. and .] to make it a - collating element (see below). With the exception of these and - some combinations using [ (see next - paragraphs), all other special characters, including - \, lose their special significance within a - bracket expression. + collating element (see below). With the exception of these characters, + some combinations using [ + (see next paragraphs), and escapes (AREs only), all other special + characters lose their special significance within a bracket expression. + In particular, \ is not special when following + ERE or BRE rules, though it is special (as introducing an escape) + in AREs. @@ -2775,6 +2986,13 @@ SUBSTRING('foobar' FROM 'o(.)b') o chchcc. + + + PostgreSQL currently has no multi-character collating + elements. This information describes possible future behavior. + + + Within a bracket expression, a collating element enclosed in [= and =] is an equivalence @@ -2809,76 +3027,732 @@ SUBSTRING('foobar' FROM 'o(.)b') o There are two special cases of bracket expressions: the bracket expressions [[:<:]] and - [[:>:]] match the null string at the beginning + [[:>:]] are constraints, + matching empty strings at the beginning and end of a word respectively. A word is defined as a sequence - of word characters which is neither preceded nor followed by word - characters. A word character is an alnum character (as defined by + of word characters that is neither preceded nor followed by word + characters. A word character is an alnum character (as + defined by ctype3) or an underscore. This is an extension, compatible with but not - specified by POSIX 1003.2, and should be used with caution in - software intended to be portable to other systems. + specified by POSIX 1003.2, and should be used with + caution in software intended to be portable to other systems. + The constraint escapes described below are usually preferable (they + are no more standard, but are certainly easier to type). + + + + + Regular Expression Escapes + + + Escapes are special sequences beginning with \ + followed by an alphanumeric character. Escapes come in several varieties: + character entry, class shorthands, constraint escapes, and back references. + A \ followed by an alphanumeric character but not constituting + a valid escape is illegal in AREs. + In EREs, there are no escapes: outside a bracket expression, + a \ followed by an alphanumeric character merely stands for + that character as an ordinary character, and inside a bracket expression, + \ is an ordinary character. + (The latter is the one actual incompatibility between EREs and AREs.) + + + + Character-entry escapes exist to make it easier to specify + non-printing and otherwise inconvenient characters in REs. They are + shown in . + + + + Class-shorthand escapes provide shorthands for certain + commonly-used character classes. They are + shown in . + + + + A constraint escape is a constraint, + matching the empty string if specific conditions are met, + written as an escape. They are + shown in . + + + + A back reference (\n) matches the + same string matched by the previous parenthesized subexpression specified + by the number n + (see ). For example, + ([bc])\1 matches bb or cc + but not bc or cb. + The subexpression must entirely precede the back reference in the RE. + Subexpressions are numbered in the order of their leading parentheses. + Non-capturing parentheses do not define subexpressions. + + + + + Keep in mind that an escape's leading \ will need to be + doubled when entering the pattern as an SQL string constant. + + + + + Regular Expression Character-Entry Escapes + + + + + Escape + Description + + + + + + \a + alert (bell) character, as in C + + + + \b + backspace, as in C + + + + \B + synonym for \ to help reduce the need for backslash + doubling + + + + \cX + (where X is any character) the character whose + low-order 5 bits are the same as those of + X, and whose other bits are all zero + + + + \e + the character whose collating-sequence name + is ESC, + or failing that, the character with octal value 033 + + + + \f + formfeed, as in C + + + + \n + newline, as in C + + + + \r + carriage return, as in C + + + + \t + horizontal tab, as in C + + + + \uwxyz + (where wxyz is exactly four hexadecimal digits) + the Unicode character U+wxyz + in the local byte ordering + + + + \Ustuvwxyz + (where stuvwxyz is exactly eight hexadecimal + digits) + reserved for a somewhat-hypothetical Unicode extension to 32 bits + + + + + \v + vertical tab, as in C + + + + \xhhh + (where hhh is any sequence of hexadecimal + digits) + the character whose hexadecimal value is + 0xhhh + (a single character no matter how many hexadecimal digits are used) + + + + + \0 + the character whose value is 0 + + + + \xy + (where xy is exactly two octal digits, + and is not a back reference) + the character whose octal value is + 0xy + + + + \xyz + (where xyz is exactly three octal digits, + and is not a back reference) + the character whose octal value is + 0xyz + + + +
+ + + Hexadecimal digits are 0-9, + a-f, and A-F. + Octal digits are 0-7. + + + + The character-entry escapes are always taken as ordinary characters. + For example, \135 is ] in ASCII, but + \135 does not terminate a bracket expression. + + Regular Expression Class-Shorthand Escapes + + + + + Escape + Description + + + + + + \d + [[:digit:]] + + + + \s + [[:space:]] + + + + \w + [[:alnum:]_] + (note underscore is included) + + + + \D + [^[:digit:]] + + + + \S + [^[:space:]] + + + + \W + [^[:alnum:]_] + (note underscore is included) + + + +
+ - In the event that an RE could match more than one substring of a - given string, the RE matches the one starting earliest in the - string. If the RE could match more than one substring starting at - that point, it matches the longest. Subexpressions also match the - longest possible substrings, subject to the constraint that the - whole match be as long as possible, with subexpressions starting - earlier in the RE taking priority over ones starting later. Note - that higher-level subexpressions thus take priority over their - lower-level component subexpressions. + Within bracket expressions, \d, \s, + and \w lose their outer brackets, + and \D, \S, and \W are illegal. + (So, for example, [a-c\d] is equivalent to + [a-c[:digit:]]. + Also, [a-c\D], which is equivalent to + [a-c^[:digit:]], is illegal.) + + Regular Expression Constraint Escapes + + + + + Escape + Description + + + + + + \A + matches only at the beginning of the string + (see for how this differs from + ^) + + + + \m + matches only at the beginning of a word + + + + \M + matches only at the end of a word + + + + \y + matches only at the beginning or end of a word + + + + \Y + matches only at a point that is not the beginning or end of a + word + + + + \Z + matches only at the end of the string + (see for how this differs from + $) + + + +
+ + + A word is defined as in the specification of + [[:<:]] and [[:>:]] above. + Constraint escapes are illegal within bracket expressions. + + + + Regular Expression Back References + + + + + Escape + Description + + + + + + \m + (where m is a nonzero digit) + a back reference to the m'th subexpression + + + + \mnn + (where m is a nonzero digit, and + nn is some more digits, and the decimal value + mnn is not greater than the number of closing capturing + parentheses seen so far) + a back reference to the mnn'th subexpression + + + +
+ + + + There is an inherent historical ambiguity between octal character-entry + escapes and back references, which is resolved by heuristics, + as hinted at above. + A leading zero always indicates an octal escape. + A single non-zero digit, not followed by another digit, + is always taken as a back reference. + A multi-digit sequence not starting with a zero is taken as a back + reference if it comes after a suitable subexpression + (i.e. the number is in the legal range for a back reference), + and otherwise is taken as octal. + + +
+ + + Regular Expression Metasyntax + - Match lengths are measured in characters, not collating - elements. A null string is considered longer than no match at - all. For example, bb* matches the three middle - characters of abbbc, - (wee|week)(knights|nights) matches all ten - characters of weeknights, when - (.*).* is matched against - abc the parenthesized subexpression matches all - three characters, and when (a*)* is matched - against bc both the whole RE and the - parenthesized subexpression match the null string. + In addition to the main syntax described above, there are some special + forms and miscellaneous syntactic facilities available. - If case-independent matching is specified, the effect is much as - if all case distinctions had vanished from the alphabet. When an - alphabetic that exists in multiple cases appears as an ordinary - character outside a bracket expression, it is effectively + Normally the flavor of RE being used is specified by + application-dependent means. + However, this can be overridden by a director. + If an RE of any flavor begins with ***:, + the rest of the RE is an ARE. + If an RE of any flavor begins with ***=, + the rest of the RE is taken to be a literal string, + with all characters considered ordinary characters. + + + + An ARE may begin with embedded options: + a sequence (?xyz) + (where xyz is one or more alphabetic characters) + specifies options affecting the rest of the RE. + These supplement, and can override, + any options specified externally. + The available option letters are + shown in . + + + + ARE Embedded-Option Letters + + + + + Option + Description + + + + + + b + rest of RE is a BRE + + + + c + case-sensitive matching (usual default) + + + + e + rest of RE is an ERE + + + + i + case-insensitive matching (see + ) + + + + m + historical synonym for n + + + + n + newline-sensitive matching (see + ) + + + + p + partial newline-sensitive matching (see + ) + + + + q + rest of RE is a literal (quoted) string, all ordinary + characters + + + + s + non-newline-sensitive matching (usual default) + + + + t + tight syntax (usual default; see below) + + + + w + inverse partial newline-sensitive (weird) matching + (see ) + + + + x + expanded syntax (see below) + + + +
+ + + Embedded options take effect at the ) terminating the sequence. + They are available only at the start of an ARE, + and may not be used later within it. + + + + In addition to the usual (tight) RE syntax, in which all + characters are significant, there is an expanded syntax, + available by specifying the embedded x option. + In the expanded syntax, + white-space characters in the RE are ignored, as are + all characters between a # + and the following newline (or the end of the RE). This + permits paragraphing and commenting a complex RE. + There are three exceptions to that basic rule: + + + + + a white-space character or # preceded by \ is + retained + + + + + white space or # within a bracket expression is retained + + + + + white space and comments are illegal within multi-character symbols, + like the ARE (?: or the BRE \( + + + + + Expanded-syntax white-space characters are blank, tab, newline, and + any character that belongs to the space character class. + + + + Finally, in an ARE, outside bracket expressions, the sequence + (?#ttt) + (where ttt is any text not containing a )) + is a comment, completely ignored. + Again, this is not allowed between the characters of + multi-character symbols, like (?:. + Such comments are more a historical artifact than a useful facility, + and their use is deprecated; use the expanded syntax instead. + + + + None of these metasyntax extensions is available if + an initial ***= director + has specified that the user's input be treated as a literal string + rather than as an RE. + +
+ + + Regular Expression Matching Rules + + + In the event that an RE could match more than one substring of a given + string, the RE matches the one starting earliest in the string. + If the RE could match more than one substring starting at that point, + its choice is determined by its preference: + either the longest substring, or the shortest. + + + + Most atoms, and all constraints, have no preference. + A parenthesized RE has the same preference (possibly none) as the RE. + A quantified atom with quantifier + {m} + or + {m}? + has the same preference (possibly none) as the atom itself. + A quantified atom with other normal quantifiers (including + {m,n} + with m equal to n) + prefers longest match. + A quantified atom with other non-greedy quantifiers (including + {m,n}? + with m equal to n) + prefers shortest match. + A branch has the same preference as the first quantified atom in it + which has a preference. + An RE consisting of two or more branches connected by the + | operator prefers longest match. + + + + Subject to the constraints imposed by the rules for matching the whole RE, + subexpressions also match the longest or shortest possible substrings, + based on their preferences, + with subexpressions starting earlier in the RE taking priority over + ones starting later. + Note that outer subexpressions thus take priority over + their component subexpressions. + + + + The quantifiers {1,1} and {1,1}? + can be used to force longest and shortest preference, respectively, + on a subexpression or a whole RE. + + + + Match lengths are measured in characters, not collating elements. + An empty string is considered longer than no match at all. + For example: + bb* + matches the three middle characters of abbbc; + (week|wee)(night|knights) + matches all ten characters of weeknights; + when (.*).* + is matched against abc the parenthesized subexpression + matches all three characters; and when + (a*)* is matched against bc + both the whole RE and the parenthesized + subexpression match an empty string. + + + + If case-independent matching is specified, + the effect is much as if all case distinctions had vanished from the + alphabet. + When an alphabetic that exists in multiple cases appears as an + ordinary character outside a bracket expression, it is effectively transformed into a bracket expression containing both cases, - e.g. x becomes [xX]. When - it appears inside a bracket expression, all case counterparts of - it are added to the bracket expression, so that (e.g.) - [x] becomes [xX] and - [^x] becomes [^xX]. + e.g. x becomes [xX]. + When it appears inside a bracket expression, all case counterparts + of it are added to the bracket expression, e.g. + [x] becomes [xX] + and [^x] becomes [^xX]. - There is no particular limit on the length of REs, except insofar - as memory is limited. Memory usage is approximately linear in RE - size, and largely insensitive to RE complexity, except for bounded - repetitions. Bounded repetitions are implemented by macro - expansion, which is costly in time and space if counts are large - or bounded repetitions are nested. An RE like, say, - ((((a{1,100}){1,100}){1,100}){1,100}){1,100} - will (eventually) run almost any existing machine out of swap - space. - - - This was written in 1994, mind you. The - numbers have probably changed, but the problem - persists. - - + If newline-sensitive matching is specified, . + and bracket expressions using ^ + will never match the newline character + (so that matches will never cross newlines unless the RE + explicitly arranges it) + and ^and $ + will match the empty string after and before a newline + respectively, in addition to matching at beginning and end of string + respectively. + But the ARE escapes \A and \Z + continue to match beginning or end of string only. - - + + If partial newline-sensitive matching is specified, + this affects . and bracket expressions + as with newline-sensitive matching, but not ^ + and $. + + + + If inverse partial newline-sensitive matching is specified, + this affects ^ and $ + as with newline-sensitive matching, but not . + and bracket expressions. + This isn't very useful but is provided for symmetry. + + + + + Limits and Compatibility + + + No particular limit is imposed on the length of REs in this + implementation. However, + programs intended to be highly portable should not employ REs longer + than 256 bytes, + as a POSIX-compliant implementation can refuse to accept such REs. + + + + The only feature of AREs that is actually incompatible with + POSIX EREs is that \ does not lose its special + significance inside bracket expressions. + All other ARE features use syntax which is illegal or has + undefined or unspecified effects in POSIX EREs; + the *** syntax of directors likewise is outside the POSIX + syntax for both BREs and EREs. + + + + Many of the ARE extensions are borrowed from Perl, but some have + been changed to clean them up, and a few Perl extensions are not present. + Incompatibilities of note include \b, \B, + the lack of special treatment for a trailing newline, + the addition of complemented bracket expressions to the things + affected by newline-sensitive matching, + the restrictions on parentheses and back references in lookahead + constraints, and the longest/shortest-match (rather than first-match) + matching semantics. + + + + Two significant incompatibilites exist between AREs and the ERE syntax + recognized by pre-7.4 releases of PostgreSQL: + + + + + In AREs, \ followed by an alphanumeric character is either + an escape or an error, while in previous releases, it was just another + way of writing the alphanumeric. + This should not be much of a problem because there was no reason to + write such a sequence in earlier releases. + + + + + In AREs, \ remains a special character within + [], so a literal \ within a bracket + expression must be written \\. + + + + + + + + Basic Regular Expressions + + + BREs differ from EREs in several respects. + |, +, and ? + are ordinary characters and there is no equivalent + for their functionality. + The delimiters for bounds are + \{ and \}, + with { and } + by themselves ordinary characters. + The parentheses for nested subexpressions are + \( and \), + with ( and ) by themselves ordinary characters. + ^ is an ordinary character except at the beginning of the + RE or the beginning of a parenthesized subexpression, + $ is an ordinary character except at the end of the + RE or the end of a parenthesized subexpression, + and * is an ordinary character if it appears at the beginning + of the RE or the beginning of a parenthesized subexpression + (after a possible leading ^). + Finally, single-digit back references are available, and + \< and \> + are synonyms for + [[:<:]] and [[:>:]] + respectively; no other escapes are available. + + + + + + diff --git a/doc/src/sgml/release.sgml b/doc/src/sgml/release.sgml index 354b70cc073..b4eabbcb777 100644 --- a/doc/src/sgml/release.sgml +++ b/doc/src/sgml/release.sgml @@ -1,5 +1,5 @@ @@ -24,6 +24,7 @@ CDATA means the content is "SGML-free", so you can write without worries about funny characters. -->