Implement lookbehind constraints in our regular-expression engine.

A lookbehind constraint is like a lookahead constraint in that it consumes no text; but it checks for existence (or nonexistence) of a match *ending* at the current point in the string, rather than one *starting* at the current point. This is a long-requested feature since it exists in many other regex libraries, but Henry Spencer had never got around to implementing it in the code we use. Just making it work is actually pretty trivial; but naive copying of the logic for lookahead constraints leads to code that often spends O(N^2) time to scan an N-character string, because we have to run the match engine from string start to the current probe point each time the constraint is checked. In typical use-cases a lookbehind constraint will be written at the start of the regex and hence will need to be checked at every character --- so O(N^2) work overall. To fix that, I introduced a third copy of the core DFA matching loop, paralleling the existing longest() and shortest() loops. This version, matchuntil(), can suspend and resume matching given a couple of pointers' worth of storage space. So we need only run it across the string once, stopping at each interesting probe point and then resuming to advance to the next one. I also put in an optimization that simplifies one-character lookahead and lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND constraints, which already existed in the engine. This avoids the overhead of the LACON machinery entirely for these rather common cases. The net result is that lookbehind constraints run a factor of three or so slower than Perl's for multi-character constraints, but faster than Perl's for one-character constraints ... and they work fine for variable-length constraints, which Perl gives up on entirely. So that's not bad from a competitive perspective, and there's room for further optimization if anyone cares. (In reality, raw scan rate across a large input string is probably not that big a deal for Postgres usage anyway; so I'm happy if it's linear.)
author: Tom Lane <tgl@sss.pgh.pa.us> 2015-10-30 19:14:19 -0400
committer: Tom Lane <tgl@sss.pgh.pa.us> 2015-10-30 19:14:19 -0400
commit: 12c9a04008870c283931d6b3b648ee21bbc2cfda (patch)
tree: 2afd1e048b3681e5a93b7d8b3c37968e71b2532d /src/backend/regex/regc_nfa.c
parent: c5057b2b34813ca114bc808cb56b7a7fcde64393 (diff)
1 files changed, 43 insertions, 0 deletions
diff --git a/src/backend/regex/regc_nfa.c b/src/backend/regex/regc_nfa.c
index 6f04321cd35..cd9a3239bd3 100644
--- a/src/backend/regex/regc_nfa.c
+++ b/src/backend/regex/regc_nfa.c
@@ -1349,6 +1349,49 @@ cleartraverse(struct nfa * nfa,
 }
 
 /*
+ * single_color_transition - does getting from s1 to s2 cross one PLAIN arc?
+ *
+ * If traversing from s1 to s2 requires a single PLAIN match (possibly of any
+ * of a set of colors), return a state whose outarc list contains only PLAIN
+ * arcs of those color(s).  Otherwise return NULL.
+ *
+ * This is used before optimizing the NFA, so there may be EMPTY arcs, which
+ * we should ignore; the possibility of an EMPTY is why the result state could
+ * be different from s1.
+ *
+ * It's worth troubling to handle multiple parallel PLAIN arcs here because a
+ * bracket construct such as [abc] might yield either one or several parallel
+ * PLAIN arcs depending on earlier atoms in the expression.  We'd rather that
+ * that implementation detail not create user-visible performance differences.
+ */
+static struct state *
+single_color_transition(struct state * s1, struct state * s2)
+{
+	struct arc *a;
+
+	/* Ignore leading EMPTY arc, if any */
+	if (s1->nouts == 1 && s1->outs->type == EMPTY)
+		s1 = s1->outs->to;
+	/* Likewise for any trailing EMPTY arc */
+	if (s2->nins == 1 && s2->ins->type == EMPTY)
+		s2 = s2->ins->from;
+	/* Perhaps we could have a single-state loop in between, if so reject */
+	if (s1 == s2)
+		return NULL;
+	/* s1 must have at least one outarc... */
+	if (s1->outs == NULL)
+		return NULL;
+	/* ... and they must all be PLAIN arcs to s2 */
+	for (a = s1->outs; a != NULL; a = a->outchain)
+	{
+		if (a->type != PLAIN || a->to != s2)
+			return NULL;
+	}
+	/* OK, return s1 as the possessor of the relevant outarcs */
+	return s1;
+}
+
+/*
  * specialcolors - fill in special colors for an NFA
  */
 static void
author	Tom Lane <tgl@sss.pgh.pa.us>	2015-10-30 19:14:19 -0400
committer	Tom Lane <tgl@sss.pgh.pa.us>	2015-10-30 19:14:19 -0400
commit	12c9a04008870c283931d6b3b648ee21bbc2cfda (patch)
tree	2afd1e048b3681e5a93b7d8b3c37968e71b2532d /src/backend/regex/regc_nfa.c
parent	c5057b2b34813ca114bc808cb56b7a7fcde64393 (diff)