> search a document for a pattern and it takes a second. search one a hundred times larger and it...

zahlman • yesterday at 8:31 PM • 4 replies • view on HN

> search a document for a pattern and it takes a second. search one a hundred times larger and it doesn't take a hundred seconds - it can take almost three hours.

Most of this is about quadratic time find-all operations where a search operation is linear. But it's also still possible to get quadratic behaviour out of a single search without catastrophic backtracking, more easily than you might expect. In late January to early February, Tim Peters was talking about an example of this on the Python forums (see e.g. https://discuss.python.org/t/add-re-prefixmatch-deprecate-re...) and also related the experience of trying to diagnose the issue with AI (see https://discuss.python.org/t/claude-code-how-much-hype-how-m... and onward). Peters' example was:

  \d+\s+

on a string containing only digits, a prefix match takes O(n) time as it considers every possible end position for the digit, and immediately sees no following whitespace. But the search is quadratic because it has to repeat that O(n) work at every position; the regex engine can't track the fact that it's already examined the string and found no whitespace, so it re-tries each digit match length.

(This is arguably "backtracking" since it tries the longest match first, but clearly not in a catastrophic way; if you use `\d+?` instead then of course it only searches forward but is still O(n). It actually is slower in my testing in the Python implementation; I don't exactly know why. As noted in the discussion, the possessive quantifier `\d++` is considerably faster, and of course doesn't backtrack, but still causes O(n^2) searching. The repeated attempts to match `\s+` aren't the problem; the problem is repeatedly looking for digits in places where digits were already found and rejected.)

The way to fix this proposed in the discussion is to use a negative lookbehind assertion before the digits: `(?<!\d)\d+\s+`. This way, the regex engine can bail out early when it's in the middle of a digit string; if the previous character was a digit, then either `\d+\s+` doesn't match here, or it would have matched there.

A simpler idea is to just search for `\d\s+`, or even `\d\s` — since these will be present if and only if `\d+\s+` is. This way, though, you still need to do extra work with the partial match to identify the start and end of the full match. My first idea was to use positive lookbehind for the digits, since the lookbehind match doesn't need to backtrack. In fact lookbehinds require a fixed-length pattern, so this is really just a more complicated way to do the `\d\s+` simplification.

----

> Hyperscan (and its fork Vectorscan) is a true linear-time all-matches regex engine. it achieves this by using "earliest match" semantics - reporting a match the moment the DFA enters a match state, instead of continuing to find the longest one.

Is this not just equivalent to forcing "reluctant" quantifiers (`\d+?`) everywhere?

Replies

ieviev • yesterday at 9:40 PM

with all-matches semantics it returns a significantly higher number of matches than leftmost greedy.

eg. /abc*/ and abccccc will return you matches at ab|c|c|c|c|c|

I think it's very common and ok that people reason about other engines in terms of backtracking but it works very differently. And fixed length lookbehinds are more of a Java/Python thing, other engines support all lookbehinds.

The main idea of linear regex and intuitive semantics is that it should be declarative and the engine does whatever is the fastest without you having to worry about it. Instead of describing character by character how to perform the search and where it can blow up, think of it as just a specification. Then you can truly express whatever is the shortest/most convenient to explain.

Something i'm still trying to figure out and perhaps failing to understand is what are the killer features of backtracking regex that you would really miss if you were to use linear regex? It would help me a lot to know, i'm trying to convince others to make the switch

MoonZ • yesterday at 8:42 PM

> In fact lookbehinds require a fixed-length pattern

Just a small note: some regex engines support "variable length lookbehind", check the last column on this wikipedia article : https://en.wikipedia.org/wiki/Comparison_of_regular_expressi...

➕ show 1 reply

Izkata • yesterday at 8:38 PM

If there's supposed to be a literal asterisk in there somewhere, you can escape it with a backslash. Right now two paragraphs are italic because of mismatched asterisks.

➕ show 1 reply

mjmas • today at 5:06 AM

> the search is quadratic because it has to repeat that O(n) work at every position

The problem is that this is one of the regexes that backtracking engines have a bad time with.

With a NFA implementation it can be done in O(regexlen * haystacklen) time, though that only holds for true regular expressions (no backreferences).

https://swtch.com/~rsc/regexp/regexp1.html

And then for re.search, since the NFA wants to just do it once, it should run it with the pattern as

  ^.*?(\d+\s+).*$

(where *? is a non-greedy repeat)

alt Hacker News

Replies