logoalt Hacker News

nine_kyesterday at 8:53 PM3 repliesview on HN

> nearly everything that matters in practice: where the matches are, how long they are, and how many there are

I would say that regexes that matter in practice, e.g. when digging through logs, have clear boundaries that curb the pathological backtracking behavior. In particular, I find it difficult to imagine a practical need to find all matches of an expression like /.*a|b/, as shown in the article. Realistically you'd have to handle /\b.*a|b\b/, or similar, because realistically when you need all matches, you don't want intersecting matches. This means you want to proceed past the end of the n-th match to look for n+1-th match, and never want to use indeterminate prefixes like /.*a/.

This OTOH gives a reasonably useful heuristic if your regexp comes from an untrusted source and could be adversarial. Check that it does not start with a prefix with a Kleene star, like /a*/. Require at least one positive match (in each alternate branch). Of course, /a+b|c/ would still be quadratic if your text is long sequences of "a" interspersed with characters other than "b". But this, again, is more of a theoretical case, to my mind.


Replies

hrmtst93837today at 7:49 AM

That only works when you control both the pattern and the shape of the input. Once you accept user regexes or start scanning big log dumps without tight boundaries the weird cases stop being weird.

Greedy globs plus unanchored branches can turn a boring batch job into a CPU bonfire, and defensive limits seems a lot less optional when one dumb pattern can pin a core for minutes. A timeout or a restricted regex subset is usually cheaper than pretending everybody writes sane patterns.

ievievyesterday at 9:17 PM

> I would say that regexes that matter in practice, e.g. when digging through logs, have clear boundaries that curb the pathological backtracking behavior

I agree with you in the sense that most practical regexes do not expose this quadratic blowup (from all matches) but i do not think the same about backtracking. The effect of backtracking is immediately clear when you're searching inside text without a clear anchor like \A or ^ or a very rare string prefix. It is much more visible with either larger patterns or larger character classes like unicode

hrmtst93837yesterday at 9:14 PM

[dead]