For any given regex, an opponent can craft a string which is valid HTML but that the regex cannot pa...

kstrauser • yesterday at 7:33 PM • 1 reply • view on HN

For any given regex, an opponent can craft a string which is valid HTML but that the regex cannot parse. There are a million edge cases like:

  <!—- Don't count <hr> this! -—> but do count <hr> this -->

and

  <!-- <!-- Ignore <ht> this --> but do count <hr> this —->

Now your regex has to include balanced comment markers. Solve that

You need a context-free grammar to correctly parse HTML with its quoting rules, and escaping, and embedded scripts and CDATA, etc. etc. etc. I don't think any common regex libraries are as powerful as CFGs.

Basically, you can get pretty far with regexes, but it's provably (like in a rigorous compsci kinda way) impossible to correctly parse all valid HTML with only regular expressions.

Replies

marcosdumay • yesterday at 9:23 PM

HTML comments do not nest. The obvious tokenizer you can create with regular expressions is the correct one.

➕ show 1 reply

alt Hacker News

Replies