This is interesting. The argument which I’m gleaning from the essay is that the old proposed API of having an intermediary new Sanitizer() class with a sanitize(input) method which returns a string is actually insecure because of mutated XSS (MXSS) bugs.
The theory is that the parse->serialize->parse round-trip is not idempotent and that sanitization is element context-dependent, so having a pure string->string function opens a new class of vulnerabilities. Having a stateful setHTML() function defined on elements means the HTML context-specific rules for tables, SVG, MathML etc. are baked in, and eliminates double-parsing errors.
Are MXSS errors actually that common?
Aside from the article's content, I really like the inline exercise for the reader with the hidden/expandable answer section. It's fun and it successfully got me to read the proceeding section more closely than I would have otherwise.
Makes sense. I think this is a variant of the "parse, don't validate" motto, but is more "parse, don't parse-serialize-parse" in the implementation.
This is a good API. I hope it gains adoption in at least one browser, that way other browsers which don't adopt it will be called 'insecure'... Which would be warranted IMO... People have been wanting the ability to inject safe HTML for almost as long as JavaScript existed.
Seriously, we got CSP before setHTML() WTF!
CSP is nasty. Removing essential functionality to mitigate possible security flaws, ignoring the developer's intent. CSP is like taping your mouth shut to lose weight... But you still sit through 3 meals a day... Basically smashing the food against your face.
> This is pretty similar to the Sanitizer that I wanted to build into the browser: […] But that is NOT the Sanitizer we ended up with.¶ And the reason is essentially Mutated XSS (mXSS). To quickly recap, the idea behind mXSS is[…]
No, the reason is that the problem is underspecified and unsatisfiable.
The whole notion of HTML "sanitization" is the ultimate "just do what I mean". It's the customer who cannot articulate what they need. It's «Hey, how about if there were some sort of `import "nobugs"`?»
"HTML sanitization" is never going to be solved because it's not solvable.
There's no getting around knowing whether or any arbitrary string is legitimate markup from a trusted source or some untrusted input that needs to be treated like text. This is a hard requirement. (And if you already have this information, then the necessary tools have been available for years—decades, even: `innerHTML` and `textContent`—or if you don't like the latter, then it's trivial to write your own `escapeText` subroutine that's correct, well-formed, and sound.) No new DOMPurify alternative or native API baked into the browser is going to change this, ever.
So is the usecase for this that you save un trusted html from your user in your database, then send that untrusted html to your users, but in the front end parse it down to just the safe bits?
I think maybe a better api would be to add an unsafe html tag so it would look something like:
<unsafe>
all unsafe code here
</unsafe>
Then if the browsers do indeed support it, it would work even without javascript.But in any case, you really should be validating everything server side.
Interesting, though not really a replacement for server-side sanitization. But as another layer of defense? Sure. I could see it useful especially in RTEs.
I think this API makes more sense from another standpoint as well.
You don’t want developers trying to rely on client-only sanitization for user input submitted to the server. Sanitizing while setting a user-face UI makes sense.
> Traverse the HTML fragment and remove elements as configured.
Well this is clearly wrong isn't it? You need a whitelist of elements, not a blacklist. That lesson is at least 2 decades old.
The downside of a new method is that it leaves innerHtml as a source of future security issues.
> HTML parsing is not stable and a line of HTML being parsed and serialized and parsed again may turn into something rather different
This is why people should really use XHTML, the strict XML dialect of HTML, in order to avoid these nasty parsing surprises. It has the predictable behavior that you want.
In XHTML, the code does exactly what it says it does. If you write <table><a></a></table> like the example on the mXSS page, then you get a table element and an anchor child. As another example, if you write <table><td>xyz</td></table>, that's exactly what you get, and there are no implicit <tbody> or <tr> inserted inside.
It's just wild as I continue to watch the world double down for decades on HTML and all its wild behavior in parsing. Furthermore, HTML's syntax is a unique snowflake, whereas XML is a standardized language that just so happens to be used in SVG, MathML, Atom, and other standards - no need to relearn syntax every single time.
With context, this article is more interesting than the title might imply.
> The Sanitizer API is a proposed new browser API to bring a safe and easy-to-use capability to sanitize HTML into the web platform [and] is currently being incubated in the Sanitizer API WICG, with the goal of bringing this to the WHATWG.
Which would replace the need for sanitizing user-entered content with libraries like DOMPurify by having it built into the browser's API.
The proposed specification has additional information: https://github.com/WICG/sanitizer-api/