XML lost because 1) the existence of attributes means a document cannot be automatically mapped to a...

acabal • today at 5:31 AM • 2 replies • view on HN

XML lost because 1) the existence of attributes means a document cannot be automatically mapped to a basic language data structure like an array of strings, and 2) namespaces are an unmitigated hell to work with. Even just declaring a default namespace and doing nothing else immediately makes your day 10x harder.

These items make XML deeply tedious and annoying to ingest and manipulate. Plus, some major XML libraries, like lxml in Python, are extremely unintuitive in their implementation of DOM structures and manipulation. If ingesting and manipulating your markup language feels like an endless trudge through a fiery wasteland then don't be surprised when a simpler, more ergonomic alternative wins, even if its feature set is strictly inferior. And that's exactly what happened.

I say this having spent the last 10 years struggling with lxml specifically, and my entire 25 year career dealing with XML in some shape or form. I still routinely throw up my hands in frustration when having to use Python tooling to do what feels like what should be even the most basic XML task.

Though xpath is nice.

Replies

masklinn • today at 7:59 AM

> Plus, some major XML libraries, like lxml in Python, are extremely unintuitive in their implementation of DOM structures and manipulation.

Lxml, or more specifically its inspiration ElementTree is specifically not a (W3C) DOM or dom-style API. It was designed for what it called “data-style” XML documents where elements would hold either text or sub-elements but not both, which is why mixed-content interactions are a chore (lxml augments the API by adding more traversal axis but elementtree does not even have that, it’s a literal tree of elements). effbot.org used to have a page explaining its simplified infoset before Fredrik passed and registration lapsed, it can be accessed through archive.org.

That means lxml is, by design, not the right tool to interact with mixed-content documents. But of course the issue is there isn’t really a right tool for that, as to my knowledge nobody has bothered building a fast DOM-style library for Python.

If you approach lxml as what ElementTree was designed as it’s very intuitive: an element is a sequence of sub-elements, with a mapping of attributes. It’s a very straightforward model and works great for data documents, as well as fits great within the langage. But of course that breaks down for mixed content documents as your text nodes get relegated to `tail` attributes (and ElementTree straight up discards comments and PIs, though lxml reverted that).

matkoniecz • today at 6:17 AM

> even if its feature set is strictly inferior

and often having less bizarre and overly complex features is a feature by itself

➕ show 1 reply

alt Hacker News

Replies