logoalt Hacker News

The lost art of XML

55 pointsby Curiositrytoday at 3:45 AM55 commentsview on HN

Comments

in_a_societytoday at 4:45 AM

Smells like an article from someone that didn’t really USE the XML ecosystem.

First, there is modeling ambiguity, too many ways to represent the same data structure. Which means you can’t parse into native structs but instead into a heavy DOM object and it sucks to interact with it.

Then, schemas sound great, until you run into DTD, XSD, and RelaxNG. Relax only exists because XSD is pretty much incomprehensible.

Then let’s talk about entity escaping and CDATA. And how you break entire parsers because CDATA is a separate incantation on the DOM.

And in practice, XML is always over engineered. It’s the AbstractFactoryProxyBuilder of data formats. SOAP and WSDL are great examples of this, vs looking at a JSON response and simply understanding what it is.

I worked with XML and all the tooling around it for a long time. Zero interest in going back. It’s not the angle brackets or the serialization efficiency. It’s all of the above brain damage.

show 9 replies
lighthouse1212today at 5:23 AM

XML was designed for documents; JSON for data structures. The 'lost art' framing implies we forgot something valuable, but what actually happened is we stopped using a document format for data serialization. That's not forgetting - that's learning. XML is still the right choice for its original domain (markup, documents with mixed content). It was never the right choice for API payloads and config files.

show 2 replies
acabaltoday at 5:31 AM

XML lost because 1) the existence of attributes means a document cannot be automatically mapped to a basic language data structure like an array of strings, and 2) namespaces are an unmitigated hell to work with. Even just declaring a default namespace and doing nothing else immediately makes your day 10x harder.

These items make XML deeply tedious and annoying to ingest and manipulate. Plus, some major XML libraries, like lxml in Python, are extremely unintuitive in their implementation of DOM structures and manipulation. If ingesting and manipulating your markup language feels like an endless trudge through a fiery wasteland then don't be surprised when a simpler, more ergonomic alternative wins, even if its feature set is strictly inferior. And that's exactly what happened.

I say this having spent the last 10 years struggling with lxml specifically, and my entire 25 year career dealing with XML in some shape or form. I still routinely throw up my hands in frustration when having to use Python tooling to do what feels like what should be even the most basic XML task.

Though xpath is nice.

show 2 replies
mkozlowstoday at 4:39 AM

This is performance art, right? The very first bullet point it starts with is extolling the merits of XSD. Even back in the day when XML was huge, XSD was widely recognized as a monstrosity and a boondoggle -- the real XMLheads were trying to make RELAX NG happen, but XSD got jammed through because it was needed for all those monstrous WS-* specs.

XML did some good things for its day, but no, we abandoned it for very good reasons.

show 2 replies
bnitoday at 6:17 AM

Developers (even web developers!) were familiar with XML for many years before JSON was invented.

Also "worse is better". Many developer still prefer to use something that is similar to notepad.exe, instead of actual tools that understand the formats on a deeper level.

brunoborgestoday at 6:09 AM

XML and XSD were not meant to be edited by hand, by humans. They thrived when we used proper XML/XSD editing tools.

Although ironically there are less production-time human mistakes when editing an XML that is properly validated with a XSD than a YAML file, because Norway.

_heimdalltoday at 4:27 AM

This is a debate I've had many times. XML, and REST, are extremely useful for certain types of use cases that you quite often run into online.

The industry abandoned both in favor of JSON and RPC for speed and perceived DX improvements, and because for a period of time everyone was in fact building only against their own servers.

There are plenty of examples over the last two decades of us having to reinvent solutions to the same problems that REST solved way back then though. MCP is the latest iteration of trying to shoehorn schemas and self-documenting APIs into a sea of JSON RPC.

show 1 reply
strikingtoday at 4:31 AM

I tried using XML on a lark the other day and realized that XSDs are actually somewhat load bearing. It's difficult to map data in XML to objects in your favorite programming language without the schema being known beforehand as lists of a single element are hard to distinguish from just a property of the overall object.

Maybe this is okay if you know your schema beforehand and are willing to write an XSD. My usecase relied on not knowing the schema. Despite my excitement to use a SAX-style parser, I tucked my tail between my legs and switched back to JSONL. Was I missing something?

show 3 replies
kenforthewintoday at 4:50 AM

> This is insanity masquerading as pragmatism.

> This is not engineering. This is fashion masquerading as technical judgment.

The boring explanation is that AI wrote this. The more interesting theory is that folks are beginning to adopt the writing quirks of AI en masse.

show 2 replies
kennethallentoday at 4:59 AM

The fundamental reason JSON won over XML is that JSON maps exactly to universal data structures (lists and string-keyed maps) and XML does not.

com2kidtoday at 5:26 AM

I remember spending hours just trying to properly define the XML schema I wanted to use.

Then if there were any problems in my XML, trying to decipher horrible errors determining what I did wrong.

The docs sucked and where "enterprise grade", the examples sucked (either too complicated or too simple), and the tooling sucked.

I suspect it would be fine now days with LLMs to help, but back when it existed, XML was a huge hassle.

I once worked on a robotics project where a full 50% of the CPU was used for XML serialization and parsing. Made it hard to actually have the robot do anything. XML is violently wordy and parsing strings is expensive.

Mikhail_Edoshintoday at 6:19 AM

Another thing I disagree with is the idea that JSON uses fewer characters. This is not true: JSON uses more characters. Example:

    <aaaa bbbb="bbbb" cccc="cccc"/>
    {"bbbb":"bbbb","cccc":"cccc"}
See that the difference is only two characters? Yet XML also has a four-character element name, which JSON lacks. And JSON is packed to the limit, while XML is written naturally and is actually more readable than JSON.
stmwtoday at 5:35 AM

There were efforts to make XML 1. more ergonomic and 2. more performant, and while (2) was largely successful, (1) never got there, unfortunately - but seem https://github.com/yaml/sml-dev-archive for some history of just one of the discussions (sml-dev mailing list).

culebron21today at 5:32 AM

XML was a product of its time, when after almost 20 years of CPUs rapidly getting quicker, we contemplated that the size of data wouldn't matter, and data types won't matter (hence XML doesn't have them, but after that JSON got them back) -- we expected languages with weak type systems to dominate forever, and that we would be working and thinking levels above all this, abstractly, and so on.

I remember XML proponents back then argued that it allows semantics -- although, it was never clear how a non-human would understand it and process.

The funny thing about namespaces is that the prefix, by the XML docs, should be meaningless -- instead you should look at the URL of the namespace. It's like if we read a doc with snake:front-left-paw, and ask how come does a snake have paws? -- Because it's actually a bear -- see the definition of snake in the URL! It feels like mathematical concepts -- coordinate spaces, numeric spaces with different number 1 and base space vectors -- applied to HTML. It may be useful in rare cases. But few can wrap their heads around it, and right from the start, most tools worked only with exactly named prefixes, and everyone had to follow this way.

show 3 replies
edbaskervilletoday at 5:42 AM

Worse is better. Because better, it turns out, is often much, much worse.

cgiotoday at 5:54 AM

Not convincing. I was hoping it would go down the xslt path, which is a lost art. I despised and loved xslt at the same time, and there’s no question it was an artful enterprise using it.

show 1 reply
shmerltoday at 6:06 AM

For machine to machine communication use Protobuf, not JSON.

Mikhail_Edoshintoday at 6:01 AM

I like XML and I use it for myself daily. E.g. all documentation is XML; it is just the perfect tool for the task. Most comments that denigrate XML are very superficial. But I disagree with this article too.

My main point is that the very purpose of XML is not to transfer data between machines. XML use case is to transfer data between humans and machines.

Look at the schemas. They are all grammatical. DTD is a textbook grammar. Each term has a unique definition. XSD is much more powerful: here a term may change definition depending on the context: 'name' in 'human/name' may be defined differently than 'name' in 'pet/name' or 'ship/name'. But within a single context the definition stays. As far as I know Relax NG is even more powerful and can express even finer distinctions, but I don't know it too well to elaborate.

Machines do not need all that to talk to each other. It is pure overhead. A perfect form to exchange data between machines is a dump of a relational structure in whatever format is convenient, with pretty straightforward metadata about types. But humans cannot author data in the relational form; anything more complex than a toy example will drive a human crazy. Yet humans can produce grammatical sequences in spades. To make it useful for a machine that grammatical drive needs only a formal definition and XML gives you exactly that.

So the use case for XML is to make NOTATIONS. Formal in the sense they will be processed by a machine, but otherwise they can be pretty informal, that is have no DTD or XSD. It is actually a power of XML that I can just start writing it and invent a notation as I go. Later I may want to add formal validation to it, but it is totally optional and manifests as a need only when the notation matures and needs to turn into a product.

What makes one XML a notation and another not a notation? Notations are about forming phrases. For example:

    <func name="open">
      <data type="int"/>
      <args>
        <addr mode="c">
          <data type="char"/>
        </addr>
        <data type="int"/>
        <varg/>
      </args>
    </func>
This is a description of a C function, 'open'. Of course, a conventional description is much more compact:

    int open(char const*, int, ...)
But let's ignore the verbosity for a moment and stay with XML a bit longer. What is grammatical about this form? 'func' has '@name' and contains 'data' and 'args'. 'data' is result type, 'args' are parameters. Either or both can be omitted, resulting in what C calls "void". Either can be 'data' or 'addr'. 'data' is final and has '@type'; addr may be final (point to unknown, 'void') or non-final and may point to 'data', 'func' or another 'addr', as deep as necessary. 'addr' has '@mode' that is a combination of 'c', 'v', 'r' to indicate 'const', 'volatile', 'restrict'. Last child of 'args' may be 'varg', indicating variable parameters.

Do you see that these terms are used as words in a mechanically composed phrase? Change a word; omit a word; link words into a tree-like structure? This is the natural form of XML: the result is phrase-like, not data-like. It can, of course, be data-like when necessary but this is not using the strong side of XML. The power of XML comes when items start to interact with each other, like commands in Vim. Another example:

    <aaaa>
      <bbbb/>
    </aaaa>
This would be some data. Now assume I want to describe changes to that data:

    <aaaa>
      <drop>
        <bbbb/>
      </drop>
      <make>
        <cccc/>
      </make>
    </aaaa>
See those 'make' and 'drop'? Is it clear that they can enclose arbitrary parts of the tree? Again, what we do is that we write a phrase: we add a modifier, 'make' or 'drop' and the contents inside it get a different meaning.

This only makes sense if XML is composed by hand. For machine-to-machine exchange all this is pure overhead. It is about as convenient as if programs talked to each other via shell commands. It is much more convenient to load a library and use it programmatically than to compose a command-line call.

But all this verbosity? Yes, it is more verbose. This is a no-go for code you write 8 hours a day. But for code that you write occasionally it may be fine. E.g. a build script. An interface specification. A diagram. (It is also perfect for anything that has human-readable text, such as documentation. This use is fine even for a 8-hour workday.) And all these will be compatible. All XML dialects can be processed with the same tools, merged, reconciled, whatever. This is powerful. They require no parsing. Parsing may appear a solved problem, but to build a parser you still must at least describe the grammar for a parser generator and it is not that simple. And all that this description gives you is that the parser will take a short form and convert it into an AST, which is exactly what XML starts with. The rest of the processing is still up to you. With XML you can build the grammar bottom up and experiment with it. Wrote a lot of XML in some grammar and then found a better way? Well, write a script to transform the old XML into the new grammar and continue. The transformer is a part of the common toolset.

g947otoday at 4:34 AM

Is there anything new on this topic that has never been said before in 1000 other articles posted here?

I didn't see any.

show 1 reply
shadowgovttoday at 4:23 AM

XML was abandoned because we realized bandwidth costs money and while it was too late to do anything about how verbose HTML is, we didn't have to repeat the mistake with our data transfer protocols.

Even with zipped payloads, it's just way unnecessarily chatty without being more readable.

show 5 replies