I keep seeing people make the same mistake as XML made over and over; without learning from it. I wi...

Someone1234 • today at 2:00 PM • 9 replies • view on HN

I keep seeing people make the same mistake as XML made over and over; without learning from it. I will clarify the problem thusly:

> The more capabilities you add to a interchange format, the harder that format is to parse.

There is a reason why JSON is so popular, it supports so little, that it is legitimately easy to import. Whereas XML supports attributes, namespaces, CDATA, DTDs, QNames, xml:base, xml:lang, XInclude, etc etc. They gave it everything, including the kitchen sink.

There was a thread here the other day about using Sqlite as an interchange format to REDUCE complexity. Look, I love Sqlite, as an application specific data-store. But much like XML it has a ton of capabilities, which is good for a data-store, but awful for an interchange format with multiple producers/consumers with their own ideas.

CSV may be under-specified, but it remains popular largely due to its simplicity to produce/consume. Unfortunately, we're seeing people slowly ruin JSON by adding e.g. commands to the format, with others than using those "comments" to hold data (e.g. type information), which must be parsed. Which is a bad version of an XML Attribute.

Replies

GuB-42 • today at 2:58 PM

I think JSON has the opposite problem, it is too simple, the lack of comments in particular is particularly bad for many common usages of the format today.

I know some implementations of JSON support comments and other things, but is is not true JSON, in the same way that most simple XML implementations are not true XML. That's what I say "opposite problem", XML is too complex, and most practical uses of XML use incomplete implementations, while many practical uses of JSON use extended implementations.

By the way, this is not a problem for what JSON was designed for: a text interchange format, with JS being the language of choice, but it has gone beyond its design: configuration files, data stores, etc...

➕ show 2 replies

python-b5 • today at 4:46 PM

I've been working on an XML parser of my own recently and, to be honest, as long as you're fine with a non-validating parser (which are still compliant), it's really not that bad. You have to parse DTDs, but you don't need to actually _do_ anything with them. Namespaces are annoying but they're not in the main spec. CDATA sections aren't all that useful, but they're easy to parse. As far as I'm aware, parsers don't actually need to handle xml:lang/xml:space/etc themselves - they're for use by applications using the parser. Really the only thing that's been particularly frustrating for me is entity expansion.

If you want to support the wider XML ecosystem, with all the complex auxiliary standards, then yes, it's a lot of work, but the language itself isn't that awful to parse. It's a little messy, but I appreciate it at least being well-specified, which JSON is absolutely not.

conartist6 • today at 2:14 PM

Just gonna drop this here : ) https://docs.bablr.org/guides/cstml

CSTML is my attempt to fix all these issues with XML and revive the idea of HTML as a specific subset of a general data language.

As you mention one of the major learnings from the success of JSON was to keep the syntax stupid-simple -- easy to parse, easy to handle. Namespaces were probably the feature to get the most rework.

In theory it could also revive the ability we had with XHTML/XSLT to describe a document in a minimal, fully-semantic DSL, only generating the HTML tag structure as needed for presentation.

➕ show 2 replies

0xbadcafebee • today at 5:19 PM

The problem is that engineers of data formats have ignored the concept of layers. With network protocols, you make one layer (Ethernet), you add another layer (IP), then another (TCP), then another (HTTP). Each one fits inside the last, but is independent, and you can deal with them separately or together. Each one has a specialty and is used for certain things. The benefits are 1) you don't need "a kitchen sink", 2) you can replace layers as needed for your use-case, 3) you can ship them together or individually.

I don't think anyone designs formats this way, and I doubt any popular formats are designed for this. I'm not that familiar with enterprise/big-data formats so maybe one of them is?

For example: CSV is great, but obviously limited, and not specified all that well. A replacement table data format could be binary (it's 2026, let's stop "escaping quotes", and make room for binary data). Each row can have header metadata to define which columns are contained, so you can skip empty columns. Each cell can be any data format you want (specifically so you can layer!). The header at the beginning of the data format could (optionally) include an index of all the rows, or it could come at the end of the file. And this whole table data format could be wrapped by another format. Due to this design, you can embed it in other formats, you can choose how to define cells (pick a cell-data-format of your choosing to fit your data/type/etc, replace it later without replacing the whole table), you can view it out-of-order, you can stream it, and you can use an index.

➕ show 4 replies

necovek • today at 4:39 PM

Funnily enough, XML was an attempt to simplify SGML so it is easier to parse (as SGML only ever had one compliant parser, nsgml).

➕ show 1 reply

PunchyHamster • today at 3:19 PM

Constant erosion of data formats into the shittiest DSLs in existence is annoying. "Oh, hey, instead of writing Python, how about you write in

* YAML, with magical keywords that turn data into conditions/commands * template language for the YAML in places when that isn't enough * ....Python, because you need to eventually write stuff that ingests the above either way .... ansible is great isn't it?"

... and for some reason others decide "YES THIS IS AWESOME" and we now have a bunch of declarative YAML+template garbage.

> There was a thread here the other day about using Sqlite as an interchange format to REDUCE complexity. Look, I love Sqlite, as an application specific data-store. But much like XML it has a ton of capabilities, which is good for a data-store, but awful for an interchange format with multiple producers/consumers with their own ideas.

It's just a bunch of records put in tables with pretty simple data types. And it's trivial to convert into other formats while being compact and queryable on its own. So as far as formats go, you could do a whole lot worse.

➕ show 2 replies

xienze • today at 2:03 PM

> Whereas XML supports attributes, namespaces, CDATA, DTDs, QNames, xml:base, xml:lang, XInclude, etc etc. They gave it everything, including the kitchen sink.

But you don't have to use all those things. Configure your parser without namespace support, DTD support, etc. I'd much rather have a tool with tons of capabilities that can be selectively disabled rather than a "simple" one that requires _me_ to bolt on said extra capabilities.

➕ show 3 replies

moron4hire • today at 2:30 PM

I consider CSV to be a signal of an unserious organization. The kind of place that uses thousand line Excel files with VBA macros instead of just buying a real CRM already. The kind of place that thinks junior developers are cheaper than senior developers. The kind of place where the managers brow beat you into working overtime by arguing from a single personal perspective that "this is just how business is done, son."

People will blithely parrot, "it's a poor Workman who blames his tools." But I think the saying, as I've always heard it used to suggest that someone who is complaining is a just bad at their job, is a backwards sentiment. Experts in their respective fields do not complain about their tools not because they are internalizing failure as their own fault. They don't complain because they insist on only using the best tools and thus have nothing to complain about.

➕ show 5 replies

quotemstr • today at 4:10 PM

> XML supports attributes, namespaces, CDATA, DTDs, QNames, xml:base, xml:lang, XInclude, etc etc. They gave it everything, including the kitchen sink.

Ah, the old "throw a bag of nouns at the reader and hope he's intimidated" rhetorical flutist. These things are either non-issues (like QName), things a parser does for you, or optional standards adjacent to XML but not essential to it, e.g. XInclude.

➕ show 3 replies

alt Hacker News

Replies