Ingo Karkat - JSON is used for web apps because XML is not simple enough

JSON is used for web apps because XML is not simple enough

Posted Monday, 03-Mar-2014 by Ingo Karkat

XML used to be the transmission format of choice between the server backend and graphical frontend. With ubiquituous language support, this never was an issue in the backend, and any Java(-Applet), Flex UI, or JavaScript client can also easily produce and consume it (e.g. with libraries like E4X). In fact, in the early 2000s, one could speak of an XML-craze; almost everything got implemented with it (some arguably shouldn't have): file formats, application configuration, logging output, RPC calls, SOAP, the various WS-* standards.

However, virtually all new clients (at least in the area I'm working) now use JSON. Where did XML go wrong (and how may that be related to the demise of the mentioned SOAP and WS-* protocols in favor of the lighter REST)?! I recently gained some insight into the motivation for XML while reading Neil Bradley's The XML companion; Third Edition.

XML was meant to be simple, and its SGML + HTML heritage was a blessing

Two quotes from the above book (page 145) illustrate the motivation for XML:

One key decision that was made during the development of XML was to keep the language small.

This pulled the format out of its niche of professional document editing (think books and large texts, the stuff where Microsoft Word comes to its limits), benefiting from the tremendous success of HTML in the fledgling world-wide web. The invention of HTML has been a masterpiece: The markup enabled the goal of platform independence better than any binary format, the few tags were easy to learn because they had a tangible effect in the rendered web page, and though few early adopters of the WWW had an SGML background, the use of an established and proven syntax avoided NIH syndrome and put the format on a stable and consistent basis.

Another key decision made was that XML should be backward compatible with SGML, so that existing SGML tools could be utilized in XML projects, [...]

I don't know whether the existing SGML tools really kick-started the XML tool sets, but (especially in the Java / Apache project world, which feverishly adopted and built XML tools) one large early benefit of XML has been the good support in terms of tools and libraries. Many developers had been burned by custom configuration and data formats, which started out small and simple, but as the complexity grew, so did the custom-written parsers and APIs, until it turned into an unmaintainable mess with confusing escaping rules and syntax exceptions. (Been there; done that, too.) It made sense to rely on an extensible standard, and everybody else is adopting it, too!. XML (the standard) started out small, only later adding namespaces into the core syntax and functionality for linking and addressing through accompanying standards (like XLink, XPath, XSLT).

The document processing background of SGML was too hard to bridge for application configuration and data exchange

Reading through and refreshing my knowledge of DTDs, namespaces, document transformations, and learning about all the extensions for document management (chapter 27: extended links via XLink, chapter 28: advanced links via XPointer; do you know the difference?!), I realize that most of that isn't needed for configuring applications and exchanging information (RPC-style) between them.

superfluous features

Just ignoring those parts can backfire: At work, we recently disabled processing of external entities to avoid denial of service attacks via an XML entity bomb and information leaks via references to external files that are included in a request to the backend and then sent back to the client in expanded form.

The validation features of DTDs look useful. But for internal data formats, this explicit documentation step is often skipped (tests would have to be written validating the conformity of DTD and used XML itself), and I haven't really seen the editor support for authoring XML that is often promised (maybe it's in costly commercial packages only).
It soon became apparent that the (again, document-centric) features of DTD are not a good match for data exchange use cases, so XML Schema was provided as an alternative. SGML, because of its generality, couldn't easily define such adjacent standards in itself like XML Schema and XSLT do, so it had to invent different syntaxes for those. XML still suffered from the fragmentation between DTD, XML Schema, and other contenders like RELAX NG and Schematron.

unknown features

When we needed to include large binary data to a custom data format at work, we switched from XML to a ZIP container with mandatory XML and optional file attachment data, with the linking established by certain attributes within the XML and conventions. Interestingly, XML would have covered that with binary entities. I doubt the colleagues who decided on the approach knew about them. I also would not have been sure about the necessary support of the parser and other infrastructure. The "road less traveled" might have been a nice straight shortcut, but also potentially a dead end.

attributes vs. tags

When flexibility isn't needed, it complicates the design. Choosing between an attribute value and contained tags is one such area. <foo bar="value"/> is equivalent to <foo><bar>value</bar></foo>, but which one is better depends on the data model and future extensions. Best practices and guidelines can alleviate that, but need to be agreed upon and enforced. Why use a syntax where element ordering is observed everywhere (in XML, only attributes are unordered) in order to persist object hierarchies that do not have such ordering (except for object lists)?! JSON gets this right: Its basic building blocks are ordered Arrays and unordered Objects. (It also avoids all of the various whitespace issues of XML.)

namespaces

I get the motivation for XML namespaces; real-world examples are including an SVG image inline in XHTML, or using XHTML tags in XSLT. But most documents are still predominantly in a single namespace. Even though the XML namespaces standard gained recommended status in 1999 already, quirks and problems still exist today (mostly due to ignorance on the side of library and application developers). Just recently, a fellow developer had to drop the namespace from an XML error document returned from our Grails backend because the Flex frontend wouldn't correctly handle it. Due to the various components and abstraction layers involved, it took half a day and multiple developers to find the root cause. Well, and how did the dev implement the workaround? Like this:

errorMessage = errorMessage.replaceAll("xmlns.*?(\"|\').*?(\"|\')", "")

API support

Something's wrong when, even a decade after the widespread adoption of a technology, its manipulation is still frequently done through a more convenient lower layer than through the intended APIs. Nobody would think about accessing the raw blocks on a disk versus the file system API, so why do we still frequently treat XML as text strings and attack it with regular expressions?! Why do I see way more ad-hoc XML parsing with grep (that relies on a certain structure, and can easily be fooled e.g. with tactically placed comments) than proper use of an XML parser, either through a programming language infrastructure (like Perl's XML::Parser) or a dedicated tool like xmlstarlet (whose homepage currently admits that the development of xmlstarlet has somewhat stalled)? So even though XML is a mature standard, you can't assume full, effortless support of all of its features.

I myself needed regexp-based pattern matching in form of the matches() XSLT 2 function. Unfortunately, even the latest 1.5 version of xmlstarlet only supports XSLT 1 through the libxslt version 1.1. Even when APIs do exist, they may be cumbersome or confusing to use: In my article XML processing still hard in Groovy, I complain about the problems in Groovy's XML APIs.

verbosity

Ironically, one of the few things in SGML that got the ax were the markup minimization techniques like omitting the end tag (remember standalone <li> and <p> from the early HTML versions) and empty (<name>J. Smith</>) or even null end tags (<name/J.Smith/). This would have helped against the common XML critique of its verbosity (and also would have made ad-hoc parsing of such denormalized forms close to impossible).

To me, you also see the document-centeredness here. Repeating the tag name in the end tag is fine, even helpful, when the tags are only loosely scattered around the text (like in the HTML I'm currently writing here). But in the machine-produced and -read data formats, this is unnecessary baggage. (And we have better systems for transmission error checking at the lower layers.)

The contender

JSON is tailored to the serialization of object structures. Because it doesn't attempt to cover the field of document processing, there's no need for a lot of functionality and surrounding standards. It is a much better fit for exchanging data objects and persisting configuration. Though it has its roots in the JavaScript language, there now exist libraries for virtually all programming languages. The very lightweight syntax makes it easy to implement a parser. (Usually faster, too.) And less overhead in the form of (much more) indentation and closing tags, yet still easier to read because there are fewer data types and no attributes. The attack surface for security risks is smaller.

Its simplistic approach has its dangers, too. Though it's a one-liner to parse a JSON document and navigate the resulting object structure (and there's now a viable (even hot) command-line JavaScript interpreter in the form of node.js which doesn't have the penalty of a slow JVM startup), people will miss the query and transformation capabilities of XPath and XSLT, and clamor for support. Every tool can be misused, and the fact that the JavaScript development community largely consists of young developers spells danger in the form of unknowingly repeating mistakes from the history of XML.

Summary

I think what will go down in history is that, mostly fueled by the popularity of HTML, XML got a huge initial boost, leading many to overreact and use it in situations that never properly fit its profile, as I hopefully have illustrated. The "discovery" of JSON and passage of time allowed heads to cool down and recognize the corresponding benefits of each contender. XML continues to be great for data format standards like SVG, MathML, etc. which are used in a document context, but for the platform-independent exchange and persistence of object hierarchies, JSON provides compelling benefits, and will further drive out XML.

Ingo Karkat, 03-Mar-2014

ingo's blog is licensed under Attribution-ShareAlike 4.0 International

JSON is used for web apps because XML is not simple enough

XML was meant to be simple, and its SGML + HTML heritage was a blessing

The document processing background of SGML was too hard to bridge for application configuration and data exchange

super­fluous features

un­known features

attri­butes vs. tags

name­spaces