Friday, November 18, 2011

Who needs XML?

With the latest beta release, Candle introduces several new features to its markup format, including a new object notation and a new clean namespace syntax. With this release, I believe Candle Markup to be one of the best general data-exchange format.

Before I talk specifically about Candle's Markup format, let's look at the existing general data-exchange formats. The most well known formats are XML and JSON. We've seen many hot debates on which one is better. I'm not going to restart the holy war here. I'll try to get you out of the tit-for-tat comparison of JSON against XML, and let your look the problem from a more fundamental perspective. You can ask yourself a few questions: why do we need a general data-exchange format? what purposes should it server? what characteristics should it have? Once you've got your answers, you can read on to see if yours reconcile with mine.

I think most people would agree that the need for a general data-exchange format arises because of the Internet. It connects people from different culture, and different computers running different OSs, and different applications written in different programming languages. Thus they need a general data-exchange format to facilitate the communication. I don't need to emphasize too much on the importance of such a format.

A general format simplifies application implementation, but more importantly, it helps application user so that they don't have to remember many different abstract syntaxes. Remember how UNIX newbies sighed about having to learn the all different command line syntaxes? Today, most applications and programming languages have adopt XML as the format of their configuration files. But why hasn't XML rule out the entire world? That's because there's a pitfall here. A general purpose format suffers from the problem of "good for everything, great at nothing". A carefully designed domain-specific format can be more convenient and user-friendly in a specific domain. For example, RDF data expressed in Turtle or N3 notation is much more terse and readable than the corresponding XML format. So while there are still valid cases for domain specific data formats, we want a good general data format to eliminate as many of them as possible.

What characteristics should it have then? Well, it must enable data-exchange of course. And I think there are 3 aspects we need to look at: 1) the syntax, 2) the data types; 3) the data structure. XML addresses 1st and 3rd aspects but does not touch on the 2nd. Syntax-wise, it is straight forward (if we ignore DTD declaration). Structural-wise, XML's hierarchical structure has advantages over more flat text formats like Windows ini files, Java property files and HTML name-value paired form data. And it built-in support of mixed content makes it good for complex textual documents.

However, XML's silence on data type is its major weakness. The inventors of XML might want to make it very extensible, and thus purposely left it to the applications to define their own data types and the detailed syntaxtic encoding of the literal values. But when XML doesn't touch on the data types and data model behind the format, it becomes the burden of the applications. This makes XML less ideal for structured object data exchange. And we saw the rise of JSON to claim back this area.

Yet there are XML gurus who still persist that XML should just be a common syntaxtic format, and resist the idea of common data types and data model. It might look more extensible and versatile that way, but that extensibility is just illusion if people no longer uses XML. Just look at the Common Wealth. If the alliance only brings in a common Queen and a common language (English), it is too cheap for people to break away from it. The alliance has to go into deeper integration, for example a common currency like the Europe Union; or even more, a common constitution like the United States. Of course, every step up the staircase of standardization is harder to achieve, but also brings more benefits. The entire Internet deserves something better.

If XML insists on being just a common syntaxtic format, it's fate is going to be like Common Wealth, with more and more applications breaking away from it (like JSON and HTML5). We saw more and more people inventing domain specific formats like RDF Turtle, JavaFX literal object and GroovyMarkup, DOT language in Graphviz, and Lua being used as configuration file.

Is the XML ship sinking? I don't think so. But water is leaking in.

I'm glad that people are starting to rethink the design of XML. The discussion started by James Clark on MicroXML is good food for thoughts. But I think a simple XML subset, like MicroXML is not going to save XML. People might just as well use JSON. While a MicroXML format may have its niche market, the more important direction for XML to evolve is to do more rather than less.


And the biggest area to patch up is on the data types and data model. All the advanced processing on XML (schema and ontology, path selection, query and update) have to build on top of standardized data types and data model. However, with all the mess created by various data model related XML standards, including XML Infoset, XML Schema, XQuery data models and RDF data models, I double W3C's ability to clean it up.

If the XML designers don't keep on asking themselves questions, like why should we use a general format, like XML? what good does it buys us? And come up with solid answers. Then the application designers will.

In the next blog article, I'll talk about the unique features of Candle Markup and how it compares to existing formats like XML, JSON, YAML and JavaFX literal object.

No comments:

Post a Comment