Sunday, November 20, 2011

A Markup Notation Better Then Ever

In the previous blog article, I talked about general data-exchange format and some of the problems of XML. The original design that constrained XML to just a general syntactic format caused great headache to people building advanced processing capabilities on top of XML.

Among the people who are rethinking XML, James Clark has suggested 3 approaches: XML 2.0,, MicroXML, which pus forward a solid framework to start with. Candle is definitely along the line. Candle is not compatible with XML; it actually goes beyond XML to unify markup data model with object data model.

In this blog, I'll explain in more details on how Candle Markup (especially with the new object notation) addresses many problems of XML, and how it compares with other formats like JSON and YAML.

Syntax wise, Candle Markup has the following advantages over XML:

  • whitespace non-ambiguity: you might think this is trivial, but if you take a look at the recent discussion in xml-dev list on how XML editing tools have to come up with "creative" ways to tackle this issue, you'll see the troubles it causes. And if you ask a DB administrator, he/she will tell you a single whitespace is definitely different from 1000 consecutive whitespace characters.
  • cleaner namespace syntax: there has been enough sigh on XML's namespace, so I'll only show you the relief. In Candle, you can now write a fully expanded Qname as ns:domain:foo:bar a hierarchical name similar to Java.
  • strongly typed literal values: Candle uses unique syntax to denote the type of a literal value. Thus Candle is always strongly typed, whereas XML is only weakly typed without schema. Some people questioned why would people adopt the literal syntax of Candle. I won't claim that Candle's literal syntax is the best, but in the end the world has to agree on some syntax, so that we'll be able to exchange typed value.

    Candle's literal value syntaxes are carefully designed, and they are based on widely accepted conventions:
    • empty: (), "", '' - as intuitive as they can be;
    • boolean: true, false as usual;
    • string: double quoted as usual;
    • number
      • integer, decimal and double types follow common syntaxes you've always used; 
      • type suffix for integer types, like byte and short, is a convention used by major programming languages, including C/C++, Java, .Net, Python;
    • measure: widely used in CSS, SVG and SMIL;
    • qname: based on Candle's new hierarchical namespace syntax, and similar to C++ and Java. It can't be simpler than that;
    • uri: follows the standard URI syntax, except that some schemes are reserved to represent special literal values in Candle;
    • specially single-quoted literal values: datetime, color, id, binary, only this part might have other options, like:
      • option 1: use Turtle kind of postfix annotation, e.g. "2011-11-19"^date;
      • option 2: just use object notation, e.g. dt{"2011-11-19"};
      • option 3: constructor syntax, e.g. dt("2011-11-19");
Data model wise, Candle Markup has the following advantages over XML:
  • strongly-typed: no harm to emphasize it one more time;
  • clean data type hierarchy: Candle strips all the DTD related types from XML Schema types; namespaces are not modeled as data node; and processing instructions are combined with comment node. This results in a very clean data type hierarchy, comparing to XML Schema Data Model;
  • unification with OOP object data model: I'll talk about this later;
  • extended the data model to file and directory level: this part has not been fully implemented in current beta release, so I'll not go into details. The basic idea is that you should be able to work with file and directory nodes just like element and text nodes within a document.

Unifying Markup Data Model with Object Data Model

For years, developers and architects are baffled by the mismatch between markup data model and object data mode, and have tried hard to find the best way to map one into the other. Candle takes a different approach by unifying the two data models. Whether this is the right direction down the road, I leave it up to you to judge.

If you ask me what are the major differences between the two models, then I think it boils down to just this:
  • in markup data model, attribute can only hold simple content, not complex content;
  • in object data model, an object can only has attributes, but no child nodes;
So the unification is straightforward - extend attributes in markup data model so that they can hold complex content, and extend objects in object data model so that they can hold child nodes. As a result, markup element and OOP object are now the same thing.

In Candle, the new object notation is just an alternative syntax to the element notation. Data model wise, an object is treated exactly the same as an element.

Here's an example of an object in 3 different notations:
New Object Notation in Candle Equivalent Element Notation in Candle Similar JavaFX Object Notation
Customer {
  firstName = "John"
  lastName = "Doe"
  phoneNum = "9555-0101"
  address = Address {
    street = "1 Main Street"
    city = "Santa Clara"
    state = "CA"
    zip = "95050"
  "a text node"
Customer {
  firstName: "John";
  lastName: "Doe";
  phoneNum: "9555-0101";
  address: Address {
    street: "1 Main Street";
    city: "Santa Clara";
    state: "CA";
    zip: "95050";
  /* text node not supported*/ 

Candle Markup Comparing to Other Alternatives

In the following is a feature comparison of Candle against other alternatives. I've selected XML, JSON, YAML, JavaFX object notation. They are not exhaustive, but sufficiently representative, I think:

XML JSON YAML JavaFX Literal Object Candle
Specific Features
Unicode Support yes yes yes yes yes
Whitespace non-ambiguity no yes yes yes yes
Strongly typed literal values needs Schema yes yes yes yes
Extended literal values
(like datetime, uri, qname)
needs Schema no needs type annotation needs to use object constructor yes
Namespace support yes
(but messy)
no partial
(only the type annotation is namspaced)
(clean hierarchical ns)
(clean hierarchical ns)
Complex attribute content no
(XML Schema defines a general value list syntax, but is highly ambiguous, and not usable at all)
yes yes yes yes
Child node support yes no
(no direct support)
(no direct support)
(no direct support)
Formal data model yes
(but messy)
yes yes yes yes
Schema language yes
(XML Schema is over-complicated; RELAX NG is cleaner, but less used)
no no yes
(attribute only, no child content model support)
(similar to RELAX NG)
Embeddable in programming languages
(as structured nodes not as quoted string)
(.Net, Scala, etc.)
no yes
Advanced processing
(path language, query and update language)
(but with overlapping and conflicting features)
no no limited
(not as high-level as XPath, XQuery)
(unified query language)
General Features
Readability good for mixed text content,
but verbose for structured data
good for structured data good for structured data good for structured data good for both
(you have object notation; and literal values do not need to be quoted)
Cross platform yes yes yes yes yes
Open source yes yes yes promised
(but not delivered yet)
Lightweight runtime yes (if you only uses XML);
no (if you starts to use XML Schema, XSLT, XQuery, WS, etc.)
yes yes no yes
(entire runtime is only 2MB when compressed)
Standards status W3C standard RFC standard no Oracle only
(might become Java standard in future)
not yet
Good for structured data no yes yes yes yes
Good for mixed text content yes no no no yes

Generally, YAML can be seen as a superset of JSON, and JavaFX literal object can be seen as superset of YAML. These 3 formats are good for structured data exchange. Candle can be seen as a superset of XML (excluding DTD) and the other 3 object formats.

In the next blog, I'll give you some illustrative examples to show you how Candle Markup can naturally express data which currently requires domain-specific formats.

No comments:

Post a Comment