XML Validation

XML documents that follow all the rules for XML syntax are said to be well-formed. Well-formed XML documents have sufficient structure to guarantee that they can be represented as a hierarchical tree. DOM, XSL and typical SAX processing all inherently rely on this guarantee of well-formed XML documents.

A valid XML document is a document that has been proven to follow a set of more stringent rules than those of the XML syntax alone. Most validation methods are concerned with the vocabulary and grammar of the XML document. XML’s built-in validation mechanism is the Document Type Definition (DTD). A good portion of the XML specification is dedicated to the description of DTD.

XML Schema is the 400-pound gorilla validation method among XML validation standards. XML Schemas provide much greater constraint over both the structure and data types allowed in an XML document. RelaxNG and Schematron are two other validation methods worth being familiar with. This essay is not a reference for any particular validation method; rather it aims to demonstrate how validation saves a programmer time and effort.

We’ll begin with brief introductions and examples of several validation methods.

DTD Validation

DTDs allow for basic control of element and attribute names and the overall structure of an XML document. DTDs are usually simple to write and understand. The following XML document could be validated with the brief DTD that follows:

1 |<?xml version="1.0" ?>
2 |<!DOCTYPE List SYSTEM "list.dtd">
3 |<List name="Fruit List">
4 |   <Item>Apple</Item>
5 |   <Item>Banana</Item>
6 |   <Item>Pear</Item>
7 |</List>

1 |<!-- list.dtd -->
2 |<!ELEMENT List (Item+)>
3 |<!ATTLIST List
4 |   name CDATA #IMPLIED >
5 |<!ELEMENT Item (#PCDATA)>

Note that on line two of the XML document above, a DOCTYPE declaration associates the document with the location of a DTD to validate the document. A DOCTYPE declaration is not the only way to validate using a DTD, but it is a common one.

DTDs and Namespaces

DTDs and namespaces don’t mix very smoothly because DTDs pre-date XML Namespaces by several years. From the DTD validation perspective, namespace prefixes are just part of the string of characters that make up the element names.

*Technically speaking, there are ways to write a DTD that allow validating parsers to handle multiple namespace prefix schemes. The somewhat complex technique involves using an entity for the prefix and indirectly mapping all the element names in the content model to include the prefix entity. The prefix entity then needs to be defined or discovered in the instance document, so there is some burden beyond simply using namespaces correctly in your XML with this technique. With all the DTD maintenance trouble likely to arise from the indirect mappings, you’re probably best off pre-processing your XML to use a single namespace prefix scheme and keeping your DTD simple. Huh? Exactly.

There are many ways to declare namespaces for elements. You could choose to use a default namespace applied to the root element alone or to use a namespace prefix for all elements. In your DTD you generally have to choose a single namespace declaration method because of the weak namespace support*.

Here’s the DTD from the above sample modified for namespaces declared in two different ways:

1 |<!-- list.dtd (default namespace)-->
2 |<!ELEMENT List (Item*)>
3 |<!ATTLIST List
4 |   name CDATA #REQUIRED 
5 |   xmlns CDATA #FIXED "http://liquidhub.com/SimpleList">
6 |<!ELEMENT Item (#PCDATA)>

The default namespace technique shown above just requires adding the xmlns attribute to the List element with the namespace URI as a fixed value (line five).

1 |<!-- list.dtd (namespace w/prefixes)-->
2 |<!ELEMENT lh:List (lh:Item*)>
3 |<!ATTLIST lh:List
4 |   name CDATA #IMPLIED 
5 |   xmlns:lh CDATA #FIXED "http://liquidhub.com/SimpleList">
6 |<!ELEMENT lh:Item (#PCDATA)>

This DTD adds the xmlns:lh attribute to the List element, prefix and all, then prefixes every element with the lh: namespace prefix.

Choose your DTD namespace approach depending on what you’re trying to accomplish with your DTD. The default namespace approach is easier if you’re not mixing namespaces and just want a simple validation.

XML Schema Validation

XML Schema is a considerably more powerful validation mechanism than DTD because it adds data types and more sophisticated structure constraints. Namespaces are fully supported in XML Schema.

One of the more intuitively advantageous aspects of XML Schemas is that they are expressed as XML documents. Having a schema expressed in XML means that the information in the schema is programmatically accessible through the same standard XML interfaces you’re likely already working with.

The XML Schema specification is divided into three parts. The first part is called the primer and describes basic usage of XML Schema. The second part is devoted to the structure or content model mechanism. And the final section describes data types.

XML Schema Structure

Complex types contain child elements. Simple types do not.

XML Schema takes a somewhat object-oriented approach to describing the content model of an XML document. Simple types are combined into more complex types and elements are defined in terms of these types. XML Schema supports rudimentary inheritance for complex types, allowing derived types to extend or restrict base types.

Here is a sample XML instance document that references an XML Schema:

1 |<?xml version="1.0" ?>
2 |<List name="Fruit List"
3 |   xmlns=http://liquidhub.com/SimpleList 
4 |   xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance
5 |   xsi:schemaLocation="http://liquidhub.com/SimpleList list.xsd">
6 |   <Item>Apple</Item>
7 |   <Item>Banana</Item>
8 |   <Item>Pear</Item>
9 |</List>

Note lines four and five in the XML instance document above. The schemaLocation attribute is a common method for associating an XML instance document with its schema. Though convenient for testing, you wouldn’t want to trust the schemaLocation for documents created outside of your control. Typically you use your XML implementation’s API to provide an authoritative XML Schema for validation.

Here’s the sample XML Schema:

1 |<!-- list.xsd -->
2 |<schema xmlns="http://www.w3.org/2001/XMLSchema"
3 |   xmlns:lh="http://liquidhub.com/SimpleList"
4 |   targetNamespace="http://liquidhub.com/SimpleList"
5 |   elementFormDefault="qualified">
6 |   <complexType name="SimpleList">
7 |         <sequence>
8 |               <element name="Item" type="string" 
9 |                     maxOccurs="unbounded" />
10|         </sequence>
11|         <attribute name="name" type="string" />
12|   </complexType>
13|   <element name="List" type="lh:SimpleList"/>
14|</schema>

After a bunch of boilerplate schema setup, a complexType definition is created for our SimpleList type. This type consists of a sequence of one or more Item elements and a name attribute. The Item element is defined directly inline because it is a simple string type. Finally the root List element is declared globally to be of the SimpleList type. Seems like a lot of work for such a simple validation, no?

XML Schema Data Types

XML Schema defines simple data types like int, string, and date—a marked improvement over the limited data types of DTDs. Some other useful data types included in XML Schema are types for URIs (http://www.liquidhub.com), international language codes (en-US), and valid XML names and IDs (QName, NCName, ID). Section 3 of the standard includes a full listing of SimpleTypes.

XML Schema also provides the ability to create user-defined data types. When defining your own data types, you typically build on top of one of the simple types, adding constraints like minimum and maximum values or length limits. In this regard, schema data types are much like database types.

You can also use regular expression patterns to build just about any kind of text data type you can imagine. The following schema fragment illustrates defining a data type for social security numbers using a regular expression pattern.

1 |<!—- user-defined SSN type -->
2 |<xsd:simpleType name="ssnType">
3 |   <xsd:restriction base="xsd:string">
4 |         <xsd:length value="11">
5 |         <xsd:pattern value="[0-9]{3}-[0-9]{2}-[0-9]{4}"/>
6 |   </xsd:restriction>
7 |</xsd:simpleType>

It’s a good idea to write a common set of user-defined types for reuse in building schemas within your organization. If you’re in the retail business, for example, a SKU data type could be used in many product-related schemas. XML schema provides an import mechanism for keeping schemas modular and allowing for reuse.

XML Schema is quite a sophisticated modeling language for XML data validation. We haven’t even scratched the surface in this brief fly-over.

RelaxNG Validation

DTD and XML Schema are not the only validation games in town. RelaxNG was created as a response to the complexity of XML Schema. XML Schema spent a long time in the standardization process and is criticized for being over-engineered. RelaxNG is a powerful validation mechanism that is much simpler than XML Schema.

Like XML Schema, RelaxNG schemas are expressed as XML documents, though RelaxNG also has a compact format that is not XML. You can compare the two formats in the equivalent RelaxNG schemas below:

1 |<!-- list.rng -->
2 |<element name="List" xmlns="http://relaxng.org/ns/structure/1.0">
3 |   <attribute name="name">
4 |         <text/>
5 |   </attribute>
6 |   <oneOrMore>
7 |         <element name="Item">
8 |               <text/>
9 |         </element>
10|   </oneOrMore>
11|</element>

1 |# list.rng (compact format)
2 |element List {
3 |   attribute name { text },
4 |   element Item { text }+
5 |}

The compact format is quite spare! Toolsets for performing RelaxNG validation are widely available for Java and somewhat available for .NET. RelaxNG is unfortunately not likely to ever be part of the Microsoft XML services.

RelaxNG schemas are much easier to read and understand than XML Schemas or DTDs. I’ve successfully used DTDs converted (automagically) to RelaxNG schemas in order to explain complex XML structures to non-developer business users.

Schematron Validation

XML Schema has annotation elements designed for holding documentation or additional application information. The appInfo element is an ideal place to embed Schematron assertions within your schema. Write a validation wrapper class that can extract and execute these assertions after a successful schema validation.

Schematron is a method of validation based on making XPath assertions against XML tree structures. Implementations of Schematron validation are available as XSL stylesheets. Schematron is meant to supplement other types of validation.

Here’s a sample Schematron schema with a single assertion:

1 |<schema xmlns="http://www.ascc.net/xml/schematron">
2 |   <pattern name="No Duplicate Items">
3 |         <rule context="Item">
4 |           <assert test="not(preceding-sibling::Item=.)">
5 |               Duplicate items not allowed!
6 |           </assert>
7 |         </rule>
8 |   </pattern>
9 |</schema>

In the XSL implementation of Schematron, you feed a Schematron schema like the one above into a Schematron skeleton XSL transform that produces a second validating XSL transform. This generated validating XSL transform runs against an XML instance resulting in a list of any validation errors. The errors are formatted in any way that you specify in the Schematron schema, including either plain text or XML. That’s clever!

The clever approach to Schematron validation is worth further investigation. The pattern boils down to this: Create a simple XML grammar that can be processed with XSL to produce an XSL transform that does something useful against other XML instances. XML transformed by XSL generating XSL to transform XML…makes me giddy, let’s move on.

Strengths and Weaknesses

XML Schema and RelaxNG can certainly validate anything a DTD can, but DTDs remain relevant because of their widespread use. Industries that jumped on the XML specification bandwagon early, created DTDs.

Your choice of validation mechanism will naturally depend on the tools available to you. Java developers have the least barriers to using any of the validation methods discussed here. When you have a choice of validation methods, you need to consider the relative expressive power and ease of use of each:

description

When examining this graph, consider that the amount of effort required to implement the more expressive validation methods may likely be paid back with decreased coding effort.

Relax NG really hits a sweet spot for both ease of use and expressive power. Even if you’re a .NET developer, just because Microsoft’s not likely to implement Relax NG in its core XML services doesn’t mean that you should dismiss Relax NG altogether.

Schematron was difficult to place on the graph. Schematron can express content dependent rules across elements and attributes that none of the other methods can. On the other hand, Schematron is not as well suited to basic content model restrictions as the others are. Schematron is best used as supplemental power to the other validation methods.

Few would argue with the placement of DTD and XML Schema on the graph. DTD has the advantage and disadvantage of being quick and easy. DTD is widely used and can greatly simplify the code in your applications with little effort. But once you see what XML Schema has to offer, it’s hard to be satisfied with DTD.

Tools exist for converting between the validation methods. Be warned that only downgrading to a less expressive method works well. Tools claiming to convert DTDs to Schemas may get the job done, but without a lot of human aid, a tool can’t hope to get the data model abstractions right. It’s better to start a Schema design from scratch than from the output of automated tools. Another similar class of tools are those that produce a schema from an XML instance. Your mileage may vary.

All of the validation methods can be broken into modules for reusing components across multiple schemas. When schemas get large, and they tend to get large when you’re doing anything complicated, breaking them down into components is a big help for developers.

Modules are also the key to designing extensible schemas. XML Schema has the most sophisticated extensibility mechanisms. By planning for extensibility in your schemas, you can enable the addition of new elements and attributes without breaking other applications depending on your schema.

In a B2B scenario, extensible schemas can really pay off, allowing different vendors to extend a common schema independently, yet not getting zapped by other vendor’s extensions. This kind of schema design takes some planning, but fits a variety of real-world situations.

For the most part, validation is whatever you say it is. You can roll your own validation, but the available alternatives cover quite a bit of ground for you.

Tip: Always make sure your validation mechanism is working! An easy way to test is by making intentional errors in your instance documents.

Validation adds strength to the structure of XML. No validation mechanism is going to cover everything you’re likely to need in practice. That’s why it can be helpful to add something like Schematron validation to your toolkit. Validation can take the place of a lot of conditional error checking code, making your applications simpler and easier to maintain.

XML Schema is a large and difficult specification compared to the other validation methods introduced in this essay. But XML Schema has far greater modeling capabilities than the other validation methods, and it’s likely to be here to stay. Identifying and using established XML Schema patterns can save you a lot of frustration when learning the language. The development time and effort saved by using a strong validation language is considerable and worth your investment.

References

XML 1.1 Specification, Section 4 Physical Structures (DTD): http://www.w3.org/TR/xml11/#sec-physical-struct

XML Schema Part 0: Primer Second Edition: http://www.w3.org/TR/xmlschema-0/

XML Schema Part 1: Structures Second Edition: http://www.w3.org/TR/xmlschema-1/

XML Schema Part 2: Dataypes Second Edition: http://www.w3.org/TR/xmlschema-2/

XML Schema Part 2: Datatypes, Section 3, Built-in Datatypes: http://www.w3.org/TR/xmlschema-2/#built-in-datatypes

Relax NG: http://www.relaxng.org/

Schematron: http://www.schematron.com/