XML documents: flavors versus essence

I have steered clear of the politics surrounding XML document formats both before and after joining Microsoft. But I was, and will always be, an outspoken advocate for the idea of XML documents. That’s a message that doesn’t make headlines but bears repeating. We have hardly begun to appreciate or exploit the value of XML. A couple of articles in the current issue of CTQuarterly, a journal about how cyberinfrastructure enables science, illuminate that point.

In Next-Generation Implications of Open Access, Paul Ginsparg writes:

One of the surprises of the past two decades is how little progress has been made in the underlying document format employed. Equation-intensive physicists, mathematicians, and computer scientists now generally create PDF from TeX. It is a methodology based on a pre-1980s print-on-paper mentality and not optimized for network distribution. The implications of widespread usage of newer document formats such as Microsoft’s Open Office XML or the OASIS OpenDocument format and the attendant ability to extract semantic information and modularize documents are scarcely appreciated by the research communities.

As the developer of the arXiv (formerly LANL) preprint archive, which predates the web, he understands better than almost anyone how that “pre-1980s print-on-paper” mentality thwarts the advancement of knowledge.

In The Shape of the Scientific Article in The Developing Cyberinfrastructure, Clifford Lynch writes:

We are seeing the deployment of software that computes upon the entire corpus of scientific literature. Such computation includes not only the now familiar and commonplace indexing by various search engines, but also computational analysis, abstraction, correlation, anomaly identification and hypothesis generation that is often termed “data mining” or “text mining.”

I like his tagline for this: “Scientific literature that is computed upon, not merely read by humans.”

XML document formats aren’t a panacea, but when we use them to reduce friction and lower activation thresholds, data will find data, and people will find people. To achieve those effects, the essential property of machine readability matters more than its flavor.

2 Comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s