Last week Kevin Curry dug into some data about school violence in his district. In this case the data was made available as HTML, which means it was sort-of-but-not-really published on the web. Kevin writes:
Whenever I come across data like this the first thing I want to know is whether or not it can actually be used as data. In order to be used/usable as data the contents of this HTML table need to be, at minimum, copy-and-paste-able into a spreadsheet.
Or, alternatively, the HTML table needs to be parseable as data. In this case, I was surprised to find that a couple of tools I normally use to do that parsing — Dabble DB and Excel — didn’t work. That’s because Kevin’s target page doesn’t include a static HTML table. It’s dynamic instead: First you select a district, then the table appears. This mechanism defeats tools that try to parse data from HTML tables, so it’s a bad way to publish data that you want to be available as data.
Lacking the option to parse the HTML table, Kevin’s only choice was to copy and paste. That’s clumsy, and you have to be really motivated to do it, but it can be done. Here’s the Google spreadsheet Kevin made from the data he copied and pasted. And here’s the same stuff as an Excel Web App.
If you haven’t tried out the new Excel Web App, by the way, it’s interesting to compare the two. One key difference, at least from my point of view, is — not surprisingly — the Excel Web App’s ability to roundtrip with Excel. A Google spreadsheet is, at this point, more functional in standalone mode. While you can edit both a Google spreadsheet and an Excel Web App in the browser, for example, the Google spreadsheet can insert and modify charts, whereas the Excel Web App only edits data.
Of course if you have Excel you’d rather use it to insert and modify charts. It’s a lot more capable than any browser app is likely to be anytime soon. So it’s pretty sweet to be able to open the cloud-based Excel spreadsheet, edit locally, and then save to the web. A related limitation of the Google spreadsheet is that you lose charts when you download to, or upload from, Excel.
Another key difference: The Excel Web App currently lacks an API like the one Google provides. I really hope that the Excel Web App will grow an OData interface. In this comment at social.answers.microsoft.com, Christopher Webb cogently explains why that matters:
The big advantage of doing this [OData] would be that, when you published data to the Excel Web App, you’d be creating a resource that was simultaneously human-readable and machine-readable. Consider something like the Guardian Data Store (http://www.guardian.co.uk/data-store): their first priority is to publish data in an easily browsable form for the vast majority of people who are casual readers and just want to look at the data on their browsers, but they also need to publish it in a format from which the data can be retrieved and manipulated by data analysts. Publishing data as html tables serves the first community but not the second; publishing data in something like SQL Azure would serve the second community and not the first, and would be too technically difficult for many people who wanted to publish data in the first place.
The Guardian are using Google docs at the moment, but simply exporting the entire spreadsheet to Excel is only a first step to getting the data into a useful format for data analysts and writing code that goes against the Google docs API is a hassle. That’s why I like the idea of exposing tables/ranges through OData so much: it gives you access to the data in a standard, machine-readable form with minimal coding required, even while it remains in the spreadsheet (which is essentially a human-readable format). You’d open your browser, navigate to your spreadsheet, click on your table and you’d very quickly have the data downloaded into PowerPivot or any other OData-friendly tool.
Some newspapers may be capable of managing all of their data in SQL databases, and publishing from there to the web. For them, an OData interface to the database would be all that’s needed to make the same data uniformly machine-readable. But for most newspapers — including even the well funded and technically adept Guardian — the path of least resistance runs through spreadsheets. In those cases, it’ll be crucial to have online spreadsheets that are easy for both humans and machines to read.