Last week Kevin Curry dug into some data about school violence in his district. In this case the data was made available as HTML, which means it was sort-of-but-not-really published on the web. Kevin writes:
Whenever I come across data like this the first thing I want to know is whether or not it can actually be used as data. In order to be used/usable as data the contents of this HTML table need to be, at minimum, copy-and-paste-able into a spreadsheet.
Or, alternatively, the HTML table needs to be parseable as data. In this case, I was surprised to find that a couple of tools I normally use to do that parsing — Dabble DB and Excel — didn’t work. That’s because Kevin’s target page doesn’t include a static HTML table. It’s dynamic instead: First you select a district, then the table appears. This mechanism defeats tools that try to parse data from HTML tables, so it’s a bad way to publish data that you want to be available as data.
Lacking the option to parse the HTML table, Kevin’s only choice was to copy and paste. That’s clumsy, and you have to be really motivated to do it, but it can be done. Here’s the Google spreadsheet Kevin made from the data he copied and pasted. And here’s the same stuff as an Excel Web App.
If you haven’t tried out the new Excel Web App, by the way, it’s interesting to compare the two. One key difference, at least from my point of view, is — not surprisingly — the Excel Web App’s ability to roundtrip with Excel. A Google spreadsheet is, at this point, more functional in standalone mode. While you can edit both a Google spreadsheet and an Excel Web App in the browser, for example, the Google spreadsheet can insert and modify charts, whereas the Excel Web App only edits data.
Of course if you have Excel you’d rather use it to insert and modify charts. It’s a lot more capable than any browser app is likely to be anytime soon. So it’s pretty sweet to be able to open the cloud-based Excel spreadsheet, edit locally, and then save to the web. A related limitation of the Google spreadsheet is that you lose charts when you download to, or upload from, Excel.
Another key difference: The Excel Web App currently lacks an API like the one Google provides. I really hope that the Excel Web App will grow an OData interface. In this comment at social.answers.microsoft.com, Christopher Webb cogently explains why that matters:
The big advantage of doing this [OData] would be that, when you published data to the Excel Web App, you’d be creating a resource that was simultaneously human-readable and machine-readable. Consider something like the Guardian Data Store (http://www.guardian.co.uk/data-store): their first priority is to publish data in an easily browsable form for the vast majority of people who are casual readers and just want to look at the data on their browsers, but they also need to publish it in a format from which the data can be retrieved and manipulated by data analysts. Publishing data as html tables serves the first community but not the second; publishing data in something like SQL Azure would serve the second community and not the first, and would be too technically difficult for many people who wanted to publish data in the first place.
The Guardian are using Google docs at the moment, but simply exporting the entire spreadsheet to Excel is only a first step to getting the data into a useful format for data analysts and writing code that goes against the Google docs API is a hassle. That’s why I like the idea of exposing tables/ranges through OData so much: it gives you access to the data in a standard, machine-readable form with minimal coding required, even while it remains in the spreadsheet (which is essentially a human-readable format). You’d open your browser, navigate to your spreadsheet, click on your table and you’d very quickly have the data downloaded into PowerPivot or any other OData-friendly tool.
Some newspapers may be capable of managing all of their data in SQL databases, and publishing from there to the web. For them, an OData interface to the database would be all that’s needed to make the same data uniformly machine-readable. But for most newspapers — including even the well funded and technically adept Guardian — the path of least resistance runs through spreadsheets. In those cases, it’ll be crucial to have online spreadsheets that are easy for both humans and machines to read.
Thanks for the mention! In my opinion if the Excel Web App were able to expose the contents of tables and ranges as OData, as well as making all kinds of cool things possible it would also be a massive boost towards the uptake of OData as a standard. I hope the Excel Web App team are reading…
Re the Guardian datastore, by using Google spreadsheets they are making a queryable interface available to the data, as e.g. this example shows:
http://blog.ouseful.info/2010/06/25/guardian-datastore-mps-expenses-spreadsheet-as-a-database/
Absolutely. If one is willing to write code against the Google spreadsheet API, as you have done in that example, then a queryable interface becomes available to everyone.
What Chris is envisioning is the no-code version of that. An OData interface would enable all of the following without writing code:
– Hit an endpoint with a browser
– Discover available data sources and metadata
– Navigate within sets of related records
– Form queries that yield subsets of records
– Read the data into any OData-aware client
and then also, of course, what you have done here, namely:
– Write code to create a customized interface to the data
SO the no-code version I started trying to explore with this v clunky proof of concept:
http://ouseful.open.ac.uk/datastore/gspreadsheetdb4.php
The selection list is populated from a delicious feed for a tag that i can use to bookmark google spreadsheets.
Alternatively, you can just paste in a URL to a google spreadsheet.
If the data is simply arranged in the spreadsheet (essentially, a simple regular 2D grid that would export nicely to CSV!) then the explorer will help you write and run queries against the spreadsheet, and get URLs that point to HTML previews of the result, CSV exports etc
So the URL pattern for dumping from a sheet into the browser is:
http://spreadsheets.google.com/tq?tqx=out:html&tq=select%20*&key=r0marU6B-vs4fwjRzV1gDDg
Thanks! I’m putting it here as a mental placeholder for myself.
Although we data geeks can see the value of publishing data as data, I think that benefit is lost on many of the people who are creating and publishing it. We need to come up with incentives that give these authors a selfish reason to open the data. With our Exhibit framework (http://simile-widgets.org/exhibit), that incentive is rich visualization: if you use Exhibit, you get to show your data in maps, timelines, sortable lists, etc. The resulting exposure of the data is a side effect that the authors don’t care about. I discussed this point in more detail in a post last week: http://groups.csail.mit.edu/haystack/blog/2010/05/31/on-a-few-deadly-data-sins-and-the-entropy-of-open-data/
The resulting exposure of the data is a side effect that the authors don’t care about.
Of course rich visualization doesn’t necessarily entail data release. The NYTimes infographics are a prime example of that.
Still, great point! I would like all visualizers to be capable of disgorging their backing data in a standard way. Then choosing to do it or not is just a policy decision and a click.
Right, visualization doesn’t have to entail data release, and some authors may specifically want to hide their data (see above blog post). But as developers, we could and did choose a design that releases data by default. Such a design leverages the large mass of apathetic users who don’t care enough about the open data question to change the default. Making this design choice is a powerful way for developers of visualization tools to contribute to the spread of open data
Jon,
Thanks for picking up the thread, Chris and I will keep hammering on the door until we can break it down. In the meantime Chris has done a follow-up blog post for anyone that might be interested.
The Excel Web App and its missing API
(http://cwebbbi.spaces.live.com/blog/cns!7B84B0F2C239489A!7691.entry)
-Jamie
Damn those crappy Spaces URIs :(
I came across this DIDO an MIT project. http://projects.csail.mit.edu/exhibit/Dido
which lets it users to customize and edit data!!
Jon,
I stumbled across this post!
My scenario is more easily solved if the publisher publishes a well-formed XHTML document instead of mal-formed HTML. Such a page can be dynamically generated, too. That is what I truly seek. If the table is well-formed XHTML then we can transform the data into many more views than can be provided with charts widgets, alone. Although charts would not be as simple to create.
But switch gears for a moment and consider the perspective of analyst/journalist/researcher who daily encounters tables and other (seemingly) structured data in documents; HTMLs, PDFs, Word docs…
When I say, “…at minimum, copy-and-paste-able into a spreadsheet” I mean to suggest (if poorly so) that this how analysts/journalists/researchers often operate. And copy-and-paste is all we should expect in order for analysts/journalists/researchers to “reverse engineer” data from documents in order to be usable again as data, i.e., in formulae & charts.
Does that make sense?
Ultimately, my view is that if a newspaper is going to be in the data and data visualization publishing business it should publish structured data, not just HTML documents and images.
Best,
Kevin
Surprised that YQL hasn’t been mentioned at all here – it allows you to query web-based resources in a SQL-like fashion:
http://developer.yahoo.com/yql/
A question that arises, is: should we start teaching analysts, journalists, and researchers YQL? Some are already leaning in that direction: http://bit.ly/bhfnXJ
Check out my open source project Sheetster!:
http://sf.net/projects/sheetster
Or use it right now:
http://sheetster.com
It’s not only an open source Web Spreadsheet (yes!), but it round-trips with Excel really well as it is based upon the most robust Java Excel SDK out there — ExtenXLS.
The best part about Sheetster is that it is Open Source and was built for developers, so the REST api is awesome (if I do say so myself,) and it includes stuff like JSON data mapping (into spreadsheet cells) and automating Docs with a JavaScript API.
Here’s a link to a document on automation using the REST api:
http://extentech.com/nimbledocs/12252/2009/July/14/using_the_docsfree_sheetster_rest_api
This is some superior technology from a company that has been doing Java Spreadsheets way before Google docs was invented, and we are committed to Open Source which benefits all of us.
Thanks for mentioning this.The Excel Web App is able to show the contents of tables and ranges as OData. Good one…
http://godwinsblog.cdtech.in/2010/12/microsoft-office-excel-add-in-plug-in.html
Any chance, Jon, you could comment on the Dabble DB shutdown? There are a whole bunch of folks feeling abandoned and angry.
Hey team, DabbleDB shutdown is a sad thing to hear. It is an argument for having a data-driven solution that you can “take with you”.
The open source Sheetster web spreadsheet is not only available for immediate download to run as a server on your own or virtual machine, but it can be launched in minutes from StandingCloud.com.
Seriously, open source and virtualized servers mean you should never have to host your data in a vulnerable location again… nor should you have to settle for using a public service.
Open source services in the private cloud folks. Check it out.
-john