I’m making some progress in my quest to improve access to (and interpretation of) local public data. Yesterday’s meeting with the police department yielded a couple of spreadsheets — one with five-year historical data, and one with recent incident reports. The latter includes addresses which enabled me to plot the incidents in Virtual Earth.
It has been a while since I’ve done this, and the technology has matured. GeoRSS, in particular, seems to be a fairly new thing in the world. It’s a simple idea: use RSS (or Atom) to package sets of locations, encapsulating latitude and longitude coordinates in the GeoRSS namespace. Here’s the GeoRSS file I built from the police spreadsheet.
In poking around online in order to learn how to use GeoRSS, I ran across a familiar name: Jef Poskanzer. For many years I have been enlightened by Jef’s various experiments at acme.com. Way back in 1997, for example, I was using his implementation of Java servlets to explore that way of building online services. So I was delighted to see Jef’s name pop up again when I looked into GeoRSS.
On his GeoRSS page you can plug in the URL of a GeoRSS file and his service will map it for you in either Google Maps or Yahoo! Simple Maps. Nice! Jef’s page doesn’t include Virtual Earth as an option, but it also now supports GeoRSS so this was a good opportunity to try out that combination. It was, as you’d expect, quite easy to do. Given a well-formed GeoRSS file, all of the modern mapping APIs require very little of a developer who wants to spray the locations in the file onto a map.
But as I was reminded when going through this exercise, it requires a whole lot of work to transform a typical real-world document like the one I received yesterday — an Excel spreadsheet with manually-typed addresses — into a well-formed GeoRSS file. Data preparation is always the bottleneck.
Reflecting on how I got the job done, it’s amazing to consider the number and diversity of tools that I used. A partial list includes Excel for massaging and sorting data, Python for various bits of transfomational glue, and curl to pump addresses through an online geocoder.
I also leveraged a ton of tacit knowledge about the web, about XML, about regular expressions, and about the organization and display of data.
It’s always striking to me how we technical folk tend to focus on the endgame. “Look, ma, no hands! Just plug your [insert newest format] into your [insert newest tool] and it’s automatic!”
Here’s one small example of the difficulties we sweep under the rug. Consider a series of incidents at these addresses:
27 Damon Ct. 27 Damon Ct. 35 Castle St. 35 Castle St. 45 Damon Ct. 45 Damon Ct. 165 Castle St. 165 Castle St.
That’s how the address column will sort in a typical spreadsheet. But that’s not how you’d like to scan the legend in a mashup. To help visualize neighborhood patterns, you’d rather see something like:
Damon Ct. (27) Damon Ct. (27) Damon Ct. (45) Damon Ct. (45) Castle St. (35) Castle St. (35) Castle St. (165) Castle St. (165)
In reality it’s more complicated because I’ve omitted apartment numbers, dates, and annotations. The organization of these elements has a profound influence on which kinds of visualizations tend to come for free, and which will require a lot of extra work.
Now, because I’m handy with data, with text processing, and with regular expressions, I know how to reorganize the raw data. And because I have a view of the endgame, I know why to do it. But none of this is evident to a normal person sitting in an office compiling incident reports into an Excel log.
It seems, though, that we should now be able to normalize this kind of data entry in a way that would maximize the reuse value of the data. If I can feed random addresses into a geocoder and pretty reliably get back coordinates, I should also be able to feed unstructured addresses into some other online service and get back well-structured addresses. And I should be able to equip Excel to use that service to ensure that the structured addresses are logged with incidents. Is there a recipe for doing that?