What’s easy, and what’s hard, about getting from Excel to a GeoRSS-enabled mashup

I’m making some progress in my quest to improve access to (and interpretation of) local public data. Yesterday’s meeting with the police department yielded a couple of spreadsheets — one with five-year historical data, and one with recent incident reports. The latter includes addresses which enabled me to plot the incidents in Virtual Earth.

It has been a while since I’ve done this, and the technology has matured. GeoRSS, in particular, seems to be a fairly new thing in the world. It’s a simple idea: use RSS (or Atom) to package sets of locations, encapsulating latitude and longitude coordinates in the GeoRSS namespace. Here’s the GeoRSS file I built from the police spreadsheet.

In poking around online in order to learn how to use GeoRSS, I ran across a familiar name: Jef Poskanzer. For many years I have been enlightened by Jef’s various experiments at acme.com. Way back in 1997, for example, I was using his implementation of Java servlets to explore that way of building online services. So I was delighted to see Jef’s name pop up again when I looked into GeoRSS.

On his GeoRSS page you can plug in the URL of a GeoRSS file and his service will map it for you in either Google Maps or Yahoo! Simple Maps. Nice! Jef’s page doesn’t include Virtual Earth as an option, but it also now supports GeoRSS so this was a good opportunity to try out that combination. It was, as you’d expect, quite easy to do. Given a well-formed GeoRSS file, all of the modern mapping APIs require very little of a developer who wants to spray the locations in the file onto a map.

But as I was reminded when going through this exercise, it requires a whole lot of work to transform a typical real-world document like the one I received yesterday — an Excel spreadsheet with manually-typed addresses — into a well-formed GeoRSS file. Data preparation is always the bottleneck.

Reflecting on how I got the job done, it’s amazing to consider the number and diversity of tools that I used. A partial list includes Excel for massaging and sorting data, Python for various bits of transfomational glue, and curl to pump addresses through an online geocoder.

I also leveraged a ton of tacit knowledge about the web, about XML, about regular expressions, and about the organization and display of data.

It’s always striking to me how we technical folk tend to focus on the endgame. “Look, ma, no hands! Just plug your [insert newest format] into your [insert newest tool] and it’s automatic!”

Here’s one small example of the difficulties we sweep under the rug. Consider a series of incidents at these addresses:

27 Damon Ct.
27 Damon Ct.
35 Castle St.
35 Castle St.
45 Damon Ct.
45 Damon Ct.
165 Castle St.
165 Castle St.

That’s how the address column will sort in a typical spreadsheet. But that’s not how you’d like to scan the legend in a mashup. To help visualize neighborhood patterns, you’d rather see something like:

Damon Ct. (27)
Damon Ct. (27)
Damon Ct. (45)
Damon Ct. (45)
Castle St. (35)
Castle St. (35)
Castle St. (165)
Castle St. (165)

In reality it’s more complicated because I’ve omitted apartment numbers, dates, and annotations. The organization of these elements has a profound influence on which kinds of visualizations tend to come for free, and which will require a lot of extra work.

Now, because I’m handy with data, with text processing, and with regular expressions, I know how to reorganize the raw data. And because I have a view of the endgame, I know why to do it. But none of this is evident to a normal person sitting in an office compiling incident reports into an Excel log.

It seems, though, that we should now be able to normalize this kind of data entry in a way that would maximize the reuse value of the data. If I can feed random addresses into a geocoder and pretty reliably get back coordinates, I should also be able to feed unstructured addresses into some other online service and get back well-structured addresses. And I should be able to equip Excel to use that service to ensure that the structured addresses are logged with incidents. Is there a recipe for doing that?

9 Comments

  1. Commercial tools are available to do this, such as Trillium from Harte-Hanks. The software is expensive and requires frequent data updates to stay current with US and foreign addresses.

  2. Hi Jon… I also wonder how using geoRSS (or KML) this way scales for a high volume of points. Google Maps lets you “search” for a KML file (putting it in the querystring) and the features in it will be automatically be mapped. But suppose a KML/geoRSS file has thousands of elements in it or is a multi-MB file… it’s inefficient to load all that (esp. into the client) if the user’s current zoom/pan state only makes a fraction of them visible or relevant. It seems managing a high number of points/lines/regions dynamically as the user navigates will be a key concern for all but the most trivial of samples.

    Anyway, your current writing on public data (and the geocoding of it) is very interesting to follow. Thanks!

  3. GeoPY might come in handy for the actual geocoding of addresses:

    http://exogen.case.edu/projects/geopy/

    “geopy makes it easy for developers to locate the coordinates of addresses, cities, countries, and landmarks across the globe using third-party geocoders and other sources of data, such as wikis.”

    See the four line examples on that page for geocoding using google, yahoo or virtual earth. There are also some useful functions which make it easy to find the distance in miles/km between two lat/long points.

    Take care,
    -Brian

  4. I took your incidents.xml file and processed it through Yahoo Pipes. I ignored the lat/long data you supplied just to see if Pipes and Yahoo’s geocoding api could make sense of the address (“Keene” AND “New Hampshire” hardcoded).
    http://pipes.yahoo.com/pipes/pipe.info?_id=bEi72dM23BGmLPTj1vC6Jw
    I don’t know how accurate and inclusive the results are, although the results I did check on the map looked pretty accurate.
    Pipes could possibly take the process a step further back, since it accepts csv input (I’ve not tried this yet). Alternatively, it would take the published rss feed of a Google spreadsheet

  5. Jon,

    I’ve been playing with this for a while, and started coding a bunch of Popfly blocks to do the data mapping and transformation. While you and I could enter regexps into the blocks to get the outputs we want, we can pretty easily build transformation editor blocks and so forth. Pipe block to block and then to the Popfly VE block.

    We’re getting there, slowly getting there.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s