Daily Archives: February 20, 2008

Overcoming data friction

This headline from Adrian Holovaty’s blog speaks volumes about the state of online data in 2008: EveryBlock hiring a Python screen-scraping expert. The recently-launched EveryBlock, a generalization of ChicagoCrime.org, extends that model to other cities and to a broader range of data types. I interviewed Adrian this week for an upcoming ITConversations show, and he confirmed that while some structured data sources are available from the first three EveryBlock cities — Chicago, San Francisco, and New York — the bulk of the data comes from scraping web pages.

One day soon, the person who lands that job will find himself or herself having this converation at a cocktail party:

Friend: So, what do you do in this new job?

Screen Scraper: I write software to extract data from websites.

F: Where does the data come from?

S: It’s in a database. The website’s software reads the database and turns it into web pages.

F: So somebody got paid to write software to turn the database into web pages, and now you’re getting paid to write software that turns those web pages back into a database?

S: Yeah, basically.

F: So if they just gave you the database you’d be out of a job?

S: No. I’d have a much more interesting job. I’d be able to spend more time finding useful patterns in the data, and writing software to enable other people to find useful patterns in the data.

The irony is that I’d be great at that job. For me, web screen-scraping provides the kind of challenge that other people get from, say, solving crossword puzzles. But it’s not the highest and best use of anyone’s time.

Data friction can be intentional or not. When it’s intentional, you might have to file a FOIA request to get it. But in a lot of cases, it’s unintentional. The data is public, and intended to be widely seen and used, but isn’t readily reusable.

Consider the following two restaurant inspection records for Bully’s Deli in New York:

1. in the NYC Department of Health website

2. in EveryBlock

It’s the same data, from the same source, but EveryBlock makes better use of it. In the NYC website, you can search by ZIP code and number of violations. In EveryBlock you can search more powerfully, and you can ask and answer questions that matter to you. Maybe you care about shellfish. Have any Manhattan restaurants been cited recently for use of unapproved shellfish? Yes: five since January 21.

What EveryBlock is doing is completely aligned with the interests of the NYC Department of Health. Tax dollars are paying for those restaurant inspections. The information is published in order to make New York a safer and healthier place. It’s great to have this data available in any form, and it’s great to see EveryBlock adding value to it.

Now it’s time to grease the wheels.

Here’s one way that can happen. An enlightened city government can decide to publish this kind of data in a resuable way. I’ve written extensively about Washington DC’s groundbreaking DCStat program which does exactly that. I can’t wait to see what happens when EveryBlock goes to Washington.

But city governments shouldn’t have to go out of their way to provide web-facing data services and feeds. Databases should natively support them. That’s the idea behind Astoria (ADO.NET Services), which is discussed in this interview with Pablo Castro. If the NYC Department of Health had that kind of access layer sitting on top of its database, it wouldn’t put EveryBlock’s screen-scraper out of a job, it would just make that job a whole lot more interesting and effective.