This headline from Adrian Holovaty’s blog speaks volumes about the state of online data in 2008: EveryBlock hiring a Python screen-scraping expert. The recently-launched EveryBlock, a generalization of ChicagoCrime.org, extends that model to other cities and to a broader range of data types. I interviewed Adrian this week for an upcoming ITConversations show, and he confirmed that while some structured data sources are available from the first three EveryBlock cities — Chicago, San Francisco, and New York — the bulk of the data comes from scraping web pages.
One day soon, the person who lands that job will find himself or herself having this converation at a cocktail party:
Friend: So, what do you do in this new job?
Screen Scraper: I write software to extract data from websites.
F: Where does the data come from?
S: It’s in a database. The website’s software reads the database and turns it into web pages.
F: So somebody got paid to write software to turn the database into web pages, and now you’re getting paid to write software that turns those web pages back into a database?
S: Yeah, basically.
F: So if they just gave you the database you’d be out of a job?
S: No. I’d have a much more interesting job. I’d be able to spend more time finding useful patterns in the data, and writing software to enable other people to find useful patterns in the data.
The irony is that I’d be great at that job. For me, web screen-scraping provides the kind of challenge that other people get from, say, solving crossword puzzles. But it’s not the highest and best use of anyone’s time.
Data friction can be intentional or not. When it’s intentional, you might have to file a FOIA request to get it. But in a lot of cases, it’s unintentional. The data is public, and intended to be widely seen and used, but isn’t readily reusable.
Consider the following two restaurant inspection records for Bully’s Deli in New York:
1. in the NYC Department of Health website
2. in EveryBlock
It’s the same data, from the same source, but EveryBlock makes better use of it. In the NYC website, you can search by ZIP code and number of violations. In EveryBlock you can search more powerfully, and you can ask and answer questions that matter to you. Maybe you care about shellfish. Have any Manhattan restaurants been cited recently for use of unapproved shellfish? Yes: five since January 21.
What EveryBlock is doing is completely aligned with the interests of the NYC Department of Health. Tax dollars are paying for those restaurant inspections. The information is published in order to make New York a safer and healthier place. It’s great to have this data available in any form, and it’s great to see EveryBlock adding value to it.
Now it’s time to grease the wheels.
Here’s one way that can happen. An enlightened city government can decide to publish this kind of data in a resuable way. I’ve written extensively about Washington DC’s groundbreaking DCStat program which does exactly that. I can’t wait to see what happens when EveryBlock goes to Washington.
But city governments shouldn’t have to go out of their way to provide web-facing data services and feeds. Databases should natively support them. That’s the idea behind Astoria (ADO.NET Services), which is discussed in this interview with Pablo Castro. If the NYC Department of Health had that kind of access layer sitting on top of its database, it wouldn’t put EveryBlock’s screen-scraper out of a job, it would just make that job a whole lot more interesting and effective.
Adam Bosworth has written and spoken about this a bunch. I think GData was supposed to be an answer to this problem. Doesn’t seem like it took off, though.
Hey Jon … I love the term “Data Friction” …
Be interested in your views on my attempt to reduce the friction.
http://www.screencast.com/t/xhH3CwoRKTW
Jon,
I agree, I think the term “data friction” is superb, and with permission, would like to use it, with attribution, in some of my own analyses.
This is also a remarkably good plug for the use of standardized serialization formats: xml or json in particular. The challenge with the relational database model is that serialization was considered an afterthought, and as such, the opportunity to standardize on such a format was lost early – it was really only once people started passing xml around that its potential as a mechanism for data serialization re-raised the question of how to get data (real or virtual) from a data store to a data consumer.
Screen scraping to me is much akin to extreme sports – fraught with legal and moral peril, appealing to those with more cajones than brains, and increasingly unnecessary. That it is still somewhat necessary attests, as you point out, to the rather sorry state of data serialization on the web.
Jon –
This could be solved, at least in theory, by having cities write contracts (and RFPs) that specify performance for their vendors that include APIs for civic data.
I don’t know what that RFP language would look like, though..
Ed
I thought the post made some good points on screen scrapers, For screen scrapers i use python for simple things, but for larger projects i used extractingdata.com screen scraper software which worked great, they build custom screen scrapers and data extracting programs