My guest for this week’s ITConversations show is Greg Elin, chief data architect with the Sunlight Foundation. Founded in 2006, the Sunlight Foundation aims to make the operation of Congress and the U.S. government more transparent and accountable. There are lots of obvious reasons why that’s a good thing. Greg adds a non-obvious reason that I hadn’t heard and find compelling:
I increasingly feel that the reason for Congressional hearings to be open and recorded and annotated is market efficiency. The fed does not announce what it’s going to do with interest rates until it announces it to everybody. But is that the case for the rest of Congress and legislation? If I can afford to have a fulltime lobbyist going to the committee meetings, don’t I have an inside track? Can’t I arbitrage my market investments based on that? It’s a question of market effiency.
That was one of the moments in this conversation where I stopped and said: Wow, great point. Here’s another. We were talking about the difficulty of organizing information from disparate sources based on unique identifiers, whether for individual legislators or for sections and paragraphs of legislation. Greg made this excellent point:
As technologists, we forget how much we’ve gamed the system from the beginning in setting up our tools. That Ethernet card comes with a hardcoded ID, and it’s unique, but it took us a long time to get there, and it required the cooperation of a lot of people to make it work.
Having surveyed a wide range of government data sources, Greg’s conclusion is that the future is already here, but not yet evenly distributed. There are pockets within the government where data management practices are excellent, and large swaths where they are mediocre to horrible. The Sunlight Foundation has an interesting take on how to bootstrap better data practices across the board. By demonstrating them externally, in compelling ways, you can incent the government to internalize them:
Sunlight Foundation made a grant to OMBWatch, they put together fedspending.org, and as that was happening the Coburn-Obama bill was passed, which basically said that the OMB had to put together the same type of website. If the Sunlight Foundation — and other organizations like the Heritage Foundation and Porkbusters — if we had not been doing a collaborative project at the time around earmarks, and at the same time working with OMBWatch to do fedspending.org, I think that there wouldn’t have been the drumbeat pressure for the government to make this information available.
Later the conversation turned to data integrity and data provenanance. What I mean by integrity, here, is the sort of question raised by my Hans Rosling wannabe screencast in which I observe that town-reported crime statistics rolled up to a statewide total don’t agree with state crime statistics as seen from a national perspective. Greg has a similar example:
Everything that CRP [Center for Responsive Politics] tracks is on a two-year election cycle. But OMBWatch is tracking contracts, and Taxpayers for Common sense is tracking earmarks, on a budget year cycle. So things don’t necessarily line up.
There’s never going to be an easy way to make these different gears mesh. But until now, we’ve never had any way to see exactly how they don’t mesh, and to factor that into our thinking. That’s one of the subtler effects of transparency.
Another is the possibility of a more complete view of data provenance — that is, where it comes from, and how it’s transformed along the way. Influenced by Jeff Jonas’ notions of sequence neutrality and data tethering, Greg envisions an open protocol for what he calls continuous data analysis:
If we can get an open protocol for reporting what we find in data, you’re beginning to make explicit the transformations that you apply. What I need to be able to do here at Sunlight, and what all of us working with public data need to be able to do, is instantly reprocess data that we’ve already processed, because any data we get is going to be missing something. If someone decides to change a taxonomy term, you ought to be able to rerun the data at every level with that new taxonomy term.
This was an excellent conversation, thanks Greg!