OData for collaborative sense-making

OData, the Open Data Protocol, is described at odata.org:

The Open Data Protocol (OData) is a web protocol for querying and updating data. OData applies web technologies such as HTTP, Atom Publishing Protocol (AtomPub) and JSON to provide access to information from a variety of applications, services, and stores.

The other day, Pablo Castro wrote an excellent post explaining how developers can implement aspects of the modular OData spec, and outlining some benefits that accrue from each. One of the aspects is query, and Pablo gives this example:

http://ogdi.cloudapp.net/v1/dc/BankLocations?$filter=zipcode eq 20007

One benefit for exposing query to developers, Pablo says, is:

Developers using the Data Services client for .NET would be able to use LINQ against your service, at least for the operators that map to the query options you implemented.

I’d like to suggest that there’s a huge benefit for users as well. Consider Pablo’s example, based on some Washington, DC datasets published using the Open Government Data Initiative toolkit. Let’s look at one of those datasets, BankLocations, through the lens of Excel 2010’s PowerPivot.

PowerPivot adds heavy-duty business analytics to Excel in ways I’m not really qualified to discuss, but for my purposes here that’s beside the point. I’m just using it to show what it can be like, from a user’s perspective, to point an OData-aware client, which could be any desktop or web application, at an OData source, which could be provided by any backend service.

In this case, I pointed PowerPivot at the following URL:


I previewed the Atom feed, selected a subset of the columns, and imported them into a pivot table. I used slicers to help visualize the zipcodes associated with each bank. And I wound up with a view which reports that there are three branches of WashingtonFirst Bank in DC, at three addresses, in two zipcodes.

If I were to name this worksheet, I’d call it WashingonFirst Bank branches in DC. But it has another kind of name, one that’s independent of the user who makes such a view, and of the application used to make it. Here is that other name:

http://ogdi.cloudapp.net/v1/dc/BankLocations?$filter=name eq ‘WashingtonFirst Bank’

If you and I want to have a conversation about banks in Washington, DC, and if we agree that this dataset is an authoritative list of them, then we — and anyone else who cares about this stuff — can converse using a language in which phrases like ‘WashingtonFirst Bank branches in DC’ or ‘banks in zipcode 20007’ are well defined.

If we incorporate this kind of fully articulated web namespace into public online discourse, then others can engage with it too. Suppose, to take just one small example, I find what I think is an error in the dataset. Maybe I think one of the branch addresses is wrong. Or maybe I want to associate some extra information with the address. Today, the way things usually work, I’d visit the source website and look for some kind of feedback mechanism. If there is one, and if I’m willing to provide my feedback in a form it will accept, and if my feedback is accepted, then my effort to engage with that dataset will be successful. But that’s a lot of ifs.

When public datasets provide fully articulated web namespaces, though, things can happen in a more loosely coupled way. I can post my feedback anywhere — for example, right here on this blog. If I have something to say about the WashingtonFirst branch at 1500 K Street, NW, I can refer to it using an URL: 1500 K Street, NW.

That URL is, in effect, a trackback that points to one record in the dataset.1 The service that hosts the dataset could scan the web for these inbound links and, if desired, reflect them back to its users. Or any other service could do the same. Discourse about the dataset can grow online in a decentralized way. The publisher need not explicitly support, maintain, or be liable for that discourse. But it can be discovered and aggregated by any interested party.

The open data movement, in government and elsewhere, aims to help people engage with and participate in processes represented by the data. When you publish data in a fully articulated way, you build a framework for engagement, a trellis for participation. This is a huge opportunity, and it’s what most excites me about OData.

1 PowerPivot doesn’t currently expose that URL, but it could, and so could any other OData-aware application.

Contextual clothing for naked transparency

The other day I listened to a Spark (CBC Radio) interview with Larry Lessig about his New Republic essay Against Transparency, which begins:

We are not thinking critically enough about where and when transparency works, and where and when it may lead to confusion, or to worse. And I fear that the inevitable success of this movement–if pursued alone, without any sensitivity to the full complexity of the idea of perfect openness–will inspire not reform, but disgust. The “naked transparency movement,” as I will call it here, is not going to inspire change. It will simply push any faith in our political system over the cliff.

The essay was published in October 2009. In this interview from November, Prof. Lessig reflected on the reactions that it provoked. Although the delicious and bitly feedback now suggests that most people understood the essay to be a thoughtfully nuanced critique, there were evidently some early responders who read it as a retreat from openness and an assault on the Internet.

I’m glad I missed the essay when it first appeared. Reading it along with a cloud of feedback from readers and from the author amplifies one of the key points: We don’t really want naked transparency, we want transparency clothed in context.

The Net can be an engine for context assembly, a wonderful phrase I picked up years ago from Jack Ozzie and echoed in several essays. But it can also be a context destroyer.

In the interview, Lessig notes one example of context destruction. The article, which most people will read online, spans eleven pages, each of which wraps its nugget of “content” in layers of distraction. Some early negative comments, Lessig says, came from people who had clearly not read to the end.

Our increasingly compressed and fragmented attention can also be a context destroyer:

What about when the claims are neither true nor false? Or worse, when the claims actually require more than the 140 characters in a tweet?

This is the problem of attention-span. To understand something–an essay, an argument, a proof of innocence– requires a certain amount of attention. But on many issues, the average, or even rational, amount of attention given to understand many of these correlations, and their defamatory implications, is almost always less than the amount of time required. The result is a systemic misunderstanding–at least if the story is reported in a context, or in a manner, that does not neutralize such misunderstanding. The listing and correlating of data hardly qualifies as such a context. Understanding how and why some stories will be understood, or not understood, provides the key to grasping what is wrong with the tyranny of transparency.

Transparency is a necessary but not a sufficient condition. Recently my town’s crime data and council meetings have appeared online. But this remarkable transparency does not alone enable the sort of collaborative sense-making that we all rightly envision.

In the case of crime data, we require a context that includes historical trends, regional and national comparisons, guidance from government about how its local taxonomy relates to regional and national taxonomies, and reporting by newspapers and citizens.

In the case of city council meetings, we require a context that includes relevant state law and local code, and reporting by stakeholders, by newspapers, and by affected citizens.

To enable context assembly, we’ll need to organize the numeric and narrative data produced by the “naked transparency” movement in ways friendly to linking, aggregation, and discovery.

But these principles will need to be adopted more broadly than by governments alone. Everyone needs to understand the principles of linking, aggregation, and discovery, so that everyone can help create the context we crave.

Gov2.0 transparency: An enabler for collaborative sense-making

Recently my town has adopted two innovative web services that I’ve featured on my podcast: CrimeReports.com, which does what its name suggests, and Granicus.com, which delivers video of city council meetings along with synchronized documents.

You can see the Keene instance of CrimeReports here, and our Granicus instance here.

I’m delighted to finally become a user of these systems that I’ve advocated for, written about, and podcasted. I’m also eager to move forward. We’re still only scratching the surface of what Net-mediated democracy can and should become.

In the case of CrimeReports, the next step is clear: Publish the data. It’s nice to see pushpins on a map, but when you’re trying to answer questions — like “Are we having a crime wave?” — you need access to the information that drives the map. Greg Whisenant, the founder of CrimeReports.com, says he’d be happy to publish feeds. But so far the cities that hire him to do canned visualizations of crime data aren’t asking him to do so, because most people aren’t yet asking their city governments to provide source data. So a few intrepid hackers, like Ben Caulfield here in Keene, are reverse-engineering PDF files to get at the information. Check out Ben’s remixed police blotter — it’s awesome. Now imagine what Ben might accomplish if he hadn’t needed to move mountains to uncover the data.

In the case of Granicus, I’m reminded of this item from last year: Net-enhanced democracy: Amazing progress, solvable challenges. The gist of that item was that:

  • It’s amazing to be able to observe the processes of government.

  • It’s still a challenge to make sense of them.

  • Tools that we know how to build and use can help us meet that challenge.

Check out, for example, last week’s Keene city council meeting. Scroll down to an item labeled 2. Ordinance O-2009-21. In this clip, the council agrees to amend the city code for residential real estate tax exemptions. I wish I could link you directly to that portion of the video, which begins at 34:11, in the same way that I can link you to the associated document. But more broadly, I wish that a citizen who tunes in could understand — and help establish — the context for this amendment.

Here’s the new language:

Sec. 86-29 Residential real estate tax exemptions and credits

With regard to property tax exemptions, the city hereby adopts the provisions of RSA 72:37 (Blind); RSA 72:37-b (Disabled); RSA 72:38-b (Deaf or Severely Hearing Impaired); RSA 72:39-a (Elderly); RSA 72:62 (Solar); RSA 72:66 (Wind); and RSA 72:70 (Wood).

With regard to property tax credits, the city hereby adopts the provisions of RSA 72:28, II, (Optional Veterans’ tax credit); RSA 72:29-a , II, (Surviving Spouse); and RSA 72:35, I-a, (Optional Tax Credit for Service-Connected total disability).

In this case, I just happen to know a bit of this amendment’s backstory. Earlier this year I found out — only thanks to a serendipitous encounter with a city councilor at a social event — that my wood gasifier qualified me for an exemption. This was the first such exemption, and to my knowledge is still the only one granted.

If I hadn’t gone through that experience, though, the video clip and its associated document would mean nothing to me. There would be no way to make a connection between state law on the one hand, and a documented case study on the other.

On the next turn of the crank, I hope that services like Granicus will enable us to make those connections. Seeing the process of government in action is a great step forward. Now we need to be able to use links and annotations to help one another make sense of that process.

Ask and ye may receive, don’t ask and ye surely will not

This fall a small team of University of Toronto and Michigan State undergrads will be working on parts of the elmcity project by way of Undergraduate Capstone Open Source Projects (UCOSP), organized by Greg Wilson. In our first online meeting, the students decided they’d like to tackle the problem that FuseCal was solving: extraction of well-structured calendar information from weakly-structured web pages.

From a computer science perspective, there’s a fairly obvious path. Start with specific examples that can be scraped, then work toward a more general solution. So the first two examples are going to be MySpace and LibraryThing. The recipes[1, 2] I’d concocted for FuseCal-written iCalendar feeds were especially valuable because they could be used by almost any curator for almost any location.

But as I mentioned to the students, there’s another way to approach these two cases. And I was reminded of it again when Michael Foord pointed to this fascinating post prompted by the open source release of FriendFeed’s homegrown web server, Tornado. The author of the post, Glyph Lefkowitz, is the founder of Twisted, a Python-based network programming framework that includes the sort of asynchronous event-driven capabilities that FriendFeed recreated for Tornado. Glyph writes:

If you’re about to undergo a re-write of a major project because it didn’t meet some requirements that you had, please tell the project that you are rewriting what you are doing. In the best case scenario, someone involved with that project will say, “Oh, you’ve misunderstood the documentation, actually it does do that”. In the worst case, you go ahead with your rewrite anyway, but there is some hope that you might be able to cooperate in the future, as the project gradually evolves to meet your requirements. Somewhere in the middle, you might be able to contribute a few small fixes rather than re-implementing the whole thing and maintaining it yourself.

Whether FriendFeed could have improved the parts of Twisted that it found lacking, while leveraging its synergistic aspects, is a question only specialists close to both projects can answer. But Glyph is making a more general point. If you don’t communicate your intentions, such questions can never even be asked.

Tying this back to the elmcity project, I mentioned to the students that the best scraper for MySpace and LibraryThing calendars is no scraper at all. If these services produced iCalendar feeds directly, there would be no need. That would be the ideal solution — a win for existing users of the services, and for the iCalendar ecosystem I’m trying to bootstrap.

I’ve previously asked contacts at MySpace and LibraryThing about this. But now, since we’re intending to scrape those services for calendar info, it can’t hurt to announce that intention and hope one or both services will provide feeds directly and obviate the need. That way the students can focus on different problems — and there are plenty to choose from.

So I’ll be sending the URL of this post to my contacts at those companies, and if any readers of this blog can help move things along, please do. We may end up with scrapers anyway. But maybe not. Maybe iCalendar feeds have already been provided, but aren’t documented. Maybe they were in the priority stack and this reminder will bump them up. It’s worth a shot. If the problem can be solved by communicating intentions rather than writing redundant code, that’s the ultimate hack. And its one that I hope more computer science students will learn to aspire to.

FriendFeed for project collaboration

For me, FriendFeed has been a new answer to an old question — namely, how to collaborate in a loosely-coupled way with people who are using, and helping to develop, an online service. The elmcity project’s FriendFeed room has been an incredibly simple and effective way to interleave curated calendar feeds, blog postings describing the evolving service that aggregates those feeds, and discussion among a growing number of curators.

In his analysis of Where FriendFeed Went Wrong Dare Obasanjo describes the value of a handful of services (Facebook, Twitter, etc.) in terms that would make sense to non-geeks like his wife. Here’s the elevator pitch for FriendFeed:

Republish all of the content from the different social networking media websites you use onto this site. Also one place to stay connected to what people are saying on multiple social media sites instead of friending them on multiple sites.

As usual, I’m an outlying data point. I’m using FriendFeed as a lightweight, flexible aggregator of feeds from my blog and from Delicious, and as a discussion forum. These feeds report key events in the life of the project: I added a new feature to the aggregator, the curator for Sasktatoon found and added a new calendar. The discussion revolves around strategies for finding or creating calendar feeds, features that curators would like me to add to the service, and problems they’re having with the service.

I doubt there’s a mainstream business model here. It’s valuable to me because I’ve created a project environment in which key events in the life of the project are already flowing through feeds that are available to be aggregated and discussed. Anyone could arrange things that way, but few people will.

It’s hugely helpful to me, though. And while I don’t know for sure that FriendFeed’s acquisition by FaceBook will end my ability to use FriendFeed in this way, I do need to start thinking about how I’d replace the service.

I don’t need a lot of what FriendFeed offers. Many of the services it can aggregate — Flickr, YouTube, SlideShare — aren’t relevant. And we don’t need realtime notification. So it really boils down to a lightweight feed aggregator married to a discussion forum.

One feature that FriendFeed’s API doesn’t offer, by the way, but that I would find useful, is programmatic control of the aggregator’s registry. When a new curator shows up, I have to manually add the associated Delicious feed to the FriendFeed room. It’d be nice to automate that.

Ideally FriendFeed will coast along in a way that lets me keep using it as I currently am. If not, it wouldn’t be too hard to recreate something that provides just the subset of FriendFeed’s services that I need. But ideally, of course, I’d repurpose an existing service rather than build a new one. If you’re using something that could work, let me know.

Talking with Cathy Marshall about tags, digital archiving, and lifestreams

My guest for this week’s Innovators show is Cathy Marshall, a Senior Researcher in Microsoft’s Silicon Valley Lab. She’s long been intrigued by personal information management — and nowadays, also by its social dimension.

We kicked off the conversation with a discussion of her recent paper Do Tags Work?. (See also her slides from a talk about the project.) This was a clever study in which she collected a bunch of Flickr photos of people spinning on the bull’s balls in Milan. Notice how that fulltext query effectively retrieves a pile of images, taken by different people, of the same curious custom:

If you are passing through the Galleria Vittorio Emanuele II, you should spin around on the testicles of the bull mosaic found in the centre. Legend has it that this will bring you good luck!

Now try this query, which uses the same terms but looks at tags instead of the free text (title, description) associated with the photos. It finds nothing.

Cathy concludes that while many people think tags are effective hooks for information retrieval, they really aren’t.

Of course, those of us who attend conferences where the first order of business is to announce a tag know that tags can be a very effective way to aggregate all the blog postings, tweets, and photos associated with an event. Folksonomies that aren’t intended to converge don’t. Those that are meant to converge do, quite dramatically, which is why I’ve long been obsessed with intentional tagging as an enabler of loosely-coupled collaboration.

In the second half of the conversation we discussed personal digital archiving, curation, benign neglect, and lifestreams. Cathy tells a lot of stories about the ways in which people do, and also don’t, take care of their digital stuff. She observes, for example, that when people lose the contents of a computer, they react initially with horror, but then often feel a sense of relief. It turns out a lot of what was there wasn’t really needed. The burden of culling through it is lifted, and the guilt associated with not doing that culling that goes away.

(I laughed harder than I have in a long time when Cathy described rental storage units as “garbage cans you pay for, and then when you realize you no longer care about the stuff in them, you stop paying for.”)

We ended by agreeing that the hardest thing about introducing a hosted lifebits service ecosystem will be the conceptual model. For psychological reasons, people will want to think in terms of monolithic containers that keep stuff in one place, and monolithic services that do everything related to that stuff. For architectural reasons, though, we’ll want to federate storage, and also decouple classes of service — so that storage, for example, is orthogonal to access control and authorization, which is orthogonal to social interaction.

Influencing the production of public data

In the latest installment of my Innovators podcast, which ran while I was away on vacation, I spoke with Steven Willmott of 3scale, one of several companies in the emerging business of third-party API management. As more organizations get into the game of providing APIs to their online data, there’s a growing need for help in the design and management of those APIs.

By way of demonstration, 3scale is providing an unofficial API to some of the datasets offered by the United Nations. The UN data at http://data.un.org, while browseable and downloadable, is not programmatically accessible. If you visit 3scale’s demo at www.undata-api.org/ you can sign up for an access key, ask for available datasets — mostly, so far, from the World Health Organization (see below) — and then query them.

The query capability is rather limited. For a given measure, like Births by caesarean section (percent), you can select subsets by country or by year, but you can’t query or order by values. And you can’t make correlations across tables in one query.

It’s just a demo, of course. If 3scale wanted to invest more effort, a more robust query system could be built. The fact that such a system can be built by an unofficial intermediary, rather than by the provider of the data, is quite interesting.

As I watch this data publication meme spread, here’s something that interests me even more. These efforts don’t really reflect the Web 2.0 values of engagement and participation to the extent they could. We’re now very focused on opening up flexible means of access to data. But the conversation is still framed in terms of a producer/consumer relationship that isn’t itself much discussed.

At the end of this entry you’ll find a list of WHO datasets. Here’s one: Community and traditional health workers density (per 10,000 population). What kinds of questions do we think we might try to answer by counting this category of worker? What kinds of questions can’t we try to answer using the datasets WHO is collecting? How might we therefore want to try to influence the WHO’s data-gathering efforts, and those of other public health organizations?

“Give us the data” is an easy slogan to chant. And there’s no doubt that much good will come from poking through what we are given. But we also need to have ideas about what we want the data for, and communicate those ideas to the providers who are gathering it on our behalf.

Adolescent fertility rate
Adult literacy rate (percent)
Gross national income per capita (PPP international $)
Net primary school enrolment ratio female (percent)
Net primary school enrolment ratio male (percent)
Population (in thousands) total
Population annual growth rate (percent)
Population in urban areas (percent)
Population living below the poverty line (percent living on less than US$1 per day)
Population median age (years)
Population proportion over 60 (percent)
Population proportion under 15 (percent)
Registration coverage of births (percent)
Registration coverage of deaths (percent)
Total fertility rate (per woman)
Antenatal care coverage – at least four visits (percent)
Antiretroviral therapy coverage among HIV-infected pregnant women for PMTCT (percent)
Antiretroviral therapy coverage among people with advanced HIV infections (percent)
Births attended by skilled health personnel (percent)
Births by caesarean section (percent)
Children aged 6-59 months who received vitamin A supplementation (percent)
Children aged less than 5 years sleeping under insecticide-treated nets (percent)
Children aged less than 5 years who received any antimalarial treatment for fever (percent)
Children aged less than 5 years with ARI symptoms taken to facility (percent)
Children aged less than 5 years with diarrhoea receiving ORT (percent)
Contraceptive prevalence (percent)
Neonates protected at birth against neonatal tetanus (PAB) (percent)
One-year-olds immunized with MCV
One-year-olds immunized with three doses of Hepatitis B (HepB3) (percent)
One-year-olds immunized with three doses of Hib (Hib3) vaccine (percent)
One-year-olds immunized with three doses of diphtheria tetanus toxoid and pertussis (DTP3) (percent)
Tuberculosis detection rate under DOTS (percent)
Tuberculosis treatment success under DOTS (percent)
Women who have had PAP smear (percent)
Women who have had mammography (percent)
Community and traditional health workers density (per 10 000 population)
Dentistry personnel density (per 10 000 population)
Environment and public health workers density (per 10 000 population)
External resources for health as percentage of total expenditure on health
General government expenditure on health as percentage of total expenditure on health
General government expenditure on health as percentage of total government expenditure
Hospital beds (per 10 000 population)
Laboratory health workers density (per 10 000 population)
Number of community and traditional health workers
Number of dentistry personnel
Number of environment and public health workers
Number of laboratory health workers
Number of nursing and midwifery personnel
Number of other health service providers
Number of pharmaceutical personnel
Nursing and midwifery personnel density (per 10 000 population)
Other health service providers density (per 10 000 population)
Out-of-pocket expenditure as percentage of private expenditure on health
Per capita total expenditure on health (PPP int. $)
Per capita total expenditure on health at average exchange rate (US$
Pharmaceutical personnel density (per 10 000 population)
Physicians density (per 10 000 population)
Private expenditure on health as percentage of total expenditure on health
Private prepaid plans as percentage of private expenditure on health
Ratio of health management and support workers to health service providers
Ratio of nurses and midwives to physicians
Social security expenditure on health as percentage of general government expenditure on health
Total expenditure on health as percentage of gross domestic product