Influencing the production of public data

In the latest installment of my Innovators podcast, which ran while I was away on vacation, I spoke with Steven Willmott of 3scale, one of several companies in the emerging business of third-party API management. As more organizations get into the game of providing APIs to their online data, there’s a growing need for help in the design and management of those APIs.

By way of demonstration, 3scale is providing an unofficial API to some of the datasets offered by the United Nations. The UN data at http://data.un.org, while browseable and downloadable, is not programmatically accessible. If you visit 3scale’s demo at www.undata-api.org/ you can sign up for an access key, ask for available datasets — mostly, so far, from the World Health Organization (see below) — and then query them.

The query capability is rather limited. For a given measure, like Births by caesarean section (percent), you can select subsets by country or by year, but you can’t query or order by values. And you can’t make correlations across tables in one query.

It’s just a demo, of course. If 3scale wanted to invest more effort, a more robust query system could be built. The fact that such a system can be built by an unofficial intermediary, rather than by the provider of the data, is quite interesting.

As I watch this data publication meme spread, here’s something that interests me even more. These efforts don’t really reflect the Web 2.0 values of engagement and participation to the extent they could. We’re now very focused on opening up flexible means of access to data. But the conversation is still framed in terms of a producer/consumer relationship that isn’t itself much discussed.

At the end of this entry you’ll find a list of WHO datasets. Here’s one: Community and traditional health workers density (per 10,000 population). What kinds of questions do we think we might try to answer by counting this category of worker? What kinds of questions can’t we try to answer using the datasets WHO is collecting? How might we therefore want to try to influence the WHO’s data-gathering efforts, and those of other public health organizations?

“Give us the data” is an easy slogan to chant. And there’s no doubt that much good will come from poking through what we are given. But we also need to have ideas about what we want the data for, and communicate those ideas to the providers who are gathering it on our behalf.


Adolescent fertility rate
Adult literacy rate (percent)
Gross national income per capita (PPP international $)
Net primary school enrolment ratio female (percent)
Net primary school enrolment ratio male (percent)
Population (in thousands) total
Population annual growth rate (percent)
Population in urban areas (percent)
Population living below the poverty line (percent living on less than US$1 per day)
Population median age (years)
Population proportion over 60 (percent)
Population proportion under 15 (percent)
Registration coverage of births (percent)
Registration coverage of deaths (percent)
Total fertility rate (per woman)
Antenatal care coverage – at least four visits (percent)
Antiretroviral therapy coverage among HIV-infected pregnant women for PMTCT (percent)
Antiretroviral therapy coverage among people with advanced HIV infections (percent)
Births attended by skilled health personnel (percent)
Births by caesarean section (percent)
Children aged 6-59 months who received vitamin A supplementation (percent)
Children aged less than 5 years sleeping under insecticide-treated nets (percent)
Children aged less than 5 years who received any antimalarial treatment for fever (percent)
Children aged less than 5 years with ARI symptoms taken to facility (percent)
Children aged less than 5 years with diarrhoea receiving ORT (percent)
Contraceptive prevalence (percent)
Neonates protected at birth against neonatal tetanus (PAB) (percent)
One-year-olds immunized with MCV
One-year-olds immunized with three doses of Hepatitis B (HepB3) (percent)
One-year-olds immunized with three doses of Hib (Hib3) vaccine (percent)
One-year-olds immunized with three doses of diphtheria tetanus toxoid and pertussis (DTP3) (percent)
Tuberculosis detection rate under DOTS (percent)
Tuberculosis treatment success under DOTS (percent)
Women who have had PAP smear (percent)
Women who have had mammography (percent)
Community and traditional health workers density (per 10 000 population)
Dentistry personnel density (per 10 000 population)
Environment and public health workers density (per 10 000 population)
External resources for health as percentage of total expenditure on health
General government expenditure on health as percentage of total expenditure on health
General government expenditure on health as percentage of total government expenditure
Hospital beds (per 10 000 population)
Laboratory health workers density (per 10 000 population)
Number of community and traditional health workers
Number of dentistry personnel
Number of environment and public health workers
Number of laboratory health workers
Number of nursing and midwifery personnel
Number of other health service providers
Number of pharmaceutical personnel
Nursing and midwifery personnel density (per 10 000 population)
Other health service providers density (per 10 000 population)
Out-of-pocket expenditure as percentage of private expenditure on health
Per capita total expenditure on health (PPP int. $)
Per capita total expenditure on health at average exchange rate (US$
Pharmaceutical personnel density (per 10 000 population)
Physicians density (per 10 000 population)
Private expenditure on health as percentage of total expenditure on health
Private prepaid plans as percentage of private expenditure on health
Ratio of health management and support workers to health service providers
Ratio of nurses and midwives to physicians
Social security expenditure on health as percentage of general government expenditure on health
Total expenditure on health as percentage of gross domestic product

16 Comments

  1. Indeed, organizations that blindly follow the “give us the data” mantra often give us data that is unusable.

    Data portals often fail for the same reason that chart wizards fail — because they are constrained by pre-built casts, which data must be poured into.

    But we already have a widely-understood data query language that is interactive and expressive: it’s called SQL.

    If these data portals would merely allow end-users read-only SQL access to their underlying databases — they would be amazed at what innovative uses might emerge.

  2. We need to know “what the data mean” in it’s original form. For e.g., drug companies might say “this drug works 80% of the time”…but what does that mean? Self-report? Blind ratings by clinicians? biological tests?

    Figures don’t lie, but liars figure.

  3. > If these data portals would merely allow
    > end-users read-only SQL access to their
    > underlying databases — they would be
    > amazed at what innovative uses might emerge.

    Very interesting point. Historically that was unthinkable because of the fear that unthrottled query would impact services. But as databases move to the cloud there is renewed incentive to manage such access. I hope what you envision will come to pass.

  4. Jon –

    I always wonder whether the right answer is to get an API to a query interface to a database, or whether you’re better off with a snapshot and dump of a collection of raw data.

    The API route sounds better, but it means that whatever development you do to pull things from a site starts with protocols and interfaces and software and queries and all sorts of things that are appealing to programmers but difficult for most everyone else.

    The alternative, a simple data dump in some easy to parse file format, lets you figure out how the query structure looks like based on yor own data needs and gives you the opportunity either to restructure things for better efficiency or to apply much more primitive tools to do ad hoc queries.

    My experience with this so far has been on a street tree database (nicknamed “EveryTree”) in Ann Arbor – there’s now a CSV file with about 50000 geocoded and identified trees, and I was able to get something useful out of it with tools as simple as “grep” in a small amount of time to get a sense for what was possible.

    thanks

    Ed

    annarbor.com

    1. > The API route sounds better, but it
      > means that whatever development you
      > do to pull things from a site starts
      > with protocols and interfaces and
      > software and queries and all sorts
      > of things that are appealing to
      > programmers but difficult for most
      > everyone else.

      Of course it needn’t be either/or. There can be APIs and downloadable files. Arguably there should be, with a preference for the latter when data quantity is modest, and the former when it is vast.

      1. I definitely agree with both here; data dumps often can solve relatively simple problems much more quickly than APIs (though you don’t necessarily build the advantages of the repeatability on new data). Also, even if you are planning on hitting the API long term, having a local datastore that you can use to mockup the remote API while you’re starting out can be very helpful.

  5. Thanks for great conversation and write up Jon – it was a pleasure to talk. We definitely see the UNDATA API as an ongoing project and hope it will grow increasingly useful (we’ve just added IMF data + more sorts for the queries) – if we’d gone for an all or nothing on day one it would have taken a lot longer to launch (and maybe would never have made it). We’re certainly keen to hear what people would like from it next (especially if they have a concrete thing they’d like to do with it).

    Hopefully growing a useful resource will breed more ideas and then changes in the resource. Interestingly it’s probably a little easier for an unofficial skunkworks project to do this at least to begin with than the UN itself – since expectations on day one would be much higher.

    The SQL example is a nice one – it would actually be interesting to add different query languages that people find useful – security and load issues we’d have to look at (plus the API runs on the Google AppEngine – i.e. Big Table, not MySQL).

    Perhaps there should be one proviso though – asking for a feature means a commitment to use it :). That way we can fulfill Jon’s idea of co-evolution of the data and the Apps!

    We’d be very happy to have feedback / comments / suggestions here in this thread or over at http://www.undata-api.org/

  6. > Perhaps there should be one proviso though
    > – asking for a feature means a commitment
    > to use it.

    That’d be an interesting quid pro quo!

  7. Jon,
    I listened to the podcast a couple of weeks ago and it immediately piqued my interest – I’m fascinated by the exposure of web-based data in an easily consumable manner.

    To that end I’ve just seena d emo of something called Kapow (don’t worry I’m not selling anything here. I have no vested interest, I just thinks its a cool technology) which is effectively a cross between a screen scraper and a ETL tool. It harvests data from a DOM and presents it in a RESTful data service – really fascinating stuff. I blogged about it in case you’re interested: http://blogs.conchango.com/jamiethomson/archive/2009/07/08/kapow-etl-for-html.aspx

    Keep up the great work – especially the evangelism of calendar subscription.

    cheers
    Jamie

  8. “By extracting it from HTML tables?”
    That was my first assumption when I heard about it but actually no, it CAN do that but its much more.

    If you consider that (a) markup is inherently data and (b) the markup is in a known format (i.e. it hasn’t changed since you last looked at it) then that data can be extracted.

    The example I saw was automating the process of going to a SERP, extracting the title/URL/Description of each of the “10 blue links” on the first 5 results pages and returning those 50 rows in a 3-column dataset. There’s no HTML pages in a SERP but Kapow can still loop over them because of the nature of HTML.
    Plus it also has worflow (i.e. visit each of the first 5 SERPs in turn and union the results together)

    -Jamie

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s