Influencing the production of public data

6 Jul 200929 Mar 2010 ~ Jon Udell

In the latest installment of my Innovators podcast, which ran while I was away on vacation, I spoke with Steven Willmott of 3scale, one of several companies in the emerging business of third-party API management. As more organizations get into the game of providing APIs to their online data, there’s a growing need for help in the design and management of those APIs.

By way of demonstration, 3scale is providing an unofficial API to some of the datasets offered by the United Nations. The UN data at http://data.un.org, while browseable and downloadable, is not programmatically accessible. If you visit 3scale’s demo at www.undata-api.org/ you can sign up for an access key, ask for available datasets — mostly, so far, from the World Health Organization (see below) — and then query them.

The query capability is rather limited. For a given measure, like Births by caesarean section (percent), you can select subsets by country or by year, but you can’t query or order by values. And you can’t make correlations across tables in one query.

It’s just a demo, of course. If 3scale wanted to invest more effort, a more robust query system could be built. The fact that such a system can be built by an unofficial intermediary, rather than by the provider of the data, is quite interesting.

As I watch this data publication meme spread, here’s something that interests me even more. These efforts don’t really reflect the Web 2.0 values of engagement and participation to the extent they could. We’re now very focused on opening up flexible means of access to data. But the conversation is still framed in terms of a producer/consumer relationship that isn’t itself much discussed.

At the end of this entry you’ll find a list of WHO datasets. Here’s one: Community and traditional health workers density (per 10,000 population). What kinds of questions do we think we might try to answer by counting this category of worker? What kinds of questions can’t we try to answer using the datasets WHO is collecting? How might we therefore want to try to influence the WHO’s data-gathering efforts, and those of other public health organizations?

“Give us the data” is an easy slogan to chant. And there’s no doubt that much good will come from poking through what we are given. But we also need to have ideas about what we want the data for, and communicate those ideas to the providers who are gathering it on our behalf.

Adolescent fertility rate

Adult literacy rate (percent)

Gross national income per capita (PPP international $)

Net primary school enrolment ratio female (percent)

Net primary school enrolment ratio male (percent)

Population (in thousands) total

Population annual growth rate (percent)

Population in urban areas (percent)

Population living below the poverty line (percent living on less than US$1 per day)

Population median age (years)

Population proportion over 60 (percent)

Population proportion under 15 (percent)

Registration coverage of births (percent)

Registration coverage of deaths (percent)

Total fertility rate (per woman)

Antenatal care coverage – at least four visits (percent)

Antiretroviral therapy coverage among HIV-infected pregnant women for PMTCT (percent)

Antiretroviral therapy coverage among people with advanced HIV infections (percent)

Births attended by skilled health personnel (percent)

Births by caesarean section (percent)

Children aged 6-59 months who received vitamin A supplementation (percent)

Children aged less than 5 years sleeping under insecticide-treated nets (percent)

Children aged less than 5 years who received any antimalarial treatment for fever (percent)

Children aged less than 5 years with ARI symptoms taken to facility (percent)

Children aged less than 5 years with diarrhoea receiving ORT (percent)

Contraceptive prevalence (percent)

Neonates protected at birth against neonatal tetanus (PAB) (percent)

One-year-olds immunized with MCV

One-year-olds immunized with three doses of Hepatitis B (HepB3) (percent)

One-year-olds immunized with three doses of Hib (Hib3) vaccine (percent)

One-year-olds immunized with three doses of diphtheria tetanus toxoid and pertussis (DTP3) (percent)

Tuberculosis detection rate under DOTS (percent)

Tuberculosis treatment success under DOTS (percent)

Women who have had PAP smear (percent)

Women who have had mammography (percent)

Community and traditional health workers density (per 10 000 population)

Dentistry personnel density (per 10 000 population)

Environment and public health workers density (per 10 000 population)

External resources for health as percentage of total expenditure on health

General government expenditure on health as percentage of total expenditure on health

General government expenditure on health as percentage of total government expenditure

Hospital beds (per 10 000 population)

Laboratory health workers density (per 10 000 population)

Number of community and traditional health workers

Number of dentistry personnel

Number of environment and public health workers

Number of laboratory health workers

Number of nursing and midwifery personnel

Number of other health service providers

Number of pharmaceutical personnel

Nursing and midwifery personnel density (per 10 000 population)

Other health service providers density (per 10 000 population)

Out-of-pocket expenditure as percentage of private expenditure on health

Per capita total expenditure on health (PPP int. $)

Per capita total expenditure on health at average exchange rate (US$

Pharmaceutical personnel density (per 10 000 population)

Physicians density (per 10 000 population)

Private expenditure on health as percentage of total expenditure on health

Private prepaid plans as percentage of private expenditure on health

Ratio of health management and support workers to health service providers

Ratio of nurses and midwives to physicians

Social security expenditure on health as percentage of general government expenditure on health

Total expenditure on health as percentage of gross domestic product

Published by Jon Udell

View all posts by Jon Udell

16 thoughts on “Influencing the production of public data”

Pingback: Information in Rotation » Blog Archive » Making public data APIs is a business now
Michael E Driscoll says:

6 Jul 2009 at 1:45 pm

Indeed, organizations that blindly follow the “give us the data” mantra often give us data that is unusable.

Data portals often fail for the same reason that chart wizards fail — because they are constrained by pre-built casts, which data must be poured into.

But we already have a widely-understood data query language that is interactive and expressive: it’s called SQL.

If these data portals would merely allow end-users read-only SQL access to their underlying databases — they would be amazed at what innovative uses might emerge.

Loading...

Reply
Larry Welkowitz says:

6 Jul 2009 at 3:21 pm

We need to know “what the data mean” in it’s original form. For e.g., drug companies might say “this drug works 80% of the time”…but what does that mean? Self-report? Blind ratings by clinicians? biological tests?

Figures don’t lie, but liars figure.

Loading...

Reply
Pingback: datalibre.ca · Open Data Access & APIs
Jon Udell says:

7 Jul 2009 at 10:26 am

> If these data portals would merely allow
> end-users read-only SQL access to their
> underlying databases — they would be
> amazed at what innovative uses might emerge.

Very interesting point. Historically that was unthinkable because of the fear that unthrottled query would impact services. But as databases move to the cloud there is renewed incentive to manage such access. I hope what you envision will come to pass.

Loading...

Reply
Edward Vielmetti says:

7 Jul 2009 at 2:11 pm

Jon –

I always wonder whether the right answer is to get an API to a query interface to a database, or whether you’re better off with a snapshot and dump of a collection of raw data.

The API route sounds better, but it means that whatever development you do to pull things from a site starts with protocols and interfaces and software and queries and all sorts of things that are appealing to programmers but difficult for most everyone else.

The alternative, a simple data dump in some easy to parse file format, lets you figure out how the query structure looks like based on yor own data needs and gives you the opportunity either to restructure things for better efficiency or to apply much more primitive tools to do ad hoc queries.

My experience with this so far has been on a street tree database (nicknamed “EveryTree”) in Ann Arbor – there’s now a CSV file with about 50000 geocoded and identified trees, and I was able to get something useful out of it with tools as simple as “grep” in a small amount of time to get a sense for what was possible.

thanks

Ed

annarbor.com

Loading...

Reply
1. Jon Udell says:
  
  7 Jul 2009 at 6:23 pm
  
  > The API route sounds better, but it
  > means that whatever development you
  > do to pull things from a site starts
  > with protocols and interfaces and
  > software and queries and all sorts
  > of things that are appealing to
  > programmers but difficult for most
  > everyone else.
  
  Of course it needn’t be either/or. There can be APIs and downloadable files. Arguably there should be, with a preference for the latter when data quantity is modest, and the former when it is vast.
  
  Loading...
  
  Reply
  1. Ken Kennedy says:
    
    12 Jul 2009 at 7:08 pm
    
    I definitely agree with both here; data dumps often can solve relatively simple problems much more quickly than APIs (though you don’t necessarily build the advantages of the repeatability on new data). Also, even if you are planning on hitting the API long term, having a local datastore that you can use to mockup the remote API while you’re starting out can be very helpful.
    
    Loading...
Steven Willmott says:

7 Jul 2009 at 5:02 pm

Thanks for great conversation and write up Jon – it was a pleasure to talk. We definitely see the UNDATA API as an ongoing project and hope it will grow increasingly useful (we’ve just added IMF data + more sorts for the queries) – if we’d gone for an all or nothing on day one it would have taken a lot longer to launch (and maybe would never have made it). We’re certainly keen to hear what people would like from it next (especially if they have a concrete thing they’d like to do with it).

Hopefully growing a useful resource will breed more ideas and then changes in the resource. Interestingly it’s probably a little easier for an unofficial skunkworks project to do this at least to begin with than the UN itself – since expectations on day one would be much higher.

The SQL example is a nice one – it would actually be interesting to add different query languages that people find useful – security and load issues we’d have to look at (plus the API runs on the Google AppEngine – i.e. Big Table, not MySQL).

Perhaps there should be one proviso though – asking for a feature means a commitment to use it :). That way we can fulfill Jon’s idea of co-evolution of the data and the Apps!

We’d be very happy to have feedback / comments / suggestions here in this thread or over at http://www.undata-api.org/

Loading...

Reply
Jon Udell says:

7 Jul 2009 at 6:24 pm

> Perhaps there should be one proviso though
> – asking for a feature means a commitment
> to use it.

That’d be an interesting quid pro quo!

Loading...

Reply
Jamie Thomson says:

8 Jul 2009 at 4:34 pm

Jon,
I listened to the podcast a couple of weeks ago and it immediately piqued my interest – I’m fascinated by the exposure of web-based data in an easily consumable manner.

To that end I’ve just seena d emo of something called Kapow (don’t worry I’m not selling anything here. I have no vested interest, I just thinks its a cool technology) which is effectively a cross between a screen scraper and a ETL tool. It harvests data from a DOM and presents it in a RESTful data service – really fascinating stuff. I blogged about it in case you’re interested: http://blogs.conchango.com/jamiethomson/archive/2009/07/08/kapow-etl-for-html.aspx

Keep up the great work – especially the evangelism of calendar subscription.

cheers
Jamie

Loading...

Reply
Jon Udell says:

9 Jul 2009 at 6:27 am

> It harvests data from a DOM

By extracting it from HTML tables?

Loading...

Reply
Jamie Thomson says:

9 Jul 2009 at 7:07 am

“By extracting it from HTML tables?”
That was my first assumption when I heard about it but actually no, it CAN do that but its much more.

If you consider that (a) markup is inherently data and (b) the markup is in a known format (i.e. it hasn’t changed since you last looked at it) then that data can be extracted.

The example I saw was automating the process of going to a SERP, extracting the title/URL/Description of each of the “10 blue links” on the first 5 results pages and returning those 50 rows in a 3-column dataset. There’s no HTML pages in a SERP but Kapow can still loop over them because of the nature of HTML.
Plus it also has worflow (i.e. visit each of the first 5 SERPs in turn and union the results together)

-Jamie

Loading...

Reply
Pingback: The Third Bit » Blog Archive
Pingback: Button Forums » Enterprise Mashups
Pingback: Making public data APIs is a business now « baby blog