Facts and friction

Last weekend we all had a good chuckle when we saw that WolframAlpha knows — or anyway claims to know — the airspeed of an unladen swallow. But the more telling example, for me, was one that Stephen Wolfram showed in a post-demo discussion:

Suppose you want to know the distance to Pluto. We don’t just look it up. We answer the question: “What is the distance to Pluto right now?” And we compute the answer.

I reckon that this notion of computable knowledge is going to take a while to sink in. Here’s another example:

Q: length of grand canyon / height of mt. everest

A: 4.47.

These examples run the risk of seeming geeky and pointless. But twice in the last few days, I’ve found myself reaching for bits of computable knowledge that weren’t readily available, and that’s got me thinking about what things might be like when they are.

Both examples are from my elmcity+azure project. In one case, I needed to work out distances — based on latitude/longitude coordinates — for locations that might be written as Providence RI or Ann Arbor, MI. There’s no shortage of online services that can do this. But they all report results in different ways, and digging the answers out of XML responses — which may or may not require special handling for embedded namespaces — can be very tricky.

In the other case I wanted population data for cities whose names are written the same way. Here I wound up digging it out of a CSV file published at http://www.census.gov. It’s perfectly doable, but you’ve got to really want to do it. If you have, say, a count of calendar events in Providence, and you want to divide that by population in order to produce an experimental metric for creative class activity, you can’t just write “population of Providence RI” in the denominator and proceed with your experiment. You have to overcome some fairly serious data friction.

In a few months we’ll all get to tirekick WolframAlpha. Then we’ll draw our own conclusions about what it can or can’t do, and is or isn’t good for. I’m not expecting a Delphic oracle. But I would like to be able to compute with facts in a more frictionless way.

22 Comments

  1. This is not really directly pertinent to your post, but have you seen the geopy library?

    >>> from geopy import geocoders, distance
    >>> g = geocoders.Google()
    >>> aa = g.geocode(“Ann Arbor, MI”)
    >>> p = g.geocode(“Providence, RI”)
    >>> print distance.distance(aa[1], p[1]).miles
    634.170150868

    No API registration even required!

    http://pypi.python.org/pypi/geopy

  2. “This is not really directly pertinent to your post, but have you seen the geopy library?”

    It is highly pertinent, and that is sweet. Thanks!

  3. Replying to Joe – that’s a great trick but it misses the point – it’s a specific example with some decent fmiliarity with a few python packages just for one query. It’s exact what Jon and WA want to overcome.

    Btw Jon, did you coin data friction? It’s not a bad metaphor. Actually works well against information slippage (my term), which is when information moves TOO easily between contexts (think embarassimg fb photos). Funny how some kinds of data seem to work either too easily or not easily enough.

  4. > Btw Jon, did you coin data friction?

    Google seems to think so, FWIW.

    > Funny how some kinds of data seem to
    > work either too easily or not easily enough.

    Yeah, that’s just life I guess. When you want it to slip, it won’t. When you don’t, it does.

  5. > that’s why I said it wasn’t pertinent,
    > even if Jon says it was!

    Understood.

    I haven’t tried your example yet, but yesterday I wound up writing some silly little method to deal with “ann arbor, mi” where the comma is required, and “providence ri” where it can be optional.

    geopy already knows that, so I wasted my time. Except that geopy isn’t available as service, though I expect that’ll change as the cloud platforms gain traction.

    Here, by the way, is the same issue in the calendar domain:

    https://blog.jonudell.net/2008/04/02/parsing-human-written-date-and-time-information/

    A big chunk of the Wolfram system is dedicated to these mundane but vexing kinds of linguistic recognition.

    > How will developers use it?

    We’ll have to wait and see. I hope that linguistic as well as computational interfaces will be available.

  6. The key thing here, to my mind at least, is that data is only any use if you can do something with it.

    What most people miss, though, is that there are two variables there, and one of them is who “you” is.
    The implication of that is that there are two ways to make data more useful. Most of us have focussed on access – making the data more available, which is all well and good, but

  7. …. argh. I must have hit submit at the wrong time. Sorry!

    anyway, what I was saying: focussing on access neglects the possibility of making the data usful to as many people as possible once it’s there. That, too, needs work in a couple of different directions, I reckon: in the systems we use to help people find the data, and in the systems we use to let people transform and analyse it.

    So I’m intrigued by Wolfram Alpha because it merges both these things into a natural-language-like interface: “find these facts, then do this bit of computation on them, and tell me the answer”. It’s the world’s weirdest DSL, if you want to look at it that way. But the downside is that it looks like it needs a lot of the knowledge “baked in” – how would you add your own facts?

    The startup I’m working on, http://timetric.com/, takes a bit of a different approach. On the face of it, it’s less ambitious than what Wolfram and co. are aiming to do, because we’ve restricted the domain of facts we deal with we do numbers and their variation over time – time series, though you can always express constants as values which never change – and built an infrastructure with three important bits: permanent URLs for given bits of data, tools for finding and viewing data and uploading new series, and a calculation engine for deriving new time series from data already there.

    One important thing we share with Wolfram Alpha, though, and which maybe speaks to your idea of data friction, is the idea that it’s about expressing the computations people want to do in a way they find immediately familiar. Achieving that means leveraging a language people already know. Wolfram Alpha’s using natural language; we’re using a formula language which looks like the one people know from spreadsheets. Both of these are accessible to non-programmers – any answer which starts with “learn Python” has to be the wrong one somehow. Take the geopy example above: as programmers, we can solve most of the problems we think about, even if it’s annoying and fiddly and involves writing altogether too many little parsers. But there are many more people who don’t program at all, and making data much more useful is going to mean writing tools for them.

  8. > how would you add your own facts?

    We can only wait and see what the API will eventually enable. But initially, accepting input is a non-goal. Very counter-web-2.0, to be sure.

    > Timetric

    I had read about that here: http://seanmcgrath.blogspot.com/2009/03/take-look-at-timetric.html

    My first thought was: Hmm. GeoCommons wants me to upload upload data so we can understand its spatial dimension. Timetric wants me to upload data so we can work with the temporal dimension. In many cases it’ll be the same data, but now analysis and curation will be happening in silos.

    There’s no easy answer to this. But it does make me very curious to see what will happen when a substantial core of curated data winds up in the Wolfram silo with a very robust set of tools and services wrapped around it.

    > permanent URLs for given bits of data

    At what granularity? For example, is this URL:

    http://timetric.com/series/K63mz40KTzy3TjGdyXEoVA/

    your alias for the source:

    http://www.ecb.int/stats/eurofxref/eurofxref-hist.zip

    Or for your transformation of the source?

    What is the atomic/versionable unit of reference? The datum? The row or column? The matrix? The collection?

    > as programmers, we can solve most of
    > the problems we think about, even if
    > it’s annoying and fiddly

    Yes but time is a scarce resource, so a lot of questions go unasked/unanswered even by those who could do the fiddling.

  9. Responding to both Bernie and to Andrew, I don’t think we disagree all that much, and of course I haven’t seen Wolfram Alpha… but I think there has to be some kind of “meeting in the middle.” People and our tools evolve in synergy.

    I’m basically talking about what Jon (and Jeannette Wing) have been framing around the idea of “computational literacy.” One of the reasons I’m enthusiastic about Python is that I think it is within reach of the “computationally literate” even if they don’t think of themselves as programmers.

    I definitely laud the idea of building tools that standardize and guide people into doing things the right way — but I hope that those tools expose their innards so that if they turn out to not have captured the right standardizations, we can adapt them. I boggle at the kinds of things “non-programmers” have achieved using MS Excel and cell formulas, and I have to believe that more people would be able to learn to do those things if they could use a more humane language to do them.

    Jon wrote:
    > it does make me very curious to see what will happen when
    > a substantial core of curated data winds up in the Wolfram
    > silo with a very robust set of tools and services
    > wrapped around it.

    Good point. Perhaps Wolfram will end up as the Adobe of data tools and de facto standards…

  10. > I boggle at the kinds of things
    > “non-programmers” have achieved using
    > Excel and cell formulas, and I have to
    > believe that more people would be able to
    > learn to do those things if they could use
    > a more humane language to do them.

    This is, by the way, the very motivation and rationale that’s driving Oslo. It takes a hybrid approach to domain-specific languages, positing that people could actually create these for themselves, and then use them textually or graphically.

  11. Jon wrote:
    > At what granularity? For example, is
    > this URL…

    Our URLs each point to a single time-series – they represent the value of a given quantity varying across time (plus any associated metadata). One of those pieces of metadata is the source (provenance) of the data – which for the case you highlighted is a spreadsheet from the European Central Bank containing the history of thirty-odd exchange rates.

    So what we’re doing is taking a document and transforming it into several addressable resources, putting a documented API onto the resources thus created, and building tools around that.

    Joe writes:
    > I hope that those tools expose their
    > innards so that if they turn out to not
    > have captured the right standardizations,
    > we can adapt them.

    Our APIs are public and the data’s downloadable, so although we by necessity impose a particular form on the data in Timetric, we’d hope that ultimately it’s no more siloed than it was up on the European Central Bank’s website. Which form is more useful will more than likely depend on what one’s trying to do.

    Jon wrote:
    > Yes but time is a scarce resource, so a
    > lot of questions go unasked/unanswered
    > even by those who could do the fiddling.

    Hopefully, the things services like ours can do – primarily through the API, in this context – help out in writing programs which give you the answers you’re after, and provide facilities which let non-programmers get results they couldn’t easily get otherwise. And as Joe writes, the boundaries between programmer and non-programmer are fuzzy; I used to be a computational physicist, and that’s a field full of people who write programs but mostly wouldn’t think of themselves as programmers. It’s something they (sometimes) do, rather than something they are.

  12. @Joe, yeah – I think we’re almost on the same page. I have a caveat, though.

    These queries still require exogenous knowledge that relates not to the data but to the way it is accessed. Python libraries are exogenous knowledge; they are simply there as a means to an end. The goal should be to absolutely minimize the need for additional such knowledge. Better syntax gets us part of the way there, but it still is incomplete. You still need to know which library and function. Python makes it almost as simple as possible, but it is as simple as possible within the domain of having code that is executed rather than question that is asked.

    Computer literacy is not just about learning the limitations of the data, but also the seemingly arbitrary conventions of accessing it. We certainly need to learn better how to understand the limitations of various data. However, I don’t think the end game is teaching everyone basic scripting. It is a problem to me if everyone needs to know how to code in order to ask computers for things would not require code if you asked a knowledgeable human being.

    That is different than asking computers for things that only computer are good at, such as lists of answers, sorted and filtered. But between sites like Yahoo pipes and open.dapper.com we are getting closer to data feed nirvana.

  13. Pingback: Tycho Litjens

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s