The joy of webscale identifiers

My guest for this week’s Innovators show, Ian Forrester, heads up the BBC’s Backstage project. Launched in 2005, Backstage lives at a cultural crossroads where legacy systems and methods intersect with their next-generation counterparts. The tagline for the feeds and APIs provided under the Backstage umbrella is “use our stuff to build your stuff.”

Admittedly that sounded a lot more exciting prior to 2006, when the BBC ended its trial of the Creative Archive service that was expected to “open the floodgates” to a “treasure trove” of cultural riches. Ian Forrester says those expectations were ratcheted back for two reasons. First, much of that treasure trove remains undigitized. Second, rights clearance proved to be an intractable problem.

So the “our stuff” that’s available to build “your stuff” turns out to be mostly metadata: news headlines, program titles and schedules. What’s more, that metadata comes from a plethora of BBC content management systems. What can you make out of these ingredients?

Here’s an evocative example: http://www.bbc.co.uk/nature/species/African_Bush_Elephant. The BBC’s Tom Scott explains:

Over the last few months we’ve been plundering the NHU’s [Natural History Unit’s] archive to find the best bits — segmenting the TV programmes, tagging them (with DBpedia terms) and then aggregating them around URIs for the key concepts within the natural history domain; so that you can discover those programme segments via both the originating programme and via concepts within the natural history domain — species, habitats, adaptations and the like.

This is just the sort of remixing that Backstage ought to enable anyone, inside or outside the BBC, to achieve. Since I’m a US resident, and don’t pay the UK’s television license fee, I can’t watch the videos on that page. There’s nothing that the Backstage team can do about that. But they can take a radically open and inclusive approach to the management of the metadata that supports this remixing, and that’s just what they’re doing.

In our conversation, Ian Forrester describes how the taxonomy that governs the Backstage feeds and APIs is shared with that of Wikipedia and its structured derivative, DBpedia. Tom Scott elaborates:

You might have noticed that the slugs for our URIs (the last bit of the URL) are the same as those used by Wikipedia and DBpedia that’s because I believe in the simple joy of webscale identifiers, you will also see that much like the BBC’s music site we are transcluding the introductory text from Wikipedia to provide background information for most things. This also means that we are creating and editing Wikipedia articles where they need improving (of course you are also more than welcome to improve upon the articles).

As someone who both practices and preaches collaborative curation, I’m delighted to see the BBC taking this approach. And I love the phrase webscale identifier. Here’s how Michael Smethurst defines it, in the post pointed to by Tom Scott:

I agree with the four Linked Data rules but I’d like to try to add a fifth: if possible don’t reinvent other people’s web identifiers. By web identifiers I mean those fragments of URLs that uniquely identify a resource within a domain. So in the case of the MusicBrainz entry for The Fall (http://musicbrainz.org/artist/d5da1841-9bc8-4813-9f89-11098090148e.html) that’ll be d5da1841-9bc8-4813-9f89-11098090148e.

The last time we updated the /music site we made this mistake (kind of unavoidable at the time). Even though we linked our data to MusicBrainz we minted new identifiers for artists. So The Fall became http://www.bbc.co.uk/music/artist/jb9x/ where jb9x was the identifier. But jb9x doesn’t exist anywhere outside of /music. We’ll (hopefully) never make that mistake again.

Beautifully said. Enormous synergies have gone unrealized because web publishers have chosen to mint new namespaces rather than add value to existing ones.

What I realized when talking with Ian, though, is that there is one namespace for which the BBC is the appropriate mint, namely its own. Here, for example, are some of the family of URLs for a radio drama called The Archers:

homepage: http://www.bbc.co.uk/programmes/b006qpgr/

upcoming shows: http://www.bbc.co.uk/programmes/b006qpgr/episodes/upcoming.xml

In this example b006qpgr is, at least potentially, a webscale identifier. It’s a unique tag for the show that, if used on blogs, on Twitter, and elsewhere, would make it easy to assemble all kinds of online activity related to the show. But in fact only web developers using Backstage feeds and APIs will ever discover, or use, b006qpgr. In colloquial discourse people use The Archers.

If the BBC wants people to collaborate with its namespace in the same way that it collaborates with Wikipedia’s, this would be more inviting:

http://www.bbc.co.uk/programmes/The_Archers/

http://www.bbc.co.uk/programmes/The_Archers/episodes/upcoming.xml

It should go without saying, but right after the first rule for linked data, “Use URIs as names for things,” I would add “Where possible, choose names that make sense to people.”

20 thoughts on “The joy of webscale identifiers

  1. Hello!

    Nice post! One thing though: it would be hard to consistently re-use Wikipedia url keys consistently across http://www.bbc.co.uk/programmes, as not all programmes have a Wikipedia page (far, far from that… brands are relatively well covered, but not all BBC episodes have a corresponding page on Wikipedia). BBC Earth is different, as they deliberately only publish their data when there is a corresponding entry on Wikipedia. It wouldn’t make sense for BBC Programmes, as it would amount to lots of data not being exposed.

    Another thing, I don’t think “webscale identifiers” is the same as “names that make sense to people”. For example, Musicbrainz GUID are well-established identifiers for things in the music domain. They are opaque, but very, very useful. Wikipedia URL keys are readable, but one side-effect of that is that they change all the time!! From “Madonna” to “Madonna_(entertainer)” for example…

    Best,
    y

    1. It’s not quite true that we only publish where there is a corresponding Wikipedia entry.

      There are instances where we have content and Wikipedia doesn’t – in those instances we create new Wikipedia pages.

      There might also be istances in the future were we could have problems – for example ‘World on the Move’ tracked individual animals, Big Cat does the same indeed lots of wildlife programmes do – I would like to have a URI for those animals but will the Wikipedia community think of these animals as sufficiently important to have a page? Possibly not – if that’s the case we’ll need to mint our own identifiers.

      Glad you like the site and our design decisions, hopefully when we get radio content in there the site will become more engaging for those that live outside the UK .

  2. Yes, I agree with Yves, it will ba a problem to keep readable identifiers unique in every case. I think people often arn’t really interest how the name of a full URL is, they are more interested in the information, which will be presented through this. That means that one can add at least a meanigful topic to that URL, which will be presented to the human. Handling the URLs is the task of the programmers and the applications and customers should’nt matter about, they will hopefully get the information they want.

    Cheers,

    zazi

  3. Agreed, wikipedia page titles can change – pages can be moved, merged, redirected, deleted. Musicbrainz IDs are good because they’re intended to be persistent. Another nice example might be imdb identifiers for films – again, opaque and persistent.

  4. > not all programmes have a Wikipedia page

    Right. The example is predicated on the assumption that topics in natural science will, and that to the extent they do, BBC aggregations can align to that taxonomy.

    This is an intriguing dilemma, of course. A URL namespace is a realm where the interests of computers/software/services intersect with the interests of humans/users/customers.

    I can guess http://wikipedia.org/wiki/Madonna, and you’ll know what I (probably) mean, and it’ll at least partly work, by taking you to a disambiguation page.

    I’ll never guess http://musicbrainz.org/artist/79239441-bfd5-4981-a70c-55c3f15c1287.html, and you’ll never know what I mean when you see it.

    I don’t think that opaqueness and persistence are necessarily connected. The service that mints a namespace controls that namespace and determines its persistence. There are zillions of dead links of both sorts: opaque and readable.

    Meanwhile, of course, thanks to Twitter, we are increasingly collapsing the readable names down to opaque IDs. Around and around we go!

    The bottom line for me, though, is that computers and humans share the use of URL namespace. Computers don’t care, but humans bookmark URLs, copy and paste them, make lists of them, email them, blog them. Opaqueness adds considerable friction to these activities. Does it reduce friction elsewhere by an equal or greater amount? Maybe in some cases, but if so I’m not sure what they are.

  5. Hi Jon

    Reading this I’m reminded that I really should get round to writing the blog post on the design decisions behind /programmes URIs that I’ve meant to write for the last year or so. They’ve been a source of contention inside and outside the bbc for quite a while but i think / hope there are sound reasons behind the final design.

    Way back in 2004 Tom Coates wrote a blog post on developing a URL structure for broadcast radio sites that formed the foundations of the /programmes work. Since then numerous things have changed in both the data model and the pipes and conduits that feed it.

    Afraid it’s all rather difficult to explain in the limited space of a comment box but factors we took into consideration were:

    – cross network repeats (so no /radio3)
    – programme name ambiguity (there are a lot of programmes called breakfast eg)
    – programme brands that change name over time
    – programmes that are never broadcast
    – programmes from the archive
    – the rather strange route that programme data takes before it reaches /programmes and iPlayer

    Basically, human readable URIs are kinda nice (although not everyone speaks English and browsers are beginning to get to the language accept headers stage!?!) but persistence outweighs. And in almost every domain language and labeling change over time. Wiki|DBpedia have an easier job cos they’ve got a whole army of willing galley slaves to mint the webscale identifiers for them. Unfortunately most organisations aren’t so lucky.

    For the record both /music and /programmes URIs are kinda hackable / guessable. If you type http://www.bbc.co.uk/programmes/programme_name:

    – if there’s only one programme with that text in its name you’ll go directly to the programme as in http://www.bbc.co.uk/programmes/Kermode%20and%20Mayos%20Film%20Review

    – or if there are many programmes with that text in the name you’ll be taken to a disambiguation page as in http://www.bbc.co.uk/programmes/kermode

    The same’s true of /music so http://www.bbc.co.uk/music/artists/u2 gives you a disambiguation pages whereas http://www.bbc.co.uk/music/artists/stone%20roses takes you straight to the artist page. You don’t need to type the %20s – they’re just there for the benefit of wordpress comment processing – spaces work just as well. It’s a hackable feature of the URIs that we probably don’t make enough of…

    Anyway, hope this explains a little. And I’ll try to write a post to explain more fully…

    Michael

  6. Hi Michael,

    Thanks for your thoughtful response.

    > – if there’s only one programme with that
    > text in its name you’ll go directly to
    > the programme as in http://www.bbc.co.uk
    > /programmes
    > /Kermode%20and%20Mayos%20Film%20Review

    > – or if there are many programmes with
    > that text in the name you’ll be taken to
    > a disambiguation page as in
    > http://www.bbc.co.uk/programmes/kermode

    Well there you go. Beautiful! This works after all:

    http://www.bbc.co.uk/programmes/The_Archers

    This of course leads to the follow-on question: Should a system prefer to present the human-readable form when it is equivalent to a unique ID? In this case, for example, the system could resolve The_Archers to b006qpgr internally, as it does now, but hide the redirection.

    This would signal to users that the human-readable form is available and in fact encouraged for interaction in the human realm. But it’s arguably wrong for robots, assuming that they aren’t prepared to deal with disambiguation. Unless of course they are, in which case we’re entering the scary zone of content negotiation.

    Around and around we go!

    1. Keeping the human readable form and not redirecting to the PID (the opaque ID – b006qpgr) might work if the data set were static but unfortunately programme making people keep coming up with new ideas for programmes ;-)

      And many different programmes share the same name (there’ve been something like 7 “The Office”s over time). It’s made more difficult when you take into account not only all the programmes that may be made in the future but also the ambition to get all the programmes in the BBC archive into /programmes.

      So whilst The Archers might be THE one and only Archers for now (and so be ok to sit at /programmes/the%20archers), there might be a different The Archers in the future (or in the past). When another The Archers goes into PIPs (the system that powers /programmes and iPlayer) then THE Archers would have to move URI to /programmes/:pid and we try to resist movement if at all possible.

      Realise The Archers is a bit of a bad example here. It’s been around for decades and will probably be around for decades longer. I’m pretty sure there’s never been a different The Archers in the past and pretty certain no-one would be stupid enough to make a different The Archers in the future but the problem does exist for other programmes.

      Whilst it’s fine for the period of time where names are unique, as more data gets added and uniqueness gets lost then stuff has to move…

      You could build a system to manage human readable url keys over time and employ people to manage that system but I’m not sure that’s how I’d want my licence fee spent?!?

      Finally there are plans to support multiple language variants of programme data (English, Welsh etc) which will hopefully sit at the same URIs and be content negotiated to the appropriate representation. In which case should the url key be english or welsh or opaque???

      For such a simple thing as radio and telly programmes it does all get rather nastily complicated :-/

  7. > For such a simple thing as radio and
    > telly programmes it does all get rather
    > nastily complicated :-/

    Not simple at all, in any domain. Namespace management is one of the hardest problems of all.

    I’m sure you’ve thought about the possibility of decorating names with modifiers — sequence numbers, dates — to account for different programs using the same names.

    But your points about resources to manage that system, and multiple languages, are well taken.

    Given all this, do you think it’s feasible/desirable to promote the opaque ids as magnets for chatter in the various -spheres?

    1. > I’m sure you’ve thought about the
      > possibility of decorating names with
      > modifiers — sequence numbers, dates — > to account for different programs
      > using the same names.

      it’s been discussed yes. part of the trouble no-one wants to be /the_office_2 or /u2_2 and then things get political and you need human intervention

      > Given all this, do you think it’s
      > feasible/desirable to promote the
      > opaque ids as magnets for chatter in
      > the various -spheres?

      YES :-) There’s a recent post on radiolabs discussing machine tagging the BBC that sets out some of it.

      Also shonar is a prototype by Schulze and Webb that aims to track “buzz” around bbc programmes. It’s due to be integrated into /programmes at some point. For now it’s based on inbound links from blogs/twitter/etc but it could be expanded to use machine tags!?!

  8. Hi Jon, I worked on the original PIPs Radio3 version of the BBC giving URLs to programmes, which Tom Coates wrote up so well. I wanted to add two other points to the discussion.

    Taking Pride and Prejudice, which has been several plays, radio plays and TV mini-series. This ambiguity was instrumental in going for a short opaque identifier. We were also thinking of creating identifiers that would be unique for 30-40 years, so not prone to the whims of a producer deciding that they had to have /programmes/prideandprejudice 5-10 years out and breaking the previous url. The Archers is too strong and persistent a brand to be useful in this case.

    An amusing one is Bells on Sunday which is repeated on Monday morning…

    The model that Amazon have since moved to with a unique URL identifier and an ignored pretty human readable section is a good compromise. Amazon also face the issue of multiple editions and formats for their books and DVDs etc.

    thanks
    Gavin

  9. Pingback: contemporary home

Leave a Reply