The joy of webscale identifiers

My guest for this week’s Innovators show, Ian Forrester, heads up the BBC’s Backstage project. Launched in 2005, Backstage lives at a cultural crossroads where legacy systems and methods intersect with their next-generation counterparts. The tagline for the feeds and APIs provided under the Backstage umbrella is “use our stuff to build your stuff.”

Admittedly that sounded a lot more exciting prior to 2006, when the BBC ended its trial of the Creative Archive service that was expected to “open the floodgates” to a “treasure trove” of cultural riches. Ian Forrester says those expectations were ratcheted back for two reasons. First, much of that treasure trove remains undigitized. Second, rights clearance proved to be an intractable problem.

So the “our stuff” that’s available to build “your stuff” turns out to be mostly metadata: news headlines, program titles and schedules. What’s more, that metadata comes from a plethora of BBC content management systems. What can you make out of these ingredients?

Here’s an evocative example: The BBC’s Tom Scott explains:

Over the last few months we’ve been plundering the NHU’s [Natural History Unit’s] archive to find the best bits — segmenting the TV programmes, tagging them (with DBpedia terms) and then aggregating them around URIs for the key concepts within the natural history domain; so that you can discover those programme segments via both the originating programme and via concepts within the natural history domain — species, habitats, adaptations and the like.

This is just the sort of remixing that Backstage ought to enable anyone, inside or outside the BBC, to achieve. Since I’m a US resident, and don’t pay the UK’s television license fee, I can’t watch the videos on that page. There’s nothing that the Backstage team can do about that. But they can take a radically open and inclusive approach to the management of the metadata that supports this remixing, and that’s just what they’re doing.

In our conversation, Ian Forrester describes how the taxonomy that governs the Backstage feeds and APIs is shared with that of Wikipedia and its structured derivative, DBpedia. Tom Scott elaborates:

You might have noticed that the slugs for our URIs (the last bit of the URL) are the same as those used by Wikipedia and DBpedia that’s because I believe in the simple joy of webscale identifiers, you will also see that much like the BBC’s music site we are transcluding the introductory text from Wikipedia to provide background information for most things. This also means that we are creating and editing Wikipedia articles where they need improving (of course you are also more than welcome to improve upon the articles).

As someone who both practices and preaches collaborative curation, I’m delighted to see the BBC taking this approach. And I love the phrase webscale identifier. Here’s how Michael Smethurst defines it, in the post pointed to by Tom Scott:

I agree with the four Linked Data rules but I’d like to try to add a fifth: if possible don’t reinvent other people’s web identifiers. By web identifiers I mean those fragments of URLs that uniquely identify a resource within a domain. So in the case of the MusicBrainz entry for The Fall ( that’ll be d5da1841-9bc8-4813-9f89-11098090148e.

The last time we updated the /music site we made this mistake (kind of unavoidable at the time). Even though we linked our data to MusicBrainz we minted new identifiers for artists. So The Fall became where jb9x was the identifier. But jb9x doesn’t exist anywhere outside of /music. We’ll (hopefully) never make that mistake again.

Beautifully said. Enormous synergies have gone unrealized because web publishers have chosen to mint new namespaces rather than add value to existing ones.

What I realized when talking with Ian, though, is that there is one namespace for which the BBC is the appropriate mint, namely its own. Here, for example, are some of the family of URLs for a radio drama called The Archers:


upcoming shows:

In this example b006qpgr is, at least potentially, a webscale identifier. It’s a unique tag for the show that, if used on blogs, on Twitter, and elsewhere, would make it easy to assemble all kinds of online activity related to the show. But in fact only web developers using Backstage feeds and APIs will ever discover, or use, b006qpgr. In colloquial discourse people use The Archers.

If the BBC wants people to collaborate with its namespace in the same way that it collaborates with Wikipedia’s, this would be more inviting:

It should go without saying, but right after the first rule for linked data, “Use URIs as names for things,” I would add “Where possible, choose names that make sense to people.”