Speaking and writing webscale identifiers

I’ve really enjoyed the conversation about webscale identifiers. Naming web resources is such a crucial discipline, and yet one we’re all still making up as we go along. I ended the earlier post by suggesting that when we invent namespaces we should, where feasible, prefer names that make sense to people. In comments, a number of folks who have wrestled with the problem of ambiguity pointed out all sorts of reasons why that often just isn’t feasible.

Gavin Bell likes Amazon’s hybrid approach:

The model that Amazon have since moved to with a unique URL identifier and an ignored pretty human readable section is a good compromise.

Michael Smethurst agreed with me that the BBC’s opaque IDs — for example, b006qpgr for The Archers — could be promoted as a tag vocabulary that people would be encouraged to use:

Shownar is a prototype by Schulze and Webb that aims to track “buzz” around bbc programmes. For now it’s based on inbound links from blogs/twitter/etc but it could be expanded to use machine tags!?!

On Shownar, I find that this episode of Miss Marple was discussed in this blog entry:

BBC Radio have just started an Agatha Christie season and a whole host of programmes about the Queen of Crime are available to UK listeners on the iPlayer.

They include dramatizations of works starring super sleuths from Miss Marple to the Mysterious Mr Quin, as well as revealing documentaries.

The entry uses URLs that embed these BBC ids: b00mk71d, b007jvht. How did the author find them? Clearly, in this case, by way of the search URL which is also cited in the entry:

http://www.bbc.co.uk/iplayer/search/?q=agatha christie

The search term agatha christie is wildly ambiguous, of course. Shownar would never have included this item had it not cited specific BBC shows by way of their opaque IDs. Nor would the author have cited them if that had required typing b00mk71d or b007jvht. It only works thanks to copy/paste, but it works quite nicely, and it shows why site-specific search still matters in an era of uber search engines.

This example got me thinking about the character strings that we can and do type, easily and naturally, versus those we can’t and won’t. For example:

queries (what we can and do type) results (what we can’t and don’t type)
http://www.librarything.com/catalog/jonudell&deepsearch=
practical internet groupware

http://www.librarything.com/work/16804

http://www.librarything.com/work/16804/book/28447984

http://www.google.com/search?q=
practical internet groupware

http://oreilly.com/catalog/9781565925373

http://oreilly.com/catalog/pracintgr

http://www.bing.com/results.aspx?q=
practical internet groupware

http://www.amazon.com/Practical-Internet-Groupware-Jon-Udell/dp/156592537

http://my.safaribooksonline.com/1565925378

http://www.worldcat.org/search?q=
practical internet groupware

http://www.worldcat.org/oclc/43188074

http://www.amazon.com/s?index=blended&field-keywords=
practical internet groupware

http://www.amazon.com/Practical-Internet-Groupware-Jon-Udell/dp/1565925378

 

Looking at the consistency on the left column, and the variation on the right, I’ve got to conclude that:

  1. Practical Internet Groupware is the de facto webscale identifier for my book.

  2. 16804, 28447984, 9781565925373, pracintgr, 156592537, 1565925378, and 43188074 will never converge.

I’ve long imagined a class of equivalence services that would help us bridge the gap between vocabularies we can speak and write and those we’ll never speak and need help to write.

Both are sets of webscale identifiers that we’ll need to use in complementary ways. That’ll require a mix of social conventions and technical services.

The joy of webscale identifiers

My guest for this week’s Innovators show, Ian Forrester, heads up the BBC’s Backstage project. Launched in 2005, Backstage lives at a cultural crossroads where legacy systems and methods intersect with their next-generation counterparts. The tagline for the feeds and APIs provided under the Backstage umbrella is “use our stuff to build your stuff.”

Admittedly that sounded a lot more exciting prior to 2006, when the BBC ended its trial of the Creative Archive service that was expected to “open the floodgates” to a “treasure trove” of cultural riches. Ian Forrester says those expectations were ratcheted back for two reasons. First, much of that treasure trove remains undigitized. Second, rights clearance proved to be an intractable problem.

So the “our stuff” that’s available to build “your stuff” turns out to be mostly metadata: news headlines, program titles and schedules. What’s more, that metadata comes from a plethora of BBC content management systems. What can you make out of these ingredients?

Here’s an evocative example: http://www.bbc.co.uk/nature/species/African_Bush_Elephant. The BBC’s Tom Scott explains:

Over the last few months we’ve been plundering the NHU’s [Natural History Unit’s] archive to find the best bits — segmenting the TV programmes, tagging them (with DBpedia terms) and then aggregating them around URIs for the key concepts within the natural history domain; so that you can discover those programme segments via both the originating programme and via concepts within the natural history domain — species, habitats, adaptations and the like.

This is just the sort of remixing that Backstage ought to enable anyone, inside or outside the BBC, to achieve. Since I’m a US resident, and don’t pay the UK’s television license fee, I can’t watch the videos on that page. There’s nothing that the Backstage team can do about that. But they can take a radically open and inclusive approach to the management of the metadata that supports this remixing, and that’s just what they’re doing.

In our conversation, Ian Forrester describes how the taxonomy that governs the Backstage feeds and APIs is shared with that of Wikipedia and its structured derivative, DBpedia. Tom Scott elaborates:

You might have noticed that the slugs for our URIs (the last bit of the URL) are the same as those used by Wikipedia and DBpedia that’s because I believe in the simple joy of webscale identifiers, you will also see that much like the BBC’s music site we are transcluding the introductory text from Wikipedia to provide background information for most things. This also means that we are creating and editing Wikipedia articles where they need improving (of course you are also more than welcome to improve upon the articles).

As someone who both practices and preaches collaborative curation, I’m delighted to see the BBC taking this approach. And I love the phrase webscale identifier. Here’s how Michael Smethurst defines it, in the post pointed to by Tom Scott:

I agree with the four Linked Data rules but I’d like to try to add a fifth: if possible don’t reinvent other people’s web identifiers. By web identifiers I mean those fragments of URLs that uniquely identify a resource within a domain. So in the case of the MusicBrainz entry for The Fall (http://musicbrainz.org/artist/d5da1841-9bc8-4813-9f89-11098090148e.html) that’ll be d5da1841-9bc8-4813-9f89-11098090148e.

The last time we updated the /music site we made this mistake (kind of unavoidable at the time). Even though we linked our data to MusicBrainz we minted new identifiers for artists. So The Fall became http://www.bbc.co.uk/music/artist/jb9x/ where jb9x was the identifier. But jb9x doesn’t exist anywhere outside of /music. We’ll (hopefully) never make that mistake again.

Beautifully said. Enormous synergies have gone unrealized because web publishers have chosen to mint new namespaces rather than add value to existing ones.

What I realized when talking with Ian, though, is that there is one namespace for which the BBC is the appropriate mint, namely its own. Here, for example, are some of the family of URLs for a radio drama called The Archers:

homepage: http://www.bbc.co.uk/programmes/b006qpgr/

upcoming shows: http://www.bbc.co.uk/programmes/b006qpgr/episodes/upcoming.xml

In this example b006qpgr is, at least potentially, a webscale identifier. It’s a unique tag for the show that, if used on blogs, on Twitter, and elsewhere, would make it easy to assemble all kinds of online activity related to the show. But in fact only web developers using Backstage feeds and APIs will ever discover, or use, b006qpgr. In colloquial discourse people use The Archers.

If the BBC wants people to collaborate with its namespace in the same way that it collaborates with Wikipedia’s, this would be more inviting:

http://www.bbc.co.uk/programmes/The_Archers/

http://www.bbc.co.uk/programmes/The_Archers/episodes/upcoming.xml

It should go without saying, but right after the first rule for linked data, “Use URIs as names for things,” I would add “Where possible, choose names that make sense to people.”