I’ve really enjoyed the conversation about webscale identifiers. Naming web resources is such a crucial discipline, and yet one we’re all still making up as we go along. I ended the earlier post by suggesting that when we invent namespaces we should, where feasible, prefer names that make sense to people. In comments, a number of folks who have wrestled with the problem of ambiguity pointed out all sorts of reasons why that often just isn’t feasible.
Gavin Bell likes Amazon’s hybrid approach:
The model that Amazon have since moved to with a unique URL identifier and an ignored pretty human readable section is a good compromise.
Michael Smethurst agreed with me that the BBC’s opaque IDs — for example, b006qpgr for The Archers — could be promoted as a tag vocabulary that people would be encouraged to use:
Shownar is a prototype by Schulze and Webb that aims to track “buzz” around bbc programmes. For now it’s based on inbound links from blogs/twitter/etc but it could be expanded to use machine tags!?!
On Shownar, I find that this episode of Miss Marple was discussed in this blog entry:
BBC Radio have just started an Agatha Christie season and a whole host of programmes about the Queen of Crime are available to UK listeners on the iPlayer.
They include dramatizations of works starring super sleuths from Miss Marple to the Mysterious Mr Quin, as well as revealing documentaries.
The entry uses URLs that embed these BBC ids: b00mk71d, b007jvht. How did the author find them? Clearly, in this case, by way of the search URL which is also cited in the entry:
The search term agatha christie is wildly ambiguous, of course. Shownar would never have included this item had it not cited specific BBC shows by way of their opaque IDs. Nor would the author have cited them if that had required typing b00mk71d or b007jvht. It only works thanks to copy/paste, but it works quite nicely, and it shows why site-specific search still matters in an era of uber search engines.
This example got me thinking about the character strings that we can and do type, easily and naturally, versus those we can’t and won’t. For example:
Looking at the consistency on the left column, and the variation on the right, I’ve got to conclude that:
Practical Internet Groupware is the de facto webscale identifier for my book.
16804, 28447984, 9781565925373, pracintgr, 156592537, 1565925378, and 43188074 will never converge.
I’ve long imagined a class of equivalence services that would help us bridge the gap between vocabularies we can speak and write and those we’ll never speak and need help to write.
Both are sets of webscale identifiers that we’ll need to use in complementary ways. That’ll require a mix of social conventions and technical services.
10 thoughts on “Speaking and writing webscale identifiers”
I haven’t explored what’s ben going on with BBC URIs, but there is some naming starting to appear…
> but there is some naming starting to appear
Yes. As Gavin Bell noted, it’s the Amazon approach which combines an opaque ID and readable slug.
So far, the id is promoted only for developers:
“To access these add .xml, .json or .yaml to the end of the url.”
However it can be used to find a cluster of related things:
And the ID finds more things than the name:
Interestingly, neither finds anything on the BBC site:
Facebook has been using profileIDs (long digits of numbers) to identify people, so when you sign up for FB by default you end up with a long URL like
facebook.com/profile.php?id=12345678, which obviously you cant pass around to other people or even remember or recall it.
Then you had to go into settings and do somethings to get an url like
On the other hand myspace has always been providing urls of the form
Adding content hints to identifiers degrades into the case where you have several variations of the identifiers in the wild. For automation to help us it needs to which identifiers are for the same thing. If all identifiers used the same syntax then it would be possible to automatically do this. Using the same syntax does not seem likely to happen given the number of global identification systems we already have, eg Handle (http://www.handle.net/rfc/rfc3650.html), DOI, URL, ISSN, ISBN, LOC, etc…
The core of the problem is not coming up with another identification system but coming up with a identification relationship system. This would address not only the same thing with multiple identifiers but also the relationships of parts to whole and variations, such as revisions or translations. Let’s work on that for a while (independent of the larger semantic-web stuff).
> The core of the problem is not coming up
> with another identification system but
> coming up with a identification relationship
I agree. And that is (not coincidentally) why I’ve been talking to Kingsley Idehen and Stefano Mazzocchi about that:
Thanks for the link to the long one. ,
John, would you please change “semitic-web” to “semantic-web”. I am sure there is much groundbreaking scholarship being done regards the semitic-web but I really wanted to reference the semantic-web work. The other typos can stay.
groundbreaking scholarship being done regards the semitic-web
[Chuckle] OK, done.