Joining web namespaces

9 Mar 201029 Mar 2010 ~ Jon Udell ~ 5 Comments

The other day I read the following statement in the Economist:

Sensitivity of the data will decide if an application is suitable for processing in the cloud.

The writer does not mention, and probably is unaware of, the principle of translucent data. In a translucent database, the data is encrypted and thus opaque to the operator of the database. Users of the data share keys to unlock the data, and can do anything with cleartext copies that they keep locally. Can real and useful applications be built in this kind of regime? We don’t really know, because hardly anybody has tried. But if it turns out to be possible, it could become a foundation of cloud computing.

I wanted to advance the story. In particular, I wanted to help make a connection between that statement in the Economist and the idea of data translucency. I’ve written about translucency on my blog, and those entries are tagged on delicious. But nowadays the attention stream flows mainly through Twitter. So I composed this tweet:

Economist: “Sensitivity of the data will decide if an application is suitable for processing in the cloud.” Unless the data is #translucent.

There’s a limit to what you can do in 140 characters. That tweet uses all 140, but still falls short of what I wanted to do:

Quote from the Economist
Link to the Economist
Colonize a formerly empty hashtag namespace (#translucency)
Connect that namespace to its delicious counterpart

Inevitably I failed to do all that in 140 characters. Reflecting on the failure, I made this LazyWeb wish:

I wish I could tweet the command “join http://delicious.com/judell/translucency to #translucent and #translucency”

I’ve had some success joining tag namespaces from different domains. I mentioned the idea in this entry, and a commenter (engtech) provided a nifty solution based on Yahoo Pipes. I have since used it to keep track of items tagged icalvalid on blogs, on delicious, and on Twitter.¹

My LazyWeb wish came from that experience, plus another which I wrote up in an entry entitled To: elmcity, From: @curator, Message: start. That entry describes how elmcity curators can now use Twitter direct messages to send commands to the elmcity service. The mechanism harkens back to Rael Dornfest’s brilliant Sandy, a service that acted as a personal assistant and responded to a repertoire of command messages.

Sandy lost her job when Rael went to work for Twitter. I’ve wondered if she would be rehired there. If so, a command like the one I proposed might be an example of the kind of thing she could do.

On further reflection, I’m not really sure what such a command would mean, or whether it would make sense to use Twitter to send it, or indeed whether it would make sense for Twitter (rather than some other service) to respond to it. But I’m in an exploratory mood, so let’s explore.

It would be straightforward to create a service that would take the Yahoo Pipes trick to the next level. Instead of editing and saving a Yahoo Pipe, you’d just command that service to merge the set of feeds for some tag. That command might best take the form of a URL:

http://tagjoiner.org/join/TAG?delicious=yes&twitter=yes&wordpress=yes

As is true for my combined icalvalid feed, the result formats could be HTML for viewing and RSS for feed splicing. As the creator of the joined feed, I’m aware that it exists, and I can cite it when I want to direct people’s attention to the union of the namespaces.

But suppose I wanted the joined namespace to be more discoverable than that? Here’s where it might make sense for Twitter to be involved. If a hashtag search on Twitter did the join, it could be made evident to the followers of the person making the join request, or even to anyone searching for the hashtag involved in the request.

This is almost surely too indirect and too abstract to ever make sense as a mainstream feature. But it’s fun to imagine. If I’ve made an investment in a tag on delicious, or WordPress, or somewhere else, I’d like to be able to bring those items to the attention of people who encounter the corresponding Twitter hashtag.

The general idea behind all this goes way beyond Twitter, of course. Waiting in the wings is a whole class of services that reconcile different web namespaces.

¹ That feed used to include a mix of items marked [DELICIOUS] and [TWITTER]. But the Twitter items are less durable and seem to have aged out of the combined feed.

Speaking and writing webscale identifiers

17 Sep 200926 Mar 2010 ~ Jon Udell ~ 10 Comments

I’ve really enjoyed the conversation about webscale identifiers. Naming web resources is such a crucial discipline, and yet one we’re all still making up as we go along. I ended the earlier post by suggesting that when we invent namespaces we should, where feasible, prefer names that make sense to people. In comments, a number of folks who have wrestled with the problem of ambiguity pointed out all sorts of reasons why that often just isn’t feasible.

Gavin Bell likes Amazon’s hybrid approach:

The model that Amazon have since moved to with a unique URL identifier and an ignored pretty human readable section is a good compromise.

Michael Smethurst agreed with me that the BBC’s opaque IDs — for example, b006qpgr for The Archers — could be promoted as a tag vocabulary that people would be encouraged to use:

Shownar is a prototype by Schulze and Webb that aims to track “buzz” around bbc programmes. For now it’s based on inbound links from blogs/twitter/etc but it could be expanded to use machine tags!?!

On Shownar, I find that this episode of Miss Marple was discussed in this blog entry:

BBC Radio have just started an Agatha Christie season and a whole host of programmes about the Queen of Crime are available to UK listeners on the iPlayer.

They include dramatizations of works starring super sleuths from Miss Marple to the Mysterious Mr Quin, as well as revealing documentaries.

The entry uses URLs that embed these BBC ids: b00mk71d, b007jvht. How did the author find them? Clearly, in this case, by way of the search URL which is also cited in the entry:

http://www.bbc.co.uk/iplayer/search/?q=agatha christie

The search term agatha christie is wildly ambiguous, of course. Shownar would never have included this item had it not cited specific BBC shows by way of their opaque IDs. Nor would the author have cited them if that had required typing b00mk71d or b007jvht. It only works thanks to copy/paste, but it works quite nicely, and it shows why site-specific search still matters in an era of uber search engines.

This example got me thinking about the character strings that we can and do type, easily and naturally, versus those we can’t and won’t. For example:

queries (what we can and do type)	results (what we can’t and don’t type)
http://www.librarything.com/catalog/jonudell&deepsearch= `practical internet groupware`	http://www.librarything.com/work/`16804` http://www.librarything.com/work/16804/book/`28447984`
http://www.google.com/search?q= `practical internet groupware`	http://oreilly.com/catalog/`9781565925373` http://oreilly.com/catalog/`pracintgr`
http://www.bing.com/results.aspx?q= `practical internet groupware`	http://www.amazon.com/Practical-Internet-Groupware-Jon-Udell/dp/`156592537` http://my.safaribooksonline.com/`1565925378`
http://www.worldcat.org/search?q= `practical internet groupware`	http://www.worldcat.org/oclc/`43188074`
http://www.amazon.com/s?index=blended&field-keywords= `practical internet groupware`	http://www.amazon.com/Practical-Internet-Groupware-Jon-Udell/dp/`1565925378`

Looking at the consistency on the left column, and the variation on the right, I’ve got to conclude that:

Practical Internet Groupware is the de facto webscale identifier for my book.
16804, 28447984, 9781565925373, pracintgr, 156592537, 1565925378, and 43188074 will never converge.

I’ve long imagined a class of equivalence services that would help us bridge the gap between vocabularies we can speak and write and those we’ll never speak and need help to write.

Both are sets of webscale identifiers that we’ll need to use in complementary ways. That’ll require a mix of social conventions and technical services.

The joy of webscale identifiers

31 Aug 200929 Mar 2010 ~ Jon Udell ~ 20 Comments

My guest for this week’s Innovators show, Ian Forrester, heads up the BBC’s Backstage project. Launched in 2005, Backstage lives at a cultural crossroads where legacy systems and methods intersect with their next-generation counterparts. The tagline for the feeds and APIs provided under the Backstage umbrella is “use our stuff to build your stuff.”

Admittedly that sounded a lot more exciting prior to 2006, when the BBC ended its trial of the Creative Archive service that was expected to “open the floodgates” to a “treasure trove” of cultural riches. Ian Forrester says those expectations were ratcheted back for two reasons. First, much of that treasure trove remains undigitized. Second, rights clearance proved to be an intractable problem.

So the “our stuff” that’s available to build “your stuff” turns out to be mostly metadata: news headlines, program titles and schedules. What’s more, that metadata comes from a plethora of BBC content management systems. What can you make out of these ingredients?

Here’s an evocative example: http://www.bbc.co.uk/nature/species/African_Bush_Elephant. The BBC’s Tom Scott explains:

Over the last few months we’ve been plundering the NHU’s [Natural History Unit’s] archive to find the best bits — segmenting the TV programmes, tagging them (with DBpedia terms) and then aggregating them around URIs for the key concepts within the natural history domain; so that you can discover those programme segments via both the originating programme and via concepts within the natural history domain — species, habitats, adaptations and the like.

This is just the sort of remixing that Backstage ought to enable anyone, inside or outside the BBC, to achieve. Since I’m a US resident, and don’t pay the UK’s television license fee, I can’t watch the videos on that page. There’s nothing that the Backstage team can do about that. But they can take a radically open and inclusive approach to the management of the metadata that supports this remixing, and that’s just what they’re doing.

In our conversation, Ian Forrester describes how the taxonomy that governs the Backstage feeds and APIs is shared with that of Wikipedia and its structured derivative, DBpedia. Tom Scott elaborates:

You might have noticed that the slugs for our URIs (the last bit of the URL) are the same as those used by Wikipedia and DBpedia that’s because I believe in the simple joy of webscale identifiers, you will also see that much like the BBC’s music site we are transcluding the introductory text from Wikipedia to provide background information for most things. This also means that we are creating and editing Wikipedia articles where they need improving (of course you are also more than welcome to improve upon the articles).

As someone who both practices and preaches collaborative curation, I’m delighted to see the BBC taking this approach. And I love the phrase webscale identifier. Here’s how Michael Smethurst defines it, in the post pointed to by Tom Scott:

I agree with the four Linked Data rules but I’d like to try to add a fifth: if possible don’t reinvent other people’s web identifiers. By web identifiers I mean those fragments of URLs that uniquely identify a resource within a domain. So in the case of the MusicBrainz entry for The Fall (http://musicbrainz.org/artist/d5da1841-9bc8-4813-9f89-11098090148e.html) that’ll be d5da1841-9bc8-4813-9f89-11098090148e.

The last time we updated the /music site we made this mistake (kind of unavoidable at the time). Even though we linked our data to MusicBrainz we minted new identifiers for artists. So The Fall became http://www.bbc.co.uk/music/artist/jb9x/ where jb9x was the identifier. But jb9x doesn’t exist anywhere outside of /music. We’ll (hopefully) never make that mistake again.

Beautifully said. Enormous synergies have gone unrealized because web publishers have chosen to mint new namespaces rather than add value to existing ones.

What I realized when talking with Ian, though, is that there is one namespace for which the BBC is the appropriate mint, namely its own. Here, for example, are some of the family of URLs for a radio drama called The Archers:

homepage: http://www.bbc.co.uk/programmes/b006qpgr/

upcoming shows: http://www.bbc.co.uk/programmes/b006qpgr/episodes/upcoming.xml

In this example b006qpgr is, at least potentially, a webscale identifier. It’s a unique tag for the show that, if used on blogs, on Twitter, and elsewhere, would make it easy to assemble all kinds of online activity related to the show. But in fact only web developers using Backstage feeds and APIs will ever discover, or use, b006qpgr. In colloquial discourse people use The Archers.

If the BBC wants people to collaborate with its namespace in the same way that it collaborates with Wikipedia’s, this would be more inviting:

http://www.bbc.co.uk/programmes/The_Archers/

http://www.bbc.co.uk/programmes/The_Archers/episodes/upcoming.xml

It should go without saying, but right after the first rule for linked data, “Use URIs as names for things,” I would add “Where possible, choose names that make sense to people.”