Talking with Herbert Van de Sompel about a web that remembers

The endnotes for the book I’m now reading are a mixture of conventional citations and URLs. The former, expressed as publisher, book or journal title, author, date, and page number, seem not nearly so useful as the latter. Would you rather visit the library or click a link? But nowadays cited URLs also come with disclaimers like this: Accessed July 27, 2009. It might be inconvenient to verify a conventional citation in its original context, but I know that if I had to, I could. There’s no guarantee that I’ll be able to revisit a cited URL. Even if the page itself has not gone missing, there’s no way to know that the page I view on April 22, 2010 is the same one that the author viewed on July 27, 2009.

This anecdote was the springboard for my conversation with Herbert Van de Sompel about Memento, a proposed (and prototyped) method for adding the dimension of time to the web’s existing mechanism for content negotiation.

That mechanism has, to be sure, not taken the world by storm. The most common scenario involves a browser telling a multilingual server that its user prefers to read, say, French. A paper about Memento published last fall walks through the HTTP protocol that enables this negotiation. Odds are, though, that you’ve never seen this actually happen. It’s much more likely for a multilingual website to present itself as “a multiplication of language-specific mini-sites, instead of thinking of it as one site, with one set of URIs, only with different versions and languages available.” Wikipedia, for example, works that way.

The quote comes from a 2006 W3C article, Content Negotiation: Why it is useful, and how to make it work. The article blames the awkwardness of Apache’s implementation of the protocol (since corrected):

For a long time, with the most popular negotiation-enabled Web server (the ubiquitous apache), failed negotiation (for instance, a reader of french being proposed only english and german variants of a document), resulted in a nasty “406 not acceptable” HTTP error, which, while technically conforming to HTTP, failed to follow the recommendation that a server should try to serve some resource rather than an error message, whenever possible.

Is there any reason to suppose that time negotiation will succeed where language negotiation has so far mainly failed? That’s a hard question, and one I wish I’d thought to ask Herbert in the interview, but maybe we can continue the dialogue here.

Meanwhile, the fact that content negotiation is tricky to get right doesn’t invalidate the core of the Memento proposal. Time is fundamental, the web could have a reliable memory, and if we can build such a memory into the fabric of the web the benefits will be profound.

Examples are everywhere. Consider mediabugs.org. Founded by Scott Rosenberg, whom I interviewed last week, the site is dedicated to finding and fixing errors in media reports. A few days ago, the first bug was marked Closed:Corrected. The mediabugs.org bug page initially said:

Listing for Josh Kornbluth’s show “Andy Warhol: Good for the Jews?” says the show is at the Jewish Community Center in SF, but actually it’s at The Jewish Theater in the Theater Artaud building.

There’s a comment pointing out the error but it’s still showing with the wrong info on the Express home page.

And later:

This is fixed now!

If you visit the original news report, though, there’s no record of the correction. It’s no big deal in this particular case, but media organizations should want to be transparent about when and how they alter published items.

Likewise governments. The Citability project aims to account for the history of changes made to items published on government websites. As with mediabugs.org, the approach will initially require third-parties to monitor and chronicle the changes.

The Memento idea is that media organizations, governments, and other kinds of web publishers will be accountable for their own change histories.1 And they’ll do so in a standard way, so that people viewing these sites in browsers can straightforwardly say: “Show me this page as it existed on July 7, 2009.”

This is wildly ambitious, but I applaud the ambition. Every since I made the Heavy Metal umlaut screencast, I have imagined what it would be like to scroll back and forth along the timelines of evolving web pages. At one point Andy Baio sponsored a contest to write a script that would animate the revision history for any Wikipedia page, and I made a screencast of Dan Phiffer’s solution.

Clearly we want this. Will it be hard to arrive at a well-known and well-used standard? Sure. Is it worth doing? Absolutely.


1 Third-party watchdogs will often be needed, of course. We’d like to trust self-reported change histories, but we’d also like to verify them. Even so, third parties shouldn’t be the only mechanisms. Self-reported histories should exist.

Posted in ., .

5 thoughts on “Talking with Herbert Van de Sompel about a web that remembers

  1. Loved the heavy metal umlaut examples… ;-)

    There has been some research wrt the UI components of visualizing changes, including:

    * Zoetrope
    http://bit.ly/cfg3nN

    * Past Web Browser
    http://bit.ly/caduPh

    The former is a really neat demo, but it builds a special-purpose archive from which to operate. The latter did some interesting user evaluation based on change detection and presentation.

    We’re hoping that Memento can facilitate further UI development wrt the past web by providing a unified interface to the various caches & archives. As you mentioned in the interview, not a lot of people even know the past web exists, so tools that make use of it have been slow to arrive.

    Michael

  2. Thanks Jon for a great post, and Herbert/Michael/et.al. for a mind-mending project in Memento!

    Jon, I think you’ve answered your own key question even as you’ve asked it:

    Is there any reason to suppose that time negotiation will succeed where language negotiation has so far mainly failed? …Time is fundamental, the web could have a reliable memory, and if we can build such a memory into the fabric of the web the benefits will be profound…

    I believe the key difference between negotiating for a representation in time vs. alternate formats or encodings is that previous versions actually have existed, while representations in other languages or formats likely never did or must be synthesized. Indeed, one of the core ideas behind Memento is to “surface” alternate versions of content that have been managed by a site’s CMS (Drupal, say) but have not been easily accessible.

    The bottom line is that version control is already standard practice in the management of much of the Web of Documents — not to mention in software development! — and therefore support for time negotiation isn’t as much of a stretch as for other alternate representations and encodings.

  3. Content negotiation over time is a great idea and I’m all for it, but in case anyone finds their way to this blog post b/c they are looking for a way to read an old version of a web page, I’d like to point out that this capability is often already available via Internet Archive (http://archive.org).

  4. Yes, thanks for pointing that out. The Wayback Machine is an amazing resource which I certainly should have mentioned but didn’t.

Leave a Reply