The persistent blogosphere

In response to last Friday’s podcast with Tony Hammond about publishing for posterity, David Magda wrote to point out that our main topic of discussion — the DOI (digital object identifier) system — is one implementation of the CNRI (Corporation for National Research Initiatives) Handle System but there are others, including DSpace. I wondered whether this class of software might work its way into the realm of mainstream blogging. David responded:

A weblog (or web pages in general) are simply a collection of text, link, pictures. This is no different than any other document / object / entity that Dspace would handle. It’d simply be another type of CMS IMHO. I think this would be a really good project to implement for an undergrad thesis, or perhaps as part of a master’s thesis.

However as neat as all this is, I don’t think it would be implemented soon: or at least not in mainstream software. Few people will care whether their MySpace page survives over the aeons (and many people don’t want their kids to know what they did twenty years in the past).

But some of us do, and more of us will. The other day, for example, my daughter walked into my office while I was in the middle of a purge. Among the items destined for the recycling bin was a pile of InfoWorld magazines.

She: You’re throwing all these out?

Me: No, I’m keeping a few of my favorites. But as for the rest, I don’t have the space, and anyway it’s all on the web.

She: Don’t you want your grandkids to be able to see what you did?

Heh. She had me there. A pile of magazines sitting on a shelf is almost certainly a more reliable long-term archive than a website running on any current content management system.

Here’s another example. Back in 2002 I cited an essay by Ray Ozzie that appeared on what was then his blog, at ozzie.net. But if you follow the link I cited today, you’ll land on the home page of the latest incarnation of Ray’s blog. The original essay is still available, but to find it you have to do something like this:

My Blog v1 & v2 -> stories -> Why?

So OK, the web rots, get over it, we should all accept that, right?

Well, libraries and academic publishers don’t accept that. Nothing lasts forever, but they’re building content management systems that are far more durable and resilient than any of the current blogging systems.

Conventional wisdom says that it wouldn’t make sense to make blogging systems similarly durable and resilient, for two reasons. First, because the investment would be too costly. Second, because blogs aren’t meant to last anyway, they’re just throwaway content.

The first point is well taken. As Tony Hammond points out in our podcast, the cost isn’t just software. Even when that’s free, infrastructure and governance are costly.

But I violently disagree with the second point. Just because most blog entries aren’t written for posterity doesn’t mean that many can’t be or shouldn’t be. My view is that blogs are becoming our resumes, our digital portfolios, our public identities. We’re already forced to think long-term about the consequences of what we put into those public portfolios because, though no real persistence infrastructure exists, stuff does tend to hang around. And if it’s going to be remembered, it should be remembered properly.

So a logical next step, and a business opportunity for someone, is to provide real persistence. This service likely won’t emerge in the context of enterprise blogging, because enterprises nowadays are more focused on the flip side of document retention: forgetting rather than remembering. Instead it’s a service that individuals will pay for, to ensure that the public record they write will persist across a series of employers and content management systems.

22 Comments

  1. And because it’s a service that people will pay for, the permanent record of the web is going to be warped and scarred compared to the actual sphere of everything that’s going on.

    The Internet Archive is a valiant attempt and persistence that won’t be able to infinitely continue as the web continues to expand. Servers die, protocols change, and in twenty years it will seem anachronistic to be looking at the web as we know it at all – yet at the same time, the content we pour into it is increasing.

    Oddly, printing things out still seems like the best chance for longevity, which is why things like LJBook are popular.

  2. Sorry, to expand on the permanent record of the web is going to be warped and scarred compared to the actual sphere of everything that’s going on, this is because only certain types of users will pay to keep their content. A lot of the most valuable and revealing material will be lost forever; one of the great things about the web is its democratisation of publishing. However, it’s likely to be the same old publishing stalwarts that pay.

  3. “However, it’s likely to be the same old publishing stalwarts that pay.”

    I hope not, and the reason I hope not is that we’re trending toward affordable and easily replicatable infrastructure.

    It floors me what $8/month buys you in the way of commodity hosting nowadays. So much more capability than just a few years ago. Why shouldn’t this trend continue.

  4. Pingback: Preoccupations
  5. DSpace just embeds the CNRI Handle System Libraries, and I’m pretty sure DOI does the same. Interestingly, I’ve found that the handle system actually reduces the reliability of the system it is embedded in.

  6. “Interestingly, I’ve found that the handle system actually reduces the reliability of the system it is embedded in.”

    Because it multiplies moving parts?

    Would you then say that the infrastructure and governance aspects of these efforts might just as well play out on the plain old web?

  7. “Interestingly, I’ve found that the handle system actually reduces the reliability of the system it is embedded in.”

    [note: I’ve been a primary developer on the handle system since about 1998]

    Because it is an extra step in the “getting to what you want” stage it can seem like a one-more-thing-to-break type of scenario, but it doesn’t have to be. The handle system was designed to be ultra-scalable and reliable in that object identifiers are replicated across multiple servers (similar to DNS, but faster and more flexible) and handle clients talk to each server in the service until they get an answer.

    DOI resolution is done using the handle system, but DOI is larger than that – it is a community in which everyone has a stake in making sure the identifiers continue to work and are useful to everyone involved. The handle servers running the DOI system are replicated over a number of sites and are generally very reliable. The only weak spot in the system is the proxy – http://dx.doi.org/ – but even that is load-balanced over a number of geographically separated servers so that the weakness comes down to HTTP itself, and the fact that if an HTTP client can’t get through to one host it will stop trying. Native handle resolvers do not have that problem.

    The DSpace use of handles on the other hand does not use any mirroring or failover of its handle servers so there can be errors if a client can’t get to the single DSpace handle server that is responsible for a certain namespace. This is something I’ve been meaning to fix for a while since DSpace is open source. However it is a temporary problem – not a problem that is inherent in the system as is the case with straight up HTTP services.

  8. Hanzo Archives have been working on this very problem for a couple of years. We have two solutions.

    Hanzoweb – http://www.hanzoweb.com – is a web archiving tool with which individuals and institutions can collect pages, sites and blogs via a bookmarklet in their browser, via feeds or a WordPress plugin. The plugin is open source and still a little underdeveloped, but a promising tool in this context. As Hanzoweb is archiving the public web, the crawlers obey robots.txt, so we don’t necessarily get everything a user requests, but what we do collect is stored in our archive using the same archive containers as Internet Archive. Furthermore we have an agreement with Internet Archive to donate our archive to them regularly, to ensure material we collect is safe for a very long time. Hanzoweb provides __real persistence__ for websites and blogs.

    DOI and Handles systems are identifiers or naming schemes AFAIR and are used to point to URLs on the live web. These can equally be used to point to archived material in our archive or that of the Internet Archive as both are accessible on the live web too. More important is that several archives, including Hanzo, are working on ideas for federated archive access; the combination of archives and libraries together with federated archive access will ultimately lead to a persistent blogosphere, but not DOI or Handles, certainly not in isolation.

    Hanzo also sell a solution for Intranet archiving, which organisations can use to archive internal resources for compliance or other reasons.

    Here’s an example of persistence for you Jon: http://www.hanzoweb.com/search/?search=udell

    BTW, as I know you’re a fan of Amazon’s S3 and EC2, you might like to know the whole Hanzoweb app is hosted on EC2 and the archive is stored on S3.

  9. i have a hard time acknowledging a demand for persistent identifiers outside of URL’s. the problem here is the short life span of web apps. i’d label most of the issue as one of deployment. deploying web apps is never fun, theres usually a mess of dependencies and newer ever so slightly incompatible versions.

    in all, its actually a rather good case for virtualization: just create an os images for each webapps so as to allow the image to be easily and rapidly migrated between systems. it insures absolute OS compatibility wherver the server is moved to. with virtualization, its very low impact bordering free to keep a prehistoric mostly unused webserver online, spooling out sporadic responses for antiquated content requests.

    VMware has rapidly shifted to providing exactly this type of product. i’ve always associated them with a company focused on creating developer tools and serving niches of people who for whatever reason need multiple os’s, but they’re recently retargetted towards providing deployable os’s to counter these migration woes. its certainly a little odd and perhaps extreme, web technology moving so fast that sites will just go black for lack of supported and compatible runtimes that the solution is to make the entire os deployable, but since web deployment *is* such a huge can of worms and so problematic, its a rather ingenious solution.

    imho, urls should always be permenant. engineer room to grow from the start, and ensure old systems remain online or that new systems will be back-compatible. and create and maintain your personal systems with the complete expectation of persistence.

  10. ben werdmuller, hosting is already cheap, and with the multi-core war about to get underway, should only be getting cheaper. i have 500gb/mo on a vps for $20/mo and at my peak i’ve hosted a dozen-odd friends pages for them. the barrier to entry is willpower and tech-know-how, not resources for doing so. further, its not unreasonable to consider that perhaps Brewster Kahle will in fact save humanity and provide free content hosting for all eternity. i argue that the limiting factors is people’s scope and imagination for their content. if publishers really are the only ones creating persistent information, its because they’re the only ones that value persistent information.

  11. “SaLam KhoB KhoGBiN HaMatOn NoKare TaKBe TaKetOn”

    It’s hilarious to watch search engines try to make sense of this.

    Google:

    Did you mean: SaLam Khobi Hogbin Hamilton Nokair TaK Be Taken?

    Live:

    Other searches you may want to try:

    * Take Off Shift Knob
    * Hampton Inn
    * Hampton University
    * Hampton Bay
    * Hampton Lakes
    * Hampton VA
    * Hampton Virginia
    * Hampton Beach

    A9:

    Do you mean salad knob? khogbin? hampton? nokie? take? take on?

  12. I blame the flagrant disregard for URLs as meaningful names for pages. How can a page with a url like http://foo.bar/_cms-of_/the/week?sid=83g4h2t3he&x=232323&glarp=ethu&booger=9 be possible to retain indefinitely as _cms-of_/the/week is replaced by new software, or some kind of permanent static archive off in a corner somewhere?

    CMS software needs to use useful URLs and it needs to be the easiest thing in the world to drop in an existing set of content into a new CMS and retain the old URLs or transparently redirect them to new ones.

    This can be done at the HTTP server level (with Apache URL rewriting for example) but you have to go in and edit config files; you can’t easily do it in a blog or CMS web-based control panel.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s