Annotation is not (only) web comments

Annotation looks like a new way to comment on web pages. “It’s like Medium,” I sometimes explain, “you highlight the passage you’re talking about, you write a comment about it, the comment anchors to the passage and displays to its right.” I need to stop saying that, though, because it’s wrong in two ways.

First, annotation isn’t new. In 1968 Doug Engelbart showed a hypertext system that could link to regions within documents. In 1993, NCSA Mosaic implemented the first in a long lineage of modern annotation tools. We pretend that tech innovation races along at breakneck speed. But sometimes it sputters until conditions are right.

Second, annotation isn’t only a form of online discussion. Yes, we can converse more effectively when we refer to selected passages. Yes, such conversation is easier to discover and join when we can link directly to a context that includes the passage and its anchored conversation. But I want to draw attention to a very different use of annotation.

A web document is a kind of database. Some of its fields may be directly available: the title, the section headings. Other fields are available only indirectly. The author’s name, for example, might link to the author’s home page, or to a Wikipedia page, where facts about the author are recorded. The web we weave using such links is the map that Google reads and then rewrites for us to create the most powerful information system the world has yet seen. But we want something even more powerful: a web where the implicit connections among documents become explicit. Annotation can help us weave that web of linked data.

The semantic web is, of course, another idea that’s been kicking around forever. In that imagined version of the web, documents encode data structures governed by shared schemas. And those islands of data are linked to form archipelagos that can be traversed not only by people but also by machines. That mostly hasn’t happened because we don’t yet know what those schemas need to be, nor how to create writing tools that enable people to easily express schematized information.

Suppose we agree on a set of standard schemas, and we produce schema-aware writing tools that everyone can use to add new documents to a nascent semantic web. How will we retrofit the web we already have? Annotation can help us make the transition. A project called SciBot has given me a glimpse of how that can happen.

Hypothesis’ director of biosciences Maryann Martone and her colleagues at the Neuroscience Information Framework (NIF) project are building an inventory of antibodies, model organisms, and software tools use by neuroscientists. NIF has defined and promoted a way to identify such resources when mentioned in scientific papers. It entails a registry of Research Resource Identifiers (RRIDs) and a protocol for including RRIDs in scientific papers.

Here’s an example of some RRIDs cited in Dopaminergic lesioning impairs adult hippocampal neurogenesis by distinct modification of a-synuclein:

Free-floating sections were stained with the following primary antibodies: rat monoclonal anti-BrdU (1:500; RRID:AB_10015293; AbD Serotec, Oxford, United Kingdom), rabbit polyclonal anti-Ki67 (1:5,000; RRID:AB_442102; Leica Microsystems, Newcastle, United Kingdom), mouse monoclonal antineuronal nuclei (NeuN; 1:500; RRID:AB_10048713; Millipore, Billerica, MA), rabbit polyclonal antityrosine hydroxylase (TH; RRID:AB_1587573; Millipore), goat polyclonal anti-DCX (1:250; RRID:AB_2088494; Santa Cruz Biotechnology, Santa Cruz, CA), and mouse monoclonal anti-a-syn (1:100; syn1; clone 42; RRID:AB_398107; BD Bioscience, Franklin Lakes, NJ).

The term “goat polyclonal anti-DCX” is not necessarily unique. So the author has added the identifer RRID:AB_2088494, which corresponds to a record in NIF’s registry. RRIDs are embedded directly in papers, rather than attached as metadata, because as Dr. Martone says, “papers are the only scientific artifacts that are guaranteed to be preserved.”

But there’s no guarantee an RRID means what it should. It might be misspelled. Or it might point to a flawed record in the registry. Could annotation enable a process of computer-assisted validation? Thus was born the idea of SciBot. It’s a human/machine partnership that works as follows.

A human validator sends the text of an article to a web service. The service scans the article for RRIDs. For each that it finds, it looks up the corresponding record in the registry, then calls the Hypothesis API to post an annotation that anchors to the text of the RRID and includes the lookup result in the body of the annotation. That’s the machine’s work. Now comes the human partner.

If the RRID is well-formed, and if the lookup found the right record, a human validator tags it a valid RRID — one that can now be associated mechanically with occurrences of the same resource in other contexts. If the RRID is not well-formed, or if the lookup fails to find the right record, a human validator tags the annotation as an exception and can discuss with others how to handle it. If an RRID is just missing, the validator notes that with another kind of exception tag.

If you’re not a neuroscientist, as I am not, that all sounds rather esoteric. But this idea of a humans and machines working together to enhance web documents is, I think, powerful and general. When I read Katherine Zoepf’s article about emerging legal awareness among Saudi women, for example, I was struck by odd juxtapositions along the timeline of events. In 2004, reforms opened the way for women to enter law schools. In 2009, “the Commission for the Promotion of Virtue and the Prevention of Vice created a specially trained unit to conduct witchcraft investigations.” I annotated a set of these date-stamped statements and arranged them on a timeline. The result is a tiny data set extracted from a single article. But as with SciBot, the method could be applied by a team of researchers to a large corpus of documents.

Web documents are databases full of facts and assertions that we are ill-equipped to find and use productively. Those documents have already been published, and they are not going to change. Using annotation we can begin to make better use of the documents that exist today, and more clearly envision tomorrow’s web of linked data.

This hybrid approach is, I think, the viable middle path between two unworkable extremes. People won’t be willing or able to weave the semantic web. Nor will machines, though perfectly willing, be able to do that on their own. The machines will need training wheels and the guidance of human minds and hands. Annotation’s role as a provider of training and guidance for machine learning can powerfully complement its role as the next incarnation of web comments.

3 thoughts on “Annotation is not (only) web comments

  1. Liking the “hybrid approach”, as it sounds much more likely to succeed than many others. Part of it is indeed about the relative degree of agency from machines and humans. Despite all the hype about “robots” and a clear movement towards fully-automated processing of just about anything under the sun, there’s also been an increase in the number of times we all talk about working in partnership with non-human agents. Maybe we’re leaving the “robot butler” imagery behind (following the master/slave model which informed previous generations). In the process, we also recognise that it needs not be a race to replace human agency with a “new and improved” version of Artificial Intelligence.

    Advocates of Deep/Machine Learning often dismiss the Semantic Web, claiming that algorithms are much better at constructing knowledge from large amounts of data than are these painstaking efforts to encode knowledge. Dismissive attitudes are often the result of a profound misunderstanding. This hybrid approach could lead to neat projects which would help people on both “sides of the fence” understand what the others are trying to do.

    To me, there’s another layer to this hybrid model. It’s about “pragmatic” vs. “pure” approaches to the issue.

    To caricature:
    Some practitioners of the Semantic Web and Linked Open Data might balk at the idea that work would be done at any scale without ensuring that every single entity has its own URI, that every statement be made in RDF triplets, and that everything responds to existing standards. These could be called “purists”. Their tools are remarkably difficult to use, often very costly, and require a deep commitment to “the cause” of Linked Data. Anything deviating from the goal of having all information linked together is dismissed out of hand as not serious enough.
    On the other hand, some of us are quite enthusiastic about web ontologies, the 5-star deployment scheme for linked data, and the prospect of encoding some insight through solid procedures backed by standards. But we may also be a bit messier than the purists, possibly because we recognise that databases aren’t always that clean. We may want to assign UUID to diverse resources without predicting the failure of the whole endeavour if it can’t be accomplished through SPARQL queries. As you say on well-formed RRID, we’ve all had enough experiences with databases to know that the ideal is rarely attained. In other words, we realise that any dataset requires cleanup and that this type of procedure takes a whole lot of time. (Cf. just about any hackathon happening during the past ten years…)

    So, the hybrid approach you propose, Jon, is also one of taking some steps towards semantic annotations without taking for granted that everything will be fully integrated in the ultimate model for “Life, the Universe, and Everything” by a set deadline.

    My hunch is that you (we) are more likely to succeed than both purists and machine learning enthusiasts. Maybe not at finding resources more quickly or more efficiently. But at constructing actual insight, something which remains lacking in most Google products.

Leave a Reply