Annotation looks like a new way to comment on web pages. “It’s like Medium,” I sometimes explain, “you highlight the passage you’re talking about, you write a comment about it, the comment anchors to the passage and displays to its right.” I need to stop saying that, though, because it’s wrong in two ways.
First, annotation isn’t new. In 1968 Doug Engelbart showed a hypertext system that could link to regions within documents. In 1993, NCSA Mosaic implemented the first in a long lineage of modern annotation tools. We pretend that tech innovation races along at breakneck speed. But sometimes it sputters until conditions are right.
Second, annotation isn’t only a form of online discussion. Yes, we can converse more effectively when we refer to selected passages. Yes, such conversation is easier to discover and join when we can link directly to a context that includes the passage and its anchored conversation. But I want to draw attention to a very different use of annotation.
A web document is a kind of database. Some of its fields may be directly available: the title, the section headings. Other fields are available only indirectly. The author’s name, for example, might link to the author’s home page, or to a Wikipedia page, where facts about the author are recorded. The web we weave using such links is the map that Google reads and then rewrites for us to create the most powerful information system the world has yet seen. But we want something even more powerful: a web where the implicit connections among documents become explicit. Annotation can help us weave that web of linked data.
The semantic web is, of course, another idea that’s been kicking around forever. In that imagined version of the web, documents encode data structures governed by shared schemas. And those islands of data are linked to form archipelagos that can be traversed not only by people but also by machines. That mostly hasn’t happened because we don’t yet know what those schemas need to be, nor how to create writing tools that enable people to easily express schematized information.
Suppose we agree on a set of standard schemas, and we produce schema-aware writing tools that everyone can use to add new documents to a nascent semantic web. How will we retrofit the web we already have? Annotation can help us make the transition. A project called SciBot has given me a glimpse of how that can happen.
Hypothesis’ director of biosciences Maryann Martone and her colleagues at the Neuroscience Information Framework (NIF) project are building an inventory of antibodies, model organisms, and software tools use by neuroscientists. NIF has defined and promoted a way to identify such resources when mentioned in scientific papers. It entails a registry of Research Resource Identifiers (RRIDs) and a protocol for including RRIDs in scientific papers.
Here’s an example of some RRIDs cited in Dopaminergic lesioning impairs adult hippocampal neurogenesis by distinct modification of a-synuclein:
Free-floating sections were stained with the following primary antibodies: rat monoclonal anti-BrdU (1:500; RRID:AB_10015293; AbD Serotec, Oxford, United Kingdom), rabbit polyclonal anti-Ki67 (1:5,000; RRID:AB_442102; Leica Microsystems, Newcastle, United Kingdom), mouse monoclonal antineuronal nuclei (NeuN; 1:500; RRID:AB_10048713; Millipore, Billerica, MA), rabbit polyclonal antityrosine hydroxylase (TH; RRID:AB_1587573; Millipore), goat polyclonal anti-DCX (1:250; RRID:AB_2088494; Santa Cruz Biotechnology, Santa Cruz, CA), and mouse monoclonal anti-a-syn (1:100; syn1; clone 42; RRID:AB_398107; BD Bioscience, Franklin Lakes, NJ).
The term “goat polyclonal anti-DCX” is not necessarily unique. So the author has added the identifer RRID:AB_2088494, which corresponds to a record in NIF’s registry. RRIDs are embedded directly in papers, rather than attached as metadata, because as Dr. Martone says, “papers are the only scientific artifacts that are guaranteed to be preserved.”
But there’s no guarantee an RRID means what it should. It might be misspelled. Or it might point to a flawed record in the registry. Could annotation enable a process of computer-assisted validation? Thus was born the idea of SciBot. It’s a human/machine partnership that works as follows.
A human validator sends the text of an article to a web service. The service scans the article for RRIDs. For each that it finds, it looks up the corresponding record in the registry, then calls the Hypothesis API to post an annotation that anchors to the text of the RRID and includes the lookup result in the body of the annotation. That’s the machine’s work. Now comes the human partner.
If the RRID is well-formed, and if the lookup found the right record, a human validator tags it a valid RRID — one that can now be associated mechanically with occurrences of the same resource in other contexts. If the RRID is not well-formed, or if the lookup fails to find the right record, a human validator tags the annotation as an exception and can discuss with others how to handle it. If an RRID is just missing, the validator notes that with another kind of exception tag.
If you’re not a neuroscientist, as I am not, that all sounds rather esoteric. But this idea of a humans and machines working together to enhance web documents is, I think, powerful and general. When I read Katherine Zoepf’s article about emerging legal awareness among Saudi women, for example, I was struck by odd juxtapositions along the timeline of events. In 2004, reforms opened the way for women to enter law schools. In 2009, “the Commission for the Promotion of Virtue and the Prevention of Vice created a specially trained unit to conduct witchcraft investigations.” I annotated a set of these date-stamped statements and arranged them on a timeline. The result is a tiny data set extracted from a single article. But as with SciBot, the method could be applied by a team of researchers to a large corpus of documents.
Web documents are databases full of facts and assertions that we are ill-equipped to find and use productively. Those documents have already been published, and they are not going to change. Using annotation we can begin to make better use of the documents that exist today, and more clearly envision tomorrow’s web of linked data.
This hybrid approach is, I think, the viable middle path between two unworkable extremes. People won’t be willing or able to weave the semantic web. Nor will machines, though perfectly willing, be able to do that on their own. The machines will need training wheels and the guidance of human minds and hands. Annotation’s role as a provider of training and guidance for machine learning can powerfully complement its role as the next incarnation of web comments.