My guest for this week’s Innovators show, Randy Julian, founded the bioinformatics company Indigo BioSystems to help modernize the process of drug discovery. The challenge — and opportunity — is partly to standardize the data formats used to represent experimental data, and to locate that data in shared spaces where it can be linked and recombined.
There’s also the crucial issue of reproducibility. One requirement, as Victoria Stodden said in my conversation with her, is to publish not just data but also the code that processes the data, ideally in an environment where data-transforming computation can be replayed and verified. One of the ways Indigo’s system does that is by hosting instances of R, the wildly popular statistical programming system, in the cloud.
Another key requirement for reproducing an experiment, Randy Julian says, is a robust and machine-readable representation of the design of the experiment. If I don’t know what you’re trying to prove, and how you’re trying to prove it, your data are just numbers to me. If I do know those things, I may be able to verify your results. And we may be able to automate more of the work using machine intelligence and machine labor — a vision that also inspires Jean-Claude Bradley, Cameron Neylon, and others to pursue open-notebook science.
On this week’s Innovators show I spoke with Victoria Stodden about Science Commons, an effort to bring the values and methods of Creative Commons to the realm of science. Because modern science is so data- and computation-intensive, Science Commons provides legal tools that govern the sharing of data and code. There are lots of good reasons to share the artifacts of scientific computation. Victoria particularly focuses on the benefit of reproducibility. It’s one thing to say that your analysis of a data set leads to a conclusion. It’s quite another to give me your data, and the code you used to process it, and invite me to repeat the experiment.
In this kind of discussion, the word “repository” always comes up. If you put your stuff into a repository, I can take it out and work with it. But I’ve always had a bit of an allergic reaction to that word, and during this podcast I realized why: it connotes a burial ground. What goes into a repository just sits there. It might be looked at, it might be copied, but it’s essentially inert, a dead artifact divorced from its live context.
Sooner or later, cloud computing will change that. The live context in which primary research happens will be a shareable online space. Publishing won’t entail pushing your code and data to a repository, but rather granting access to that space.
It’s a hard conceptual shift to make, though. We think of publishing as a way of pushing stuff out from where we work on it to someplace else where people can get at it. But when we do our work in the cloud, publishing is really just an invitation to visit us there.