On this week’s Innovators show I spoke with Victoria Stodden about Science Commons, an effort to bring the values and methods of Creative Commons to the realm of science. Because modern science is so data- and computation-intensive, Science Commons provides legal tools that govern the sharing of data and code. There are lots of good reasons to share the artifacts of scientific computation. Victoria particularly focuses on the benefit of reproducibility. It’s one thing to say that your analysis of a data set leads to a conclusion. It’s quite another to give me your data, and the code you used to process it, and invite me to repeat the experiment.
In this kind of discussion, the word “repository” always comes up. If you put your stuff into a repository, I can take it out and work with it. But I’ve always had a bit of an allergic reaction to that word, and during this podcast I realized why: it connotes a burial ground. What goes into a repository just sits there. It might be looked at, it might be copied, but it’s essentially inert, a dead artifact divorced from its live context.
Sooner or later, cloud computing will change that. The live context in which primary research happens will be a shareable online space. Publishing won’t entail pushing your code and data to a repository, but rather granting access to that space.
It’s a hard conceptual shift to make, though. We think of publishing as a way of pushing stuff out from where we work on it to someplace else where people can get at it. But when we do our work in the cloud, publishing is really just an invitation to visit us there.
It feels to me like Git is already starting this change. There just seems to be a conceptual difference when you can ‘clone’ a repository with no penalty (or guarantee your changes will be merged). Github etc. feel to me like clouds of useful software you can adopt to your own needs.
You’re right about the term ‘repository’, though.
It seems like one of the biggest barriers to entry is that scientists are very “clingy” with their data. What can be done to provide a platform for protecting data so that scientists are more willing to open up?