Another of the many interesting stories coming out of Microsoft External Research these days is the one Roger Barga tells in this week’s installment of Perspectives. When Roger told me that Trident, the system he’s developing to automate scientific workflow, was inspired by Jim Gray, it was a déjà vu moment. Everywhere I turn, I find new evidence of Jim’s profound and far-reaching influence at the intersection of science and computing.
I never met Jim in person, but we collaborated briefly on this 1995 BYTE feature that condenses his career-long work in the field of scalable databases and transaction monitors into a lucid taxonomy. Well, it’s a stretch to say that we collaborated. Jim delivered the article in pristine condition, and there were only minor editorial details needing attention. But when he did attend to them, he exhibited the qualities I’ve since heard about from many others. He was gracious, fully attentive, deeply wise, broadly connected. It’s remarkable to watch the connections he formed continue to ripple through MSR and out to MSR’s external partners.
In Roger’s case, here was the seed:
Jim Gray was the first person who had the vision of an oceanographer’s workbench. His insight was that scientists really want to interact with visualizations of the ocean, but there was a huge gap between the raw data and those visualizations.
In our interview, Roger describes a project called Trident, a system for authoring, running, and tracking the provenance of scientific workflows — that is, sequences of computational steps that bridge the gap between the data produced by the Neptune sensor array and the COVE visualization system.
Oceanography is only the first scientific discipline that will benefit from Trident. Astronomy is next in line, and other fields are expected to follow. As all scientific disciplines become increasingly data intensive, two related requirements emerge. There needs to be a general framework for creating pipelines of reusable data transformations, and it needs to be coupled with the ability to document, version, and reliably reproduce the results that come out of those pipelines.
Today, as Roger points out, reproducing a scientific result is often a dicey thing:
If you happen to know the person who did the experiment, or if you happen to capture enough stuff in your lab notebook or on your whiteboard, then you have a chance of being able to do it again.
In the domain of software engineering, both commercial and open source, that would simply be unsustainable. So strong traditions of version control and provenance have developed. But as Greg Wilson has been observing for many years, those traditions have not sufficiently taken hold in many computationally-intensive areas of science. In this interview Greg takes the HPC (high-performance computing) community to task for caring too little about verifying the correctness of models and ensuring that code and the data are managed in ways that make experiments reliably reproducible.
Some scientists can and do assimilate the best practices from software engineering. But most will need a system that embodies those best practices, and that is what Trident aims to be.
One final comment of Roger’s particularly struck me:
The hope is that here in External Research, because we’re building these tools not just in the context of one science project, but many, you can have community tools that bridge communities. We’re talking to people in the earth sciences doing atmospheric studies, and their workflows and analyses are so similar to what the oceanographers are doing. But right now, since those two communities aren’t talking or sharing tools, it’s very difficult for one community to interact with the other.
Now more than ever there is a pressing need to make interdisciplinary science as frictionless as it can possibly be. I hope that what Roger and his team are doing will supply some of the necessary lubricant.