George Hripcsak, professor of biomedical informatics, is one of the recipients of a Microsoft Research grant to support work on the computational challenges of genome-wide association studies. These studies involve scanning complete human genomes, and looking for correlations between certain markers of genetic variation and certain diseases.
Doing that correlation is a computational challenge, but as I learned in my interview with George Hripcsak for Perspectives, that isn’t the challenge his research addresses. Instead he’s tackling a different challenge: mining electronic health records to figure out what they say about the diseases patients may have.
Why? Suppose you’ve sequenced the DNA of thousands of people for a study. If you’re trying to correlate genetic markers with disease, you need to know what diseases those people have. George calls this “collecting the phenotype” — that is, the expression of the genes responsible for diabetes, or a tendency to complications in labor, or whatever.
Traditionally that’s done by interviewing patients, a painstaking process that doesn’t scale. Given electronic health records, how much of this phenotype collection can be done automatically, and to what level of accuracy? That’s a different kind of computational challenge.
There are basically two ways to go. You can try to templatize the process of clinical data collection, so that health records can be harvested more effectively by researchers. Or you can try to understand the language that clinicians actually use when they describe patients.
For a decade now, George Hripcsak and his colleagues have been pursuing the latter approach, using a system for understanding natural language called MedLEE, which was developed at Columbia.
Ultimately I believe, as George Hripcsak does, that we’ll need a hybrid system that makes use of both structured templates and natural language understanding. But given that health records must primarily serve patient care, and can only secondarily serve research, I like how he harmonizes those objectives:
To the degree we make documentation efficient in serving health care, I think it’ll also be more accurate for the sake of research. If you’re filling out a record for the sake of billing, you’ll have an incentive to use diagnosis codes that optimize billing. Does that then reflect clinical accuracy? And would that then be useful for research?
The important thing is to be grounded in the clinical truth. Put health care first, and then use new computational methods to extract accurate information.