A conversation with George Hripcsak about electronic health records and clinical truth

George Hripcsak, professor of biomedical informatics, is one of the recipients of a Microsoft Research grant to support work on the computational challenges of genome-wide association studies. These studies involve scanning complete human genomes, and looking for correlations between certain markers of genetic variation and certain diseases.

Doing that correlation is a computational challenge, but as I learned in my interview with George Hripcsak for Perspectives, that isn’t the challenge his research addresses. Instead he’s tackling a different challenge: mining electronic health records to figure out what they say about the diseases patients may have.

Why? Suppose you’ve sequenced the DNA of thousands of people for a study. If you’re trying to correlate genetic markers with disease, you need to know what diseases those people have. George calls this “collecting the phenotype” — that is, the expression of the genes responsible for diabetes, or a tendency to complications in labor, or whatever.

Traditionally that’s done by interviewing patients, a painstaking process that doesn’t scale. Given electronic health records, how much of this phenotype collection can be done automatically, and to what level of accuracy? That’s a different kind of computational challenge.

There are basically two ways to go. You can try to templatize the process of clinical data collection, so that health records can be harvested more effectively by researchers. Or you can try to understand the language that clinicians actually use when they describe patients.

For a decade now, George Hripcsak and his colleagues have been pursuing the latter approach, using a system for understanding natural language called MedLEE, which was developed at Columbia.

Ultimately I believe, as George Hripcsak does, that we’ll need a hybrid system that makes use of both structured templates and natural language understanding. But given that health records must primarily serve patient care, and can only secondarily serve research, I like how he harmonizes those objectives:

To the degree we make documentation efficient in serving health care, I think it’ll also be more accurate for the sake of research. If you’re filling out a record for the sake of billing, you’ll have an incentive to use diagnosis codes that optimize billing. Does that then reflect clinical accuracy? And would that then be useful for research?

The important thing is to be grounded in the clinical truth. Put health care first, and then use new computational methods to extract accurate information.



  1. Interesting – are these records attached to samples of DNA? Wonder how he gets around informed consent (for going through protected health information / data collection)…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s