Big Data

March 1, 2016
Interview with Dana Ludwig, MD
Enterprise Information and Analytics, UCSF
Dan:
In your job, whom do you serve?
Dana:
My department covers both the medical center and the University campus, though I've been focused on research, not patient care.
Dan:
What is Big Data?
Dana:
I assume it's a marketing term. To me, the thing that is interesting is the readily accessible, highly rich clinical information from the medical record. Before, we just had discharge diagnoses from systems focused on billing. Now with the push for medical records there are a lot more codes.

My research interest is in looking at the records across several million people. We can discover new medical knowledge: the causes of diseases and maybe even the cures -- what procedures have worked and what hasn't.

Anything you do with that you have take with a grain of salt because the association that you find may be due to something that you didn't anticipate. Even something that seems cool has a high chance of being incorrect. It could be related to something more mundane.

The fact that you can't do a randomized clinical trial is a big limitation. But over the years I've come to believe that there are a lot of things that you'd want to ask in medicine that are not amenable to clinical trials. That they would be unethical is the biggest barrier. So there are a lot of good ideas that could, instead, come out of the data.

Dan:
What is an example of being led astray by not asking the right question?
Dana:
People who took drug A instead of drug B and got a worse outcome would suggest that drug A is a bad drug. But it might turn out that the reason that they did worse has nothing to do with the effectiveness of the drug. Perhaps the reason is that drug A is cheaper and those people have lower incomes and are forced to use a cheaper drug, or maybe those people are sicker or they can't afford follow-up appointments. Those are so hard to untangle. We must treat each of the associations as a possibility. If it looks very compelling you could follow it up with a randomized clinical trial.
Dan:
If you are clever with your controls, can you create studies that are reliable?
Dana:
You can go in that direction. At Kaiser, we looked for covariates and things that would vary outcomes and correct for them, either balancing the covariates in each subset of the patient population, or statistically using a regression to correct for the multivariate in the regression model. But you never know that you've found the last one, the last variable that may provide an alternative explanation. When you publish a paper, you have to do that; you have to offer every reasonable explanation for the phenomenon that you're hypothesizing. But it still doesn't prove that it's correct.
Dan:
When you were at Kaiser, you had the most phenomenal data resource, including the "Multiphasic" data going all the way back to the 1960s. How deep is your data resource at UCSF?
Dana:
Now that I'm making some progress with STOR (Summary Time-Oriented Record, developed by doctors . Whiting-O'Keefe, Arellano, and Simborg at UCSF in the 1960s and 1970s), we may go back farther than Kaiser. But the size of the patient population is about one fourth of Kaiser. Northern California Kaiser has about three million active patients and UCSF has 600,000 or 700,000.

More important than the population size is the continuity. If I get a drug at Kaiser, I know whether the prescription was filled because the Kaiser patients get better prices, whereas at UCSF I know only that they got a prescription, not whether they filled it. UCSF doesn't have an outpatient pharmacy, so that data does not exist.

Further, if a patient at Kaiser has a two-year blank in the medical record it's probably because they were healthy and needed no care. But at UCSF if a patient disappears for a few years we don't know whether it's because they went somewhere else.

Kaiser data is better for longitudinal studies. My hope is that at UCSF we use the data for the things in which it is strong: acute care medicine. For the duration of one illness, we can look at hospital practices -- the cutting-edge stuff -- but you wouldn't have the whole picture of the patient over ten years.

Dan:
What is the service you provide to your customers?
Dana:
My official role is to build systems and also to serve as an "Honest Broker". Honest Brokers can't do research because they have access to all the data including the confidential parts. I de-identify it for the real researcher. If I were a researcher, I would want to get as much data as I could, but without the explicit approval of the Institutional Review Board (IRB), I might be tempted to peek at stuff that I'm not authorized to see, whereas the Honest Broker has no temptation to misuse the data. As an Honest Broker I give them de-identified, high-value data: a massive flat file data set from which they can extract diagnoses, procedures, and medications.

Kaiser doesn't do it that way. There, the analysts who access the data all have 100% access. The presumption is that once an investigator has IRB approval, they can have access, but they never lose it. It's more of a trust system.

By comparison, the Honest Broker system is more awkward. Investigators' hands are tied, limiting their ability to do exploratory work. But they can explore in the de-identified data that I give them. They could have an idea in the middle of the night and say, "I wonder whether there is an association between this and that..."

The problem with the de-identified data is that it is more limited in scope. For example, "I know that these patients were exposed to that drug. What was the dose? Was it delivered by IV? How much did they get per day?" When you go deeper, trying to know in a more precise way, and looking up more esoteric things like, for every patient who was on an endotracheal intubation, what was the beginning and end date of their intubation? Often you can't find that in the deidentified data sets that I supply; it's too specialized. It's in Epic but it's hard to find. It's in a deep layer of the onion, not in the surface layer that everyone uses that has things like diagnoses.

Another example is blood bank data. Which patients have received transfusions? This is not in the data that I have de-identified though it is in Epic.

Epic has 13,000 tables. The high-value data set is in around twenty tables. A majority of the researchers will want more than that; eventually they'll require deeper access to the data.

Dan:
Are the topics of the studies most commonly about drug efficacy? What are typical questions that your researcher clients ask you to help them answer?
Dana:
The most typical one is, "Give me a list of people with disease X so I can recruit them for a study." Researchers will do their own data collection once they recruit the people. The data that they collect in their own studies do not come back to us. We'd have no place to put it. Our hands are already full with those 13,000 tables.
Dan:
In addition to the databases in Epic, what resources do you use?
Dana:
I've been blessed with not having to go beyond Epic. Epic includes both financial and clinical data. It has a lot of text, too.

Others I work with do use other sources. The biggest is SFGH (San Francisco General Hospital) which has non-Epic EHRs (Electronic Health Record).

The next biggest is probably STOR. It has an excellent database written in MUMPS. It was the best database at UCSF from 1988 until go-live of Epic -- called APeX at UCSF -- in June of 2012. Once I get done with my STOR project we should be able to do our own requests on its data.

Dan:
What's the hardest part of your job?
Dana:
Not really knowing for sure what you're looking at. In the ideal world, we would work with our client, the investigator, and say, "This is what I'm seeing. What are you seeing?" Every department has their own view, for example the blood bank. We can consult with the specialists, for example for the pharmacy, but it's hard to do because they are not experts in Epic. They know it well enough to get their work done. In a big project I would work with that department to understand their workflow and what their codes really mean. There is a vast amount of Epic documentation about the data but it's never enough since there is a lot of flexibility in how a hospital uses Epic. Epic themselves couldn't answer many questions.

For example, say code 3 is defined as inpatient but what it really means is either inpatient or emergency department and if it's emergency department, it's hard to know whether the patient went home afterward or they were admitted to the hospital. It's possible to figure this out but not easy; detailed analysis is needed. We wouldn't have known this without having talked to the people who set it up. They are the ones who decided what code 3 means. Many of those people were in Russ Cucina's team.

What I wish we could do is let the data speak for itself through neural networks. Let's say I was trying to look for diabetics. We could start by looking for the diagnosis codes but we would also want algorithms for looking in the 13,000 tables to find variables that discriminate between people with diabetes and everybody else. For example, it could find that diabetics are also taking insulin or oral hypoglycemic drugs. Machine learning algorithms might find associations that I don't know about by crawling through the massive tables. Given the 100,000 variables that we have on line, have the machines crawl through and see which ones are predictive of that diagnosis code. I don't know whether the algorithms can do a better job than the experts can do, but I'd like to find out. The goal of the neural networks is to let the algorithm discover the attributes that are predictive rather than presupposing that the experts can name them.

Dan:
Are you involved in the transition mandated by the ACA toward data interchange and interoperability?
Dana:
I find that frustrating because when we go between institutions everybody has a different standard. My group was trying to map our data to three different standards. It's labor-intensive.

An important one is ICD-10 which is central to how Medicare reimburses.

A promising one did come out of Obamacare, called PCORINet. (PCORI stands for Patient-Centered Outcomes Research Institute, a government-sponsored program to investigate the relative effectiveness of various medical treatments).

Others are not being fully adopted, for example the drug standard RxNorm. Only one third of our drugs are coded with RxNorm.

Another is the lab standard LOINC. It's mandated in certain subsets but I find that in our database it's used to code only half of our labs.

ICD-9 is 100%. ICD-10 is 100% -- you need it if you're going to bill.

Dan:
Who are your big data heroes? What were some insights that you got from them?
Dana:
My current hero is Marina Sirota. She's in Atul Butte's lab (here at UCSF). She's a hero because she found out something that Atul's lab is very interested in reproducing in other domains. They look at an old drug and apply it to a new disease. From publicly-available databases, they find a gene expression in the RNA. The RNA is transient. It's different in different organs -- liver genes do liver work; brain genes do brain work -- so you have to know which sample you have. The RNA goes up when it's needed and comes back down when it's not needed because it doesn't need to produce more proteins. They look at the RNA expression profile in a disease state and then they look at it in patients who have been given a drug.

The theory that Marina pioneered was that these should be moving in different directions. If you want to reverse the effects of a disease, you want a drug that suppresses or increases the RNA expression in the opposite direction of what the disease is doing. With that simple principle, they were able to take an old drug that was used for epilepsy and hypothesized that it would be effective for Crohn's disease. As a grad student, she found this association in a pattern of gene expressions in publicly-available data. Then in Atul's lab in rats with laboratory-induced Crohn's disease, They tested the epilepsy drug. The half of the rats that got the old epilepsy drug had their Crohn's disease cured.

I heard her present this about ten years ago. Now she is an assistant professor on a tenure track. She came with Atul from Stanford. She is extremely capable and a really deep thinker. She's also dabbled in neural networks.

Dan:
At Kaiser, one of your heroes was Morris Collen. Why?
Dana:
His idea was that Kaiser's data could be used to discover medical knowledge. The first couple of years I worked with him, he said, "Build the database." After that he said, "I don't want to keep building databases. Let's do something with it." He wanted to study complications of polypharmacy in the elderly. He was elderly. He was taking twelve pills. Surely there were unexpected interactions. It would take a machine learning algorithm to cope with something that complicated.

Another of his intuitions was that we would need supercomputers. Before neural networks were actually working, Morris Collen in 2006 said we needed them to do this discovery. Everybody said, no, it's more important to get the right algorithm and the computing power is secondary. Now the neural network people say the algorithm is less important than the computing power. He was way ahead of his time.

He wrote several books on medical informatics. We called it data mining back then; today it's called data analytics or machine learning. He thought that was a worthy goal. At the time, others in DOR (Kaiser's Department of Research) scoffed at the idea but Morris -- ironically for an old-timer -- was thinking ahead of us.

Those were inspirational relationships.

Dan:
Where is the best work being done today?
Dana:
I hate to say this because it's not here, but at Vanderbilt University (Nashville, TN) they use the medical record to define cohorts for genetic analysis. They call them phenotypes. Diabetes is a phenotype and they're using the medical record to get a refined definition of what is a diabetic. Then they compare that with their DNA assays to discover new knowledge. When I read their papers I think, that's what I want to do.

Stanford is doing a lot more informatics than UCSF but hopefully that will change in the near future.

Dan:
What are trends or future directions that you see?
Dana:
The most amazing trend is that the medical community now takes seriously the value of medical data. When I was in training (to be an MD) in the late 1970s, they considered it "cookbook medicine". They didn't want standards or protocols imposed on them. They said medicine was as much an art as a science and requires the judgment of the expert. Now with all the talk of evidence-based care that has changed.

Now we have "personalized" or "precision" medicine. I'm not a big believer. But it's a big trend, with a lot of buzz. Multiple myeloma had some sub-groups based on the tumor genetics that make it more susceptible to one drug than another. So there is some reality to it, but the hype is that everybody should have their own very personalized treatment.

Dan:
Are you vulnerable to the ups and downs of academic funding?
Dana:
No too much in my day job, since we're a shared resource across the whole University. I don't mind the risks of academic funding because I'm near retirement, and I would rather have a shot at doing something significant.

Back to Dan's nursing site