Anya Katsevich smiles at the camera while leaning against a wall
Anya Katsevich is one of four faculty joining the Department of Statistical Science this fall. (John West/Trinity Communications)

Anya Katsevich: Taming Big Data With Old Statistics

Do old statistical methods work with today’s big data? That’s one of the questions that drives Anya Katsevich, assistant professor of Statistical Science. “There are a number of classical methods — nice reliable methods — which work in the regime of small data,” she said, “but we’re not sure anymore whether we can rely on them.”

She’s witnessed the confusion: “I’ve seen scientists use a method when it’s questionable, and I’ve also seen people lament in papers, ‘We wish we could use this method but we’re not sure it works anymore, so we’re going to use something else.’”

While some theoretical statisticians are developing entirely new methods, Katsevich finds it more practical to investigate whether the tried-and-true methods still work, and if yes, under what conditions. 

One “old reliable” in the statistics world is the Laplace approximation, which has been around since the 1700s. It works as long as the dimension — that is, the number of variables for each item in a sample — is not too large. In many medical studies, there might be tens of thousands of gene expression levels (variables) per patient in the sample. Is that too many variables?

In her postdoc at MIT, Katsevich set out to answer the question: “How big can the dimension get for us to still trust this method, and when does it start to break?” 

She found a pretty straightforward answer: “My research shows that the dimensionality can be as large as the square root of the sample size. If you have a million patients, then you can do this procedure for a thousand genes.” (For you scientists out there, she cautions there is some additional nuance to the square-root rule: “It does depend on the exact problem you’re working on.”)

She also works on methods to more accurately describe the margin of error for the probability of certain events, particularly if the probability is small and the margin of error is large enough that it could take that probability down to zero. “If you’re trying to understand whether a particular gene is implicated in cancer, it really matters if it’s zero or non-zero,” she said.

As a theoretician, Katsevich doesn’t work on scientific or clinical studies herself, but illuminates the mathematical principles behind statistical methods. By doing so, she hopes to support scientific research as a whole by providing clarity about when it’s appropriate to apply certain statistical methods and when it’s not.

Katsevich is glad to be at Duke, where the Department of Statistical Science is well known for its focus on Bayesian statistical methods, which can involve intensive and expensive numerical computation. “Computer power is what enables us to carry out these computations,” she said. “That’s one reason why Bayesian inference is becoming more tenable and popular these days. Before, we couldn’t do these calculations except in very simple scenarios.”

Yet even today’s powerful computers can’t keep up with all the calculations that Bayesian statistics requires if the dataset is huge. “It’s important for people to come up with streamlined computational techniques to help the computer,” she said. And that’s one of the things she does.

When not figuring out statistical theory, Katsevich enjoys tackling much simpler problems. “I’m trying to figure out if there is good swing dancing in Durham,” she said.

Katsevich is one of four faculty joining the Statistical Science department this fall. Read more about her new colleagues, Omar Melikechi, Lasse Vuursteen and Sifan Liu.