Assistant Professor of Statistical Science
I am an Assistant Professor of the Department of Statistical Science at Duke University and affiliated faculty in Computer Science, Biostatistics and Bioinformatics, the information initiative at Duke (iiD), and the Social Science Research Institute. I also hold a Schedule A appointment at the U.S. Census Bureau.
My main research focus is on entity resolution (record linkage or deduplication), where the goal is to remove duplicated information from large, noisy databases in the absence of unique identifiers. In my research, I develop flexible methods for entity resolution that are able to handle the uncertainty of the record linkage process and can be easily integrated with post-linkage statistical analyses, such as logistic regression or capture recapture. In addition, a strength of the methods I propose, is that they are able to maintain low error rates (precision and recall) and beat the state-of-the-art methods in the literature in terms of these error rates. Furthermore, I have developed the first performance bounds for a general class of entity resolution models, illustrating when the bounds hold in practice. I proposed a new methodology for entity resolution, realizing that the size of the clusters grows sub-linearly compared to the number of records, which contrasts with many other processes. In turn, this had led to proposing a general class of models for clustering of tasks with a sublinear growth that are scalable, and illustrating their success for entity resolution.
In addition to approaching entity resolution from a Bayesian perspective, I also approach it using statistical machine learning. Specifically, I have been able to leverage locality sensitive hashing (LSH) as a dimension reduction technique for entity resolution and develop fast ways of estimating the unique number of clusters in very large databases. In addition, we have shown that our methods have nice theoretical properties and are very scalable.
Duke Machine Learning
I am heavily involved in integrating computation into both the graduate and undergraduate statistics curriculum, using reproducible research and also using real and complex data sets. All of my courses that are taught at Duke can be found at github. In addition, I have taught the first course in Statistical Science in machine learning for undergraduates, and I'm working with students so that machine learning can have a greater presence on campus through the Duke Undergraduate Machine Learning (ML) Program (http://dukeml.org/) with the undergraduate ML board. We have a student run seminar series (MLBytes), bootcamps, a machine learning day, and a datathon.