Instructor in the Social Science Research Institute
Asst. Professor of the PracticeDuke Social Science Research Institute May 2019-Present
A Comparison of Multiple Imputation Methods for Categorical Data
This thesis evaluates the performance of several multiple imputation methods for categorical data, including multiple imputation by chained equations using generalized linear models, multiple imputation by chained equations using classification and regression trees and non-parametric Bayesian multiple imputation for categorical data (using the Dirichlet process mixture of products of multinomial distributions model). The performance of each method is evaluated with repeated sampling studies using housing unit data from the American Community Survey 2012. These data afford exploration of practical problems such as multicollinearity and large dimensions. This thesis highlights some advantages and limitations of each method compared to others. Finally, it provides suggestions on which method should be preferred, and conditions under which the suggestions hold.
Bayesian Models for Imputing Missing Data & Editing Erroneous Responses in Surveys
Abstract This thesis develops Bayesian methods for handling unit nonresponse, item non response, and erroneous responses in large scale surveys and censuses containing categorical data. I focus on applications to nested household data where individu als are nested within households and certain combinations of the variables are not allowed, such as the U.S. Decennial Census, as well as surveys subject to both unit and item nonresponse, such as the Current Population Survey. The first contribution is a Bayesian model for imputing plausible values for item nonresponse in data nested within households, in the presence of impossible com binations. The imputation is done using a nested data Dirichlet process mixture of products of multinomial distributions model, truncated so that impossible house hold configurations have zero probability in the model. I show how to generate imputations from the Markov Chain Monte Carlo sampler, and describe strategies for improving the computational efficiency of the model estimation. I illustrate the performance of the approach with data that mimic the variables collected in the U.S. Decennial Census. The results indicate that my approach can generate high quality imputations in such nested data. The second contribution extends the imputation engine in the first contribution to allow for the editing and imputation of household data containing faulty values. The approach relies on a Bayesian hierarchical model that uses the nested data Dirichlet process mixture of products of multinomial distributions as a model for the true iv unobserved data, but also includes a model for the location of errors, and a reporting model for the observed responses in error. I illustrate the performance of the edit and imputation engine using data from the 2012 American Community Survey. I show that my approach can simultaneously estimate multivariate relationships in the data accurately, adjust for measurement errors, and respect impossible combinations in estimation and imputation. The third contribution is a framework for using auxiliary information to specify nonignorable models that can handle both item and unit nonresponse simultaneously. My approach focuses on how to leverage auxiliary information from external data sources in nonresponse adjustments. This method is developed for specifying imputa tion models so that users can posit distinct specifications of missingness mechanisms for different blocks of variables, for example, a nonignorablc model for variables with auxiliary marginal information and an ignorable model for the variables exclusive to the survey. I illustrate the framework using data on voter turnout in the Current Population Survey. The final contribution extends the framework in the third contribution to complex surveys, specifically, handling nonresponse in complex surveys, such that we can still leverage auxiliary data while respecting the survey design through survey weights. Using several simulations, I illustrate the performance of my approach when the sample is generated primarily through stratified sampling. v
I work on developing statistical methodology for handling missing and faulty data, with particular emphasis on applications that intersect with the social sciences. I am especially motivated to develop methods that can be readily applied by statistical agencies and data analysts. I completed my PhD in statistical science at Duke in 2019, under the supervision of Jerry Reiter. I obtained an MSc in Statistical and Economic modeling from Duke in 2015, and a BSc in Mathematics and Statistics from the University of Lagos, Nigeria in 2010. Prior to coming to Duke, I worked as an analyst at KPMG Professional Services, Nigeria between 2011 and 2012.