Multiple Imputation for Large-scale Incomplete Categorical Data

Presenter: 
Sophie (Yajuan) Si, DUke University, Department of Statistical Science
Date: 
Monday, January 30, 2012 - 4:30pm to 5:30pm
yajuan.si@stat.duke.edu
Location: 
Old Chemistry 116


Abstract: 
Large-scale educational assessment surveys collect student/teacher/school background information to investigate impacts on student achievement. Because questionnaires are self-reported without penalty for nonresponse, the data typically have a notable proportion of missingness. Multiple imputation has been a popular tool for propagating the uncertainty introduced by missing data and removing the burden of handling missingness from data analysts to data imputers. Standard MI modeling approaches have drawbacks in the high dimensional settings. For example, multivariate normal distribution assumptions are not reliable for categorical data; multiple imputation via chained equations (MICE) with only main effects fails to capture complex structures; and loglinear models become impracticable for high dimensional datasets due to extremely sparse cell counts. Without restricting the dependence structure a priori while favoring sparsity, we propose using Dirichlet process mixture of products of multinomial distributions for nominal categorical variables of high dimension with efficient and exact posterior computation algorithms. Repeated sampling studies show that this approach outperforms competitors such as MICE and sequential classification and regression trees. Joint work with Jerry Reiter




Series: 
Statistical Science Seminar Series