Presenter:
Sophie (Yajuan) Si, DUke University, Department of Statistical Science
Abstract:
Large-scale educational assessment surveys collect student/teacher/school
background information to investigate
impacts on student achievement. Because questionnaires are self-reported
without penalty for nonresponse,
the data typically have a notable proportion of missingness. Multiple
imputation has been a popular tool
for propagating the uncertainty introduced by missing data and removing the
burden of handling missingness from data analysts to data imputers.
Standard MI modeling approaches have drawbacks in the high dimensional
settings. For example, multivariate normal distribution assumptions are not
reliable for categorical data; multiple imputation via chained equations
(MICE) with only main effects fails
to capture complex structures; and loglinear models become impracticable
for high dimensional datasets due to extremely sparse cell counts. Without
restricting the dependence structure a priori while favoring sparsity, we
propose using Dirichlet process mixture of products of multinomial
distributions for nominal categorical
variables of high dimension with efficient and exact posterior computation
algorithms.
Repeated sampling studies show that this approach outperforms competitors
such as MICE and sequential classification and regression trees.
Joint work with Jerry Reiter