Research StatisticianRTI International2015-Present
Model selection and multivariate inference using data multiply imputed for disclosure limitation and nonresponse
This thesis proposes some inferential methods for use with multiple imputation for missing data and statistical disclosure limitation, and describes an application of multiple imputation to protect data confidentiality. A third component concerns model selection in random effects models. The use of multiple imputation to generate partially synthetic public release files for confidential datasets has the potential to limit unauthorized disclosure while allow- ing valid inferences to be made. When confidential datasets contain missing values, it is natural to use multiple imputation to handle the missing data simultaneously with the generation of synthetic data. This is done in a two-stage process so that the variability may be estimated properly. The combining rules for data multiply imputed in this fashion differ from those developed for multiple imputation in a sin- gle stage. Combining rules for scalar estimands have been derived previously; here hypothesis tests for multivariate components are derived. Longitudinal business data are widely desired by researchers, but difficult to make available to the public because of confidentiality constraints. An application of par- tially synthetic data to the U. S. Census Longitudinal Business Database is described. This is a large complex economic census for which nearly the entire database must be imputed in order for it to be considered for public release. The methods used and analytical results for synthetic data generated for a subgroup are described. Modifcations to the multiple imputation combining rules for population data are also developed. Model selection is an area in which few methods have been developed for use with multiply-imputed data. Careful consideration is given to how Bayesian model selection can be conducted with multiply-imputed data. The usual assumption of correspondence between the imputation and analyst models is not amenable to model selection procedures. Hence, the model selection procedure developed incorporates the imputation model and assumes that the imputation model is known to the ana- lyst. Lastly, a model selection problem outside the multiple imputation context is ad- dressed. A fully Bayesian approach for selecting fixed and random effects in linear and logistic models is developed utilizing a parameter expanded stochastic search Gibbs sampling algorithm to estimate the exact model-averaged posterior distribu- tion. This approach automatically identifies subsets of predictors having nonzero fixed coeffcients or nonzero random effects variance, while allowing uncertainty in the model selection process.