LecturerUniversity of Southampton, UK
Bayesian Methods to Impute Missing Covariates for Causal Inference and Model Selection
This thesis presents new approaches to deal with missing covariate data in two situations; matching in observational studies and model selection for generalized linear models. In observational studies, inferences about treatment effects are often affected by confounding covariates. Analysts can reduce bias due to differences in control and treated units' observed covariates using propensity score matching, which results in a matched control group with similar characteristics to the treated group. Propensity scores are typically estimated from the data using a logistic regression. When covariates are partially observed, missing values can be filled in using multiple imputation. Analysts can estimate propensity scores from the imputed data sets to find a matched control set. Typically, in observational studies, covariates are spread thinly over a large space. It is not always clear what an appropriate imputation model for the missing data should be. Implausible imputations can influence the matches selected and hence the estimate of the treatment effect. In propensity score matching, units tend to be selected from among those lying in the treated units' covariate space. Thus, we would like to generate plausible imputations for these units' missing values. I investigate the use of a general location model with two latent classes to impute missing covariates. One class comprises units with covariates lying in the region of the treated units' covariate space and the other class comprises all other units. When multiply imputing missing covariates in observational studies, the analyst has several approaches to estimate treatment effects. I consider two such approaches. One approach averages propensity scores across imputed data sets. These are used to find a matched control set and estimate the treatment effect. An alternative approach first estimates the treatment effect within each imputed data set and then averages the corresponding estimates. I investigate properties of both these approaches for different numbers of imputations, with a focus on bias and variance trade offs. The final chapter in my thesis develops an approach to perform Bayesian model selection in generalized linear models with missing covariate data. Stochastic search variable selection (SSVS) offers an efficient way to simultaneously search the model space and make posterior inferences using an MCMC algorithm. When missing data is present in the covariates, SSVS cannot be applied directly. I develop a SSVS algorithm to handle missing covariate data. I place a joint distribution on the covariates using a sequence of generalized linear models. I use data augmentation techniques to impute missing values within the SSVS algorithm. In addition, I incorporate model uncertainty in the distribution of the missing data, which results in a two level SSVS algorithm.