Lan "Sophia" Wei

missing portrait
Graduation Year: 
2016

Employment Info

Director of Marketing Science
In4mation Insights
2016 - Present

Dissertation

Methods for Imputing Missing Values and Synthesizing Confidential Values for Continuous and Magnitude Data

Continuous variable is one of the major data types collected by the survey organizations. It can be incomplete such that the data collectors need to fill in the missingness. Or, it can contain sensitive information which needs to be aggregated in order not to be re-identified. In this thesis, I represent novel methods of multiple imputation (MI) that can be applied to impute missing values and synthesize confidential values for continuous and magnitude data. The first method is for limiting the disclosure risk of the continuous microdata whose marginal sums are fixed. The motivation comes from the magnitude tables of non-negative integer values in economic surveys. I present approaches based on a mixture of Poisson distributions to describe the multivariate distribution so that the marginals of the synthetic data are guaranteed to sum to the original totals. At the same time, I present methods for assessing disclosure risks in releasing such synthetic magnitude microdata. The illustration on a survey of manufacturing establishments shows that the disclosure risks are low while the information loss is acceptable. The second method is for releasing synthetic continuous microdata by interval-protected MI methods. Typically, MI ts a synthesis model directly on the confidential values and then generates multiple synthetic datasets from the model. Thus, its disclosure risk can be high especially when the original data contain extreme values. From a new perspective, I present MI approaches conditioned on the protective intervals. The basic idea is to estimate the parameters of the synthesis model from these intervals and/or restrict the synthetic values from truncated distributions. The results of simple simulation studies are encouraging, which suggests the potential of interval-protected MI in limiting the posterior disclosure risk for continuous microdata. The third method is for imputing missing values in continuous and categorical variables. It is extended from a hierarchically coupled mixture model with local dependence. However, the new method separates the variables into non-focused (e.g., almost-fully-observed) and focused (e.g., missing-a-lot) ones. The sub-model structure of focused variables is more complex than that of non-focused ones with their cluster indicators being linked together by a tensor factorization. In addition, the focused continuous variables depend locally on non-focused values. This model property suggests that moving the strongly associated non-focused variables to the side of focused ones can help to improve estimation accuracy, which is examined by several simulation studies. And this method is applied to data from the American Community Survey. v