Christine N Kohnen
Using Multiply Imputed, Synthetic Data to Facilitate Data Sharing
The collection of data by statistical agencies and other statistical organizations for internal use and public release is a complex process. Researchers and policy makers demand high quality public-use data, while agency concerns regarding confidentiality and respondent protection limit the information that can be released. Even the sharing of data between statistical agencies cannot be done without first protecting the data in question. Advances in computer technology pose a threat to data confidentiality because data intruders are equipped with tools and resources that can be used to link public records with released data. Therefore, to limit disclosures, agencies apply disclosure control techniques to their data prior to release to ensure that respondent information is protected. However, the application of such techniques reduces the utility of the released data. The requirements of agencies to safeguard their data from disclosures limit their ability to share and exchange unperturbed data with one another. Even in situations where agencies desire to work in an honest environment and the exchange of data would benefit agencies and the researchers who study public-use data, data sharing is limited. One approach agencies can use to safely share their data and create public-use data in the process, is to exchange synthetic data rather than real data. If the agencies have mutual interests, then it may be advantageous for them to create a combined data set that is accessible to all contributing agencies. This combined data set would give agencies and public-use data users the ability to incorporate additional records or attributes into their analyses than previously available from the individual data sources. To facilitate the sharing of confidential data between agencies, synthetic data methods are used to create multiply imputed, synthetic data sets that can be shared among participating agencies. Inferential methods for combining data sets from multiple sources are derived and then validated based on simulation studies that utilize several different analysis models. Implementation of the proposed data sharing methods on real data requires creativity and an inherent understanding of the data to maintain both the overall structure of the data and the underlying relationships.