Data Confidentiality

Many agencies collect data that they intend to share with others. However, releasing the data as collected might reveal data subjects' identities or sensitive attributes. Simple actions like stripping names and addresses may not sufficiently protect confidentiality when the data contain other variables, such as demographic variables or employment/education histories, that ill-intentioned users can match to external data files. Thus, agencies need sophisticated methods to facilitate safe data sharing and dissemination. Faculty members in the department work extensively on theories and applications of such methods, including (i) synthetic data in which original data values at high risk of disclosure are replaced with values simulated from probability distributions specified to reproduce as many of the relationships in the original data as possible, (ii) metrics that agencies can use to quantify risks of disclosures and potential loss of information due to confidentiality protection procedures, and (iii) methods of secure data analysis that offer users output from statistical models without allowing them to see the actual data. Applications areas include government databases, such as data from the U.S. Census Bureau, and health data.

Faculty in this Research Area