Data Science LeadDemocratic National CommitteeJan 2019-Present
Record Linkage Methods with Applications to Causal Inference and Election Voting Data
Probabilistic record linkage enables researchers and analysts to combine data from multiple data sources to conduct statistical analysis. This analysis may be to answer causal questions, to predict future outcomes, or to provide descriptive statistics. In this dissertation, I develop methodology for probabilistic record linkage for two scenarios: general causal inference applications with linked data, and identifying previously removed voters in North Carolina who cast provisional ballots in 2016. In Chapter 2, we develop methodology for causal inference in observational studies when using propensity score subclassification on data constructed with probabilistic record linkage techniques. We focus on scenarios where covariates and binary treatment assignments are in one file and outcomes are in another file, and the goal is to estimate an additive treatment effect by merging the files. We assume that the files can be linked using variables common to both files, e.g., names or birth dates, but that links are subject to errors, e.g., due to reporting errors in the linking variables. We develop methodology for cases where such reporting errors are independent of the other variables on the files. We describe conceptually how linkage errors can affect causal estimates in subclassification contexts. We also present and evaluate several algorithms for deciding which record pairs to use in estimation of causal effects. Using simulation studies, we demonstrate that case selection procedures can result in improved accuracy in estimates of treatment effects from linked data compared to using only cases known to be true links. In Chapter 3, we introduce a model for Bayesian record linkage and clustered sub-models, which we call BRACS. The model is designed for combining two sets of data in which there are differences in the comparison distributions for links and non-links, conditional on attributes observed in one of the files. We use simulation studies to demonstrate that the proposed approach can yield improvements in classifying record pairs as links versus non-links. In Chapter 4, we apply BRACS to 2016 voting data from North Carolina. We describe the process of provisional voting and the list of provisional voters provided by the North Carolina Board of Elections. We provide background on the North Carolina voter file, of which we use a snapshot from November 2016. We outline the limitations of exact-matching the two files using only the state-provided identifiers. Finally, we use BRACS to link the two files, with and without the state-provided identifiers, in order to estimate the number of removed voters who cast provisional ballots in the November 2016 election in North Carolina. In Chapter 5, we modify BRACS to relax the assumption of conditionally independent field comparisons, motivated by the correlation between party registration and race in North Carolina. We outline a method for accounting for this correlation, in which we combine two dependent comparison fields into one joint comparison field. We use simulation studies to demonstrate that this can yield improvements in linkage quality, and we also outline when it may not be appropriate to use. Finally, we apply the results to the data in Chapter 4 and re-estimate the number of removed voters with the joint-comparison BRACS.