Kelly R Moran
Instructor in the Department of Marine Science and Conservation
Scientist IIU.S. Department of Energy, Statistical Science GroupBeginning January 2021
Advances in Bayesian Factor Modeling & Scalable Gaussian Process Regression
Thesis title: Advances in Bayesian Factor Modeling and Scalable Gaussian Process Regression Author: Kelly R. Moran Abstract Correlated measurements arise across a diverse array of disciplines such as epidemiology, toxicology, genomics, economics, and meteorology. Factor models describe the association between variables by assuming some latent factors drive structured variation therein. Gaussian process (GP) models, on the other hand, describe the association between variables using a distance-based covariance kernel. This dissertation introduces two novel extensions of Bayesian factor models driven by applied problems, and then proposes an algorithm to allow for scalable approximate Bayesian GP sampling. First, the FActor Regression for Verbal Autopsy (FARVA) model is developed for using verbal autopsies to predict the cause of death for individuals and cause-specific mortality fraction for populations in low-resource settings. Both the mean and the association between symptoms provides information used to differentiate decedents across cause of death groups. This class of hierarchical factor regression models avoids restrictive independence assumptions of standard methods, allows both the mean and covariance to vary with COD category, and can include covariate information on the decedent, region, or events surrounding death. Next, the Bayesian partially Supervised Sparse and Smooth Factor Analysis (BS3FA) model is developed to enable toxicologists, who are faced with a rising tide of chemicals under regulation and in use, to choose which chemicals to prioritize for screening and to predict the toxicity of as-yet-unscreened chemicals based on their molecular structure. Latent factors driving structured variability are assumed to be shared between the molecular structure and dose-response observations from high-throughput screening. These shared latent factors allow the model to learn a distance between chemicals targeted to toxicity, rather than one based on molecular structure alone. Finally, the Fast Increased Fidelity Approximate GP (FIFA-GP) allows for the association between observations to be modeled by a high fidelity Gaussian process approximation even when the number of observations is on the order of 105. A sampling algorithm that scales at O(nlog2n) time is described, and a proof showing that the approximation's Kullback-Leibler divergence to the true posterior can be made arbitrarily small is provided.