Assistant Professor, Department of BiostatisticsBrown University School of Public Health2017 - Present
Bayesian Kernel Models for Statistical Genetics and Cancer Genomics
Abstract The main contribution of this thesis is to examine the utility of kernel regression approaches and variance component models for solving complex problems in statistical genetics and molecular biology. Many of these types of statistical methods have been developed specifically to be applied to solve similar biological problems. For example, kernel regression models have a long history in statistics, applied mathematics, and machine learning. More recently, variance component models have been extensively utilized as tools to broaden understanding of the genetic basis of phenotypic variation. However, because of large combinatorial search spaces and other confounding factors, many of these current methods face enormous computational challenges and often suffer from low statistical power — particularly when phenotypic variation is driven by complicated underlying genetic architectures (e.g. the presence of epistatic effects involving higher order genetic interactions). This thesis highlights two novel methods which provide innovative solutions to better address the important statistical and computational hurdles faced within complex biological data sets. The first is a Bayesian non-parametric statistical framework that allows for efficient variable selection in nonlinear regression which we refer to as "Bayesian approximate kernel regression", or BAKR. The second is a novel algorithm for identifying genetic variants that are involved in epistasis without the need to identify the exact partners with which the variants interact. We refer to this method as the "MArginal ePIstasis Test", or MAPIT. Here, we develop the theory of these two approaches, and demonstrate their power, interpretability, and computational efficiency for analyzing complex phenotypes. We also illustrate their ability to facilitate novel biological discoveries in several real data sets, each of them representing a particular class of analyses: genome-wide association studies (GWASs), molecular trait quantitative trait loci (QTL) mapping studies, and cancer biology association studies. Lastly, we will also explore the potential of these approaches in radiogenomics, a brand new subfield of genetics and genomics that focuses on the study of correlations between imaging or network features and genetic variation.