Principal Staff Machine Learning Engineer / Applied ResearcherLinkedIn
Statistical Computation For Model Space Exploration In High-Dimensional Problems
The increasing dimension of data sets and resulting parameter spaces in modern statis- tical models raise demands for new methods of statistical computation for scalability and efficiency. Model space exploration, in particular, is an increasingly important and challenging area. This dissertation focuses on graphical and regression model space exploration arising in statistical models for high-dimensional data. In contrast to the traditional graphical model space exploration algorithms, which focus on exploring of the graphical model of all variables, this dissertation develops and evaluates an innovative concept: local graphical model search. Local graphical model search algorithms apply to problems where we are interested in a single targeted gene, Y , among thousands of genes in the gene expression data, for example, and wish to understand the graphical structure of Y and its neighborhood. Usual (global) graphical model search methods will not be efficient and precise in such problems. To implement local graphical model search, this dissertation employs stochastic search algorithms subject to restrictions on the model space as well as develops a novel Metropolis-Hasting method referred to as targeted Metropolis-Hastings (TMH). TMH is empirically compared with the usual Metropolis-Hasting (UMH) algorithm in terms of local convergence and the convergence of the stationary "local edge" inclusion distributions. The performances of the methods developed herein are tested with simulation studies and high-dimensional cardiovascular genomics data. Variable selection in generalized linear models with many candidate covariates, is a very challenging problem and widely developed in many applications. Because current stochastic regression model search algorithms rely on conjugacy, they are not appropriate for generalized linear models without use of approximation methods for the marginal likelihood. This dissertation studies two possible marginal likelihood approximation methods: variational Bayes and Laplace approximation. These methods are compared in simulation studies and then applied to the problem of predicting conception using data on timing of intercourse in the menstrual cycle. The final topic of this dissertation concerns large-scale modeling of high-dimensional data in a problem of forecasting click events with content match data in computa- tional advertising. This challenging problem of modeling and computation generally arises in internet advertising, and the study discussed in this dissertation is part of a collaboration with Yahoo! Research. In models that reflect the hierarchy of the high-dimensional data structure, Kalman ltering and Expectation-Maximization al- gorithms aid in providing scalability without losing much precision in generating relevant, applied computational approaches. The studies using both simulated and real "content match" data sets demonstrate the feasibility, utility and efficacy of the developed approach.