Assistant Professor, Department of StatisticsTexas A&M University
Bayesian Semi-Parametric Factor Models for High-Dimensional Data
Identifying a lower-dimensional latent space for representation of high-dimensional observations is of significant importance in numerous biomedical and machine learning applications. In many such applications, it is now routine to collect data where the dimensionality of the outcomes is comparable or even larger than the number of available observations. Motivated in particular by the problem of predicting the risk of impending diseases from massive gene expression and single nucleotide polymorphism profiles, this dissertation focuses on building parsimonious models and computational schemes for high-dimensional continuous and unordered categorical data, while also studying theoretical properties of the proposed methods. Sparse factor modeling is fast becoming a standard tool for parsimonious modeling of such massive dimensional data and the content of this thesis is specifically directed towards methodological and theoretical developments in Bayesian sparse factor models. The first three chapters of the thesis studies sparse factor models for high- dimensional continuous data. A class of shrinkage priors on factor loadings are introduced with attractive computational properties, with operating characteristics explored through a number of simulated and real data examples. In spite of the methodological advances over the past decade, theoretical justifications in high-dimensional factor models are scarce in the Bayesian literature. Part of the dissertation focuses on exploring estimation of high-dimensional covariance matrices using a factor model and studying the rate of posterior contraction as both the sample size & dimensionality increases. To relax the usual assumption of a linear relationship among the latent and observed variables in a standard factor model, extensions to a non-linear latent factor model are also considered. Although Gaussian latent factor models are routinely used for modeling of dependence in continuous, binary and ordered categorical data, it leads to challenging computation and complex modeling structures for unordered categorical variables. As an alternative, a novel class of simplex factor models for massive-dimensional and enormously sparse contingency table data is proposed in the second part of the thesis. An efficient MCMC scheme is developed for posterior computation and the methods are applied to modeling dependence in nucleotide sequences and prediction from high-dimensional categorical features. Building on a connection between the proposed model & sparse tensor decompositions, we propose new classes of nonparametric Bayesian models for testing associations between a high dimensional vector of genetic markers and a phenotypical outcome.