Senior Staff Data Scientist & DirectorAlibaba Group
Nonparametric Bayes Models for High-Dimensional and Sparse Data
Current research has evolved at a dramatic rate in the past decade, with improvements in technology leading to a fundamental shift in the way in which data are collected and analyzed. It has become routine to collect large amounts of information and it has become necessary to consider new statistical paradigms that perform well in characterizing complex data from a broad variety of problems. We develop novel nonparametric Bayes models for high-dimensional and sparse data in this dissertation. Bayesian nonparametric methods are useful for modeling data without having to define the complexity of the entire model a priori, but rather allowing for this complexity determined by the data. The flexibility of Bayesian nonparametric priors arises from the prior.s definition over an infinite dimensional parameter space. Therefore, there are theoretically an infinite number of latent components and an infinite number of latent factors. Nevertheless, draws from each respective prior will produce only a small number of components or factors that appear in a given data set. As mentioned, the number of these components and factors, and their corresponding parameter values, are left for the data to decide. This dissertation is divided into four parts, which motivate novel Bayesian nonparametric methods and clearly illustrate their utilities: . Chapter 1: In Chapter 1, we review the Dirichlet process (DP) in detail. There are many other ways of nonparametric modeling, but with the availability of efficient computation and complete set up of theories, the DP is most popular and has been developed and studied extensively. We will also review the most recent development of the DP in this chapter. . Chapter 2: We propose the multiple Bayesian elastic net (abbreviated as MBEN), a new regularization and variable selection method. High dimensional and highly correlated data are commonplace. In such situations, maximum likelihood procedures typically fail.their estimates are unstable, and have large variance. To address this problem, a number of shrinkage methods have been proposed, including ridge regression, the lasso and the elastic net; these methods encourage coefficients to be near zero (in fact, the lasso and the elastic net perform variable selection by forcing some regression coefficients to equal zero). In this paper we describe a semiparametric approach that allows shrinkage to multiple locations, where the location and scale parameters are assigned Dirichlet process hyperpriors. The MBEN prior encourages variables to cluster, so that strongly correlated predictors tend to be in or out of the model together. We apply the MBEN prior to a multi-task learning (MTL) problem, using text data from the Wikipedia. An efficient MCMC algorithm and an automated Monte Carlo EM algorithm enable fast computation in high dimensions. The methods are applied to Wikipedia data using shared words to predict article links. Chapter 3: Latent class models (LCMs) are used increasingly for addressing abroad variety of problems, including sparse modeling of multivariate and longitudinal data, model-based clustering, and flexible inferences on predictor effects. Typical frequentist LCMs require estimation of a single finite number of classes, which does not increase with the sample size, and have a well-known sensitivity to parametric assumptions on the distributions within a class. Bayesian nonparametric methods have been developed to allow an infinite number of classes in the general population, with the number represented in a sample increasing with sample size. In this article, we propose a new nonparametric Bayes model that allows predictors to flexibly impact the allocation to latent classes, while limiting sensitivity to parametric assumptions by allowing class-specific distributions to be unknown subject to a stochastic ordering constraint. An efficient MCMC algorithm is developed for posterior computation. The methods are validated using simulation studies and applied to the problem of ranking medical procedures in terms of the distribution of patient morbidity. . Chapter 4: In studies involving multi-level data structures, problems of data sparsity are often encountered and it becomes necessary to borrow information to improve inferences and predictions. This article is motivated by studies collecting data on different outcomes following congenital heart surgery. If there were sufficient numbers of patients receiving each type of procedure, one could potentially fit procedure-specific multivariate random effects model to relate the outcomes of surgery to patient predictors while allowing variability among hospitals. However, as there are approximately 150 procedures with many procedures conducted on few patients, it is important to borrow information. Allowing variability among hospitals, procedures and outcome types in the regression coefficients relating patient factors to outcomes, we obtain a three-way tensor of regression coefficient vectors. To borrow information in estimating these coefficients, we propose a Bayesian multiway tensor co-clustering model. In particular, the model works by reducing the dimension of the table through separately clustering hospitals, procedures and outcome types. This soft probabilistic clustering proceeds via nonparametric Bayesian latent class models, which favor clustering of dimensions that have similar values for feature vectors. Efficient MCMC and fast approximation approaches are proposed for posterior computation. The methods are illustrated using simulated data, and applied to heart surgery outcome data from a Duke study.