Assistant Professor of StatisticsPennsylvania State University
Bayesian Variable Selection in Clustering and Hierarchical Mixture Modeling
Clustering methods are designed to separate heterogeneous data into groups of similar objects such that objects within a group are similar, and objects in different groups are dissimilar. From the machine learning perspective, clustering can also be viewed as one of the most important topics within the unsupervised learning problem, which involves finding structures in a collection of unlabeled data. Various clustering methods have been developed under different problem contexts. Specifically, high dimensional data has stimulated a high level of interest in combining clustering algorithms and variable selection procedures; large data sets with expanding dimension have provoked an increasing need for relevant, customized clustering algorithms that offer the ability to detect low probability clusters. This dissertation focuses on the model-based Bayesian approach to clustering. I first develop a new Bayesian Expectation-Maximization algorithm in fitting Dirichlet process mixture models and an algorithm to identify clusters under mixture models by aggregating mixture components. These two algorithms are used extensively throughout the dissertation. I then develop the concept and theory of a new variable selection method that is based on an evaluation of subsets of variables for the discriminatory evidence they provide in multivariate mixture modeling. This new approach to discriminative information analysis uses a natural measure of concordance between mixture component densities. The approach is both effective and computationally attractive for routine use in assessing and prioritizing subsets of variables according to their roles in the discrimination of one or more clusters. I demonstrate that the approach is useful for providing an objective basis for including or excluding specific variables in flow cytometry data analysis. These studies demonstrate how ranked sets of such variables can be used to optimize clustering strategies and selectively visualize identified clusters of the data of interest. Next, I create a new approach to Bayesian mixture modeling with large data sets for a specific, important class of problems in biological subtype identification. The context, that of combinatorial encoding in flow cytometry, naturally introduces the hierarchical structure that these new models are designed to incorporate. I describe these novel classes of Bayesian mixture models with hierarchical structures that reflect the underlying problem context. The Bayesian analysis involves structured priors and computations using customized Markov chain Monte Carlo methods for model fitting that exploit a distributed GPU (graphics processing unit) implementation. The hierarchical mixture model is applied in the novel use of automated flow cytometry technology to measure levels of protein markers on thousands to millions of cells. Finally, I develop a new approach to cluster high dimensional data based on Kingman's coalescent tree modeling ideas. Under traditional clustering models, the number of parameters required to construct the model increases exponentially with the number of dimensions. This phenomenon can lead to model overfitting and an enormous computational search challenge. The approach addresses these issues by proposing to learn the data structure in each individual dimension and combining these dimensions in a flexible tree-based model class. The new tree-based mixture model is studied extensively under various simulation studies, under which the model's superiority is reflected compared with traditional mixture models.