Nonparametric Methods for Analysis & Modeling of Complex Multivariate Distributions
ABSTRACT Modern statistical science is challenged by data sets that grow rapidly in both size and complexity. These data sets are very often multivariate, including, for instance, continuous and categorical variables. In addition, such data may encode information about multiple data-generative mechanisms. Traditional, ``parametric'' statistical models and methods are limited, either in their ability to capture nuances that cannot be generated by low dimensional models or in applying restrictive assumptions to inferential procedures that are rarely met. In this work, we present three novel nonparametric methods we developed which tackle different challenges that large and complex multivariate data sets present. Our first contribution introduces a scalable method to test the independence between two random vectors by breaking down the task into simple univariate tests of independence, transforming the inference task into a multiple testing problem that can be completed with almost linear complexity with respect to the sample size. To address increasing dimensionality, we introduce a coarse-to-fine sequential adaptive procedure that exploits the spatial features of dependency structures to examine the sample space more effectively. We derive a finite-sample theory that guarantees the inferential validity of our adaptive procedure at any given sample size. We demonstrate the substantial computational advantage of the procedure in comparison with existing approaches as well as its decent statistical power under various dependency scenarios through an extensive simulation study. We illustrate how the divide-and-conquer nature of the procedure can be exploited not only to test independence but to learn the nature of the underlying dependency. Our second method is motivated by the task of classification and calibration of flow cytometry observations. An important step in comparative analyses of multi-sample flow cytometry data is cross-sample calibration, whose goal is to align cell subsets across multiple samples in the presence of variations in locations, so that variation due to technical reasons is minimized and true biological variation can be meaningfully compared. We introduce a Bayesian nonparametric hierarchical modeling approach for accomplishing both calibration and cell classification jointly in a unified probabilistic manner. Three important features of our method make it particularly effective for analyzing multi-sample flow cytometry data: a nonparametric mixture avoids prespecifying the number of cell clusters; the hierarchical skew normal kernels allow flexibility in the shapes of the cell subsets and cross-sample variation in their locations; and finally the ``coarsening'' strategy makes inference robust to small departures from the model, a feature that becomes crucial with massive numbers of observations such as can be encountered in flow cytometry data. Our third contributed method concerns hierarchical modeling of weights of a Dirichlet Process Mixture. We build on the Hierarchical Dirichlet Process where an infinite-parameter mean measure is taken as a Dirichlet Process Mixture and child measures are drawn as Dirichlet Process Mixtures with the base distribution taken as the above mean measure. The Hierarchical Dirichlet Process only admits a scalar dispersion parameter, a formulation that prevents it from capturing structures that may have been generated from different data-generating mechanisms. Our approach is based on mixing over latent classes of Hierarchical Dirichlet Processes where each class corresponds to a certain level of dispersion and a portion of the shared sample space, which allows heterogeneous variation among multiple distributions over it. We demonstrate the strengths of our three methods through extensive simulation studies and case studies that can yield valuable scientific insights.