Bayesian Conditional Density Filtering for Big Data

Authors: 
Rajarshi Guhaniyogi, Shaan Qamar, David Dunson
Duke University

Oct 31 2013

We propose a novel Conditional Density Filtering (C-DF) algorithm for efficient online Bayesian inference. Parameter sampling in massive or streaming data settings typically proceeds by updating MCMC transition kernels as a function of the data or its sufficient statistics (SS) in time. For high-dimensional parameter spaces and in other settings, sufficient statistics are either absent or computationally onerous to propagate. C-DF is a powerful adaptation of Gibbs sampling enabling online inference in such settings. We propose the tracking of conditional sufficient statistics (CSS), namely summaries of the observed data and consistent estimators of given model parameters; this eliminates the need to store the entire dataset which is prohibitive for large data. When both SS and CSS exist, propagating CSS can provide a substantial reduction in data storage without losing inferential power. Furthermore, draws from C-DF are shown to have the correct joint posterior distribution asymptotically.

The C-DF algorithm is illustrated through several motivating examples. In section 5, we discuss a novel framework for Factor regression with application to compressed regression in high-dimensional streaming data. Augmenting the parameter learning of our Bayesian model, we achieve near state-of-the-art performance compared to "batch" implementations of the Lasso, vastly outperforming competing Bayesian shrinkage methods. Metrics for comparison include prediction error, variable selection and support recovery, as well as computation time.

Keywords: 

Compressed regression, Dimension reduction, Density filtering, Bayesian updating, Approximate MCMC

Manuscript: 

PDF icon 2013-06.pdf