Instructor in MedicineChanning Library, Brigham and Women's Hospital, Harvard Medical School
Bayesian Function Estimation Using Overcomplete Dictionaries with Application in Genomics
In this dissertation we present a Bayesian approach for nonparametric function estimation based on continuous wavelet dictionaries, where the unknown function is modeled by a sum of wavelet functions at arbitrary locations and scales. By avoiding the dyadic constraints for orthonormal wavelet bases, the continuous wavelet dictionaries have greater flexibility to adapt to the structure of the data, and lead to sparser representations. The price for this flexibility is the computational challenge of searching efficiently over an infinite number of potential dictionary elements. We develop a reversible jump Markov Chain Monte Carlo algorithm which utilizes local features in the proposal distributions for the addition of new wavelet elements to improve mixing of the Markov chain. By utilizing continuous wavelets, we have the flexibility to handle data with non-equal spacing without resorting to interpolation or imputation of missing data. In Chapter 1 we start with a review of wavelets and function estimation and provide an overview of array Comparative Genomic Hybridization (CGH) and gene expression data. Chapter 2 introduces the continuous wavelet dictionaries. We discuss the basic setting of the model and estimation. We present simulation results using standard wavelet test functions, which show that the new method leads to greater sparsity and improved mean square error over translational invariant wavelets, another overcomplete representation. We illustrates the method on non-equally spaced data, and show that the method compares favorably to methods using interpolation or imputation. In Chapter 3 and 4 we present applications with array CGH, which is a technology used to detect DNA copy number alterations that could help identify the relevant genes for cancer development. This recent technology calls for new statistical methods for analyzing array CGH data. In Chapter 3 we present a hierarchical model to analyze multiple samples via a functional data analysis approach using the overcomplete dictionaries. The hierarchical model is based on samples grouped according to the disease progression and survival status. The posterior probabilities of copy gain/loss are estimated for each gene at the group level. From that result, we can also classify new patients and identify the genes relevant to the group differences. We demonstrate the performance of our method using simulated and real data sets. In Chapter 4 we extend our model to analyze gene expression and gene copy number alterations jointly. Both types of data have been linked to cancer development and progression and have been studied extensively to describe the pattern of expression levels and copy number changes in cancer. However, uncovering the genes related to cancer development is still a difficult task and few studies have combined analysis of both data types. Here we discuss our model and inference methods for joint analysis of these two genomic measurements. We present results from simulation studies and the breast cancer cell line data published by Hyman et al. (2002). We provide estimates for both gene expression levels and DNA copy numbers, along with the degree to which the two types of data are associated. We identify a subset of genes for which the expression levels are most likely attributable to gene copy number alterations across the samples, including some of the oncogenes that were previously associated with breast cancer and some new targets.