Head of Digital and Consumer Engagement MarketingUnilever, Shanghai
Bayesian Models and Machine Learning with Gene Expression Analysis Applications
The present thesis is divided into two major parts. The first part focuses on developing model-based estimates for gene expression indices in the Bayesian framework. In the application of oligonucleotide expression array technology, reliable estimation of expression indices is critical for "high-level analysis" such as classification, clustering and regulatory net work exploration. A statistical model (Li and Wong, 2001a) has been proposed to develop model-based estimates for gene expression indices and outlier detection. Chapter 1 illustrates an extension of the model in the Bayesian framework. Proper constrain ts on model parameters, heavy-tail distributions for noise, and mixture priors are introduced with the help of Gibbs sampling. Our model is applied to both artificial probe data and real microarray probe data, with a demonstration that it is more robust and reliable than the original model. The second part of the thesis concerns a novel Bayesian models for the problem of nonlinear regression for prediction. Recently, kernel methods have been introduced and become an increasingly popular tool for various regression, classification and function estimation problems. They exhibit good generalization performance on many real-life problems and the approach is properly motivated theoretically . After a brief introduction to kernel models and methods in Chapter 2, a new class of Bayesian kernel models is proposed in Chapter 3. First, we derive a novel Bayesian version of radial basis functions (RBFs) by utilizing the Dirichlet process prior on the distribution of location variables. This results in Bayesian kernel models as a special case. To achieve a sparse solution similar to SVMs, we introduce two classes of structured priors for regression parameters: mixture priors with point masses and Student-t priors. Orthogonalized kernel models are introduced to achieve better model mixing and speedup in the computation for problems with large sample sizes n. The problem of inference on kernel parameters is addressed and a new discrete updating algorithm is proposed. For all the models introduced in this section, we develop both MCMC algorithms for fully Bayesian inference and EM algorithms for MAP estimation. Experimental results on some benchmark data sets show that the performance of our Bayesian kernel models is among the best of current state-of-art nonlinear models. Chapter 4 concludes the thesis by summarizing future developments.