Vinicius D. Mayrink
Associate ProfessorFederal University of Minas Gerais (UFMG)
Factor Models to Describe Linear and Non-linear Structure in High Dimensional Gene Expression Data
An important problem in the analysis of gene expression data is the identification of groups of features that are coherently expressed. For example, one often wishes to know whether a group of genes, clustered because of correlation in one data set, is still highly co-expressed in another data set. For some microarray platforms there are many, relatively short, probes for each gene of interest. In this case, it is possible that a given probe is not measuring its targeted transcript, but rather a different gene with a similar region (called cross-hybridization). Similarly, the incorrect mapping of short nucleotide sequences to a target gene is a common issue related to the young technology producing RNA-Seq data. The expression pattern across samples is a valuable source of information, which can be used to address distinct problems through the application of factor models. Our first study is focused on the identification of the presence/absence status of a gene in a sample. We compare our factor model to "state of the art" detection methods; the results suggest superior performance of the factor analysis for detecting transcripts. In the second study, we apply factor models to investigate gene modules (groups of coherently expressed genes). Variation in the number of copies of regions of the genome is a well known and important feature of most cancers. Copy number alteration is detected for a group of genes in breast cancer; our goal is to examine this abnormality in the same chromosomal region for other types of tumors (Ovarian, Lung and Brain). In the third application, the expression pattern related to RNA-Seq count data is evaluated through a factor model based on the Poisson distribution. Here, the presence/absence of coherent patterns is closely associated with the number of incorrect read mappings. The final study of this dissertation is dedicated to the analysis of multi-factor models with linear and non-linear structure of interactions between latent factors. The interaction terms can have important implications in the model; they represent relationships between genes which cannot be captured in an ordinary analysis.