Arpita Mandan

Arpita Mandan

Student

Graduation Year: 
2018

Employment Info

Data Scientist
Akamai Technologies
September 2018-

Master's Thesis

Two Applications of Summary Statistics: Integrating Information Across Genes & Confidence Intervals with Missing Data

Abstract Gene set enrichment methods are useful for the mapping of individual genes or proteins to pathways and signatures. We use this approach to study the expression levels of proteins encoded by different genes, and compare individuals that have Alzheimer's disease (AD) to those that are cognitively normal (CN). Different gene sets might show differential enrichment in the two classes. A correlation statistic is computed for measuring the correlation of a sample to one class rather than to the other, with respect to a gene. This allows us to nd the enrichment score for the sample with respect to an entire gene set, and to analyze the gene sets that are differentially expressed in the two classes. The linear model is a powerful tool that we use to estimate the correlation statistic, thus accounting for the class, and also the other covariates such as age and sex of the individual. We study the Jeffreys and Clopper-Pearson intervals for binomial proportions when we have missing data. We use multiple imputation (MI) to deal with missing data. Using simulation studies, we compare the MI Wilson, MI Clopper-Pearson, and the MI Jefferys intervals. We then show that the MI Wilson interval has better repeated sampling properties among all in the case of high missingness. In the case of low missingness, the MI Wilson and MI Clopper-Pearson produce similar empirical coverage rates that are close to the nominal coverage. For a very low value of the binomial proportion, the Jeffreys interval has the largest coverage with the smallest average interval length. iv