Christine Chai

Teaching Assistant

Phone: 
+1 919 684 4210
Webpages: 
Graduation Year: 
2017

Employment Info

Mathematical Statistician
U.S. Census Bureau, Decennial Statistical Studies Division
August 2017-Present

Dissertation

Statistical Issues in Quantifying Text Mining Performance

Abstract: Text mining is an emerging field in data science because text information is ubiquitous, but analyzing text data is much more complicated than analyzing numerical data. Topic modeling is a commonly-used approach to classify text documents into topics and identify key words, so the text information of interest is distilled from the large corpus sea. In this dissertation, I investigate various statistical issues in quantifying text mining performance, and Chapter 1 is a brief introduction. Chapter 2 is about the adequate pre-processing for text data. For example, words of the same stem (e.g. "study" and "studied") should be assigned the same token because they share the exact same meaning. In addition, specific phrases such as "New York" and "White House" should be retained because many topic classification models focus exclusively on words. Statistical methods, such as conditional probability and p-values, are used as an objective approach to discover these phrases. Chapter 3 starts the quantification of text mining performance; this measures the improvement of topic modeling results from text pre-processing. Retaining specific phrases increases their distinctivity because the "signal" of the most probable topic becomes stronger (i.e., the maximum probability is higher) than the "signal" generated by any of the two words separately. Therefore, text pre-processing helps recover semantic information at word level. Chapter 4 quantifies the uncertainty of a widely-used topic model { latent Dirichlet allocation (LDA). A synthetic text dataset was created with known topic proportions, and I tried several methods to determine the appropriate number of topics from the data. Currently, the pre-set number of topics is important to the topic model results because LDA tends to utilize all topics allotted, so that each topic has about equal representation. Last but not least, Chapter 5 explores a few selected text models as extensions, such as supervised latent Dirichlet allocation (sLDA), survey data application, sentiment analysis, and the infinite Gaussian mixture model.

Chai, YH, Lee, HT, Chan, LY, Lin, YS, and Chai, CP. "Automatic Sterilization Trash-Can with Application of UV-LED and Photocatalytic Oxidation." Northern Taiwan Journal (March 31, 2009). Open Access Copy

Henry, T, Banks, D, Chai, C, and Owens-Oas, D. "Modeling community structure and topics in dynamic text networks (Submitted)." Journal of Classification. Open Access Copy

Chai, CP, Ruan, HM, Yang, YS, Fan, IA, Chuang, YH, Lei, CL, Huang, CY, Teng, CW, Shaw, YS, and Liang, M. "Lightweight Mutual Authentication Scheme for Advanced Metering Infrastructure." International Conference on Applied and Theoretical Information Systems Research. December 27, 2012 - December 29, 2012. Taipei, Taiwan. December 31, 2012. Open Access Copy

Ruan, HM, Yang, YS, Fan, IA, Chai, CP, Huang, CY, and Lei, CL. "Security Threats in Advanced Metering Infrastructure." Joint Workshop on Information Security. October 5, 2011 - October 6, 2011. Taipei, Taiwan. 2011. Open Access Copy

Avery, RB, Bilinski, MF, Bucks, BK, Chai, C, Critchfield, T, Keith, IH, Mohamed, IE, Pafenberg, FW, Patrabansh, S, Schultz, JD, and Wood, CE. A Profile of 2013 Mortgage Borrowers: Statistics from the National Survey of Mortgage Originations. National Mortgage Database, Federal Housing Finance Agency, March 21, 2017. Open Access Copy

Avery, RB, Bilinski, MF, Bucks, BK, Chai, C, Chow, M, Clement, A, Critchfield, T, Frumkin, S, Keith, IH, Mohamed, IE, Pafenberg, FW, Patrabansh, S, Schultz, JD, and Wood, CE. A Profile of 2014 Mortgage Borrowers: Statistics from the National Survey of Mortgage Originations. National Mortgage Database, Federal Housing Finance Agency, March 21, 2017. Open Access Copy

Avery, RB, Bucks, B, Chai, C, Critchfield, T, Keith, IH, Mohamed, IE, Pafenberg, FW, Patrabansh, S, Schultz, JD, and Wood, CE. A Profile of 2013 Mortgage Borrowers: Statistics from the National Survey of Mortgage Originations. National Mortgage Database, Federal Housing Finance Agency, May 27, 2016. Open Access Copy

Beckman, E, Chai, C, Lyu, J, Mahserejian, S, Tran, H, Yavari, S, Mitchell, H, Calatroni, A, and Kang, EL. Investigating the Relationship Between the Microbiome and Environmental Characteristics (Published online). North Carolina State University, July 31, 2015. Open Access Copy

Chai, CP. Facebook Account Misuse Detection -- A Statistical Approach. Ed. CL Lei. June 30, 2013. (Master's Thesis) Open Access Copy