Machine Learning Seminar

Gary King
Wednesday, March 1, 2017 -
3:30pm to 5:00pm

Gross Hall, 330 -- Ahmadieh Family Grand Hall

Machine Learning Seminar


Title: An Improved Method of Automated Nonparametric Content Analysis for Social ScienceAbstract: A vast literature in computer science and statistics develops methods to automatically classify textual documents into chosen categories. In contrast, social scientists are often more interested in aggregate generalizations about populations of documents --- such as the percent of social media posts that speak favorably of a candidate's foreign policy. Unfortunately, trying to maximize the proportion of individual documents correctly classified often yields biased estimates of statistical aggregates. Fortunately, classification is neither a necessary nor always a desirable step in estimating aggregate proportions, as in the widely used nonparametric method developed in King and Lu (2008) and Hopkins and King (2010). In this paper, we first prove the properties of this methodology, develop ways around its weaknesses, and show how to improve its estimates in real applications. We then develop a unified approach to inference about statistical aggregates that uses this approach, along with the best classifiers for extrapolations when language changes over time, to produce better estimates than either method can accomplish alone. We evaluate our approach with analyses of 74 separate data sets. This talk is based on joint work with Connor Jerzak and Anton Strezhnev.