Preanalysis of Superlarge Industrial Datasets

Authors: 
Giovanni Parmigiani, David L. Banks
Carnegie Mellon Univeristy, Duke University

Nov 30 1990

Successful analysis of superlarge datasets requires statistical procedures that automatically clean the data and uncover simple structure. The protocol we describe applies to multivariate industrial data from continuous manufacturing processes with feedback and feedforward control. Our methods form a twelve-step sequence that edits and relags the time series, as well as applying diagnostics to look for subtle data flaws. At different stages, the protocol will reject data, impute data, relag the time series, flag categories of suspicious data, and divide the dataset into more homogeneous subsets. the output is a cleaned dataset for analysis with standard statistical packages or software tools. Although there is no guarantee that every corruption has been caught and corrected, the output dataset is more thoroughly examined than traditional human-intensive methods can achieve. To assist in this preliminary analysis, we describe four graphical methods developed in studies of glass manufacture from PPG Industries' production plants and sheet aluminum production by Alcoa.

Keywords: 

Imputation, Multivariate Time Series, Graphics, Data Analysis

Manuscript: 

PDF icon 1991-20.pdf