High Dimensional Random Forests Estimation and Inference
Yingying Fan, Professor, USC Marshall
This talk contributes to a fine-grained understanding of the random forests algorithm by discussing its consistency and variable selection properties in a general high-dimensional nonparametric regression setting. Specifically speaking, we derive the consistency rates for the random forests algorithm associated with the sample CART splitting criterion used in the original version of the algorithm (Breiman, 2001) through a bias-variance decomposition analysis. Our new theoretical results show that random forests can indeed adapt to high dimensionality and allow for discontinuous regression function. Our bias analysis takes a global approach that characterizes explicitly how the random forests bias depends on the sample size, tree height, and column subsampling parameter; and our variance analysis takes a local approach that bounds the forests variance via bounding the tree variance. A major technical innovation of our work is to introduce the sufficient impurity decrease (SID) condition which makes our bias analysis possible and precise.
We further proceed with quantifying the usefulness of individual features in random forests learning, which can greatly enhance the interpretability of the learning outcome. Existing studies have shown that some popularly used feature importance measures suffer from the bias issue. In addition, most of these existing methods lack comprehensive size and power analyses. We approach the problem via hypothesis testing and suggest a general framework of the self-normalized feature-residual correlation test (FACT) for evaluating the significance of a given feature. The vanilla version of our FACT test can suffer from the bias issue in the presence of feature dependency. We exploit the techniques of imbalancing and conditioning for bias correction. We further incorporate the ensemble idea into the FACT statistic through feature transformations for enhanced power. We formally establish that FACT can provide theoretically justified random forests feature p-values and enjoy appealing power through nonasymptotic analyses.