Random Forests for Big Data

Robin Genuer,Jean-Michel Poggi,Christine Tuleau-Malot,Nathalie Villa-Vialaneix

DOI: https://doi.org/10.48550/arXiv.1511.08327

2017-03-22

Abstract:Big Data is one of the major challenges of statistical science and has numerous consequences from algorithmic and theoretical viewpoints. Big Data always involve massive data but they also often include online data and data heterogeneity. Recently some statistical methods have been adapted to process Big Data, like linear regression models, clustering methods and bootstrapping schemes. Based on decision trees combined with aggregation and bootstrap ideas, random forests were introduced by Breiman in 2001. They are a powerful nonparametric statistical method allowing to consider in a single and versatile framework regression problems, as well as two-class and multi-class classification problems. Focusing on classification problems, this paper proposes a selective review of available proposals that deal with scaling random forests to Big Data problems. These proposals rely on parallel environments or on online adaptations of random forests. We also describe how related quantities -- such as out-of-bag error and variable importance -- are addressed in these methods. Then, we formulate various remarks for random forests in the Big Data context. Finally, we experiment five variants on two massive datasets (15 and 120 millions of observations), a simulated one as well as real world data. One variant relies on subsampling while three others are related to parallel implementations of random forests and involve either various adaptations of bootstrap to Big Data or to "divide-and-conquer" approaches. The fifth variant relates on online learning of random forests. These numerical experiments lead to highlight the relative performance of the different variants, as well as some of their limitations.

Machine Learning,Statistics Theory

What problem does this paper attempt to address?

This paper attempts to address the problem of how to effectively scale the Random Forests (RF) algorithm in the context of Big Data. Specifically, the paper focuses on improving the performance of Random Forests when handling large-scale datasets through parallel computing environments or online adaptation methods. The main issues include: 1. **Computational Efficiency**: The traditional Random Forest algorithm takes too long to compute when processing large datasets, making it unable to provide results within an acceptable time frame. 2. **Memory Limitations**: Large datasets often exceed the memory capacity of a single computer, necessitating distributed storage and computation. 3. **Algorithm Adaptability**: How to adapt to the characteristics of Big Data, such as data streams and data heterogeneity, while maintaining the original statistical performance of Random Forests. The paper reviews several existing methods, including sampling methods, parallel implementations, and online learning, and compares the performance of these methods through experiments, aiming to find the best strategy for effectively scaling Random Forests in a Big Data environment.

Random Forests for Big Data

Random Forests: some methodological insights

Arbres CART et Forêts aléatoires, Importance et sélection de variables

A random forest guided tour

Consistency of random forests

An Advanced Random Forest Algorithm Targeting the Big Data with Redundant Features.

A Parallel Random Forest Algorithm for Big Data in a Spark Cloud Computing Environment

Random Bits Forest: a Strong Classifier/Regressor for Big Data

Search for the Smallest Random Forest

Mondrian Forests: Efficient Online Random Forests

Random Similarity Forests

When are Deep Networks really better than Decision Forests at small sample sizes, and how?

Scalable and Efficient Hypothesis Testing with Random Forests

Random And Deterministic Forests

Improved Parallel Random Forest Algorithm Combining Information Theory and Norm

The random forest algorithm for statistical learning

Smart Data Driven Decision Trees Ensemble Methodology for Imbalanced Big Data

Random Forest for Bioinformatics

Understanding Random Forests: From Theory to Practice

Exploiting random projections and sparsity with random forests and gradient boosting methods -- Application to multi-label and multi-output learning, random forest model compression and leveraging input sparsity