Abstract:Abstract Background Selection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinical practice). Many gene selection approaches use univariate (gene-by-gene) rankings of gene relevance and arbitrary thresholds to select the number of genes, can only be applied to two-class problems, and use gene selection ranking criteria unrelated to the classification algorithm. In contrast, random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its possible use for gene selection. Results We investigate the use of random forest for classification of microarray data (including multi-class problems) and propose a new method of gene selection in classification problems based on random forest. Using simulated and nine microarray data sets we show that random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM, and that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy. Conclusion Because of its performance and features, random forest and gene selection using random forest should probably become part of the "standard tool-box" of methods for class prediction and gene selection with microarray data.

Random Forest for Genomic Prediction

A zero altered Poisson random forest model for genomic-enabled prediction

Gene selection and classification of microarray data using random forest

The random forest algorithm for statistical learning

Random Forest for Bioinformatics

A random forest guided tour

Random Forests: some methodological insights

ggRandomForests: Visually Exploring a Random Forest for Regression

Random Forests for Big Data

Random Bits Forest: a Strong Classifier/Regressor for Big Data

Genomic prediction using machine learning: a comparison of the performance of regularized regression, ensemble, instance-based and deep learning methods on synthetic and empirical data

ggRandomForests: Exploring Random Forest Survival

Random Forest Algorithm for Prediction of HIV Drug Resistance

Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling

A Benchmarking Between Deep Learning, Support Vector Machine and Bayesian Threshold Best Linear Unbiased Prediction for Predicting Ordinal Traits in Plant Breeding

Artificial Neural Networks and Deep Learning for Genomic Prediction of Binary, Ordinal, and Mixed Outcomes

A graph model for genomic prediction in the context of a linear mixed model framework

A Penalized Regression Method for Genomic Prediction Reduces Mismatch between Training and Testing Sets

Common, uncommon, and novel applications of random forest in psychological research

Random Forests for time-fixed and time-dependent predictors: The DynForest R package

pyRforest: A comprehensive R package for genomic data analysis featuring scikit-learn Random Forests in R