uniForest: an unsupervised machine learning technique to detect outliers and restrict variance in microbiome studies

R.J. Leigh,R.A. Murphy,F. Walsh
DOI: https://doi.org/10.1101/2021.05.17.444491
2021-05-17
Abstract:Abstract Isolation Forests is an unsupervised machine learning technique for detecting outliers in continuous datasets that does not require an underlying equivariant or Gaussian distribution and is suitable for use on small datasets. While this procedure is widely used across quantitative fields, to our knowledge, this is the first attempt to solely assess its use for microbiome datasets. Here we present uniForest, an interactive Python notebook (which can be run from any desktop computer using the Google Colaboratory web service) for the processing of microbiome outliers. We used uniForest to apply Isolation Forests to the Healthy Human Microbiome project dataset and imputed outliers with the mean of the remaining inliers to maintain sample size and assessed its prowess in variance reduction in both community structure and derived ecological statistics (α-diversity). We also assessed its functionality in anatomical site differentiation (pre- and postprocessing) using principal component analysis, dissimilarity matrices, and ANOSIM. We observed a minimum variance reduction of 81.17% across the entire dataset and in alpha diversity at the Phylum level. Application of Isolation Forests also separated the dataset to an extremely high specificity, reducing variance within taxa samples by a minimum of 81.33%. It is evident that Isolation Forests are a potent tool in restricting the effect of variance in microbiome analysis and has potential for broad application in studies where high levels of microbiome variance is expected. This software allows for clean analyses of otherwise noisy datasets.
What problem does this paper attempt to address?