Abstract:The Random Forests classifier, a widely utilized off-the-shelf classification tool, assumes training and test samples come from the same distribution as other standard classifiers. However, in safety-critical scenarios like medical diagnosis and network attack detection, discrepancies between the training and test sets, including the potential presence of novel outlier samples not appearing during training, can pose significant challenges. To address this problem, we introduce the Conformalized Semi-Supervised Random Forest (CSForest), which couples the conformalization technique Jackknife+aB with semi-supervised tree ensembles to construct a set-valued prediction $C(x)$. Instead of optimizing over the training distribution, CSForest employs unlabeled test samples to enhance accuracy and flag unseen outliers by generating an empty set. Theoretically, we establish CSForest to cover true labels for previously observed inlier classes under arbitrarily label-shift in the test data. We compare CSForest with state-of-the-art methods using synthetic examples and various real-world datasets, under different types of distribution changes in the test domain. Our results highlight CSForest's effective prediction of inliers and its ability to detect outlier samples unique to the test data. In addition, CSForest shows persistently good performance as the sizes of the training and test sets vary. Codes of CSForest are available at <a class="link-external link-https" href="https://github.com/yujinhan98/CSForest" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the issue where traditional classifiers (such as Random Forest) perform poorly in safety-critical scenarios (e.g., medical diagnosis and network attack detection) when there is a distribution shift between the training set and the test set. Specifically, when new anomalous samples that were not seen in the training set appear in the test set, traditional classifiers may produce incorrect predictions, leading to severe consequences. To tackle this challenge, the authors propose **Conformalized Semi-Supervised Random Forest (CSForest)**, a tree ensemble classifier that combines semi-supervised learning and conformal prediction techniques. The main objectives of CSForest include: 1. **Improving classification accuracy**: Enhancing the model's accuracy by utilizing unlabeled test data. 2. **Detecting new anomalous samples**: Generating empty sets to flag anomalous samples that were not seen in the training set. 3. **Adapting to distribution changes**: Ensuring effective coverage of the true labels of observed normal classes when the label distribution of the test data changes. ### Main Contributions 1. **Proposing a novel classifier**: CSForest is a new classifier that provides calibrated uncertainty quantification in the presence of distribution shifts between the training and test sets. 2. **Theoretical guarantees**: CSForest ensures effective coverage of the true labels of observed normal classes under any test distribution shift. 3. **Experimental validation**: Extensive experiments on synthetic data and multiple real-world datasets demonstrate the effectiveness and robustness of CSForest under different types of distribution shifts. ### Solution CSForest achieves its goals through the following methods: - **Semi-supervised learning**: Enhancing the model's accuracy by utilizing unlabeled test data. - **Conformal prediction techniques**: Using Jackknife+aB techniques to handle the joint and asymmetric utilization of training and test samples. - **Empty set generation**: Generating empty sets to flag anomalous samples that were not seen before. ### Experimental Results - **Synthetic data**: On simple 2-dimensional synthetic datasets, CSForest showed better anomaly detection capability and higher classification accuracy compared to other methods. - **Real-world data**: On multiple real-world datasets (such as MNIST, FashionMNIST, and CIFAR-10), CSForest excelled in detecting anomalous samples while maintaining classification accuracy for normal samples. ### Conclusion By combining semi-supervised learning and conformal prediction techniques, CSForest effectively addresses the classification and anomaly detection problems in the presence of distribution shifts between the training and test sets. It not only improves classification accuracy but also reliably detects unseen anomalous samples, making it suitable for safety-critical applications.

Conformalized Semi-supervised Random Forest for Classification and Abnormality Detection

Using random forest for reliable classification and cost-sensitive learning for medical diagnosis

Hierarchical Semi-supervised Contrastive Learning for Contamination-Resistant Anomaly Detection

Learning Discrimination from Contaminated Data: Multi-Instance Learning for Unsupervised Anomaly Detection

Self-Supervised Random Forest on Transformed Distribution for Anomaly Detection

Class Weights Random Forest Algorithm for Processing Class Imbalanced Medical Data

Dual-distribution discrepancy with self-supervised refinement for anomaly detection in medical images

Semi-supervised Classification Forests

OptIForest: Optimal Isolation Forest for Anomaly Detection

Efficient Normalized Conformal Prediction and Uncertainty Quantification for Anti-Cancer Drug Sensitivity Prediction with Deep Regression Forests

Classification-Based Self-Supervised Learning For Anomaly Detection

Enhancing identification performance of cognitive impairment high-risk based on a semi-supervised learning method

Deep Isolation Forest for Anomaly Detection

A Density-Based Random Forest for Imbalanced Data Classification

Semi-Supervised Learning in Medical Images Through Graph-Embedded Random Forest

Optimizing the Isolation Forest Algorithm for Identifying Abnormal Behaviors of Students in Education Management Big Data

Research of Medical High-Dimensional Imbalanced Data Classification Ensemble Feature Selection Algorithm with Random Forest

Constrained Contrastive Distribution Learning for Unsupervised Anomaly Detection and Localisation in Medical Images

FORF-S: A Novel Classification Technique for Class Imbalance Problem

A Privacy-Preserving Algorithm for Clinical Decision-Support Systems Using Random Forest

On Detecting Clustered Anomalies Using Sciforest