Conformalized Semi-supervised Random Forest for Classification and Abnormality Detection

Yujin Han,Mingwenchan Xu,Leying Guan
2024-02-29
Abstract:The Random Forests classifier, a widely utilized off-the-shelf classification tool, assumes training and test samples come from the same distribution as other standard classifiers. However, in safety-critical scenarios like medical diagnosis and network attack detection, discrepancies between the training and test sets, including the potential presence of novel outlier samples not appearing during training, can pose significant challenges. To address this problem, we introduce the Conformalized Semi-Supervised Random Forest (CSForest), which couples the conformalization technique Jackknife+aB with semi-supervised tree ensembles to construct a set-valued prediction $C(x)$. Instead of optimizing over the training distribution, CSForest employs unlabeled test samples to enhance accuracy and flag unseen outliers by generating an empty set. Theoretically, we establish CSForest to cover true labels for previously observed inlier classes under arbitrarily label-shift in the test data. We compare CSForest with state-of-the-art methods using synthetic examples and various real-world datasets, under different types of distribution changes in the test domain. Our results highlight CSForest's effective prediction of inliers and its ability to detect outlier samples unique to the test data. In addition, CSForest shows persistently good performance as the sizes of the training and test sets vary. Codes of CSForest are available at <a class="link-external link-https" href="https://github.com/yujinhan98/CSForest" rel="external noopener nofollow">this https URL</a>.
Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the issue where traditional classifiers (such as Random Forest) perform poorly in safety-critical scenarios (e.g., medical diagnosis and network attack detection) when there is a distribution shift between the training set and the test set. Specifically, when new anomalous samples that were not seen in the training set appear in the test set, traditional classifiers may produce incorrect predictions, leading to severe consequences. To tackle this challenge, the authors propose **Conformalized Semi-Supervised Random Forest (CSForest)**, a tree ensemble classifier that combines semi-supervised learning and conformal prediction techniques. The main objectives of CSForest include: 1. **Improving classification accuracy**: Enhancing the model's accuracy by utilizing unlabeled test data. 2. **Detecting new anomalous samples**: Generating empty sets to flag anomalous samples that were not seen in the training set. 3. **Adapting to distribution changes**: Ensuring effective coverage of the true labels of observed normal classes when the label distribution of the test data changes. ### Main Contributions 1. **Proposing a novel classifier**: CSForest is a new classifier that provides calibrated uncertainty quantification in the presence of distribution shifts between the training and test sets. 2. **Theoretical guarantees**: CSForest ensures effective coverage of the true labels of observed normal classes under any test distribution shift. 3. **Experimental validation**: Extensive experiments on synthetic data and multiple real-world datasets demonstrate the effectiveness and robustness of CSForest under different types of distribution shifts. ### Solution CSForest achieves its goals through the following methods: - **Semi-supervised learning**: Enhancing the model's accuracy by utilizing unlabeled test data. - **Conformal prediction techniques**: Using Jackknife+aB techniques to handle the joint and asymmetric utilization of training and test samples. - **Empty set generation**: Generating empty sets to flag anomalous samples that were not seen before. ### Experimental Results - **Synthetic data**: On simple 2-dimensional synthetic datasets, CSForest showed better anomaly detection capability and higher classification accuracy compared to other methods. - **Real-world data**: On multiple real-world datasets (such as MNIST, FashionMNIST, and CIFAR-10), CSForest excelled in detecting anomalous samples while maintaining classification accuracy for normal samples. ### Conclusion By combining semi-supervised learning and conformal prediction techniques, CSForest effectively addresses the classification and anomaly detection problems in the presence of distribution shifts between the training and test sets. It not only improves classification accuracy but also reliably detects unseen anomalous samples, making it suitable for safety-critical applications.