CAST: Cluster-Aware Self-Training for Tabular Data via Reliable Confidence

Minwook Kim,Juseong Kim,Ki Beom Kim,Giltae Song
2024-08-29
Abstract:Tabular data is one of the most widely used data modalities, encompassing numerous datasets with substantial amounts of unlabeled data. Despite this prevalence, there is a notable lack of simple and versatile methods for utilizing unlabeled data in the tabular domain, where both gradient-boosting decision trees and neural networks are employed. In this context, self-training has gained attraction due to its simplicity and versatility, yet it is vulnerable to noisy pseudo-labels caused by erroneous confidence. Several solutions have been proposed to handle this problem, but they often compromise the inherent advantages of self-training, resulting in limited applicability in the tabular domain. To address this issue, we explore a novel direction of reliable confidence in self-training contexts and conclude that self-training can be improved by making that the confidence, which represents the value of the pseudo-label, aligns with the cluster assumption. In this regard, we propose Cluster-Aware Self-Training (CAST) for tabular data, which enhances existing self-training algorithms at a negligible cost while maintaining simplicity and versatility. Concretely, CAST calibrates confidence by regularizing the classifier's confidence based on local density for each class in the labeled training data, resulting in lower confidence for pseudo-labels in low-density regions. Extensive empirical evaluations on up to 21 real-world datasets confirm not only the superior performance of CAST but also its robustness in various setups in self-training contexts.
Machine Learning
What problem does this paper attempt to address?
The problems that this paper attempts to solve are: in the self - training process of tabular data, how to improve the reliability of pseudo - labels, thereby enhancing the overall performance of the self - training algorithm. Specifically, the paper focuses on: 1. **The noise problem of pseudo - labels**: Existing self - training methods rely on the classifier's confidence to generate pseudo - labels, but these confidences may be unreliable due to the classifier's bias or overfitting, resulting in the generated pseudo - labels containing noise, which in turn affects the final model performance. 2. **Limitations of existing solutions**: In order to deal with noisy pseudo - labels, some studies have proposed methods to modify the self - training algorithm or model architecture, but these methods often introduce additional computational overhead and are not compatible with Gradient Boosting Decision Trees (GBDTs), limiting their application in tabular data. 3. **Lack of simple and effective methods suitable for tabular data**: Although neural networks and GBDTs are widely used in tabular data, at present, there is a lack of a method that can both maintain the simplicity and universality of self - training and effectively utilize unlabeled data. To solve the above problems, the paper proposes a new self - training method - CAST (Cluster - Aware Self - Training), which calibrates the confidence by introducing the cluster assumption to make the pseudo - labels more reliable. Specifically, CAST improves self - training in the following ways: - **Calibrating confidence based on local density**: CAST adjusts the classifier's confidence according to the local density of the labeled data, so that the pseudo - labels in high - density areas have higher confidence, while the pseudo - labels in low - density areas have lower confidence. - **Maintaining the simplicity and universality of self - training**: CAST does not need to modify the self - training algorithm or model architecture and can be applied to neural networks and GBDTs, ensuring the universality and ease of use of the method. In this way, CAST can significantly improve the performance and robustness of the self - training algorithm on tabular data without significantly increasing the computational cost.