Abstract:Tabular data is one of the most widely used data modalities, encompassing numerous datasets with substantial amounts of unlabeled data. Despite this prevalence, there is a notable lack of simple and versatile methods for utilizing unlabeled data in the tabular domain, where both gradient-boosting decision trees and neural networks are employed. In this context, self-training has gained attraction due to its simplicity and versatility, yet it is vulnerable to noisy pseudo-labels caused by erroneous confidence. Several solutions have been proposed to handle this problem, but they often compromise the inherent advantages of self-training, resulting in limited applicability in the tabular domain. To address this issue, we explore a novel direction of reliable confidence in self-training contexts and conclude that self-training can be improved by making that the confidence, which represents the value of the pseudo-label, aligns with the cluster assumption. In this regard, we propose Cluster-Aware Self-Training (CAST) for tabular data, which enhances existing self-training algorithms at a negligible cost while maintaining simplicity and versatility. Concretely, CAST calibrates confidence by regularizing the classifier's confidence based on local density for each class in the labeled training data, resulting in lower confidence for pseudo-labels in low-density regions. Extensive empirical evaluations on up to 21 real-world datasets confirm not only the superior performance of CAST but also its robustness in various setups in self-training contexts.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are: in the self - training process of tabular data, how to improve the reliability of pseudo - labels, thereby enhancing the overall performance of the self - training algorithm. Specifically, the paper focuses on: 1. **The noise problem of pseudo - labels**: Existing self - training methods rely on the classifier's confidence to generate pseudo - labels, but these confidences may be unreliable due to the classifier's bias or overfitting, resulting in the generated pseudo - labels containing noise, which in turn affects the final model performance. 2. **Limitations of existing solutions**: In order to deal with noisy pseudo - labels, some studies have proposed methods to modify the self - training algorithm or model architecture, but these methods often introduce additional computational overhead and are not compatible with Gradient Boosting Decision Trees (GBDTs), limiting their application in tabular data. 3. **Lack of simple and effective methods suitable for tabular data**: Although neural networks and GBDTs are widely used in tabular data, at present, there is a lack of a method that can both maintain the simplicity and universality of self - training and effectively utilize unlabeled data. To solve the above problems, the paper proposes a new self - training method - CAST (Cluster - Aware Self - Training), which calibrates the confidence by introducing the cluster assumption to make the pseudo - labels more reliable. Specifically, CAST improves self - training in the following ways: - **Calibrating confidence based on local density**: CAST adjusts the classifier's confidence according to the local density of the labeled data, so that the pseudo - labels in high - density areas have higher confidence, while the pseudo - labels in low - density areas have lower confidence. - **Maintaining the simplicity and universality of self - training**: CAST does not need to modify the self - training algorithm or model architecture and can be applied to neural networks and GBDTs, ensuring the universality and ease of use of the method. In this way, CAST can significantly improve the performance and robustness of the self - training algorithm on tabular data without significantly increasing the computational cost.

CAST: Cluster-Aware Self-Training for Tabular Data via Reliable Confidence

FGCM: Noisy Label Learning via Fine-Grained Confidence Modeling

DCAST: Diverse Class-Aware Self-Training Mitigates Selection Bias for Fairer Learning

Improving self-training under distribution shifts via anchored confidence with theoretical guarantees

Knowledge Based Cluster Ensemble for Cancer Discovery from Biomolecular Data

A Semi-Supervised Self-Training Ensemble Method Based on Clustering

Confident Learning: Estimating Uncertainty in Dataset Labels

Overcoming Overconfidence for Active Learning

Adaptive Teaching with Shared Classifier for Knowledge Distillation

Distributionally robust self-supervised learning for tabular data

Active Clustering Ensemble With Self-Paced Learning

A practical approach to novel class discovery in tabular data

Exploring the combination of self and mutual teaching for tabular-data-related semi-supervised regression

AdapTable: Test-Time Adaptation for Tabular Data via Shift-Aware Uncertainty Calibrator and Label Distribution Handler

Revisiting Realistic Test-Time Training: Sequential Inference and Adaptation by Anchored Clustering Regularized Self-Training

Leveraging Ensemble Diversity for Robust Self-Training in the Presence of Sample Selection Bias

Crowd-Certain: Label Aggregation in Crowdsourced and Ensemble Learning Classification

CSAL: Self-adaptive Labeling based Clustering Integrating Supervised Learning on Unlabeled Data

Addressing Selection Bias in Computerized Adaptive Testing: A User-Wise Aggregate Influence Function Approach