Progressive Feature Upgrade in Semi-supervised Learning on Tabular Domain

Morteza Mohammady Gharasuie,Fenjiao Wang
DOI: https://doi.org/10.48550/arXiv.2212.00892
2022-12-02
Abstract:Recent semi-supervised and self-supervised methods have shown great success in the image and text domain by utilizing augmentation techniques. Despite such success, it is not easy to transfer this success to tabular domains. It is not easy to adapt domain-specific transformations from image and language to tabular data due to mixing of different data types (continuous data and categorical data) in the tabular domain. There are a few semi-supervised works on the tabular domain that have focused on proposing new augmentation techniques for tabular data. These approaches may have shown some improvement on datasets with low-cardinality in categorical data. However, the fundamental challenges have not been tackled. The proposed methods either do not apply to datasets with high-cardinality or do not use an efficient encoding of categorical data. We propose using conditional probability representation and an efficient progressively feature upgrading framework to effectively learn representations for tabular data in semi-supervised applications. The extensive experiments show superior performance of the proposed framework and the potential application in semi-supervised settings.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges encountered when applying semi - supervised learning methods in the tabular data domain. Specifically, existing semi - supervised learning methods have achieved remarkable success in the image and text fields, and these successes are mainly attributed to domain - specific data augmentation techniques. However, these techniques are difficult to be directly applied to tabular data because tabular data contains different types of data (continuous and categorical) and lacks an explicit structure like that of images or texts. In addition, the high - cardinality categorical data in tabular data makes existing representation methods (such as one - hot encoding) inefficient and impractical. The main contribution of the paper lies in proposing a new representation method - Conditional Probability Representation (CPR) and a framework for gradually upgrading features. By using pseudo - labels to update CPR, the performance of the model in semi - supervised learning tasks is improved. This method not only solves the representation problem of high - cardinality categorical data but also enhances the learning ability of the model through dynamic feature representation updating, especially when dealing with large - scale datasets. ### Specific problems solved by the paper: 1. **Efficient representation of high - cardinality categorical data**: Traditional one - hot encoding will lead to dimension explosion when dealing with high - cardinality categorical data, while CPR provides a fixed - dimension representation method whose dimension is only related to the number of target labels, not the number of categorical values. 2. **Using pseudo - labels to improve model training**: By using pseudo - labels to update CPR, the model can use more data for training, thereby improving generalization ability and prediction accuracy. 3. **Adapting to the mixture of different data types**: Tabular data usually contains a mixture of continuous and categorical data. CPR converts categorical data into numerical representation by means of conditional probability, making it easier to be processed together with other data types. ### Main methods and techniques: - **Conditional Probability Representation (CPR)**: Map each value of the categorical feature to the probability estimate or expected value of the target attribute, thereby generating a fixed - dimension vector representation. - **Gradually upgrading framework**: Continuously update CPR during the training process and use pseudo - labels to improve feature representation, thereby enhancing model performance. - **Pseudo - label selection mechanism**: Introduce multiple mechanisms to filter inaccurate pseudo - labels and reduce the impact of noise on model training. ### Experimental results: The paper verifies the effectiveness of the proposed method through multiple experiments. The experimental results show that the model using CPR and the gradually upgrading framework exhibits better performance than traditional methods on multiple tabular datasets, especially when dealing with high - cardinality categorical data. In conclusion, by proposing a new data representation method and a gradually upgrading framework, this paper effectively solves the challenges of semi - supervised learning in the tabular data domain and provides new ideas and methods for related research.