Abstract:Semi-supervised learning (SSL) is an important theme in machine learning, in which we have a few labeled samples and many unlabeled samples. In this paper, for SSL in a regression problem, we consider a method of incorporating information on unlabeled samples into kernel functions. As a typical implementation, we employ Gaussian kernels whose centers are labeled and unlabeled input samples. Since the number of coefficients is larger than the number of labeled samples in this setting, this is an over-parameterized regression roblem. A ridge regression is a typical estimation method under this setting. In this paper, alternatively, we consider to apply the minimum norm least squares (MNLS), which is known as a helpful tool for understanding deep learning behavior while it may not be application oriented. Then, in applying the MNLS for SSL, we established several methods based on feature extraction/dimension reduction in the SVD (singular value decomposition) representation of a Gram type matrix appeared in the over-parameterized regression problem. The methods are thresholding according to singular value magnitude with cross validation, hard-thresholding with cross validation, universal thresholding and bridge thresholding methods. The first one is equivalent to a method using a well-known low rank approximation of a Gram type matrix. We refer to these methods as SVD regression methods. In the experiments for real data, depending on datasets, clear superiority of the proposed SVD regression methods over ridge regression methods was observed. And, depending on datasets, incorporation of information on unlabeled input samples into kernels was found to be clearly effective.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to use the information of unlabeled data to improve the performance of regression tasks in semi - supervised learning (SSL). Specifically, the paper proposes a method of using Minimum Norm Least Squares (MNLS) in the over - parameterized situation, and performs feature extraction and dimension reduction through Singular Value Decomposition (SVD), so as to effectively use unlabeled data in regression problems.
### Background and Motivation of the Paper
- **Semi - supervised Learning**: Semi - supervised learning is a machine - learning method in which only a small amount of data is labeled while a large amount of data is unlabeled. This method has been widely studied in tasks such as image classification, but has relatively few applications in regression problems.
- **Over - parameterization Problem**: In semi - supervised learning, when the number of kernel functions used is greater than the number of labeled samples, it will lead to the over - parameterization problem. Traditional Ridge Regression can solve this problem, but the paper proposes a new method based on MNLS.
- **Application of MNLS**: Although MNLS is used to analyze phenomena such as "double descent" and "benign overfitting" in deep learning, it has relatively few applications in regression problems. The paper attempts to apply MNLS to semi - supervised learning to improve the performance of regression tasks.
### Contributions of the Paper
- **Proposing a New Semi - supervised Learning Scheme**: The paper proposes a new semi - supervised learning scheme, which constructs a regression model by linearly combining Gaussian kernel functions. The centers of these kernel functions include both labeled and unlabeled samples.
- **Over - parameterized Regression Method**: The paper explores the method of using MNLS for regression in the over - parameterized situation and proposes several SVD - based thresholding methods, such as Hard - Thresholding (HT), Universal Thresholding (UT) and Bridge - Thresholding (BT).
- **Experimental Verification**: The paper conducts experiments with four typical datasets in the UCI Machine Learning Repository to verify the effectiveness of the proposed method. The experimental results show that on some datasets, the SVD regression method performs better than the traditional Ridge Regression method.
### Experimental Results
- **Performance Differences of Different Datasets**: For different datasets, the performance of regression methods varies. For example, for the GPU and Auction datasets, the traditional Ridge Regression method performs better; while for the Energy and Yacht datasets, the SVD regression method performs better.
- **Influence of Sample Size**: When the number of training samples is small (such as 50), the standard deviation of all methods is large, indicating that insufficient sample size may lead to unstable results. When the sample size is large enough (such as 200), all methods can stably produce better prediction results.
- **Importance of Feature Selection**: For some datasets, the SVD regression method can more stably improve the prediction performance by selecting components with larger singular values.
In conclusion, the paper proposes a new semi - supervised learning method. By using MNLS and SVD for feature extraction and dimension reduction in the over - parameterized situation, it effectively uses unlabeled data and improves the performance of regression tasks.