Abstract:This article addresses the problem of classification method based on both labeled and unlabeled data, where we assume that a density function for labeled data is different from that for unlabeled data. We propose a semi-supervised logistic regression model for classification problem along with the technique of covariate shift adaptation. Unknown parameters involved in proposed models are estimated by regularization with EM algorithm. A crucial issue in the modeling process is the choices of tuning parameters in our semi-supervised logistic models. In order to select the parameters, a model selection criterion is derived from an information-theoretic approach. Some numerical studies show that our modeling procedure performs well in various cases.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to use labeled data and unlabeled data for effective classification when their probability density functions are different. Specifically, the paper proposes a semi - supervised logistic regression model. This model can estimate model parameters through covariate shift adaptation techniques and regularization methods when the labeled data and unlabeled data come from different distributions. In addition, the paper also proposes an information - theory - based method to select the tuning parameters in the model to optimize the model performance. ### Main Contributions 1. **Model Construction**: Proposed a semi - supervised logistic regression model applicable when the probability density functions of labeled data and unlabeled data are different. 2. **Parameter Estimation**: Used the EM algorithm and regularization methods to estimate the unknown parameters in the model. 3. **Model Selection**: Introduced an information - theory - based model selection criterion for selecting the tuning parameters in the model. 4. **Numerical Experiments**: Verified the effectiveness of the proposed method through Monte Carlo simulation and benchmark data set analysis. ### Key Formulas - **Weighted Log - Likelihood Function**: \[ \ell^*(w; \gamma_1, \gamma_2)=\sum_{\alpha = 1}^{n_1}\left(\frac{q_{\text{unlabel}}(x_{\alpha})}{q_{\text{label}}(x_{\alpha})}\right)^{\gamma_1}\left[y_{\alpha}w^Tx_{\alpha}^*-\log(1 + \exp(w^Tx_{\alpha}^*))\right]+\sum_{\alpha=n_1 + 1}^{n}\left(\frac{q_{\text{label}}(x_{\alpha})}{q_{\text{unlabel}}(x_{\alpha})}\right)^{\gamma_2}\left[t_{\alpha}w^Tx_{\alpha}^*-\log(1 + \exp(w^Tx_{\alpha}^*))\right] \] where \(\gamma_1\) and \(\gamma_2\) are tuning parameters, and \(q_{\text{label}}(x)\) and \(q_{\text{unlabel}}(x)\) are the density functions of labeled data and unlabeled data respectively. - **Regularized Log - Likelihood Function**: \[ \ell_{\lambda}^*(w; \gamma_1, \gamma_2)=\ell^*(w; \gamma_1, \gamma_2)-\frac{n_1\lambda}{2}w^TKw \] where \(\lambda\) is a regularization parameter, \(K = \text{diag}(0, I_p)\) is a \((p + 1)\times(p + 1)\) matrix, and \(I_p\) is a \(p\)-dimensional identity matrix. - **Model Selection Criterion**: \[ \text{GIC}=- 2\sum_{\alpha = 1}^{n_1}\left(\frac{q_{\text{unlabel}}(x_{\alpha})}{q_{\text{label}}(x_{\alpha})}\right)^{\gamma_1}\log f(y_{\alpha}|x_{\alpha};\hat{w})+2\text{tr}\left\{Q(\hat{w})R^{-1}(\hat{w})\right\} \] where \(Q(\hat{w})\) and \(R(\hat{w})\) are matrices derived from the model parameters. ### Experimental Results

Semi-supervised logistic discrimination via labeled data and unlabeled data from different sampling distributions

Kyushu University Institutional Repository SEMI-SUPERVISED LOGISTIC DISCRIMINATION FOR FUNCTIONAL DATA

Semi-supervised inference for nonparametric logistic regression

Generalized entropy based semi-supervised learning

Efficient semi-supervised inference for logistic regression under case-control studies

Dual-Classifier Collaborative Method Based on Semi-Supervised Active Learning

Semi-supervised Logistic Learning Based on Exponential Tilt Mixture Models

Semi-supervised learning with density-ratio estimation

On Discriminative Semi-Supervised Classification.

Fairness in Semi-supervised Learning: Unlabeled Data Help to Reduce Discrimination

A general semi-parametric elliptical distribution model for semi-supervised learning

Nonparametric semi-supervised learning of class proportions

Probabilistic Labeled Semi-supervised SVM

The Use of Unlabeled Data in Predictive Modeling

Enhancing efficiency and robustness in high-dimensional linear regression with additional unlabeled data

Density-based logistic regression

New semi-supervised classification method based on modified cluster assumption.

Semi-Supervised Empirical Risk Minimization: Using unlabeled data to improve prediction

Dimension reduction-based adaptive-to-model semi-supervised classification

Classification by Semi-Supervised Discriminative Regularization

Research on Multi-Label Semi-Supervised Learning Algorithm Based on Dual Selection Criteria