Semi-supervised logistic discrimination via labeled data and unlabeled data from different sampling distributions

Shuichi Kawano
DOI: https://doi.org/10.1002/sam.11204
2012-10-13
Abstract:This article addresses the problem of classification method based on both labeled and unlabeled data, where we assume that a density function for labeled data is different from that for unlabeled data. We propose a semi-supervised logistic regression model for classification problem along with the technique of covariate shift adaptation. Unknown parameters involved in proposed models are estimated by regularization with EM algorithm. A crucial issue in the modeling process is the choices of tuning parameters in our semi-supervised logistic models. In order to select the parameters, a model selection criterion is derived from an information-theoretic approach. Some numerical studies show that our modeling procedure performs well in various cases.
Machine Learning,Methodology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to use labeled data and unlabeled data for effective classification when their probability density functions are different. Specifically, the paper proposes a semi - supervised logistic regression model. This model can estimate model parameters through covariate shift adaptation techniques and regularization methods when the labeled data and unlabeled data come from different distributions. In addition, the paper also proposes an information - theory - based method to select the tuning parameters in the model to optimize the model performance. ### Main Contributions 1. **Model Construction**: Proposed a semi - supervised logistic regression model applicable when the probability density functions of labeled data and unlabeled data are different. 2. **Parameter Estimation**: Used the EM algorithm and regularization methods to estimate the unknown parameters in the model. 3. **Model Selection**: Introduced an information - theory - based model selection criterion for selecting the tuning parameters in the model. 4. **Numerical Experiments**: Verified the effectiveness of the proposed method through Monte Carlo simulation and benchmark data set analysis. ### Key Formulas - **Weighted Log - Likelihood Function**: \[ \ell^*(w; \gamma_1, \gamma_2)=\sum_{\alpha = 1}^{n_1}\left(\frac{q_{\text{unlabel}}(x_{\alpha})}{q_{\text{label}}(x_{\alpha})}\right)^{\gamma_1}\left[y_{\alpha}w^Tx_{\alpha}^*-\log(1 + \exp(w^Tx_{\alpha}^*))\right]+\sum_{\alpha=n_1 + 1}^{n}\left(\frac{q_{\text{label}}(x_{\alpha})}{q_{\text{unlabel}}(x_{\alpha})}\right)^{\gamma_2}\left[t_{\alpha}w^Tx_{\alpha}^*-\log(1 + \exp(w^Tx_{\alpha}^*))\right] \] where \(\gamma_1\) and \(\gamma_2\) are tuning parameters, and \(q_{\text{label}}(x)\) and \(q_{\text{unlabel}}(x)\) are the density functions of labeled data and unlabeled data respectively. - **Regularized Log - Likelihood Function**: \[ \ell_{\lambda}^*(w; \gamma_1, \gamma_2)=\ell^*(w; \gamma_1, \gamma_2)-\frac{n_1\lambda}{2}w^TKw \] where \(\lambda\) is a regularization parameter, \(K = \text{diag}(0, I_p)\) is a \((p + 1)\times(p + 1)\) matrix, and \(I_p\) is a \(p\)-dimensional identity matrix. - **Model Selection Criterion**: \[ \text{GIC}=- 2\sum_{\alpha = 1}^{n_1}\left(\frac{q_{\text{unlabel}}(x_{\alpha})}{q_{\text{label}}(x_{\alpha})}\right)^{\gamma_1}\log f(y_{\alpha}|x_{\alpha};\hat{w})+2\text{tr}\left\{Q(\hat{w})R^{-1}(\hat{w})\right\} \] where \(Q(\hat{w})\) and \(R(\hat{w})\) are matrices derived from the model parameters. ### Experimental Results