Semi-supervised Logistic Learning Based on Exponential Tilt Mixture Models

Xinwei Zhang,Zhiqiang Tan
DOI: https://doi.org/10.48550/arXiv.1906.07882
2019-06-19
Abstract:Consider semi-supervised learning for classification, where both labeled and unlabeled data are available for training. The goal is to exploit both datasets to achieve higher prediction accuracy than just using labeled data alone. We develop a semi-supervised logistic learning method based on exponential tilt mixture models, by extending a statistical equivalence between logistic regression and exponential tilt modeling. We study maximum nonparametric likelihood estimation and derive novel objective functions which are shown to be Fisher consistent. We also propose regularized estimation and construct simple and highly interpretable EM algorithms. Finally, we present numerical results which demonstrate the advantage of the proposed methods compared with existing methods.
Machine Learning
What problem does this paper attempt to address?
This paper aims to solve several key problems in semi - supervised classification: 1. **Improve prediction accuracy**: Use a large amount of unlabeled data and a small amount of labeled data to build a more accurate classifier, so as to achieve higher classification accuracy under a limited budget of labeled data. 2. **Fisher consistency problem**: Existing semi - supervised classification methods are usually not Fisher - consistent, which means that optimizing the objective functions of these methods does not necessarily converge to the true conditional probability function or Bayes classifier. The paper proposes a new objective function, which is proved to be Fisher - consistent and has higher estimation efficiency when using unlabeled data than methods using only labeled data. 3. **Relaxation of class - proportion assumptions**: Most existing methods assume that the class proportions in unlabeled data and labeled data are the same. However, this assumption often does not hold in practical applications. The method proposed in the paper allows the class proportions in unlabeled data to be different from those in labeled data. By estimating these proportions as unknown parameters, the flexibility and adaptability of the model are improved. ### Main contributions of the paper 1. **Derivation of new objective functions**: New objective functions are proposed, which are proved to be Fisher - consistent and have higher estimation efficiency when using unlabeled data than methods using only labeled data. 2. **Regularized estimation and EM algorithm**: A regularized estimation method is proposed, and a simple and easy - to - understand EM algorithm is constructed. These algorithms show significant advantages in numerical experiments. ### Results of numerical experiments The paper conducts experiments on 15 benchmark datasets (including 11 UCI datasets and 4 SSL benchmark datasets), comparing the performance of the proposed methods (profile SLR and direct SLR) with two supervised methods (ridge logistic regression RLR and SVM) and two semi - supervised methods (entropy regularization ER and transductive SVM TSVM). The experimental results show that in the case of different class proportions, the proposed methods show significant advantages. Especially in the "Flip Prop" scheme, that is, when the class proportions in unlabeled data are different from those in labeled data, the average accuracy of the proposed methods is significantly higher than that of other methods. ### Conclusion By introducing new objective functions and EM algorithms, the paper successfully solves the Fisher - consistency problem and the class - proportion assumption problem in semi - supervised classification, and improves the prediction accuracy of the classifier. These methods perform well in numerical experiments, especially in the case where the class proportions in unlabeled data and labeled data are different.