Abstract:Consider semi-supervised learning for classification, where both labeled and unlabeled data are available for training. The goal is to exploit both datasets to achieve higher prediction accuracy than just using labeled data alone. We develop a semi-supervised logistic learning method based on exponential tilt mixture models, by extending a statistical equivalence between logistic regression and exponential tilt modeling. We study maximum nonparametric likelihood estimation and derive novel objective functions which are shown to be Fisher consistent. We also propose regularized estimation and construct simple and highly interpretable EM algorithms. Finally, we present numerical results which demonstrate the advantage of the proposed methods compared with existing methods.

What problem does this paper attempt to address?

This paper aims to solve several key problems in semi - supervised classification: 1. **Improve prediction accuracy**: Use a large amount of unlabeled data and a small amount of labeled data to build a more accurate classifier, so as to achieve higher classification accuracy under a limited budget of labeled data. 2. **Fisher consistency problem**: Existing semi - supervised classification methods are usually not Fisher - consistent, which means that optimizing the objective functions of these methods does not necessarily converge to the true conditional probability function or Bayes classifier. The paper proposes a new objective function, which is proved to be Fisher - consistent and has higher estimation efficiency when using unlabeled data than methods using only labeled data. 3. **Relaxation of class - proportion assumptions**: Most existing methods assume that the class proportions in unlabeled data and labeled data are the same. However, this assumption often does not hold in practical applications. The method proposed in the paper allows the class proportions in unlabeled data to be different from those in labeled data. By estimating these proportions as unknown parameters, the flexibility and adaptability of the model are improved. ### Main contributions of the paper 1. **Derivation of new objective functions**: New objective functions are proposed, which are proved to be Fisher - consistent and have higher estimation efficiency when using unlabeled data than methods using only labeled data. 2. **Regularized estimation and EM algorithm**: A regularized estimation method is proposed, and a simple and easy - to - understand EM algorithm is constructed. These algorithms show significant advantages in numerical experiments. ### Results of numerical experiments The paper conducts experiments on 15 benchmark datasets (including 11 UCI datasets and 4 SSL benchmark datasets), comparing the performance of the proposed methods (profile SLR and direct SLR) with two supervised methods (ridge logistic regression RLR and SVM) and two semi - supervised methods (entropy regularization ER and transductive SVM TSVM). The experimental results show that in the case of different class proportions, the proposed methods show significant advantages. Especially in the "Flip Prop" scheme, that is, when the class proportions in unlabeled data are different from those in labeled data, the average accuracy of the proposed methods is significantly higher than that of other methods. ### Conclusion By introducing new objective functions and EM algorithms, the paper successfully solves the Fisher - consistency problem and the class - proportion assumption problem in semi - supervised classification, and improves the prediction accuracy of the classifier. These methods perform well in numerical experiments, especially in the case where the class proportions in unlabeled data and labeled data are different.

Semi-supervised Logistic Learning Based on Exponential Tilt Mixture Models

On semi-supervised estimation using exponential tilt mixture models

Gaussian mixture density modeling and decomposition with weighted likelihood

Model-based clustering and classification using mixtures of multivariate skewed power exponential distributions

Semi-supervised logistic discrimination via labeled data and unlabeled data from different sampling distributions

Semi-supervised inference for nonparametric logistic regression

Semisupervised Robust Modeling of Multimode Industrial Processes for Quality Variable Prediction Based on Student's T Mixture Model.

Semi-supervised Mixture of Latent Factor Analysis Models with Application to Online Key Variable Estimation

Supervised Latent Dirichlet Allocation with a Mixture of Sparse Softmax

A Semi-Supervised Learning Algorithm on Gaussian Mixture with Automatic Model Selection

Efficient semi-supervised inference for logistic regression under case-control studies

A Mixture of Multiple Linear Classifiers with Sample Weight and Manifold Regularization

The Infinite Student'S T-Mixture For Robust Modeling

Unsupervised Classification Based on Penalized Maximum Likelihood of Gaussian Mixture Models

Analysis of High-dimensional Gaussian Labeled-unlabeled Mixture Model via Message-passing Algorithm

Semi-supervised Logistic Regression Via Manifold Regularization

Semi-Supervised Learning of Noisy Mixture of Experts Models

Nonparametric semi-supervised learning of class proportions

Learning Mixed Multinomial Logits with Provable Guarantees

Semi-supervised Multi-View Maximum Entropy Discrimination with Expectation Laplacian Regularization

Unsupervised Learning of Mixture Regression Models for Longitudinal Data.