On semi-supervised estimation using exponential tilt mixture models

Ye Tian,Xinwei Zhang,Zhiqiang Tan
2023-11-15
Abstract:Consider a semi-supervised setting with a labeled dataset of binary responses and predictors and an unlabeled dataset with only the predictors. Logistic regression is equivalent to an exponential tilt model in the labeled population. For semi-supervised estimation, we develop further analysis and understanding of a statistical approach using exponential tilt mixture (ETM) models and maximum nonparametric likelihood estimation, while allowing that the class proportions may differ between the unlabeled and labeled data. We derive asymptotic properties of ETM-based estimation and demonstrate improved efficiency over supervised logistic regression in a random sampling setup and an outcome-stratified sampling setup previously used. Moreover, we reconcile such efficiency improvement with the existing semiparametric efficiency theory when the class proportions in the unlabeled and labeled data are restricted to be the same. We also provide a simulation study to numerically illustrate our theoretical findings.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to use the extra information in unlabeled data to improve the supervised learning method that only uses labeled data in the semi - supervised learning (SSL) environment. Specifically, the paper focuses on, in the case of binary - class response variables, studying how to improve the estimation efficiency in sample settings under different conditions through the Exponential Tilt Mixture (ETM) model and the maximum non - parametric likelihood estimation method. The paper specifically explores how to use this information to improve the estimation efficiency when the class proportions in the unlabeled data and the labeled data may be different, and conducts a comparative analysis with the existing semi - parametric efficiency theory. ### Main contributions of the paper 1. **Proposing the ETM model**: The paper proposes a method based on the exponential tilt mixture model to deal with the situation where the class proportions of unlabeled data and labeled data are different. This method allows the class proportion in the unlabeled data to be different from that in the labeled data, but the conditional distributions are the same. 2. **Analysis of asymptotic properties**: The paper derives the asymptotic properties of the ETM model estimation and compares it with supervised logistic regression. The results show that in both random sampling and outcome - stratified sampling settings, the ETM model estimation is generally more efficient. 3. **Theoretical verification**: Through theoretical analysis and simulation studies, the paper verifies the efficiency improvement of the ETM model under different conditions, and compares it with the existing semi - parametric efficiency theory, explaining the consistency of these improvements with the existing theory. ### Key concepts - **Exponential Tilt Model**: A statistical model used to describe the relationship between two distributions, especially between labeled data and unlabeled data. - **Maximum Nonparametric Likelihood Estimation**: An estimation method used to maximize the likelihood function without assuming a specific parametric form. - **Semi - Supervised Learning (SSL)**: A machine - learning method that uses a small amount of labeled data and a large amount of unlabeled data to improve model performance. ### Paper structure - **Introduction**: Introduces the background and motivation of semi - supervised learning, as well as the importance and application scenarios of the research. - **Exponential Tilt Model and Logistic Regression**: Details the exponential tilt model and its equivalence to logistic regression. - **ETM Model in Random Sampling Settings**: Discusses the properties and estimation methods of the ETM model in random sampling settings. - **ETM Model in Outcome - Stratified Sampling Settings**: Discusses the properties and estimation methods of the ETM model in outcome - stratified sampling settings. - **Theoretical Analysis and Simulation Studies**: Verifies the effectiveness and efficiency improvement of the ETM model through theoretical analysis and simulation experiments. ### Conclusion By introducing the ETM model, the paper provides a method for effectively using unlabeled data when the class proportions of unlabeled data and labeled data may be different. Through theoretical analysis and simulation experiments, it is proved that the ETM model can improve the estimation efficiency in multiple settings, especially when the class proportions are different. These results provide a new perspective and method for semi - supervised learning.