Abstract:The case-control sampling design serves as a pivotal strategy in mitigating the imbalanced structure observed in binary data. We consider the estimation of a non-parametric logistic model with the case-control data supplemented by external summary information. The incorporation of external summary information ensures the identifiability of the model. We propose a two-step estimation procedure. In the first step, the external information is utilized to estimate the marginal case proportion. In the second step, the estimated proportion is used to construct a weighted objective function for parameter training. A deep neural network architecture is employed for functional approximation. We further derive the non-asymptotic error bound of the proposed estimator. Following this the convergence rate is obtained and is shown to reach the optimal speed of the non-parametric regression estimation. Simulation studies are conducted to evaluate the theoretical findings of the proposed method. A real data example is analyzed for illustration.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is how to use external summary information to estimate non - parametric logistic regression models in an unbalanced binary - classification data structure. Specifically, the paper focuses on how to solve the identifiability problem of the model through external summary information under the case - control study design, and proposes a two - step estimation method to improve the accuracy of model parameter estimation.
### Background and Motivation
1. **Unbalanced Data Structure**:
- In many practical applications, binary - classification data often has a serious class - imbalance problem, that is, the number of samples in one class is far less than that in the other class. For example, in medical diagnosis, the number of diseased cases (positive) is usually much less than that of the healthy control group (negative).
- This imbalance will lead to the instability of the traditional maximum likelihood estimation (MLE) method for estimating model parameters, especially for the intercept term in the linear logistic regression model.
2. **Case - Control Study Design**:
- The case - control study design is a commonly used method. By randomly sampling from the case group and the control group respectively, the sample is made more balanced between the two classes, thus alleviating the problems brought by unbalanced data.
- However, the data set resulting from this design is an artificially biased sample, and this bias needs to be considered during analysis.
3. **Model Identifiability**:
- Under the case - control study design, the intercept term of the model cannot be accurately estimated by a single case - control sample alone, because the intercept term cannot be uniquely determined from this biased sample.
- To overcome this problem, the paper proposes to use external summary information to assist in the estimation, so that the model parameters become identifiable.
### Solutions
1. **Two - Step Estimation Method**:
- **First Step**: Use external summary information to estimate the marginal case proportion. Specifically, obtain the summary statistics (such as the mean) of some covariates through external data sources, and then use this information to estimate the marginal case proportion.
- **Second Step**: Based on the estimated marginal case proportion, construct a weighted objective function for parameter training. The objective function adopts the inverse - probability - weighting technique to correct the bias of the case - control sample.
2. **Deep Neural Network**:
- Use a multi - layer perceptron (MLP) for function approximation to handle the complex relationships in the non - parametric logistic regression model. The structure of the deep neural network can effectively alleviate the curse - of - dimensionality problem brought by high - dimensional data.
### Theoretical Results
1. **Non - Asymptotic Error Bound**:
- The paper derives the non - asymptotic error bound of the proposed estimator and proves the upper bound of the estimation error.
- Further, the consistency and convergence rate of the estimator are established, and it is proved that the convergence speed reaches the optimal speed of the classical non - parametric regression estimation.
2. **Simulation Study**:
- The effectiveness of the theoretical results is verified through extensive simulation studies. The results show that the estimation method using external summary information can significantly reduce the estimation bias in various situations, especially in the case of unbalanced samples.
### Practical Applications
1. **Real - Data Example**:
- The paper conducts an empirical analysis using the Adult dataset in the UCI Machine Learning Library. This dataset contains the 1994 US Census data, and its main purpose is to predict whether an individual's annual income exceeds $50,000.
- The results show that the estimation method using external summary information is superior to the traditional method without using external information in prediction performance.
### Conclusion
By introducing external summary information, the paper solves the identifiability problem of the non - parametric logistic regression model under the case - control study design and proposes an effective two - step estimation method. This method is not only strictly proven theoretically but also shows good performance in practical applications. Future research can further explore extended applications in other biased - sampling scenarios.