Robust model-based estimation for binary outcomes in genomics studies

Suyoung Park,Alexander E. Lipka,Daniel J. Eck
DOI: https://doi.org/10.48550/arXiv.2110.15189
2021-10-28
Abstract:In quantitative genetics, statistical modeling techniques are used to facilitate advances in the understanding of which genes underlie agronomically important traits and have enabled the use of genome-wide markers to accelerate genetic gain. The logistic regression model is a statistically optimal approach for quantitative genetics analysis of binary traits. To encourage more widespread use of the logistic model in such analyses, efforts need to be made to address separation, which occurs whenever a specific combination of predictors can perfectly predict the value of a binary trait. Data separation is especially prevalent in applications where the number of predictors is near the sample size. In this study we motivate a logistic model that is robust to separation, and we develop a novel prediction procedure for this robust model that is appropriate when separation exists. We show that this robust model offers superior inferences and comparable predictions to existing approaches while remaining true to the logistic model. This is an improvement to previously existing approaches which treats separation as a modeling shortcoming and not an antagonistic data configuration. Previous approaches, therefore, change the modeling paradigm to consider separation that, before our robust model exists, is problematic to logistic models. Our comparisons are conducted on several didactic examples and a genomics study on the kernel color in maize. The ensuing analyses reaffirm the billed superior inferences and comparable predictive performance of our robust model. Therefore, our approach provides scientists with an appropriate statistical modeling framework for analyses involving agronomically important binary traits.
Methodology,Applications
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the separation problem encountered when dealing with binary traits in genomics research. Specifically, the author is concerned with the problem that the traditional logistic regression model fails when certain combinations of predictor variables can perfectly predict binary outcomes. This phenomenon is called **complete separation** or **quasi - complete separation**, and is especially common in applications where the sample size is close to the number of predictor variables. #### Main problems and challenges 1. **Separation problem**: - In the presence of separation, the coefficient estimates of the logistic regression model become unstable or infinite, resulting in the inability to obtain reasonable marker effect estimates. - Commonly used statistical software usually cannot correctly diagnose this problem, especially in genomic prediction applications where the number of predictor variables may far exceed the sample size. 2. **Limitations of existing methods**: - Existing methods (such as bias - correction and Bayesian methods) deal with the separation problem by changing the modeling paradigm, but these methods do not fundamentally solve the problem but bypass the data configuration problems brought by separation. - Although these methods can improve the prediction performance to a certain extent, they still have deficiencies in inference. #### Goals of the paper The paper proposes a new robust logistic regression model and its prediction framework to deal with the separation problem. Specific goals include: - Developing a logistic regression model that is still effective in the presence of separation. - Proposing a new prediction method to ensure reliable prediction in the presence of separation. - Verifying the superiority and comparability of the new method through multiple examples and actual data sets (such as genomic research on maize kernel color). #### Method overview 1. **Robust logistic regression model**: - This model is based on maximum likelihood estimation (MLE) and provides one - sided confidence intervals in the presence of separation instead of the traditional two - sided confidence intervals. - Detect and handle the separation problem through the method of Eck and Geyer [2021]. 2. **Prediction framework**: - For new data points with separation, combine the observed data and fit two logistic regression models respectively (one assumes the new data point is 0, and the other assumes it is 1). - Use model averaging to combine the prediction results of the two models to form the final prediction. - Calculate the model - averaged estimate and use the optimal cut - off value for classification. #### Experimental verification The paper verifies the effectiveness of the new method through multiple examples (such as complete separation, quasi - complete separation, quadratic logistic regression model, etc.) and actual data sets (such as endometrial cancer research and maize genomic data). The results show that the new method is superior to existing methods in both inference and prediction performance, especially more robust when dealing with the separation problem. ### Summary The main contribution of this paper is to develop a robust logistic regression model and its prediction framework that can effectively deal with the separation problem, thus providing a more reliable statistical tool for binary trait analysis in genomics research.