Abstract:In quantitative genetics, statistical modeling techniques are used to facilitate advances in the understanding of which genes underlie agronomically important traits and have enabled the use of genome-wide markers to accelerate genetic gain. The logistic regression model is a statistically optimal approach for quantitative genetics analysis of binary traits. To encourage more widespread use of the logistic model in such analyses, efforts need to be made to address separation, which occurs whenever a specific combination of predictors can perfectly predict the value of a binary trait. Data separation is especially prevalent in applications where the number of predictors is near the sample size. In this study we motivate a logistic model that is robust to separation, and we develop a novel prediction procedure for this robust model that is appropriate when separation exists. We show that this robust model offers superior inferences and comparable predictions to existing approaches while remaining true to the logistic model. This is an improvement to previously existing approaches which treats separation as a modeling shortcoming and not an antagonistic data configuration. Previous approaches, therefore, change the modeling paradigm to consider separation that, before our robust model exists, is problematic to logistic models. Our comparisons are conducted on several didactic examples and a genomics study on the kernel color in maize. The ensuing analyses reaffirm the billed superior inferences and comparable predictive performance of our robust model. Therefore, our approach provides scientists with an appropriate statistical modeling framework for analyses involving agronomically important binary traits.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the separation problem encountered when dealing with binary traits in genomics research. Specifically, the author is concerned with the problem that the traditional logistic regression model fails when certain combinations of predictor variables can perfectly predict binary outcomes. This phenomenon is called **complete separation** or **quasi - complete separation**, and is especially common in applications where the sample size is close to the number of predictor variables. #### Main problems and challenges 1. **Separation problem**: - In the presence of separation, the coefficient estimates of the logistic regression model become unstable or infinite, resulting in the inability to obtain reasonable marker effect estimates. - Commonly used statistical software usually cannot correctly diagnose this problem, especially in genomic prediction applications where the number of predictor variables may far exceed the sample size. 2. **Limitations of existing methods**: - Existing methods (such as bias - correction and Bayesian methods) deal with the separation problem by changing the modeling paradigm, but these methods do not fundamentally solve the problem but bypass the data configuration problems brought by separation. - Although these methods can improve the prediction performance to a certain extent, they still have deficiencies in inference. #### Goals of the paper The paper proposes a new robust logistic regression model and its prediction framework to deal with the separation problem. Specific goals include: - Developing a logistic regression model that is still effective in the presence of separation. - Proposing a new prediction method to ensure reliable prediction in the presence of separation. - Verifying the superiority and comparability of the new method through multiple examples and actual data sets (such as genomic research on maize kernel color). #### Method overview 1. **Robust logistic regression model**: - This model is based on maximum likelihood estimation (MLE) and provides one - sided confidence intervals in the presence of separation instead of the traditional two - sided confidence intervals. - Detect and handle the separation problem through the method of Eck and Geyer [2021]. 2. **Prediction framework**: - For new data points with separation, combine the observed data and fit two logistic regression models respectively (one assumes the new data point is 0, and the other assumes it is 1). - Use model averaging to combine the prediction results of the two models to form the final prediction. - Calculate the model - averaged estimate and use the optimal cut - off value for classification. #### Experimental verification The paper verifies the effectiveness of the new method through multiple examples (such as complete separation, quasi - complete separation, quadratic logistic regression model, etc.) and actual data sets (such as endometrial cancer research and maize genomic data). The results show that the new method is superior to existing methods in both inference and prediction performance, especially more robust when dealing with the separation problem. ### Summary The main contribution of this paper is to develop a robust logistic regression model and its prediction framework that can effectively deal with the separation problem, thus providing a more reliable statistical tool for binary trait analysis in genomics research.

Robust model-based estimation for binary outcomes in genomics studies

Robust Logistic Regression of Family Data in the Presence of Missing Genotypes

Robust data integration from multiple external sources for generalized linear models with binary outcomes

Mixed Linear Model Approaches for Analyzing Genetic Models of Complex Quantitative Traits

Robust Genomic Prediction and Heritability Estimation using Density Power Divergence

Mixed Model Approaches for Diallel Analysis Based on a Bio-Model.

Computationally efficient whole-genome regression for quantitative and binary traits

A Regression-based Approach to Robust Estimation and Inference for Genetic Covariance

A logistic mixture model for a family-based association study

An ensemble approach to improved prediction from multitype data

Robust adaptive LASSO in high-dimensional logistic regression

Two-step penalised logistic regression for multi-omic data with an application to cardiometabolic syndrome

Nerve transfer in brachial plexus traction injuries.

Statistical Inference for Genetic Relatedness Based on High-Dimensional Logistic Regression

Efficient penalized generalized linear mixed models for variable selection and genetic risk prediction in high-dimensional data

A Fully Nonparametric Modelling Approach to Binary Regression

Low-rank regression models for multiple binary responses and their applications to cancer cell-line encyclopedia data

Genomic Bayesian Prediction Model for Count Data with Genotype × Environment Interaction

Robust functional logistic regression

Logistic Regression with Misclassification in Binary Outcome Variables: a Method and Software

Robust Emax Model Fitting: Addressing Nonignorable Missing Binary Outcome in Dose-Response Analysis