Zero-inflated generalized extreme value regression model for binary data and application in health study

Aba Diop,El Hadji Deme,Aliou Diop
DOI: https://doi.org/10.48550/arXiv.2105.00482
2021-05-02
Abstract:Logistic regression model is widely used in many studies to investigate the relationship between a binary response variable $Y$ and a set of potential predictors $\mathbf X$. The binary response may represent, for example, the occurrence of some outcome of interest ($Y=1$ if the outcome occurred and $Y=0$ otherwise). When the dependent variable $Y$ represents a rare event, the logistic regression model shows relevant drawbacks. In order to overcome these drawbacks we propose the Generalized Extreme Value (GEV) regression model. In particular, we suggest the quantile function of the GEV distribution as link function, so our attention is focused on the tail of the response curve for values close to one. A sample of observations is said to contain a cure fraction when a proportion of the study subjects (the so-called cured individuals, as opposed to the susceptibles) cannot experience the outcome of interest. One problem arising then is that it is usually unknown who are the cured and the susceptible subjects, unless the outcome of interest has been observed. In these settings, a logistic regression analysis of the relationship between $\mathbf X$ and $Y$ among the susceptibles is no more straightforward. We develop a maximum likelihood estimation procedure for this problem, based on the joint modeling of the binary response of interest and the cure status. We investigate the identifiability of the resulting model. Then, we conduct a simulation study to investigate its finite-sample behavior, and application to real data.
Methodology
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the deficiencies of the traditional logistic regression model when dealing with rare events (i.e., the number of "1" in the binary response variable is very small). Specifically: - **Underestimation of the probability of rare events**: When the dependent variable represents a rare event, the traditional logistic regression model will underestimate the probability of this event occurring. - **Problems with the symmetry assumption**: The logistic regression model assumes that the response curve between covariates and probability is symmetric, but in actual data, this symmetry may not hold, especially when the number of observations in the two response categories is significantly different. - **Zero - inflation problem**: In some cases, there may be a part of individuals in the sample who will not experience the outcome of interest no matter what (called cured individuals), and another part of individuals (susceptible individuals) may experience this outcome. Since it is usually not known which individuals are cured individuals and which are susceptible individuals, this complicates the analysis of the relationship between covariates \(X\) and response variable \(Y\) based on logistic regression. To solve these problems, the author proposes a new model - the generalized extreme value (GEV) regression model with a cure fraction. This model uses the quantile function of the GEV distribution as the link function, paying special attention to the tail response curve close to 1, in order to better fit the data characteristics of rare events. In addition, this model also combines the logit link function to handle the cure fraction part, thus forming a zero - inflated generalized extreme value regression model. ### Mathematical formula representation 1. **Cumulative distribution function of the GEV distribution**: \[ G(x|\mu, \sigma, \tau) = \begin{cases} \exp\left[-\left\{1 + \tau \frac{x - \mu}{\sigma}\right\}^{-\frac{1}{\tau}}\right] & \text{if } \tau \neq 0 \\ \exp\left[-\exp\left(\frac{x - \mu}{\sigma}\right)\right] & \text{if } \tau = 0 \end{cases} \] where \(\mu \in \mathbb{R}\) is the location parameter, \(\sigma \in \mathbb{R}^+\) is the scale parameter, and \(\tau \in \mathbb{R}\) is the shape parameter. 2. **Conditional infection probability**: \[ \pi(x_i) = P(Y_i = 1 | X_i = x_i, S_i = 1) = 1 - \text{GEV}(-x_i'\beta; \tau) \] If \(S_i = 0\), then \(P(Y_i = 1 | X_i = x_i, S_i = 0) = 0\). 3. **Conditional probability of immune status**: \[ \alpha(z_i) = \log\left(\frac{P(S_i = 1 | Z_i = z_i)}{1 - P(S_i = 1 | Z_i = z_i)}\right) = z_i' \theta \] 4. **Likelihood function of the joint model**: \[ L_n(\psi) = \prod_{i = 1}^n \left[ \left(1 - \exp\left[\left(1 + \tau \beta' X_i\right)^{-\frac{1}{\tau}}\right]\right) \frac{e^{\theta' Z_i}}{1 + e^{\theta' Z_i}} \right]^{Y_i} \left[ 1 - \left(1 - \exp\left[\left(1 + \tau \beta' X_i\right)^{-\frac{1}{\tau}}\right]\right)\right]