Abstract:ABSTRACT When building regression models for multivariate abundance data in ecology, it is important to allow for the fact that the species are correlated with each other. Moreover, there is often evidence species exhibit some degree of homogeneity in their responses to each environmental predictor, and that most species are informed by only a subset of predictors. We propose a generalized estimating equation (GEE) approach for simultaneous homogeneity pursuit (ie, grouping species with similar coefficient values while allowing differing groups for different covariates) and variable selection in regression models for multivariate abundance data. Using GEEs allows us to straightforwardly account for between-response correlations through a (reduced-rank) working correlation matrix. We augment the GEE with both adaptive fused lasso- and adaptive lasso-type penalties, which aim to cluster the species-specific coefficients within each covariate and encourage differing levels of sparsity across the covariates, respectively. Numerical studies demonstrate the strong finite sample performance of the proposed method relative to several existing approaches for modeling multivariate abundance data. Applying the proposed method to presence–absence records collected along the Great Barrier Reef in Australia reveals both a substantial degree of homogeneity and sparsity in species-environmental relationships. We show this leads to a more parsimonious model for understanding the environmental drivers of seabed biodiversity, and results in stronger out-of-sample predictive performance relative to methods that do not accommodate such features.
What problem does this paper attempt to address?
This paper attempts to solve two key problems faced when constructing regression models for multivariate abundance data in ecology:
1. **Correlation between species**: In ecological research, the responses among different species are usually correlated. For example, the symbiotic or competitive relationships between certain organisms will cause their responses to environmental factors to show a certain degree of association. Therefore, this correlation must be taken into account during the modeling process.
2. **Homogeneity and variable selection**: Many species show a certain degree of homogeneity (i.e., similarity) in their responses to the same environmental predictor, and most species are only affected by some of the predictors. Specifically:
- **Homogeneity Pursuit**: Different species may have similar response coefficients for the same environmental variable. Grouping species with similar responses can simplify the model and improve the model's explanatory power.
- **Variable Selection**: Not all environmental variables affect all species. Through variable selection, it is possible to identify which environmental variables are important for specific species, thereby constructing a more concise and effective model.
To solve the above problems, the author proposes a method based on Generalized Estimating Equations (GEE), called HPGEE (Homogeneity Pursuit and Variable Selection in GEE). This method not only takes into account the correlation between species but also combines adaptive fused Lasso and adaptive Lasso - type penalty terms to achieve homogeneity pursuit and variable selection simultaneously. The formula is expressed as follows:
- The basic form of the Generalized Estimating Equation (GEE) is:
\[
S(B)=\sum_{i = 1}^{N}S_i(B)=\sum_{i = 1}^{N}D_i^{\top}V_i^{-1}(y_i-\mu_i)=0
\]
where \(D_i = W_iX_i\), \(X_i=I_J\otimes x_i^{\top}\), \(W_i\) is a diagonal matrix containing elements \(\left\{\frac{1}{g'(\eta_{i1})},\frac{1}{g'(\eta_{i2})},\ldots,\frac{1}{g'(\eta_{iJ})}\right\}\), \(\eta_i = (\eta_{i1},\eta_{i2},\ldots,\eta_{iJ})^{\top}\), \(\mu_i = (\mu_{i1},\mu_{i2},\ldots,\mu_{iJ})^{\top}\).
- On this basis, adaptive Lasso and fused Lasso - type penalty terms are introduced to achieve homogeneity pursuit and variable selection:
\[
S_{HPEG}(\Upsilon)=S(\Upsilon)-NJ\lambda\sum_{k = 2}^{P}\kappa_k\sum_{j = 1}^{J}w_{jk}\text{sgn}(\upsilon_{jk})=0
\]
where \(\kappa_k\geq0\) are constants, and the weights \(w_{jk}\) are defined as follows:
\[
w_{1k}=\frac{1}{|\tilde{\upsilon}_{1k}|}=\frac{1}{|\tilde{\beta}_{(|1|),k}|}
\]
\[
w_{jk}=\frac{1}{|\tilde{\upsilon}_{jk}|}\times\frac{1}{|\sum_{j' = 2}^{J}\tilde{\upsilon}_{j'k}|}=\frac{1}{|\tilde{\beta}_{(j),k}-\tilde{\beta}_{(j - 1),k}|}\times\frac{1}{|\tilde{\beta}_{(J),k}-\tilde{\beta}_{(J - 1),k}|}
\]