Abstract:ABSTRACT When building regression models for multivariate abundance data in ecology, it is important to allow for the fact that the species are correlated with each other. Moreover, there is often evidence species exhibit some degree of homogeneity in their responses to each environmental predictor, and that most species are informed by only a subset of predictors. We propose a generalized estimating equation (GEE) approach for simultaneous homogeneity pursuit (ie, grouping species with similar coefficient values while allowing differing groups for different covariates) and variable selection in regression models for multivariate abundance data. Using GEEs allows us to straightforwardly account for between-response correlations through a (reduced-rank) working correlation matrix. We augment the GEE with both adaptive fused lasso- and adaptive lasso-type penalties, which aim to cluster the species-specific coefficients within each covariate and encourage differing levels of sparsity across the covariates, respectively. Numerical studies demonstrate the strong finite sample performance of the proposed method relative to several existing approaches for modeling multivariate abundance data. Applying the proposed method to presence–absence records collected along the Great Barrier Reef in Australia reveals both a substantial degree of homogeneity and sparsity in species-environmental relationships. We show this leads to a more parsimonious model for understanding the environmental drivers of seabed biodiversity, and results in stronger out-of-sample predictive performance relative to methods that do not accommodate such features.

What problem does this paper attempt to address?

This paper attempts to solve two key problems faced when constructing regression models for multivariate abundance data in ecology: 1. **Correlation between species**: In ecological research, the responses among different species are usually correlated. For example, the symbiotic or competitive relationships between certain organisms will cause their responses to environmental factors to show a certain degree of association. Therefore, this correlation must be taken into account during the modeling process. 2. **Homogeneity and variable selection**: Many species show a certain degree of homogeneity (i.e., similarity) in their responses to the same environmental predictor, and most species are only affected by some of the predictors. Specifically: - **Homogeneity Pursuit**: Different species may have similar response coefficients for the same environmental variable. Grouping species with similar responses can simplify the model and improve the model's explanatory power. - **Variable Selection**: Not all environmental variables affect all species. Through variable selection, it is possible to identify which environmental variables are important for specific species, thereby constructing a more concise and effective model. To solve the above problems, the author proposes a method based on Generalized Estimating Equations (GEE), called HPGEE (Homogeneity Pursuit and Variable Selection in GEE). This method not only takes into account the correlation between species but also combines adaptive fused Lasso and adaptive Lasso - type penalty terms to achieve homogeneity pursuit and variable selection simultaneously. The formula is expressed as follows: - The basic form of the Generalized Estimating Equation (GEE) is: \[ S(B)=\sum_{i = 1}^{N}S_i(B)=\sum_{i = 1}^{N}D_i^{\top}V_i^{-1}(y_i-\mu_i)=0 \] where \(D_i = W_iX_i\), \(X_i=I_J\otimes x_i^{\top}\), \(W_i\) is a diagonal matrix containing elements \(\left\{\frac{1}{g'(\eta_{i1})},\frac{1}{g'(\eta_{i2})},\ldots,\frac{1}{g'(\eta_{iJ})}\right\}\), \(\eta_i = (\eta_{i1},\eta_{i2},\ldots,\eta_{iJ})^{\top}\), \(\mu_i = (\mu_{i1},\mu_{i2},\ldots,\mu_{iJ})^{\top}\). - On this basis, adaptive Lasso and fused Lasso - type penalty terms are introduced to achieve homogeneity pursuit and variable selection: \[ S_{HPEG}(\Upsilon)=S(\Upsilon)-NJ\lambda\sum_{k = 2}^{P}\kappa_k\sum_{j = 1}^{J}w_{jk}\text{sgn}(\upsilon_{jk})=0 \] where \(\kappa_k\geq0\) are constants, and the weights \(w_{jk}\) are defined as follows: \[ w_{1k}=\frac{1}{|\tilde{\upsilon}_{1k}|}=\frac{1}{|\tilde{\beta}_{(|1|),k}|} \] \[ w_{jk}=\frac{1}{|\tilde{\upsilon}_{jk}|}\times\frac{1}{|\sum_{j' = 2}^{J}\tilde{\upsilon}_{j'k}|}=\frac{1}{|\tilde{\beta}_{(j),k}-\tilde{\beta}_{(j - 1),k}|}\times\frac{1}{|\tilde{\beta}_{(J),k}-\tilde{\beta}_{(J - 1),k}|} \]

Homogeneity pursuit and variable selection in regression models for multivariate abundance data

Semiparametric Expectile Regression for High-dimensional Heavy-tailed and Heterogeneous Data

How Data Heterogeneity Affects Innovating Knowledge and Information in Gene Identification: A Statistical Learning Perspective

Multi-species distribution modeling using penalized mixture of regressions

Modelling heterogeneity in the classification process in multi‐species distribution models can improve predictive performance

Variable selection for sparse Dirichlet-multinomial regression with an application to microbiome data analysis

Multivariate Bayesian variable selection with application to multi-trait genetic fine mapping

Joint Mean and Correlation Regression Models for Multivariate Data

GEE-assisted Forward Regression for Spatial Latent Variable Models

On GEE for Mean-Variance-Correlation Models: Variance Estimation and Model Selection

Sparse Variable Selection on High Dimensional Heterogeneous Data With Tree Structured Responses

Heterogeneous quantile regression for longitudinal data with subgroup structures

Hierarchical selection of genetic and gene by environment interaction effects in high-dimensional mixed models

Variable selection for varying multi-index coefficients models with applications to synergistic GxE interactions

High-dimensional variable selection accounting for heterogeneity in regression coefficients across multiple data sources

Interpretable and predictive models based on high-dimensional data in ecology and evolution

Spatial Homogeneity Pursuit of Regression Coefficients for Large Datasets

Individualized Multi-directional Variable Selection

A Generalized Heterogeneity Model for Spatial Interpolation

Regression‐based heterogeneity analysis to identify overlapping subgroup structure in high‐dimensional data

Identifying Gene-environment interactions with robust marginal Bayesian variable selection