Abstract:<p>To understand how organisms adapt to their environment, a gene‐environmental association (GEA) analysis is commonly conducted. GEA methods based on mixed models, such as linear latent factor mixed models (LFMM) and LFMM2, have grown in popularity for their robust performance in terms of power and computational speed. However, it is unclear how the assumption of a Gaussian distribution for the response variables influences model performance. In this paper, we develop a generalized linear model (GLM) that allows for non‐Gaussian distribution in the genotypic response variables, and treatment of multiallelic nucleotide polymorphisms. Moreover, this multinomial logistic regression model (MLR) is combined with an admixture‐based model or principal components analysis to correct for population structure (MLR‐ADM and MLR‐PC). Using simulations, we evaluate the type I error, false discovery rates (FDR), and power to detect selected SNPs, to guide model choice and best practices. With genomic control, MLR‐PC and LFMM2 have similar type I error, FDRs, and power when analyzing biallelic SNPs, while dramatically outperforming models not accounting for population structure. Differences in performance occur under continuous population structure where MLR‐PC outperforms LFMM/LFMM2, especially when a larger number of clusters or triallelic SNPs are analyzed. The Human Genome Diversity Project (HGDP) dataset shows that both MLR‐PC and LFMM2 control the inflation of P‐values. Analysis of the 1000 Genome Project Phase III dataset illustrates that MLR‐PC and LFMM2 produce consistent results for most significant SNPs, while MLR‐PC discovered additional SNPs corresponding to certain genes, suggesting MLR‐PC may be a useful alternative to GEA inference.</p>

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to evaluate and compare the performance of different models in detecting adaptive single - nucleotide polymorphisms (SNPs), especially the performance differences between linear models (LM) and generalized linear models (GLM) when processing genotype data. Specifically, the research focuses on the following aspects: 1. **Impact of model assumptions**: Traditional linear models (such as LFMM and LFMM2) assume that genotype data follow a Gaussian distribution, which may not match the actual discrete genotype data, resulting in statistical inference problems such as systematic bias and power loss. Therefore, this study developed a generalized linear model (GLM) that allows non - Gaussian - distributed genotype response variables and deals with multi - allele nucleotide polymorphisms. 2. **Population structure correction**: In order to correct the impact of population structure on gene - environment association analysis (GEA), the study extended the existing multi - class logistic regression model (MLR) by combining principal component analysis (PCA) or admixture - based model methods to form two new methods, MLR - PC and MLR - ADM. 3. **Performance evaluation**: Through simulation experiments and actual data analysis, the study evaluated the performance of these models in controlling the type I error rate, false discovery rate (FDR), and the power of detecting selective SNPs. In particular, the study examined the influence of factors such as sample size, environmental covariate effect size, and population structure complexity on model performance. 4. **Tri - allele data processing**: As the sample size and sequencing depth of population genomic data sets increase, tri - allele SNPs are becoming more and more common. Therefore, the study also explored the performance of the MLR method in processing tri - allele data. In summary, this paper aims to provide more effective tools and guidance for gene - environment association analysis by introducing generalized linear models and their improved versions (MLR - PC and MLR - ADM) and combining strict performance evaluation, especially in the presence of complex population structures. ### Key formulas 1. **Binomial distribution generation model**: \[ G_{i\ell} \sim \text{Binom}(2, \pi_{i\ell}), \quad \pi_{i\ell} = \Phi(\beta_\ell X_i + V_\ell^T U_i) \] where $\Phi$ is the cumulative probability function of the standard Gaussian distribution. 2. **Multivariate normal distribution generation of environmental covariates and population structure terms**: \[ (U_i, X_i)^T \sim N(0, S) \] where $S$ is a $(K + 1)\times(K + 1)$ covariance matrix. 3. **Likelihood ratio test statistic**: \[ \lambda=\frac{\sup_{\theta\in\Theta_0}\ell(\theta)}{\sup_{\theta\in\Theta_1}\ell(\theta)}, \quad - 2\ln\lambda\sim\chi^2_{J - 1} \] Through these models and formulas, researchers can detect adaptive SNPs more accurately while effectively controlling the impact of population structure.

A comprehensive analysis comparing linear and generalized linear models in detecting adaptive SNPs

Efficient penalized generalized linear mixed models for variable selection and genetic risk prediction in high-dimensional data

Tradeoffs of Linear Mixed Models in Genome-wide Association Studies

Linear Models for Analysis of Multiple Single Nucleotide Polymorphisms with Quantitative Traits in Unrelated Individuals

Hierarchical Generalized Linear Mixed Model for Genome-wide Association Analysis

LFMM 2: Fast and Accurate Inference of Gene-Environment Associations in Genome-Wide Studies

A SUPER powerful method for genome wide association study.

Guidance for the utility of linear models in meta-analysis of genetic association studies of binary phenotypes

Efficient and powerful familywise error control in genome-wide association studies using generalized linear models

An efficient and robust method for analyzing population pharmacokinetic data in genome-wide pharmacogenomic studies: a generalized estimating equation approach

An integrated approach to reduce the impact of minor allele frequency and linkage disequilibrium on variable importance measures for genome-wide data

High-dimensional genome-wide association study and misspecified mixed model analysis

Analysis of Case-Control Association Studies: SNPs, Imputation and Haplotypes

Comparison of dimension reduction based logistic regression models for case control genome wide association study： principal components analysis vs. partial least squares

A Bayesian Framework for Generalized Linear Mixed Modeling Identifies New Candidate Loci for Late-Onset Alzheimer's Disease

A Sparse Graph-Structured Lasso Mixed Model for Genetic Association with Confounding Correction

A generalized linear mixed model association tool for biobank-scale data

Efficient and Accurate Multiple-Phenotype Regression Method for High Dimensional Data Considering Population Structure

A unified method for rare variant analysis of gene-environment interactions

Testing for Associations between Loci and Environmental Gradients Using Latent Factor Mixed Models

Detecting latent gene-environment interaction when analyzing binary traits