Abstract:Fitting generalised linear models (GLMs) with more than one predictor has become the standard method of analysis in evolutionary and behavioural research. Often, GLMs are used for exploratory data analysis, where one starts with a complex full model including interaction terms and then simplifies by removing non-significant terms. While this approach can be useful, it is problematic if significant effects are interpreted as if they arose from a single a priori hypothesis test. This is because model selection involves cryptic multiple hypothesis testing, a fact that has only rarely been acknowledged or quantified. We show that the probability of finding at least one ‘significant’ effect is high, even if all null hypotheses are true (e.g. 40% when starting with four predictors and their two-way interactions). This probability is close to theoretical expectations when the sample size (N) is large relative to the number of predictors including interactions (k). In contrast, type I error rates strongly exceed even those expectations when model simplification is applied to models that are over-fitted before simplification (low N/k ratio). The increase in false-positive results arises primarily from an overestimation of effect sizes among significant predictors, leading to upward-biased effect sizes that often cannot be reproduced in follow-up studies (‘the winner's curse’). Despite having their own problems, full model tests and P value adjustments can be used as a guide to how frequently type I errors arise by sampling variation alone. We favour the presentation of full models, since they best reflect the range of predictors investigated and ensure a balanced representation also of non-significant results.

How much should we trust R2 and adjusted R2: evidence from regressions in top economics journals and Monte Carlo simulations

R2 Around the World: New Theory and New Tests

With random regressors, least squares inference is robust to correlated errors with unknown correlation structure

Tyranny-of-the-Minority Regression Adjustment in Randomized Experiments

Machine-Learning Tests for Effects on Multiple Outcomes

Evaluating two small-sample corrections for fixed-effects standard errors and inferences in multilevel models with heteroscedastic, unbalanced, clustered data

A Monte Carlo experiment The distributions of OLS parameter estimators of time series regression

Goodness-of-fit Testing in Linear Regression Models

Robustness of fit indices to outliers and leverage observations in structural equation modeling.

The R2 Ridge Trace in 2sls Regression Estimation

Predictive ability tests with possibly overlapping models

Improving Estimation Efficiency via Regression-Adjustment in Covariate-Adaptive Randomizations with Imperfect Compliance

Correction of overfitting bias in regression models

R2 should not be used to describe behavioral‐economic discounting and demand models

R-Squared, Noise, and Stock Returns

Heteroskedasticity-robust inference in linear regression models with many covariates

Randomization-based joint central limit theorem and efficient covariate adjustment in stratified $2^K$ factorial experiments

Statistical Inference for a Robust Measure of Multiple Correlation

Cryptic multiple hypotheses testing in linear models: overestimated effect sizes and the winner's curse

Covariate adjustment in randomized experiments with missing outcomes and covariates

How Much Should We Trust Staggered Difference-In-Differences Estimates?