Abstract:In ecology, as in other research fields, efficient sampling for population estimation often drives sample designs toward unequal probability sampling, such as in stratified sampling. Design based statistical analysis tools are appropriate for seamless integration of sample design into the statistical analysis. However, it is also common and necessary, after a sampling design has been implemented, to use datasets to address questions that, in many cases, were not considered during the sampling design phase. Questions may arise requiring the use of model based statistical tools such as multiple regression, quantile regression, or regression tree analysis. However, such model based tools may require, for ensuring unbiased estimation, data from simple random samples, which can be problematic when analyzing data from unequal probability designs. Despite numerous method specific tools available to properly account for sampling design, too often in the analysis of ecological data, sample design is ignored and consequences are not properly considered. We demonstrate here that violation of this assumption can lead to biased parameter estimates in ecological research. In addition, to the set of tools available for researchers to properly account for sampling design in model based analysis, we introduce inverse probability bootstrapping (IPB). Inverse probability bootstrapping is an easily implemented method for obtaining equal probability re-samples from a probability sample, from which unbiased model based estimates can be made. We demonstrate the potential for bias in model-based analyses that ignore sample inclusion probabilities, and the effectiveness of IPB sampling in eliminating this bias, using both simulated and actual ecological data. For illustration, we considered three model based analysis tools--linear regression, quantile regression, and boosted regression tree analysis. In all models, using both simulated and actual ecological data, we found inferences to be biased, sometimes severely, when sample inclusion probabilities were ignored, while IPB sampling effectively produced unbiased parameter estimates.

Learning de-biased regression trees and forests from complex samples

Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies

Using Inverse Probability Bootstrap Sampling to Eliminate Sample Induced Bias in Model Based Analysis of Unequal Probability Samples

Building Consistent Regression Trees From Complex Sample Data

Infinite random forests for imbalanced classification tasks

Statistical Learning from Biased Training Samples

Analysis of purely random forests bias

Bias-corrected Random Forests in Regression

Inference with Mondrian Random Forests

Learning from a Biased Sample

Estimation and Inference of Heterogeneous Treatment Effects using Random Forests

Asymptotic Properties of High-Dimensional Random Forests

Randomization Can Reduce Both Bias and Variance: A Case Study in Random Forests

Systematic Bias in Sample Inference and its Effect on Machine Learning

Learning Optimal and Fair Decision Trees for Non-Discriminative Decision-Making

Achieving Reliable Causal Inference with Data-Mined Variables: A Random Forest Approach to the Measurement Error Problem

Debiased Causal Tree: Heterogeneous Treatment Effects Estimation with Unmeasured Confounding

Statistical Advantages of Oblique Randomized Decision Trees and Forests

Undersmoothing Causal Estimators With Generative Trees

Boosting Random Forests to Reduce Bias; One-Step Boosted Forest and Its Variance Estimate

RieszNet and ForestRiesz: Automatic Debiased Machine Learning with Neural Nets and Random Forests