Unweighted regression models perform better than weighted regression techniques for respondent-driven sampling data: results from a simulation study

Lisa Avery,Nooshin Rotondi,Constance McKnight,Michelle Firestone,Janet Smylie,Michael Rotondi
DOI: https://doi.org/10.1186/s12874-019-0842-5
2019-10-29
BMC Medical Research Methodology
Abstract:Abstract Background It is unclear whether weighted or unweighted regression is preferred in the analysis of data derived from respondent driven sampling. Our objective was to evaluate the validity of various regression models, with and without weights and with various controls for clustering in the estimation of the risk of group membership from data collected using respondent-driven sampling (RDS). Methods Twelve networked populations, with varying levels of homophily and prevalence, based on a known distribution of a continuous predictor were simulated using 1000 RDS samples from each population. Weighted and unweighted binomial and Poisson general linear models, with and without various clustering controls and standard error adjustments were modelled for each sample and evaluated with respect to validity, bias and coverage rate. Population prevalence was also estimated. Results In the regression analysis, the unweighted log-link (Poisson) models maintained the nominal type-I error rate across all populations. Bias was substantial and type-I error rates unacceptably high for weighted binomial regression. Coverage rates for the estimation of prevalence were highest using RDS-weighted logistic regression, except at low prevalence (10%) where unweighted models are recommended. Conclusions Caution is warranted when undertaking regression analysis of RDS data. Even when reported degree is accurate, low reported degree can unduly influence regression estimates. Unweighted Poisson regression is therefore recommended.
health care sciences & services
What problem does this paper attempt to address?
This paper aims to solve the problem of which is better between weighted and non - weighted regression models when using Respondent - Driven Sampling (RDS) data for regression analysis. Specifically, the research objective is to evaluate the effectiveness of different regression models (including weighted and non - weighted binomial and Poisson generalized linear models) in estimating the risk of group members, while considering the influence of cluster control and standard error adjustment. ### Research Background RDS is an improved snowball sampling method for measuring the disease prevalence of hard - to - reach "hidden" populations such as men who have sex with men, sex workers, and drug users. There are two important differences between RDS data and simple random sampling data: First, the sampling is not random, and some participants have a higher probability of being selected, depending on the size of their interpersonal network; second, the observations are not independent because the data may be clustered among recruiters or seeds. These characteristics make the analysis of RDS data challenging, especially how to deal with the correlation between participants and non - random sampling issues in regression analysis has not been determined yet. ### Methods The study evaluated the performance of various regression models by simulating 12 networked populations with different homogeneity and prevalence levels and extracting 1000 RDS samples from each group. The models include unweighted and weighted binomial and Poisson generalized linear models, as well as different cluster control and standard error adjustment methods. ### Results - **Type I Error Rate**: The unweighted log - link (Poisson) model maintained the nominal Type I error rate in all groups. However, the weighted binomial regression has a large bias and an excessive Type I error rate. - **Coverage**: When estimating the prevalence, the coverage rate of using RDS - weighted logistic regression is the highest, but in the case of low prevalence (10%), it is recommended to use the unweighted model. - **Bias**: The bias of the Poisson regression model is large both in the mean and median, but it is more consistent compared to the binomial model. - **Accuracy**: The prediction accuracy is independent of the homogeneity level of the group, but it decreases as the outcome prevalence increases. The unweighted binomial model (including the outcome variable of the participant's recruiter as a model predictor) has the highest accuracy, followed by the ordinary unweighted binomial model. ### Conclusions Caution is required when performing regression analysis on RDS data. Even if the reported network size is accurate, a lower reported network size may inappropriately affect the regression estimate. Therefore, the unweighted Poisson regression model is recommended. Through the above analysis, this paper provides important guidance for the regression analysis of RDS data, especially in terms of selecting appropriate statistical models.