Benchmarking commercial healthcare claims data

Alex Dahlen,Yaowei Deng,Vivek Charu
DOI: https://doi.org/10.1101/2024.08.19.24312249
2024-08-20
Abstract:Importance: Commercial healthcare claims datasets represent a sample of the US population that is biased along socioeconomic/demographic lines; depending on the target population of interest, results derived from these datasets may not generalize. Rigorous comparisons of claims-derived results to ground-truth data that quantify this bias are lacking. Objectives: (1) To quantify the extent and variation of the bias associated with commercial healthcare claims data with respect to different target populations; (2) To evaluate how socioeconomic/demographic factors may explain the magnitude of the bias. Design: This is a retrospective observational study. Healthcare claims data come from the Merative MarketScan Commercial Database; reference data for comparison come from the State Inpatient Databases (SID) and the US Census. We considered three target populations, aged 18-64 years: (1) all Americans; (2) Americans with health insurance; (3) Americans with commercial health insurance. Participants: We analyzed inpatient discharge records of patients aged 18-64 years, occurring between 01/01/2019 to 12/31/2019 in five states: California, Iowa, Maryland, Massachusetts, and New Jersey. Outcomes: We estimated rates of the 250 most common inpatient procedures, using claims data and using reference data for each target population, and we compared the two estimates. Results: The average rate of inpatient discharges per 100 person-years was 5.39 in the claims data (95% CI: [5.37, 5.40]) and 7.003 (95% CI: [7.002, 7.004]) in the reference data for all Americans, corresponding to a 23.1% underestimate from claims. We found large variation in the extent of relative bias across inpatient procedures, including 22.8% of procedures that were underestimated by more than a factor of 2. There was a significant relationship between socioeconomic/demographic factors and the magnitude of bias: procedures that disproportionately occur in disadvantaged neighborhoods were more underestimated in claims data (R^2=51.6%, p < 0.001). When the target population was restricted to commercially insured Americans, the bias decreased substantially (3.2% of procedures were biased by more than factor of 2), but some variation across procedures remained. Conclusions and relevance: Naive use of healthcare claims data to derive estimates for the underlying US population can be severely biased. The extent of bias is at least partially explained by neighborhood-level socioeconomic factors.
What problem does this paper attempt to address?