Biases in Race and Ethnicity Introduced by Filtering Electronic Health Records for 'Complete Data'

Jose Miguel Acitores Cortina,Yasaman Fatapour,Michael Zietz,Kathleen LaRow Brown,Undina Gisladottir,Danner Peter,Oliver John Bear Don't Walk IV,Aditi Kuchi,Apoorva Srinivasan,Hongyu Liu,Jacob S. Berkowitz,Kevin Tsang,Sophia Kivelson,Nadine Friedrich,Nicholas P Tatonetti
DOI: https://doi.org/10.1101/2024.10.04.24314914
2024-10-07
Abstract:Integrated clinical databases from national biobanks have advanced the capacity for disease research. Data quality and completeness filters are used when building clinical cohorts to address limitations of data missingness. However, these filters may unintentionally introduce systemic biases when they are correlated with race and ethnicity. In this study, we examined the race/ethnicity biases introduced by applying common filters to four clinical records databases. We used 19 filters commonly used in electronic health records research on the availability of demographics, medication records, visit details, observation periods, and other data types. We evaluated the effect of applying these filters on self-reported race and ethnicity. This assessment was performed across four databases comprising approximately 12 million patients. Applying the observation period filter led to a substantial reduction in data availability across all races and ethnicities in all four datasets. However, among those examined, the availability of data in the white group remained consistently higher compared to other racial groups after applying each filter. Conversely, the Black/African American group was the most impacted by each filter on these three datasets, Cedars-Sinai dataset, UK-Biobank, and Columbia University Dataset. Our findings underscore the importance of using only necessary filters as they might disproportionally affect data availability of minoritized racial and ethnic populations. Researchers must consider these unintentional biases when performing data-driven research and explore techniques to minimize the impact of these filters, such as probabilistic methods or the use of machine learning and artificial intelligence.
Health Informatics
What problem does this paper attempt to address?