Improving generalization of machine learning-identified biomarkers using causal modelling with examples from immune receptor diagnostics

Milena Pavlović,Ghadi S. Al Hajj,Chakravarthi Kanduri,Johan Pensar,Mollie E. Wood,Ludvig M. Sollid,Victor Greiff,Geir K. Sandve
DOI: https://doi.org/10.1038/s42256-023-00781-8
IF: 23.8
2024-01-25
Nature Machine Intelligence
Abstract:Machine learning is increasingly used to discover diagnostic and prognostic biomarkers from high-dimensional molecular data. However, a variety of factors related to experimental design may affect the ability to learn generalizable and clinically applicable diagnostics. Here we argue that a causal perspective improves the identification of these challenges and formalizes their relation to the robustness and generalization of machine learning-based diagnostics. To make for a concrete discussion, we focus on a specific, recently established high-dimensional biomarker—adaptive immune receptor repertoires (AIRRs). Through simulations, we illustrate how major biological and experimental factors of the AIRR domain may influence the learned biomarkers. In conclusion, we argue that causal modelling improves machine learning-based biomarker robustness by identifying stable relations between variables and guiding the adjustment of the relations and variables that vary between populations.
computer science, artificial intelligence, interdisciplinary applications
What problem does this paper attempt to address?