Machine learning predicts liver cancer risk from routine clinical data: a large population-based multicentric study

Jan Clusmann,Paul-Henry Koop,David Y. Zhang,Felix van Haag,Omar S. M. El Nahhas,Tobias Seibel,Laura Zigutyte,Apichat Kaewdech,Julien Calderaro,Frank Tacke,Tom Luedde,Daniel Truhn,Tony Bruns,Kai Markus Schneider,Jakob N. Kather,Carolin V. Schneider
DOI: https://doi.org/10.1101/2024.11.03.24316662
2024-11-04
Abstract:Background and aims: Hepatocellular carcinoma (HCC) is a highly fatal tumor, for which early detection and risk stratification is crucial, yet remains challenging. We aimed to develop an interpretable machine-learning framework for HCC risk stratification based on routinely collected clinical data. Methods: We leverage data obtained from over 900,000 individuals and 983 cases of HCC across two large-scale population-based cohorts: the UK Biobank study and the "All Of Us Research Program". For all of these patients, clinical data from timepoints years before diagnosis of HCC was available. We integrate data modalities including demographics, electronic health records, lifestyle, routine blood tests, genomics and metabolomics to offer a unique, multi-modal perspective on HCC risk. Results: Our random-forest-based model significantly outperforms all publicly available state-of-the-art risk-scores, with an AUROC of 0.88 both for internal and external test sets. We demonstrate robustness of our model across ethnic subgroups, a major advance over previous models with variable performance by ethnicity. Further, we perform extensive feature-importance analysis, showcasing our approach as an interpretable framework. We provide all model weights and an open-source web calculator to facilitate further validation of our model. Conclusion: Our study presents a robust and interpretable machine-learning framework for HCC risk stratification, which offers the potential to improve early detection and could ultimately reduce disease burden through targeted interventions.
Oncology
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the accuracy of early detection and risk stratification of hepatocellular carcinoma (HCC). Specifically, the authors aim to develop an interpretable machine - learning framework based on routine clinical data to predict an individual's risk of liver cancer. The following are the core problems of the paper and their backgrounds: ### 1. **Background and Objectives** - **The lethality of hepatocellular carcinoma (HCC)**: HCC is a highly lethal tumor, and early detection and risk stratification are crucial but still challenging at present. - **Limitations of existing screening methods**: Current screening protocols mainly target patients with diagnosed cirrhosis and rely on resource - intensive imaging techniques, and these methods are not generally applicable in practical applications. In addition, these methods fail to fully consider multiple risk factors (such as lifestyle, past medical history, blood test results, etc.), resulting in many patients being diagnosed with HCC at an advanced stage. - **The increase in metabolic - associated fatty liver disease (MASLD)**: With the increase in MASLD and its related HCC cases, there is an urgent need for a more widely applicable and efficient screening strategy. ### 2. **Research Objectives** - **Develop a machine - learning model**: Use data from large - scale population cohorts (such as the UK Biobank and the "All Of Us Research Program") to develop a machine - learning model that can integrate multiple data modalities (including demographics, electronic health records, lifestyle, routine blood tests, genomics, and metabolomics) to predict HCC risk. - **Improve the early detection rate**: Through more accurate risk stratification, make early detection more efficient, thereby improving the prognosis of patients. - **Reduce healthcare inequalities**: Provide a cost - effective and easy - to - implement method, especially in resource - limited areas. ### 3. **Specific Problems** - **How to integrate multi - modal data**: Construct a comprehensive risk assessment model by combining different types of clinical data. - **Improve the interpretability of the model**: Ensure that the model not only has high predictive performance but also can explain its prediction results, helping doctors understand which factors are most important for risk assessment. - **Verify the generalization ability of the model**: Ensure that the model performs consistently in different ethnic and gender groups, avoiding the problem of instability of existing models in different populations. ### 4. **Expected Results** - **Improve predictive performance**: The AUROC of the model on the internal and external test sets reaches 0.88, which is significantly better than the existing publicly available risk scores. - **Wide applicability**: The model shows robustness in different subgroups and is especially suitable for high - risk populations not covered by existing screening methods. - **Open - source code tools**: Provide all model weights and an open - source online calculator for further verification and application. In conclusion, this paper aims to improve early detection and risk stratification by developing a machine - learning - based HCC risk prediction model, thereby ultimately reducing the disease burden of HCC and improving patient survival rates.