Knowledge-Informed Machine Learning for Cancer Diagnosis and Prognosis: A review

Lingchao Mao,Hairong Wang,Leland S. Hu,Nhan L Tran,Peter D Canoll,Kristin R Swanson,Jing Li
2024-01-12
Abstract:Cancer remains one of the most challenging diseases to treat in the medical field. Machine learning has enabled in-depth analysis of rich multi-omics profiles and medical imaging for cancer diagnosis and prognosis. Despite these advancements, machine learning models face challenges stemming from limited labeled sample sizes, the intricate interplay of high-dimensionality data types, the inherent heterogeneity observed among patients and within tumors, and concerns about interpretability and consistency with existing biomedical knowledge. One approach to surmount these challenges is to integrate biomedical knowledge into data-driven models, which has proven potential to improve the accuracy, robustness, and interpretability of model results. Here, we review the state-of-the-art machine learning studies that adopted the fusion of biomedical knowledge and data, termed knowledge-informed machine learning, for cancer diagnosis and prognosis. Emphasizing the properties inherent in four primary data types including clinical, imaging, molecular, and treatment data, we highlight modeling considerations relevant to these contexts. We provide an overview of diverse forms of knowledge representation and current strategies of knowledge integration into machine learning pipelines with concrete examples. We conclude the review article by discussing future directions to advance cancer research through knowledge-informed machine learning.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the following key issues: 1. **Tumor Heterogeneity and Individual Differences**: A major challenge in cancer treatment is the high heterogeneity of tumors between different patients and within the same tumor. This heterogeneity limits the effectiveness of traditional "one-size-fits-all" treatment approaches. Therefore, there is a need to develop models that can accurately describe the spatial landscape of tumors and support personalized treatment. 2. **Data Annotation and Sample Size Limitations**: High-quality and large-scale training and testing data are crucial for the performance of machine learning models. However, in practical applications, obtaining a large number of annotated tumor samples is very difficult because each patient's biopsy sample is limited in quantity and location. This limits the ability of machine learning models to independently learn the complete spatial landscape of tumors. 3. **Integration of Multimodal, High-Dimensional Data**: Cancer diagnosis and prognosis often require the analysis of various types of data, including clinical data, imaging data, molecular data, and treatment data. These data are usually high-dimensional and relatively small in sample size. Effectively integrating these data to provide clinical predictions is a significant challenge. 4. **Model Interpretability and Consistency**: Although deep learning models perform well in many tasks, they are often considered "black box" models, with decision processes that are difficult to understand and verify. This limits their credibility and practicality as clinical decision support tools. Therefore, improving the interpretability of models and their consistency with existing biomedical knowledge is another important direction. To address the above challenges, the paper proposes a method to incorporate biomedical knowledge into machine learning models, called Knowledge-Infused Machine Learning (KIML). By utilizing domain knowledge to regularize the model's learning process, the accuracy, robustness, and interpretability of the model can be improved.