Integrating Heterogeneous Gene Expression Data through Knowledge Graphs for Improving Diabetes Prediction

Rita T. Sousa,Heiko Paulheim
2024-04-23
Abstract:Diabetes is a worldwide health issue affecting millions of people. Machine learning methods have shown promising results in improving diabetes prediction, particularly through the analysis of diverse data types, namely gene expression data. While gene expression data can provide valuable insights, challenges arise from the fact that the sample sizes in expression datasets are usually limited, and the data from different datasets with different gene expressions cannot be easily combined.
Machine Learning
What problem does this paper attempt to address?
This paper aims to address the challenges encountered when using machine learning methods for diabetes prediction, especially when analyzing multiple heterogeneous gene expression datasets. Due to the limited sample size and incompatible gene expressions between different datasets, integrating the datasets poses challenges. The paper proposes an innovative approach that integrates multiple gene expression datasets and specific domain knowledge, such as protein function and interactions, using knowledge graph (KG) integration. Knowledge graph embedding techniques are used to generate vector representations as inputs to the classifier, aiming to improve the accuracy of diabetes prediction. Experimental results demonstrate that this approach, after integrating multiple gene expression datasets and specific domain knowledge, enhances the performance of diabetes prediction, validating the effectiveness of the proposed method.