DualMAR: Medical-Augmented Representation from Dual-Expertise Perspectives

Pengfei Hu,Chang Lu,Fei Wang,Yue Ning
2024-10-26
Abstract:Electronic Health Records (EHR) has revolutionized healthcare data management and prediction in the field of AI and machine learning. Accurate predictions of diagnosis and medications significantly mitigate health risks and provide guidance for preventive care. However, EHR driven models often have limited scope on understanding medical-domain knowledge and mostly rely on simple-and-sole ontologies. In addition, due to the missing features and incomplete disease coverage of EHR, most studies only focus on basic analysis on conditions and medication. We propose DualMAR, a framework that enhances EHR prediction tasks through both individual observation data and public knowledge bases. First, we construct a bi-hierarchical Diagnosis Knowledge Graph (KG) using verified public clinical ontologies and augment this KG via Large Language Models (LLMs); Second, we design a new proxy-task learning on lab results in EHR for pretraining, which further enhance KG representation and patient embeddings. By retrieving radial and angular coordinates upon polar space, DualMAR enables accurate predictions based on rich hierarchical and semantic embeddings from KG. Experiments also demonstrate that DualMAR outperforms state-of-the-art models, validating its effectiveness in EHR prediction and KG integration in medical domains.
Machine Learning,Artificial Intelligence,Information Retrieval
What problem does this paper attempt to address?
This paper attempts to solve several key problems in the prediction tasks in the medical field using the Electronic Health Record (EHR) - driven model. Specifically: 1. **Limited understanding of medical domain knowledge**: Existing EHR - driven models usually rely on simple single ontologies and are difficult to comprehensively understand the complex knowledge in the medical field. 2. **Missing data and incomplete disease coverage**: Due to the problems of feature missing and incomplete disease coverage in EHR data, most studies are limited to simple analysis of basic conditions and drugs. 3. **Lack of utilization of laboratory test results**: Existing methods usually ignore key information such as laboratory test results, which are crucial for accurate diagnosis and treatment recommendations. To solve these problems, the paper proposes a framework named DualMAR, which enhances EHR prediction tasks by combining individual observation data and public knowledge bases. The main contributions of DualMAR include: - **"Knowledge Scholar" module**: A two - level diagnostic knowledge graph (KG) is constructed and enhanced by a large - language model (LLM). This module uses polar - coordinate - space projection to capture semantic and hierarchical information. - **"Local Expert" module**: A new proxy - task - learning method is designed, which uses laboratory results in EHR for pre - training, thereby enhancing patient - embedding representations. - **Dual - expertise perspective**: An encoder - decoder architecture is adopted. The embeddings from the "Knowledge Scholar" are used as prior knowledge, and these representations are continuously refined by the "Local Expert", ultimately achieving more accurate predictions. ### Formula Summary 1. **KG Fusion Formula**: \[ GH = GM \cup GN, \quad \tilde{GH} = \text{NORMALIZE}(GH) \] where \(GM\) and \(GN\) are knowledge graphs generated based on the existing database and LLM respectively. 2. **Polar - Coordinate - Space Embedding Formula**: \[ h_r \odot r_r = t_r, \quad (h_a + r_a) \mod 2\pi = t_a \] \[ d_r(h_r, t_r) = \|h_r \odot r_r - t_r\|_2, \quad d_a(h_a, t_a) = \|\sin((h_a + r_a - t_a)/2)\|_1 \] \[ d(h, t) = \alpha d_r(h_r, t_r) + \beta d_a(h_a, t_a) \] 3. **Loss Function**: \[ L = -\log \sigma(\gamma - d(h, t)) - \sum_{i = 1}^{n} \log \sigma(d(h', t') - \gamma) \] 4. **Attention Mechanism**: \[ z_i = \tanh(W_c x_i), \quad r_\tau = \tanh(W_v \sigma(W_u v_\tau)) \] \[ \alpha_i^\tau = \frac{\exp(z_i)}{\sum_{j = 1}^n \exp(z_j)}, \quad \beta_\tau = \frac{\exp(r_\tau)}{\sum_{\tau = 1}^T \exp(r_\tau)} \] 5. **Downstream - Task Loss**: \[ L_j = \frac{1}{|Y|} \sum_{i = 1}^{|Y|} \text{BCE}(\hat{y}_i, y_i), \quad Y = \{L_1, L_2, L_3\} \] \[ L_i = \text{BCE}(\hat{y}_i, y_i), \quad i = 1, 2, 3 \] Through these methods, DualMAR can...