Abstract:Objective: With the increasing amount and growing variety of healthcare data, multimodal machine learning supporting integrated modeling of structured and unstructured data is an increasingly important tool for clinical machine learning tasks. However, it is non-trivial to manage the differences in dimensionality, volume, and temporal characteristics of data modalities in the context of a shared target task. Furthermore, patients can have substantial variations in the availability of data, while existing multimodal modeling methods typically assume data completeness and lack a mechanism to handle missing modalities. Methods: We propose a Transformer-based fusion model with modality-specific tokens that summarize the corresponding modalities to achieve effective cross-modal interaction accommodating missing modalities in the clinical context. The model is further refined by inter-modal, inter-sample contrastive learning to improve the representations for better predictive performance. We denote the model as Attention-based cRoss-MOdal fUsion with contRast (ARMOUR). We evaluate ARMOUR using two input modalities (structured measurements and unstructured text), six clinical prediction tasks, and two evaluation regimes, either including or excluding samples with missing modalities. Results: Our model shows improved performances over unimodal or multimodal baselines in both evaluation regimes, including or excluding patients with missing modalities in the input. The contrastive learning improves the representation power and is shown to be essential for better results. The simple setup of modality-specific tokens enables ARMOUR to handle patients with missing modalities and allows comparison with existing unimodal benchmark results. Conclusion: We propose a multimodal model for robust clinical prediction to achieve improved performance while accommodating patients with missing modalities. This work could inspire future research to study the effective incorporation of multiple, more complex modalities of clinical data into a single model.

MEDFuse: Multimodal EHR Data Fusion with Masked Lab-Test Modeling and Large Language Models

DrFuse: Learning Disentangled Representation for Clinical Multi-Modal Fusion with Missing Modality and Modal Inconsistency

MedFuse: Multi-modal fusion with clinical time-series data and chest X-ray images

Research on Multimodal Fusion of Temporal Electronic Medical Records

How to Leverage Multimodal EHR Data for Better Medical Predictions?

Multimodal risk prediction with physiological signals, medical images and clinical notes

Multimodal Fusion of EHR in Structures and Semantics: Integrating Clinical Records and Notes with Hypergraph and LLM

UniMed: Multimodal Multitask Learning for Medical Predictions.

Multimodal Data Hybrid Fusion and Natural Language Processing for Clinical Prediction Models

Missing-modality Enabled Multi-modal Fusion Architecture for Medical Data

Multimodal Data Matters: Language Model Pre-Training Over Structured and Unstructured Electronic Health Records

Combining structured and unstructured data for predictive models: a deep learning approach

Automated Fusion of Multimodal Electronic Health Records for Better Medical Predictions

FlexCare: Leveraging Cross-Task Synergy for Flexible Multimodal Healthcare Prediction

Two heads are better than one: Enhancing medical representations by pre-training over structured and unstructured electronic health records

EMERGE: Integrating RAG for Improved Multimodal EHR Predictive Modeling

Attention-based multimodal fusion with contrast for robust clinical prediction in the face of missing modalities

HMDFF: A Heterogeneous Medical Data Fusion Framework Supporting Multimodal Query

AliFuse: Aligning and Fusing Multi-modal Medical Data for Computer-Aided Diagnosis

EMERGE: Enhancing Multimodal Electronic Health Records Predictive Modeling with Retrieval-Augmented Generation

Multi-channel fusion LSTM for medical event prediction using EHRs