Abstract:Many healthcare applications are inherently multimodal, involving several physiological signals. As sensors for these signals become more common, improving machine learning methods for multimodal healthcare data is crucial. Pretraining foundation models is a promising avenue for success. However, methods for developing foundation models in healthcare are still in early exploration and it is unclear which pretraining strategies are most effective given the diversity of physiological signals. This is partly due to challenges in multimodal health data: obtaining data across many patients is difficult and costly, there is a lot of inter-subject variability, and modalities are often heterogeneously informative across downstream tasks. Here, we explore these challenges in the PhysioNet 2018 dataset. We use a masked autoencoding objective to pretrain a multimodal model. We show that the model learns representations that can be linearly probed for a diverse set of downstream tasks. We hypothesize that cross-modal reconstruction objectives are important for successful multimodal training, as they encourage the model to integrate information across modalities. We demonstrate that modality dropout in the input space improves performance across downstream tasks. We also find that late-fusion models pretrained with contrastive learning objectives are less effective across multiple tasks. Finally, we analyze the model's representations, showing that attention weights become more cross-modal and temporally aligned with our pretraining strategy. The learned embeddings also become more distributed in terms of the modalities encoded by each unit. Overall, our work demonstrates the utility of multimodal foundation models with health data, even across diverse physiological data sources. We further argue that explicit methods for inducing cross-modality may enhance multimodal pretraining strategies.

Multimodal foundation models are better simulators of the human brain

Brain-inspired Multimodal Learning Based on Neural Networks

Brain encoding models based on multimodal transformers can transfer across language and vision

Multimodal Contrastive Learning for Brain-Machine Fusion: from Brain-in-the-loop Modeling to Brain-out-of-the-loop Application

Foundation model of neural activity predicts response to new stimulus types and anatomy

MI-MAMI: Multisensory Integration Model Inspired by the Macro and Micro Mechanisms of the Human Brain

Bio‐Plausible Multimodal Learning with Emerging Neuromorphic Devices

Bridging the Semantic Latent Space Between Brain and Machine: Similarity is All You Need

Interpretable Multimodal Fusion Networks Reveal Mechanisms of Brain Cognition

Promoting cross-modal representations to improve multimodal foundation models for physiological signals

Towards artificial general intelligence via a multimodal foundation model

Multimodal Foundation Models: From Specialists to General-Purpose Assistants

Decoding Visual Neural Representations by Multimodal Learning of Brain-Visual-Linguistic Features

VTVBrain: A Two-stage Brain Encoding Model for Decoding Key Neural Responses in Multimodal Contexts

Explainable Multimodal Deep Dictionary Learning to Capture Developmental Differences From Three fMRI Paradigms

A Multimodal Visual Encoding Model Aided by Introducing Verbal Semantic Information

Multimodal Fusion of Brain Imaging Data: Methods and Applications

Foundations of Multisensory Artificial Intelligence

Revealing Vision-Language Integration in the Brain with Multimodal Networks

BrainLM: A foundation model for brain activity recordings

A Theory of Multimodal Learning