Modality-Agnostic fMRI Decoding of Vision and Language

Mitja Nikolaus,Milad Mozafari,Nicholas Asher,Leila Reddy,Rufin VanRullen
2024-03-18
Abstract:Previous studies have shown that it is possible to map brain activation data of subjects viewing images onto the feature representation space of not only vision models (modality-specific decoding) but also language models (cross-modal decoding). In this work, we introduce and use a new large-scale fMRI dataset (~8,500 trials per subject) of people watching both images and text descriptions of such images. This novel dataset enables the development of modality-agnostic decoders: a single decoder that can predict which stimulus a subject is seeing, irrespective of the modality (image or text) in which the stimulus is presented. We train and evaluate such decoders to map brain signals onto stimulus representations from a large range of publicly available vision, language and multimodal (vision+language) models. Our findings reveal that (1) modality-agnostic decoders perform as well as (and sometimes even better than) modality-specific decoders (2) modality-agnostic decoders mapping brain data onto representations from unimodal models perform as well as decoders relying on multimodal representations (3) while language and low-level visual (occipital) brain regions are best at decoding text and image stimuli, respectively, high-level visual (temporal) regions perform well on both stimulus types.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the problem of developing a model capable of cross-modal decoding of brain activity, i.e., a single decoder that can predict the stimuli the subject is viewing (whether images or text descriptions) without prior knowledge of the specific modality of the stimuli. Specifically, the researchers utilized a new large-scale fMRI dataset that includes brain activation data of subjects while viewing images and corresponding text descriptions. By training and evaluating this cross-modal decoder, the researchers aim to explore the following points: 1. **Performance of the Cross-Modal Decoder**: Verify whether the cross-modal decoder can achieve similar or even better performance compared to modality-specific decoders (e.g., decoders targeting only vision or language). 2. **Multimodal vs. Unimodal Representations**: Compare the decoding effectiveness based on features from unimodal models (vision or language models) and multimodal models (models combining vision and language) to determine which type of model features are more beneficial for cross-modal decoding. 3. **Decoding Ability of Different Brain Regions**: Analyze the performance of different brain regions (such as low-level visual areas, high-level visual areas, and language-related areas) in decoding image and text stimuli, and explore which brain regions perform better in cross-modal decoding. Through these studies, the paper aims to advance the understanding of how the brain processes and integrates information from different modalities and provide new tools and methods for future cross-modal neuroscience research.