Brain-Streams: fMRI-to-Image Reconstruction with Multi-modal Guidance

Jaehoon Joo,Taejin Jeong,Seongjae Hwang
2024-09-19
Abstract:Understanding how humans process visual information is one of the crucial steps for unraveling the underlying mechanism of brain activity. Recently, this curiosity has motivated the fMRI-to-image reconstruction task; given the fMRI data from visual stimuli, it aims to reconstruct the corresponding visual stimuli. Surprisingly, leveraging powerful generative models such as the Latent Diffusion Model (LDM) has shown promising results in reconstructing complex visual stimuli such as high-resolution natural images from vision datasets. Despite the impressive structural fidelity of these reconstructions, they often lack details of small objects, ambiguous shapes, and semantic nuances. Consequently, the incorporation of additional semantic knowledge, beyond mere visuals, becomes imperative. In light of this, we exploit how modern LDMs effectively incorporate multi-modal guidance (text guidance, visual guidance, and image layout) for structurally and semantically plausible image generations. Specifically, inspired by the two-streams hypothesis suggesting that perceptual and semantic information are processed in different brain regions, our framework, Brain-Streams, maps fMRI signals from these brain regions to appropriate embeddings. That is, by extracting textual guidance from semantic information regions and visual guidance from perceptual information regions, Brain-Streams provides accurate multi-modal guidance to LDMs. We validate the reconstruction ability of Brain-Streams both quantitatively and qualitatively on a real fMRI dataset comprising natural image stimuli and fMRI data.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem this paper attempts to address is the reconstruction of visual stimulus images from functional magnetic resonance imaging (fMRI) data through multimodal guidance (including text guidance, visual guidance, and image layout) to improve the detail accuracy and semantic consistency of the reconstructed images. Specifically, the paper proposes a new method called Brain-Streams, which leverages the perceptual and semantic information processed by different regions of the brain. By mapping this information to corresponding embedding vectors, it provides multi-level multimodal guidance for the Latent Diffusion Model (LDM), thereby achieving more precise visual stimulus reconstruction. The main contributions of the paper include: 1. Proposing a new fMRI-to-image reconstruction framework, Brain-Streams, which can extract three levels of guidance information (high-level, mid-level, and low-level) from specific brain regions to provide multimodal guidance for LDM. 2. Achieving not only the reconstruction of visual stimulus images but also generating detailed semantic information captions optimized by large language models (LLM), further guiding LDM in image reconstruction. 3. Attaining the current best performance in visual stimulus reconstruction on the NSD dataset through the aforementioned methods. In summary, this research aims to significantly enhance the quality of reconstructing complex natural images from fMRI data by combining multimodal information, particularly precise text guidance, especially in capturing small objects, blurred shapes, and semantic nuances in the images.