Abstract:In the pursuit to understand the intricacies of human brain's visual processing, reconstructing dynamic visual experiences from brain activities emerges as a challenging yet fascinating endeavor. While recent advancements have achieved success in reconstructing static images from non-invasive brain recordings, the domain of translating continuous brain activities into video format remains underexplored. In this work, we introduce NeuroCine, a novel dual-phase framework to targeting the inherent challenges of decoding fMRI data, such as noises, spatial redundancy and temporal lags. This framework proposes spatial masking and temporal interpolation-based augmentation for contrastive learning fMRI representations and a diffusion model enhanced by dependent prior noise for video generation. Tested on a publicly available fMRI dataset, our method shows promising results, outperforming the previous state-of-the-art models by a notable margin of ${20.97\%}$, ${31.00\%}$ and ${12.30\%}$ respectively on decoding the brain activities of three subjects in the fMRI dataset, as measured by SSIM. Additionally, our attention analysis suggests that the model aligns with existing brain structures and functions, indicating its biological plausibility and interpretability.

What problem does this paper attempt to address?

The problem this paper attempts to address is the reconstruction of vivid videos from human brain activity, specifically functional magnetic resonance imaging (fMRI). While significant progress has been made in recent years in reconstructing static images from non-invasive brain recordings, there is still limited research on translating continuous brain activity into video format. The paper introduces a new two-stage framework called NeuralFlix, which aims to tackle some inherent issues faced when decoding fMRI data, such as noise, spatial redundancy, and temporal lag. Specifically, the main contributions of the paper include: 1. **Two-stage framework**: The first stage enhances fMRI data through spatial masking and temporal interpolation, and trains an optimized fMRI encoder to resist the interference brought by these enhancements. The second stage uses the trained fMRI encoder to guide a video diffusion model to generate videos, while introducing dependency prior noise to compensate for the low signal-to-noise ratio of fMRI data. 2. **Innovative techniques**: The use of contrastive learning methods to learn fMRI representations and the generation of videos through a diffusion model. These techniques help transform complex and noisy fMRI data into precise and meaningful visual reconstructions. 3. **Experimental validation**: Tests conducted on publicly available fMRI datasets show that NeuralFlix significantly outperforms existing state-of-the-art models in decoding brain activity of three subjects, with improvements of 20.97%, 31.00%, and 12.30% respectively (measured by the SSIM metric). 4. **Biological interpretability**: Analysis of the model's attention reveals that its outputs are consistent with known brain structures and functions, further demonstrating the biological plausibility and interpretability of the method. In summary, this paper aims to advance the field of neural decoding and visual reconstruction by efficiently and accurately reconstructing videos from human brain activity through the combination of advanced neuroimaging techniques and machine learning methods.

NeuroCine: Decoding Vivid Video Sequences from Human Brain Activties

Cinematic Mindscapes: High-quality Video Reconstruction from Brain Activity

NeuroClips: Towards High-fidelity and Smooth fMRI-to-Video Reconstruction

Animate Your Thoughts: Decoupled Reconstruction of Dynamic Natural Vision from Slow Brain Activity

Decoding Visual Experience and Mapping Semantics through Whole-Brain Analysis Using fMRI Foundation Models

Decoding Realistic Images from Brain Activity with Contrastive Self-supervision and Latent Diffusion

Encoding functional brain interactions from computational visual features

Enhancing Cross-Subject Fmri-to-video Decoding with Global-Local Functional Alignment

Natural scene reconstruction from fMRI signals using generative latent diffusion

Encoding brain network response to free viewing of videos

Video Abstraction Based on Fmri-Driven Visual Attention Model

Brain decoding: toward real-time reconstruction of visual perception

Brain Captioning: Decoding human brain activity into images and text

Neural Encoding and Decoding with Deep Learning for Dynamic Natural Vision

Decoding human brain activity with deep learning

Visual Image Decoding of Brain Activities using a Dual Attention Hierarchical Latent Generative Network with Multi-Scale Feature Fusion

Sparsity-Constrained fMRI Decoding of Visual Saliency in Naturalistic Video Streams

Neural Representations of Dynamic Visual Stimuli

Reconstructing Rapid Natural Vision with fMRI-Conditional Video Generative Adversarial Network

Unsupervised decoding of long-term, naturalistic human neural recordings with automated video and audio annotations

Neuro-Vision to Language: Enhancing Brain Recording-based Visual Reconstruction and Language Interaction