Animate Your Thoughts: Decoupled Reconstruction of Dynamic Natural Vision from Slow Brain Activity

Yizhuo Lu,Changde Du,Chong Wang,Xuanliu Zhu,Liuyun Jiang,Huiguang He
2024-05-06
Abstract:Reconstructing human dynamic vision from brain activity is a challenging task with great scientific significance. The difficulty stems from two primary issues: (1) vision-processing mechanisms in the brain are highly intricate and not fully revealed, making it challenging to directly learn a mapping between fMRI and video; (2) the temporal resolution of fMRI is significantly lower than that of natural videos. To overcome these issues, this paper propose a two-stage model named Mind-Animator, which achieves state-of-the-art performance on three public datasets. Specifically, during the fMRI-to-feature stage, we decouple semantic, structural, and motion features from fMRI through fMRI-vision-language tri-modal contrastive learning and sparse causal attention. In the feature-to-video stage, these features are merged to videos by an inflated Stable Diffusion. We substantiate that the reconstructed video dynamics are indeed derived from fMRI, rather than hallucinations of the generative model, through permutation tests. Additionally, the visualization of voxel-wise and ROI-wise importance maps confirms the neurobiological interpretability of our model.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the challenging problem of reconstructing dynamic visual stimuli from functional magnetic resonance imaging (fMRI) signals. Specifically, the paper focuses on the following two main issues: 1. **Complex and Not Fully Revealed Visual Processing Mechanisms**: The visual processing mechanisms in the brain are very complex and not yet fully understood, making it difficult to directly learn video mappings from fMRI signals. 2. **Low Temporal Resolution of fMRI**: The temporal resolution of fMRI is significantly lower than that of natural videos, leading to a substantial mismatch in the time dimension. To overcome these issues, the researchers propose a two-stage model called Mind-Animator, which can decouple semantic, structural, and motion information from fMRI signals and generate video frames through an inflated Stable Diffusion model. Additionally, permutation tests were conducted to verify that the motion information in the reconstructed videos indeed originates from the fMRI signals rather than being an "illusion" of the generative model. Finally, the neurobiological interpretability of the model was confirmed through voxel-level and ROI-level importance maps.