Abstract:BACKGROUND AND OBJECTIVE: Most studies used neural activities evoked by linguistic stimuli such as phrases or sentences to decode the language structure. However, compared to linguistic stimuli, it is more common for the human brain to perceive the outside world through non-linguistic stimuli such as natural images, so only relying on linguistic stimuli cannot fully understand the information perceived by the human brain. To address this, an end-to-end mapping model between visual neural activities evoked by non-linguistic stimuli and visual contents is demanded.METHODS: Inspired by the success of the Transformer network in neural machine translation and the convolutional neural network (CNN) in computer vision, here a CNN-Transformer hybrid language decoding model is constructed in an end-to-end fashion to decode functional magnetic resonance imaging (fMRI) signals evoked by natural images into descriptive texts about the visual stimuli. Specifically, this model first encodes a semantic sequence extracted by a two-layer 1D CNN from the multi-time visual neural activity into a multi-level abstract representation, then decodes this representation, step by step, into an English sentence.RESULTS: Experimental results show that the decoded texts are semantically consistent with the corresponding ground truth annotations. Additionally, by varying the encoding and decoding layers and modifying the original positional encoding of the Transformer, we found that a specific architecture of the Transformer is required in this work.CONCLUSIONS: The study results indicate that the proposed model can decode the visual neural activities evoked by natural images into descriptive text about the visual stimuli in the form of sentences. Hence, it may be considered as a potential computer-aided tool for neuroscientists to understand the neural mechanism of visual information processing in the human brain in the future.

BrainChat: Decoding Semantic Information from fMRI using Vision-language Pretrained Models

MindSemantix: Deciphering Brain Visual Experiences with a Brain-Language Model

Brain Captioning: Decoding human brain activity into images and text

Decoding Visual Experience and Mapping Semantics through Whole-Brain Analysis Using fMRI Foundation Models

BrainCLIP: Bridging Brain and Visual-Linguistic Representation Via CLIP for Generic Natural Visual Stimulus Decoding

Neuro-Vision to Language: Enhancing Brain Recording-based Visual Reconstruction and Language Interaction

MindGPT: Interpreting What You See with Non-invasive Brain Recordings

Bridging the Semantic Latent Space Between Brain and Machine: Similarity is All You Need

Mind captioning: Evolving descriptive text of mental content from human brain activity

Describing Semantic Representations of Brain Activity Evoked by Visual Stimuli

UniBrain: Unify Image Reconstruction and Captioning All in One Diffusion Model from Human Brain Activity

DreamCatcher: Revealing the Language of the Brain with fMRI using GPT Embedding

Brain2Word: Decoding Brain Activity for Language Generation

Modality-Agnostic fMRI Decoding of Vision and Language

BrainSCUBA: Fine-Grained Natural Language Captions of Visual Cortex Selectivity

A dual‐channel language decoding from brain activity with progressive transfer training

Decoding Visual Neural Representations by Multimodal Learning of Brain-Visual-Linguistic Features

A CNN-transformer hybrid approach for decoding visual neural activity into text

Brain decoding: toward real-time reconstruction of visual perception

NeuroCine: Decoding Vivid Video Sequences from Human Brain Activties

LLM4Brain: Training a Large Language Model for Brain Video Understanding