Abstract:Electroencephalography (EEG)-based visual perception reconstruction has become an important area of research. Neuroscientific studies indicate that humans can decode imagined 3D objects by perceiving or imagining various visual information, such as color, shape, and rotation. Existing EEG-based visual decoding methods typically focus only on the reconstruction of 2D visual stimulus images and face various challenges in generation quality, including inconsistencies in texture, shape, and color between the visual stimuli and the reconstructed images. This paper proposes an EEG-based 3D object reconstruction method with style consistency and diffusion priors. The method consists of an EEG-driven multi-task joint learning stage and an EEG-to-3D diffusion stage. The first stage uses a neural EEG encoder based on regional semantic learning, employing a multi-task joint learning scheme that includes a masked EEG signal recovery task and an EEG based visual classification task. The second stage introduces a latent diffusion model (LDM) fine-tuning strategy with style-conditioned constraints and a neural radiance field (NeRF) optimization strategy. This strategy explicitly embeds semantic- and location-aware latent EEG codes and combines them with visual stimulus maps to fine-tune the LDM. The fine-tuned LDM serves as a diffusion prior, which, combined with the style loss of visual stimuli, is used to optimize NeRF for generating 3D objects. Finally, through experimental validation, we demonstrate that this method can effectively use EEG data to reconstruct 3D objects with style consistency.
What problem does this paper attempt to address?
### What problems does this paper attempt to solve?
This paper aims to solve the challenges encountered in reconstructing 3D objects based on electroencephalogram (EEG) signals, especially how to ensure that the reconstructed 3D objects are consistent in style with the visual stimuli. Specifically, the existing EEG - based visual decoding methods mainly focus on the reconstruction of 2D visual stimulus images, and there are many problems in terms of generation quality, such as insufficient consistency in texture, shape and color.
#### Main problems include:
1. **Reconstructing 3D objects from EEG signals**: Most of the existing research focuses on the reconstruction of 2D images, and the reconstruction of 3D objects has not been deeply explored yet.
2. **Style consistency**: The reconstructed 3D objects need to be consistent in style with the original visual stimuli, including visual features such as color and shape.
3. **Semantic and location awareness**: In order to better capture the semantic information in EEG signals, a method that can understand regional semantic features needs to be developed.
4. **Generation quality**: Ensure that the generated 3D objects are not only accurate in geometric structure, but also highly consistent in visual style with the original stimulus images.
To solve these problems, the author proposes a new framework, which combines multi - task joint learning and the diffusion model (Diffusion Model), and realizes high - quality 3D object reconstruction through neural radiance field (NeRF) optimization. Specifically, this method is divided into two stages:
- **First stage**: Use the neural EEG encoder for multi - task joint learning, including the masked EEG signal recovery task and the EEG - based visual classification task, to capture regional semantic features.
- **Second stage**: Introduce the latent diffusion model (LDM) fine - tuning strategy and the NeRF optimization strategy, fine - tune the LDM through style constraints and visual stimulus maps, and finally generate style - consistent 3D objects.
Through experimental verification, this method can effectively use EEG data to reconstruct 3D objects with style consistency, thus promoting the research progress in the field of EEG - based visual reconstruction.
### Formula summary
- **Diffusion model loss function**:
\[
L_{\text{ldm}}=\mathbb{E}_{z, \epsilon \sim \mathcal{N}(0,1), t}\left[\left\|\epsilon-\epsilon_{\theta}(z_{t}, t, \tau_{\theta}(y))\right\|_{2}^{2}\right]
\]
- **Regional semantic loss function**:
\[
L_{\text{region}} =-\frac{1}{N}\sum_{i = 1}^{N}\sum_{k = 1}^{M}p_{i,k}\cdot\log(\hat{p}_{i,k})
\]
- **Comprehensive loss function**:
\[
L_{\text{ldm - region}}=\lambda_{\text{ldm}}L_{\text{ldm}}+\lambda_{\text{region}}L_{\text{region}}
\]
These formulas are used to guide the training process of the model, ensuring that the generated 3D objects are highly consistent with the original stimulus images in both geometric structure and visual style.