Spiking Tucker Fusion Transformer for Audio-Visual Zero-Shot Learning

Wenrui Li,Penghong Wang,Ruiqin Xiong,Xiaopeng Fan
2024-07-11
Abstract:The spiking neural networks (SNNs) that efficiently encode temporal sequences have shown great potential in extracting audio-visual joint feature representations. However, coupling SNNs (binary spike sequences) with transformers (float-point sequences) to jointly explore the temporal-semantic information still facing challenges. In this paper, we introduce a novel Spiking Tucker Fusion Transformer (STFT) for audio-visual zero-shot learning (ZSL). The STFT leverage the temporal and semantic information from different time steps to generate robust representations. The time-step factor (TSF) is introduced to dynamically synthesis the subsequent inference information. To guide the formation of input membrane potentials and reduce the spike noise, we propose a global-local pooling (GLP) which combines the max and average pooling operations. Furthermore, the thresholds of the spiking neurons are dynamically adjusted based on semantic and temporal cues. Integrating the temporal and semantic information extracted by SNNs and Transformers are difficult due to the increased number of parameters in a straightforward bilinear model. To address this, we introduce a temporal-semantic Tucker fusion module, which achieves multi-scale fusion of SNN and Transformer outputs while maintaining full second-order interactions. Our experimental results demonstrate the effectiveness of the proposed approach in achieving state-of-the-art performance in three benchmark datasets. The harmonic mean (HM) improvement of VGGSound, UCF101 and ActivityNet are around 15.4\%, 3.9\%, and 14.9\%, respectively.
Multimedia,Computer Vision and Pattern Recognition,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
This paper aims to address several key challenges in Audio - Visual Zero - Shot Learning (AV - ZSL). Specifically, the paper attempts to solve the following problems: 1. **Time Steps**: Most existing Spiking Neural Networks (SNNs) obtain the final output through a fixed time step, which not only ignores the importance of different layers in encoding time series but also leads to significant fluctuations in SNN performance. 2. **Spiking Redundancy**: There is redundancy in the output of SNNs. Noisy spikes are highly correlated in the time and space dimensions and are closely related to the spike firing frequency and neuron location. Finding the balance point between the spike neuron firing frequency and accuracy is crucial for reducing the redundancy of SNNs. 3. **Output Heterogeneity**: There are significant differences in the output data distributions of SNNs and Transformers, which are binary spike sequences and floating - point features respectively. Efficiently integrating the features of these different data distributions is crucial for unleashing the potential of SNNs. To address these challenges, the authors propose a new Spiking Tucker Fusion Transformer (STFT), with the following main contributions: - **Proposing a new STFT model**: STFT effectively couples SNNs and Transformers, combines temporal and semantic information at different time steps, and generates robust representations. - **Introducing the Temporal - Semantic Tucker Fusion Module**: This module achieves multi - scale fusion of the outputs of SNNs and Transformers while maintaining full second - order interactions. This helps to effectively integrate temporal and semantic information and provide a comprehensive audio - visual data representation. - **Dynamically adjusting the thresholds of spiking neurons**: Dynamically adjust the thresholds of spiking neurons based on semantic and temporal information cues, reduce spike noise, and improve the robustness of the model. - **Global - Local Pooling (GLP)**: Combine the max - pooling and average - pooling operations to guide the formation of the input membrane potential and generate input features based on global and local characteristics. Through these innovations, STFT outperforms existing methods on three benchmark datasets (VGGSound, UCF101, and ActivityNet), increasing the Harmonic Mean (HM) by 15.4%, 3.9%, and 14.9% respectively.