Abstract:The spiking neural networks (SNNs) that efficiently encode temporal sequences have shown great potential in extracting audio-visual joint feature representations. However, coupling SNNs (binary spike sequences) with transformers (float-point sequences) to jointly explore the temporal-semantic information still facing challenges. In this paper, we introduce a novel Spiking Tucker Fusion Transformer (STFT) for audio-visual zero-shot learning (ZSL). The STFT leverage the temporal and semantic information from different time steps to generate robust representations. The time-step factor (TSF) is introduced to dynamically synthesis the subsequent inference information. To guide the formation of input membrane potentials and reduce the spike noise, we propose a global-local pooling (GLP) which combines the max and average pooling operations. Furthermore, the thresholds of the spiking neurons are dynamically adjusted based on semantic and temporal cues. Integrating the temporal and semantic information extracted by SNNs and Transformers are difficult due to the increased number of parameters in a straightforward bilinear model. To address this, we introduce a temporal-semantic Tucker fusion module, which achieves multi-scale fusion of SNN and Transformer outputs while maintaining full second-order interactions. Our experimental results demonstrate the effectiveness of the proposed approach in achieving state-of-the-art performance in three benchmark datasets. The harmonic mean (HM) improvement of VGGSound, UCF101 and ActivityNet are around 15.4\%, 3.9\%, and 14.9\%, respectively.

What problem does this paper attempt to address?

This paper aims to address several key challenges in Audio - Visual Zero - Shot Learning (AV - ZSL). Specifically, the paper attempts to solve the following problems: 1. **Time Steps**: Most existing Spiking Neural Networks (SNNs) obtain the final output through a fixed time step, which not only ignores the importance of different layers in encoding time series but also leads to significant fluctuations in SNN performance. 2. **Spiking Redundancy**: There is redundancy in the output of SNNs. Noisy spikes are highly correlated in the time and space dimensions and are closely related to the spike firing frequency and neuron location. Finding the balance point between the spike neuron firing frequency and accuracy is crucial for reducing the redundancy of SNNs. 3. **Output Heterogeneity**: There are significant differences in the output data distributions of SNNs and Transformers, which are binary spike sequences and floating - point features respectively. Efficiently integrating the features of these different data distributions is crucial for unleashing the potential of SNNs. To address these challenges, the authors propose a new Spiking Tucker Fusion Transformer (STFT), with the following main contributions: - **Proposing a new STFT model**: STFT effectively couples SNNs and Transformers, combines temporal and semantic information at different time steps, and generates robust representations. - **Introducing the Temporal - Semantic Tucker Fusion Module**: This module achieves multi - scale fusion of the outputs of SNNs and Transformers while maintaining full second - order interactions. This helps to effectively integrate temporal and semantic information and provide a comprehensive audio - visual data representation. - **Dynamically adjusting the thresholds of spiking neurons**: Dynamically adjust the thresholds of spiking neurons based on semantic and temporal information cues, reduce spike noise, and improve the robustness of the model. - **Global - Local Pooling (GLP)**: Combine the max - pooling and average - pooling operations to guide the formation of the input membrane potential and generate input features based on global and local characteristics. Through these innovations, STFT outperforms existing methods on three benchmark datasets (VGGSound, UCF101, and ActivityNet), increasing the Harmonic Mean (HM) by 15.4%, 3.9%, and 14.9% respectively.

Spiking Tucker Fusion Transformer for Audio-Visual Zero-Shot Learning

Motion-Decoupled Spiking Transformer for Audio-Visual Zero-Shot Learning

Delving into Temporal-Spectral Connections in Spike-LFP Decoding by Transformer Networks

Spiking Transformer with Spatial-Temporal Attention

Spiking Wavelet Transformer

TE-Spikformer:Temporal-enhanced spiking neural network with transformer

SGLFormer: Spiking Global-Local-Fusion Transformer with high performance

Spiking Neural Networks with Dynamic Time Steps for Vision Transformers

PSSD-Transformer: Powerful Sparse Spike-Driven Transformer for Image Semantic Segmentation

Spike Trains Encoding and Threshold Rescaling Method for Deep Spiking Neural Networks

Robust Transcoding Sensory Information with Neural Spikes

Spike-based Encoding and Learning of Spectrum Features for Robust Sound Recognition.

Scaling Spike-driven Transformer with Efficient Spike Firing Approximation Training

Spikeformer: Training high-performance spiking neural network with transformer

CSNN: an Augmented Spiking Based Framework with Perceptron-Inception

A Novel Spike Transformer Network for Depth Estimation from Event Cameras via Cross-modality Knowledge Distillation

Spiking Transformers for Event-based Single Object Tracking

VTSNN: a virtual temporal spiking neural network

Trimming Down Large Spiking Vision Transformers via Heterogeneous Quantization Search

Temporal–Semantic Aligning and Reasoning Transformer for Audio-Visual Zero-Shot Learning

Spikformer: When Spiking Neural Network Meets Transformer