Abstract:Scene recognition aims to automatically comprehend scenes, and is widely utilized in various fields such as autonomous driving, intelligent security, and robotics. Current research predominantly employs local audio feature extractors, which results in the extracted features being unable to accommodate long-range contextual characteristics. Moreover, regarding the extracted features, most studies assume that the features of each modality possess equal importance. Our work primarily introduces a long-range audio feature extractor and employs a self-attention module to re-weight different features, addressing the limitations of the aforementioned local audio features and the varying importance of different modalities. We propose a visual-audio fusion model based on a self-attention-based graph convolutional neural network (SAGCN). In this model, we introduce an attention mechanism based cross-modal learning module into a structured multi-modal fusion network, and integrate the extracted features from different modalities to achieve precise scene recognition. The proposed model achieves an accuracy of 93.1 on a standard multi-modal scene recognition dataset: TAU dataset. Compared with other standard early and late fusion methods, the prediction accuracy enhances by 1.4 and 10 , respectively. For comparison with the SOTA methods, SAGCN exceeded the TAU baseline and attentional graph convolutional network on the TAU dataset by 8.3 and 1.5 , respectively, and achieved a 95.0% accuracy on the UCF101 dataset, outperforming the evolved loss method by 1.2 and the cross-modal deep clusterin method by 0.8 . The code is available at https://github.com/submission1234/SAGCN.

Multi-modal Attention Mechanisms in LSTM and Its Application to Acoustic Scene Classification

Acoustic scene classification by feed forward neural network with class dependent attention mechanism

Event-Based Multimodal Spiking Neural Network with Attention Mechanism

Data Independent Sequence Augmentation Method for Acoustic Scene Classification.

Learning Multimodal Attention LSTM Networks for Video Captioning.

Multi-stream Network With Temporal Attention For Environmental Sound Classification

Spatio-Temporal Attention Pooling for Audio Scene Classification

Auxiliary Multimodal LSTM for Audio-visual Speech Recognition and Lipreading

High-Resolution Attention Network with Acoustic Segment Model for Acoustic Scene Classification

Multi-layer Attention Mechanism for Speech Keyword Recognition

Speech Emotion Recognition Using Multi-hop Attention Mechanism

Deep Multi-Kernel Convolutional LSTM Networks and an Attention-Based Mechanism for Videos

Look, Listen and Learn - A Multimodal LSTM for Speaker Identification

Low-Complexity Acoustic Scene Classification Using Parallel Attention-Convolution Network

Future Context Attention for Unidirectional LSTM Based Acoustic Model

Bidirectional LSTM with attention mechanism and convolutional layer for text classification

Attention and Language Ensemble for Scene Text Recognition with Convolutional Sequence Modeling.

Jointly Trained Sequential Labeling and Classification by Sparse Attention Neural Networks.

Audio Visual Emotion Recognition with Temporal Alignment and Perception Attention

Audio-visual scene recognition using attention-based graph convolutional model

TMac: Temporal Multi-Modal Graph Learning for Acoustic Event Classification