Abstract:Scene recognition aims to automatically comprehend scenes, and is widely utilized in various fields such as autonomous driving, intelligent security, and robotics. Current research predominantly employs local audio feature extractors, which results in the extracted features being unable to accommodate long-range contextual characteristics. Moreover, regarding the extracted features, most studies assume that the features of each modality possess equal importance. Our work primarily introduces a long-range audio feature extractor and employs a self-attention module to re-weight different features, addressing the limitations of the aforementioned local audio features and the varying importance of different modalities. We propose a visual-audio fusion model based on a self-attention-based graph convolutional neural network (SAGCN). In this model, we introduce an attention mechanism based cross-modal learning module into a structured multi-modal fusion network, and integrate the extracted features from different modalities to achieve precise scene recognition. The proposed model achieves an accuracy of 93.1 on a standard multi-modal scene recognition dataset: TAU dataset. Compared with other standard early and late fusion methods, the prediction accuracy enhances by 1.4 and 10 , respectively. For comparison with the SOTA methods, SAGCN exceeded the TAU baseline and attentional graph convolutional network on the TAU dataset by 8.3 and 1.5 , respectively, and achieved a 95.0% accuracy on the UCF101 dataset, outperforming the evolved loss method by 1.2 and the cross-modal deep clusterin method by 0.8 . The code is available at https://github.com/submission1234/SAGCN.

Improved Fusion of Visual and Semantic Representations by Gated Co-Attention for Scene Text Recognition.

Scene Text Recognition Via Gated Cascade Attention

Gaussian Constrained Attention Network for Scene Text Recognition

Efficient Scene Text Detection with Textual Attention Tower

Scene Graph Based Fusion Network For Image-Text Retrieval

Attention and Language Ensemble for Scene Text Recognition with Convolutional Sequence Modeling.

Deep Neural Network with Attention Model for Scene Text Recognition.

SceneGATE: Scene-Graph Based Co-Attention Networks for Text Visual Question Answering

Audio-visual scene recognition using attention-based graph convolutional model

Hierarchical Refined Attention for Scene Text Recognition.

CMFN: Cross-Modal Fusion Network for Irregular Scene Text Recognition

Integrating Scene Text and Visual Appearance for Fine-Grained Image Classification

Learning and Fusing Multi-Scale Representations for Accurate Arbitrary-Shaped Scene Text Recognition.

Scene Text Recognition with Cascade Attention Network.

Scene Text Recognition with Temporal Convolutional Encoder

Attention-based Feature Decomposition-Reconstruction Network for Scene Text Detection

Scene Text Recognition from Two-Dimensional Perspective

Aligning Where to See and What to Tell: Image Caption with Region-Based Attention and Scene Factorization

TextFusion: Unveiling the Power of Textual Semantics for Controllable Image Fusion

Scene-text aware cross-modal retrieval based on semantic matching (ChinaMM2024)

Reading Scene Text with Attention Convolutional Sequence Modeling