Multimodal scene classification for encoder-assisted videos

HUANG Tianyang,HOU Yuanbo,LI Shengchen,SHAO Xi
DOI: https://doi.org/10.14132/j.cnki.1673-5439.2023.01.013
2023-01-01
Abstract:Given the low accuracy of multi-modal scene classification, this paper proposes a multi-modal scene classification method assisted by mutual coders. First, The audio part extracts the features of the input audio data and uses the self-attention mechanism to obtain the attention information. The image part extracts the frame images of the video, and then extracts the features through the ResNet50. Second, the extracted dual-mode information is entered into the mutual encoder. The mutual encoder performs feature fusion by extracting the hidden layer features of each mode. The new features after fusion are combined with the attention mechanism to assist the video features. In this model, the mutual coder is an auxiliary system for feature fusion. The experiment is conducted on the DCASE2021 Challenge Task 1B dataset, and the results show that the mutual encoder can improve the classification accuracy.
What problem does this paper attempt to address?