A Video Description Model with Improved Attention Mechanism

Feiyan Huang,Shangyou Zeng,Jie Ke,Songtong Lei,JinJin Wang
DOI: https://doi.org/10.1088/1742-6596/2384/1/012015
2022-12-08
Journal of Physics: Conference Series
Abstract:Video description generation refers to the automatic generation of text descriptions of videos by computers, which belongs to the intersection of computer vision and natural language processing. Aiming at the problem that the traditional attention mechanism has insufficient ability to extract video features, the model is complex and the description quality is not high, this paper proposes a video description model with an improved attention mechanism. The model is based on the encoder-decoder structure, uses inception-v4 as the encoder to extract features, and introduces a lightweight coordinate attention module (CA) into the attention mechanism, which improves the feature extraction effect and reduces the model complexity, and sends the extracted important feature information into the decoder long short-term memory network (LSTM) to generate the description sentence corresponding to the video. The model is validated on the MSVD dataset using various evaluation metrics (BLEU, ROUGEL, CIDEr, METEOR). The experimental results show that the improved attention mechanism of the video description model proposed in this paper has better accuracy in different performance metrics and can further improve the performance of video description.
What problem does this paper attempt to address?