Research on Feature Extraction and Multimodal Fusion of Video Caption Based on Deep Learning

Hongjun Chen,Hengyi Li,Xueqin Wu
DOI: https://doi.org/10.1145/3380625.3380669
2020-01-17
Abstract:Video Caption shows the objects, attributes and their relationship in natural language. It has been a very challenging research topic in the field of computer and multimedia. In this paper, the method of deep learning is used to extract the video frame feature, motion information, video sequence feature. And the multi-modal feature fusion method: feature cascade, model weighted average fusion are studied, and then the valuation is also studied. The experimental results show that the score of each evaluation in the model of weighted average fusion method is higher than that of the feature cascade method. The feature extraction methods and multimodal fusion methods in this paper have certain value for the application of video caption.
What problem does this paper attempt to address?