Deep Relationship Analysis in Video with Multimodal Feature Fusion

Fan Yu,DanDan Wang,Beibei Zhang,Tongwei Ren
DOI: https://doi.org/10.1145/3394171.3416303
2020-01-01
Abstract:In this paper, we propose a novel multimodal feature fusion method based on scene segmentation to detect the relationships between entities in a long duration video. Specifically, a long video is split into some scenes and entities in the scenes are tracked. Text, audio and visual features in a scene are extracted to predict relationships between different entities in the scene. The relationships between entities construct a knowledge graph of the video and can be used to answer some queries about the video. The experimental results show that our method performs well for deep video understanding on the HLVU dataset.
What problem does this paper attempt to address?