Fusing Temporally Distributed Multi-Modal Semantic Clues for Video Question Answering.

Fuwei Zhang,Ruomei Wang,Songhua Xu,Fan Zhou
DOI: https://doi.org/10.1109/ICME51207.2021.9428225
2021-01-01
Abstract:Video Question Answering (VideoQA) is an intriguing topic, attracting increasing interest among the broad AI community. Yet videoQA is a difficult task. An algorithm competently tackle this task that needs to be able to: 1) extract rich semantics supplied in each modality of a video and incorporate them across modalities, and 2) identify and integrate such multimodal semantics from pertinent moments of a video, which may or may not be temporally adjacent or nearby, while filtering away irrelevant or even detractive portions of the video, to yield the most precise and sensible semantic context for executing the QA task. In response to the above requirements, a novel deep VideoQA solution is proposed in this paper, which comprises a multi-modal semantic clue extraction module, driven by a series of deep networks, each dedicated to digesting signals of a distinct modality type, to develop the first algorithmic QA capability, and a multi-modal temporal QA module empowered by a deep graph attention network to build the second algorithmic QA capability. Comprehensive experiments are conducted on publicly available benchmark data to validate advantages of the new solution in the end.
What problem does this paper attempt to address?