VLG: General Video Recognition with Web Textual Knowledge

Jintao Lin,Zhaoyang Liu,Wenhai Wang,Wayne Wu,Limin Wang
DOI: https://doi.org/10.1007/s11263-024-02081-z
2024-01-01
Abstract:Video recognition (action recognition) in an open world is quite challenging, as we need to handle different settings such as closed-set, long-tail, few-shot, and open-set. The majority of existing works often address each individual setting separately using various frameworks. However, these separate investigations would ignore the possibility of knowledge sharing across different settings, and stymie progress in video recognition as well as its application in the real world. By leveraging semantic knowledge from noisy text descriptions crawled from the Internet, we focus on the general video recognition (GVR) task of solving recognition problems of different settings within a unified framework. The core contribution of this paper is twofold. First, we build a comprehensive video recognition benchmark to facilitate the research of GVR, called Kinetics-Text. This dataset covers the mentioned four common settings, and provides multi-source text descriptions for all action classes for utilizing external textual knowledge from the Internet. Second, inspired by the flexibility of language representation, we analyse the correspondence between the video and text descriptions of its category and present a unified visual-linguistic framework (VLG) to solve the problem of GVR with an effective two-stage training paradigm. Our VLG is first pre-trained on video and language datasets to learn a shared feature space, and then devises a flexible bi-modal attention head to collaborate high-level semantic concepts under different settings. Extensive results show that our VLG obtains the state-of-the-art performance under four settings, and the superior performance demonstrates the effectiveness and generalization ability of our proposed framework. We hope our work makes a step towards the general video recognition and could serve as a baseline for future research. Code and datasets have been released in https://github.com/MCG-NJU/VLG .
What problem does this paper attempt to address?