VideoGLUE: Video General Understanding Evaluation of Foundation Models
Liangzhe Yuan,Nitesh Bharadwaj Gundavarapu,Long Zhao,Hao Zhou,Yin Cui,Lu Jiang,Xuan Yang,Menglin Jia,Tobias Weyand,Luke Friedman,Mikhail Sirotenko,Huisheng Wang,Florian Schroff,Hartwig Adam,Ming-Hsuan Yang,Ting Liu,Boqing Gong
2023-12-02
Abstract:We evaluate existing foundation models video understanding capabilities using a carefully designed experiment protocol consisting of three hallmark tasks (action recognition, temporal localization, and spatiotemporal localization), eight datasets well received by the community, and four adaptation methods tailoring a foundation model (FM) for a downstream task. Moreover, we propose a scalar VideoGLUE score (VGS) to measure an FMs efficacy and efficiency when adapting to general video understanding tasks. Our main findings are as follows. First, task-specialized models significantly outperform the six FMs studied in this work, in sharp contrast to what FMs have achieved in natural language and image understanding. Second,video-native FMs, whose pretraining data contains the video modality, are generally better than image-native FMs in classifying motion-rich videos, localizing actions in time, and understanding a video of more than one action. Third, the video-native FMs can perform well on video tasks under light adaptations to downstream tasks(e.g., freezing the FM backbones), while image-native FMs win in full end-to-end finetuning. The first two observations reveal the need and tremendous opportunities to conduct research on video-focused FMs, and the last confirms that both tasks and adaptation methods matter when it comes to the evaluation of FMs. Our code is released under: <a class="link-external link-https" href="https://github.com/tensorflow/models/tree/master/official/projects/videoglue" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition