A CLIP-Enhanced Method for Video-Language Understanding

Guohao Li,Feng He,Zhifan Feng
DOI: https://doi.org/10.48550/arXiv.2110.07137
2021-10-14
Abstract:This technical report summarizes our method for the Video-And-Language Understanding Evaluation (VALUE) challenge (<a class="link-external link-https" href="https://value-benchmark.github.io/challenge" rel="external noopener nofollow">this https URL</a>\<a class="link-external link-http" href="http://_2021.html" rel="external noopener nofollow">this http URL</a>). We propose a CLIP-Enhanced method to incorporate the image-text pretrained knowledge into downstream video-text tasks. Combined with several other improved designs, our method outperforms the state-of-the-art by $2.4\%$ ($57.58$ to $60.00$) Meta-Ave score on VALUE benchmark.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?