Unleash the Potential of CLIP for Video Highlight Detection

Donghoon Han,Seunghyeon Seo,Eunhwan Park,Seong-Uk Nam,Nojun Kwak
2024-04-02
Abstract:Multimodal and large language models (LLMs) have revolutionized the utilization of open-world knowledge, unlocking novel potentials across various tasks and applications. Among these domains, the video domain has notably benefited from their capabilities. In this paper, we present Highlight-CLIP (HL-CLIP), a method designed to excel in the video highlight detection task by leveraging the pre-trained knowledge embedded in multimodal models. By simply fine-tuning the multimodal encoder in combination with our innovative saliency pooling technique, we have achieved the state-of-the-art performance in the highlight detection task, the QVHighlight Benchmark, to the best of our knowledge.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The main goal of this paper is to explore how to use pre-trained multimodal models (specifically CLIP) for video highlight detection. Specifically, the authors propose a method called Highlight-CLIP (HL-CLIP), which fine-tunes CLIP's multimodal encoder and combines it with a novel saliency pooling technique to achieve efficient detection of video highlight segments. This method achieves the current best performance on the QVHighlight benchmark, demonstrating the effectiveness and competitiveness of using only pre-trained multimodal models for video highlight detection. The authors believe that although existing multimodal models perform well on zero-shot text-image matching tasks, they fall short when handling tasks that require spatiotemporal understanding (such as video highlight detection). Therefore, they attempt to enhance the model's performance on these tasks by integrating temporal and spatial knowledge. Additionally, the HL-CLIP method further improves highlight detection performance during the inference stage through the saliency pooling technique without additional training.