Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Tsai-Shien Chen,Aliaksandr Siarohin,Willi Menapace,Ekaterina Deyneka,Hsiang-wei Chao,Byung Eun Jeon,Yuwei Fang,Hsin-Ying Lee,Jian Ren,Ming-Hsuan Yang,Sergey Tulyakov

2024-03-01

Abstract:The quality of the data and annotation upper-bounds the quality of a downstream model. While there exist large text corpora and image-text pairs, high-quality video-text data is much harder to collect. First of all, manual labeling is more time-consuming, as it requires an annotator to watch an entire video. Second, videos have a temporal dimension, consisting of several scenes stacked together, and showing multiple actions. Accordingly, to establish a video dataset with high-quality captions, we propose an automatic approach leveraging multimodal inputs, such as textual video description, subtitles, and individual video frames. Specifically, we curate 3.8M high-resolution videos from the publicly available HD-VILA-100M dataset. We then split them into semantically consistent video clips, and apply multiple cross-modality teacher models to obtain captions for each video. Next, we finetune a retrieval model on a small subset where the best caption of each video is manually selected and then employ the model in the whole dataset to select the best caption as the annotation. In this way, we get 70M videos paired with high-quality text captions. We dub the dataset as Panda-70M. We show the value of the proposed dataset on three downstream tasks: video captioning, video and text retrieval, and text-driven video generation. The models trained on the proposed data score substantially better on the majority of metrics across all the tasks.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

This paper proposes a solution to the problem of collecting large-scale high-quality video textual data. Existing video language datasets suffer from issues such as inaccurate descriptions, complex temporal content, low resolution, and watermarks. In the paper, the researchers introduce a large-scale video dataset called Panda-70M, which contains 70 million video clips with precisely annotated multimodal captions. They generate video descriptions by leveraging multiple sources of information such as textual descriptions, subtitles, static frames, and the videos themselves. Multiple cross-modal teacher models are used for selection and validation to ensure accurate description generation. Finally, they fine-tune a video-to-text retrieval model to select the best caption as the annotation. Experimental results show that models trained on the Panda-70M dataset outperform other datasets significantly in tasks such as video description, video-text retrieval, and text-driven video generation. Additionally, they train a student model to learn from the teacher model through knowledge distillation, resulting in more efficient video description.

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content

A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval

A Dataset with Multi-Modal Information and Multi-Granularity Descriptions for Video Captioning

Distilling Vision-Language Models on Millions of Videos

Cross-language Multimodal Scene Semantic Guidance and Leap Sampling for Video Captioning

Multi-Task Video Captioning with a Stepwise Multimodal Encoder

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

Multimodality-guided Visual-Caption Semantic Enhancement

Hierarchical & Multimodal Video Captioning: Discovering and Transferring Multimodal Knowledge for Vision to Language

Captioning Videos Using Large-Scale Image Corpus

Multimodal-enhanced hierarchical attention network for video captioning

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

LVD-2M: A Long-take Video Dataset with Temporally Dense Captions

Research on Video Captioning Based on Multifeature Fusion.

POS-Trends Dynamic-Aware Model for Video Caption

15M Multimodal Facial Image-Text Dataset

AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

Video Captioning with External Knowledge Assistance and Multi-feature Fusion

The MSR-Video to Text Dataset with Clean Annotations

A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero Shot