MoS 2 : Mixture of Scale and Shift Experts for Text-Only Video Captioning

Heng Jia,Yunqiu Xu,Linchao Zhu,Guang Chen,Yufei Wang,Yi Yang
DOI: https://doi.org/10.1145/3664647.3680686
2024-01-01
Abstract:Video captioning is a challenging task and typically requires paired video-text data for training. However, manually annotating coherent textual descriptions for videos is laborious and time-consuming. To address this challenge, we propose a novel approach that enhances video captioning using only synthetic text data. Leveraging the exceptional text generation capabilities of large language models (LLMs), we produce high-quality and diverse video captions tailored to the target domain. Our approach employs a two-stage prompting strategy: first prompt GPT-4 with few-shot target-domain captions to create a set of high-quality captions, and then continue prompting with the generated captions to acquire large-scale synthetic data. To effectively utilize these captions, we introduce Mixture of Scale and Shift experts (MoS2), an efficient adaptation method for pre-trained captioning models. MoS2 employs lightweight routing networks to estimate probability distributions over a collection of scale and shift experts, dynamically allocating tokens to the appropriate experts. This dynamic adjustment mechanism enhances the model's ability to handle data variations and mitigates the distribution shift between synthetic and real captions. Moreover, our method reduces the number of learnable parameters, facilitating more efficient adaptation. Our method achieves superior performance with only synthetic text data, narrowing the gap between zero-shot and fine-tuned models and reducing the dependency on paired data from the target domain.
What problem does this paper attempt to address?