Video captioning using transformer-based GAN

DOI: https://doi.org/10.1007/s11042-024-19247-z
IF: 2.577
2024-04-24
Multimedia Tools and Applications
Abstract:Video captioning is the process of automatically generating natural language descriptions of video content. Historically, most video captioning methods have relied on extending Sequence-to-Sequence (Seq2Seq) models. However, such approaches possess limitations due to the sequential nature of the captions, which leads to less accurate captions. To address this limitation, this paper introduces a novel end-to-end architecture for video captioning that combines conditional Wasserstein Generative Adversarial Networks (cWGAN) with a transformer model. The proposed architecture consists of two modules: feature extraction and caption generation. The feature extraction module aims to obtain an encoded feature vector representing the video contents, while the caption generation module generates human-readable captions from encoded feature vector. To the best of our knowledge, this is the first architecture for generative video captioning that integrates a transformer model with GAN. The results of the proposed model based on the BLEU-4, METEOR, ROUGE-L, and CIDEr metrics, on two datasets, MSVD (BLEU-4 = 61.2, METEOR = 41.6) and MSR-VTT (BLEU-4 = 61.2, METEOR = 31.1), compared to state-of-the-art approaches, demonstrate the effectiveness of the transformer with generative model in generating accurate and human-readable captions.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering
What problem does this paper attempt to address?