Utilizing Text-based Augmentation to Enhance Video Captioning

Shanhao Li,Bang Yang,Yuexian Zou
DOI: https://doi.org/10.1109/icaibd55127.2022.9820499
2022-01-01
Abstract:Video captioning (VC) is a challenging cross-modality task that requires the model to capture the visual information in the video and to automatically generate the captions accordingly. Literature shows that Transformer-based deep neural networks (DNN) achieve the state-of-arts. Without exception, such DNN models are data-hungry, which hinders the development of the VC models since large-scale VC training datasets need to pay a much higher cost to build compared to the datasets for other tasks such as image recognition or neural machine translation. As a result, data augmentation is a valuable approach to improving the performance of VC models. In this work, we propose two text-based augmentation methods to enlarge the scale of VC datasets, so as to develop better VC models. Our basic ideas lie that when a video is given, a different person may give different descriptions, which leads to a better understanding of the given video. From another view of point, language has flexible and versatile expression properties which can be used to augment training corpora. Specifically, in our work, the pre-training Transformer-based language models, i.e., PEGASUS from Google and translator WMT19 from FAIR, have been employed to generate “new captions.” The various ways to select the proper captions and training strategies have also been fully explored to capitalize on data augmentation. Extensive experiments are conducted on the mainstream VC training datasets: MSVD and MSR-VTT. It is encouraged to see that our data augmentation method consistently boosts LSTM-based and Transformer-based VC models, with improvements of an average of 3.8 and up to 7.9 CIDEr scores.
What problem does this paper attempt to address?