A CRNN-GRU Based Reinforcement Learning Approach to Audio Captioning.

Xuenan Xu,Heinrich Dinkel,Mengyue Wu,Kai Yu
2020-01-01
Abstract:Audio captioning aims at generating a natural sentence to describe the content in an audio clip. This paper proposes the use of a powerful CRNN encoder combined with a GRU decoder to tackle this multi-modal task. In addition to standard cross-entropy, reinforcement learning is also investigated for generating richer and more accurate captions. Our approach significantly improves against the baseline model on all shown metrics achieving a relative improvement of at least 34%. Results indicate that our proposed CRNNGRU model with reinforcement learning achieves a SPIDEr of 0.190 on the Clotho evaluation set. With data augmentation, the performance is further boosted to 0.223. In the DCASE challenge Task 6 we ranked fourth based on SPIDEr, second on 5 metrics including BLEU, ROUGE-L and METEOR, without ensemble or data augmentation while maintaining a small model size (only 5 Million parameters).
What problem does this paper attempt to address?