Multimodal Image Captioning Through Combining Reinforced Cross Entropy Loss And Stochastic Deprecation

Xi Meng,Hao Kong,Dongqi Tang,Tong Lu
DOI: https://doi.org/10.1109/ICME.2019.00229
2019-01-01
Abstract:Recently, Cross Entropy Loss (CEL) has been proved to be useful in encoder-decoder based multimodal image captioning; however, it still faces the difficulty of inconsistency between optimizing function and evaluation metrics. In this paper, we propose a new approach for multimodal image captioning. It consists of 1) Reinforced Cross Entropy Loss (RCEL) to maximize the probability of ground truth captions and optimize evaluation metrics directly, and 2) Stochastic Deprecation (SD) to automatically select high-quality ground truth sentences without losing the diversity of corpus. The proposed RCEL and SD are generic and can improve the existing natural language generation models while combining them (RCEL-SD) can achieve the best result. Experimental results on the benchmark MSCOCO dataset show that the proposed RCEL-SD respectively outperforms CEL in terms of all the 7 evaluation metrics on three recent image captioning models.
What problem does this paper attempt to address?