Variational Transformer: A Framework Beyond the Trade-off Between Accuracy and Diversity for Image Captioning

Longzhen Yang,Lianghua He,Die Hu,Yihang Liu,Yitao Peng,Hongzhou Chen,Mengchu Zhou
DOI: https://doi.org/10.1109/tnnls.2024.3440872
IF: 14.255
2024-01-01
IEEE Transactions on Neural Networks and Learning Systems
Abstract:Accuracy and diversity represent two critical quantifiable performance metrics in the generation of natural and semantically accurate captions. While efforts are made to enhance one of them, the other suffers due to the inherent conflicting and complex relationship between them. In this study, we demonstrate that the suboptimal accuracy levels derived from human annotations are unsuitable for machine-generated captions. To boost diversity while maintaining high accuracy, we propose an innovative variational transformer (VaT) framework. By integrating “invisible information prior (IIP)” and “auto-selectable Gaussian mixture model (AGMM)”, we enable its encoder to learn precise linguistic information and object relationships in various scenes, thus ensuring high accuracy. By incorporating the “range-median reward (RMR)” baseline into it, we preserve a wider range of candidates with higher rewards during the reinforcement-learning-based training process, thereby guaranteeing outstanding diversity. Experimental results indicate that our method achieves simultaneous improvements in accuracy and diversity by up to 1.1% and 4.8%, respectively, over the state-of-the-art. Furthermore, our approach demonstrates its performance that is the closest to human annotations in semantic retrieval, with its score of 50.3 versus the human score of 50.6. Thus, the method can be readily put into industrial use.
What problem does this paper attempt to address?