Improved image captioning with subword units training and transformer

Cai Qiang,Li Jing,Li Haisheng,Zuo Min
DOI: https://doi.org/10.3772/j.issn.1006-6748.2020.02.011
2020-01-01
Abstract:Image captioning models typically operate with a fixed vocabulary,but captioning is an open-vocabulary problem.Existing work addresses the image captioning of out-of-vocabulary words by la-beling it as unknown in a dictionary.In addition,recurrent neural network ( RNN) and its variants used in the caption task have become a bottleneck for their generation quality and training time cost.To address these 2 essential problems,a simpler but more effective approach is proposed for genera-ting open-vocabulary caption,long short-term memory ( LSTM) unit is replaced with transformer as decoder for better caption quality and less training time.The effectiveness of different word segmen-tation vocabulary and generation improvement of transformer over LSTM is discussed and it is proved that the improved models achieve state-of-the-art performance for the MSCOCO2014 image captio-ning tasks over a back-off dictionary baseline model.
What problem does this paper attempt to address?