Multiple Transformer Mining for VizWiz Image Caption

Xuchao Gong,Hongji Zhu,Yongliang Wang,Biaolong Chen,Aixi Zhang,Fangxun Shu,Si Liu
2021-01-01
Abstract:This paper proposes a multiple transformer mining algorithm (MTMA) for the VizWiz image captioning task. MTMA consists of grid image feature extraction, OCR and object detectors to effectively describe the image information. Self-Critical Sequence Training (SCST) approach is adopted for image captioning models in the training phase, and semantic similarity aggregation is adopted in the postprocessing phase. Meanwhile, ensemble power is leveraged in multi-modal feature fusion and post-caption generation to further enhance the performance. As a result, the proposed algorithm outperforms others with 94.06 CIDEr.
What problem does this paper attempt to address?