Modeling Local and Global Contexts for Image Captioning

Peng Yao,Jiangyun Li,Longteng Guo,Jing Liu
DOI: https://doi.org/10.1109/icme46284.2020.9102935
2020-01-01
Abstract:Image captioning aims to first observe an image, most notably the involved objects that are highly context-dependent, and then depict it with a natural description. However, most of the current models solely use the isolated objects vectors as image representations, ignoring the contexts among them. In this paper, we introduce a Local-Global Context (LGC) network, endowing the independent object features with shortrange perception (local contexts) and long-range dependence (global contexts). LGC network can be viewed as feature refiner, much beneficial to reason the novel objects and verbal words for the caption decoder. The local contexts are modeled with 1-D group convolution on adjacent objects, strengthening the local connections. Still further, self-attention mechanism is utilized to model the global contexts by correlating all the local contexts. Extensive experiments on MSCOCO dataset demonstrate that LGC network can easily plug into almost any neural captioning models and significantly improve the model performance.
What problem does this paper attempt to address?