Generating captions without looking beyond objects

Hendrik Heuer,Christof Monz,Arnold W. M. Smeulders,Arnold W.M. Smeulders
DOI: https://doi.org/10.48550/arXiv.1610.03708
2016-10-12
Computer Vision and Pattern Recognition
Abstract:This paper explores new evaluation perspectives for image captioning and introduces a noun translation task that achieves comparative image caption generation performance by translating from a set of nouns to captions. This implies that in image captioning, all word categories other than nouns can be evoked by a powerful language model without sacrificing performance on n-gram precision. The paper also investigates lower and upper bounds of how much individual word categories in the captions contribute to the final BLEU score. A large possible improvement exists for nouns, verbs, and prepositions.
What problem does this paper attempt to address?