Pre-trained CNNs as Feature-Extraction Modules for Image Captioning

Muhammad Abdelhadie Al-Malla,Muhammad Abdelhadie Al-Malla,Assef Jafar,Nada Ghneim
DOI: https://doi.org/10.5565/rev/elcvia.1436
2022-05-11
Electronic Letters on Computer Vision and Image Analysis
Abstract:In this work, we present a thorough experimental study about feature extraction using Convolutional Neural Networks (CNNs) for the task of image captioning in the context of deep learning. We perform a set of 72 experiments on 12 image classification CNNs pre-trained on the ImageNet [29] dataset. The features are extracted from the last layer after removing the fully connected layer and fed into the captioning model. We use a unified captioning model with a fixed vocabulary size across all the experiments to study the effect of changing the CNN feature extractor on image captioning quality. The scores are calculated using the standard metrics in image captioning. We find a strong relationship between the model structure and the image captioning dataset and prove that VGG models give the least quality for image captioning feature extraction among the tested CNNs. In the end, we recommend a set of pre-trained CNNs for each of the image captioning evaluation metrics we want to optimise, and show the connection between our results and previous works. To our knowledge, this work is the most comprehensive comparison between feature extractors for image captioning.
What problem does this paper attempt to address?