CaptionNet: Automatic End-to-End Siamese Difference Captioning Model with Attention

Ariyo Oluwasanmi,Muhammad Umar Aftab,Eatedal Alabdulkreem,Bulbula Kumeda,Edward Y. Baagyere,Zhiquang Qin
DOI: https://doi.org/10.1109/access.2019.2931223
IF: 3.9
2019-01-01
IEEE Access
Abstract:Several deep learning techniques have been intensively reviewed for captioning tasks, enabling the possibility of textual understanding, and description of both simple and complex images. In advancing this knowledge, this paper proposes a multimodal end-to-end siamese difference captioning model (SDCM) to automatically generate a natural language description of differences in an image pair. The proposed supervised learning model combines several deep learning techniques in exploring the practicability of capturing, aligning, and computing the disparities between two image features, for the purpose of creating corresponding language model probability distribution. First, a deep siamese convolutional neural network is used to extract the feature vector discrepancies of an image pair, and then an attention mechanism enables the detection of salient regions of the feature vector which effectively allows a bidirectional long short-term memory decoder to generate a matching and semantically associated textual sequence. The evaluation of the model is tested on the spot-the-diff baseline dataset which consists of pairs of images and their equivalent captions. The results indicate that our proposed model demonstrates a highly competitive performance in comparison to the state of the art.
What problem does this paper attempt to address?