Multi-modal Remote Sensing Image Description Based on Word Embedding and Self-Attention Mechanism

Yuan Wang,Kuerban Alifu,Hongbing Ma,Junli Li,Umut Halik,Yalong Lv
DOI: https://doi.org/10.1109/isass.2019.8757726
2019-01-01
Abstract:Traditional multi-modal models are relatively weak in describing complex image content when describing and identifying objects to be identified in microwave images, the generated sentences by which are relatively simple. In this paper, a multimodal remote sensing semantic description and recognition method based on self-attention mechanism is proposed, which combined with the Ngram 2vec word embedding technique. Firstly, Ngram2ve is used to mine the semantic information and context features between the pixels to be identified in the domain window and adjacent pixels. Secondly, a self-attention mechanism is introduced to further learn the internal structure information of all pixels in the neighborhood window to generate a multidimensional representation. Finally, in order to avoid the loss of information transmitted between layers, Dense nets are used to implement information flow integration, and a multi-layered independent recurrent neural network is added between each densely connected module to solve the gradient disappearance. Experimental results show that this method is superior to traditional deep learning methods in image description and recognition.
What problem does this paper attempt to address?