Label-Guided Cross-Modal Attention Network for Multi-Label Aerial Image Classification
Ying Chen,Ding Zhang,Tao Han,Xiaoliang Meng,Mianxin Gao,Teng Wang
DOI: https://doi.org/10.1109/lgrs.2024.3388568
IF: 5.343
2024-04-30
IEEE Geoscience and Remote Sensing Letters
Abstract:Multi-label aerial image classification is a fundamental yet complex task in remote sensing interpretation that aims to identify multiple labels in a single image. In this letter, we propose a label-guided cross-modal attention (L-GCMA) network, which first introduces a novel approach to enrich the semantic information of labels and utilizes the multi-head attention module to extract diverse features. The proposed method consists of two components before the cross-modal attention. First, the visual features of the image are obtained using a transformer encoder. In addition, to capture the rich semantic relationship of the scene, we design a label-sentence mapping attention (L-SMA) module. This module performs word embedding encoding on the labels and applies BERT encoding on the sentence prompts, followed by multi-head attention to extract comprehensive interclass and intraclass relationships for the labels, specifically obtaining label-scene text features. Subsequently, by treating the text features as a query, the visual features and text features are combined using cross-modal attention. This progressive integration narrows the semantic gap between vision and text, facilitating accurate label recognition. Experiments on the UCM and AID multi-label datasets demonstrate the superior performance of our L-GCMA, surpassing state-of-the-art methods with the mean average precision (mAP) scores of 99.10% (UCM) and 85.96% (AID).
engineering, electrical & electronic,imaging science & photographic technology,remote sensing,geochemistry & geophysics