PECR:Parameter-Efficient Transfer Learning with Cross-Modal Representation Learning for Remote Sensing Visual Question Answering

Pengfei Li,Jinlong He,Gang Liu,Shenjun Zhong
DOI: https://doi.org/10.1109/ICASSP48485.2024.10446146
2024-01-01
Abstract:Remote sensing (RS) visual question answering (VQA) aims to provide accurate answers to questions related to RS images. Transformer-based models have gradually become popular to solve RS VQA tasks. Due to the ever-growing model size, full-parameter training of the model becomes prohibitively costly. Moreover, most current RS VQA methods primarily focus on improving unimodal image encoding, paying little attention to cross-modal interactions between visual and textual features. In this paper, we propose the Parameter-Efficient transfer learning with Cross-modal Representation learning model (PECR) for RS VQA tasks. Specifically, we introduce adapter-based parameter-efficient transfer learning techniques into the visual encoder and initialize them with pre-trained weights on large-scale RS images. Furthermore, we utilize a cross attention mechanism in the cross-modal fusion module to merge context representations of images and text, facilitating cross-modal representation learning. Experimental results demonstrate that our approach outperforms previous state-of-the-art methods on both RSVQA-LR and RSVQA-HR datasets. Additionally, we also validate that employing the adapter strategy for local parameter training can yield performance results comparable to full parameter training, significantly reducing the model training cost.
What problem does this paper attempt to address?