Abstract:Visual Grounding has a broad application prospect in the field of remote sensing. Current state-of-the-art methods predominantly are based on the transformer architecture, utilizing multi-head self-attention in multi-modal encoders to integrate visual and textual features. However, they typically rely on a single fusion approach, which may limit the model’s capacity to learn intricate correlations between textual semantics and visual information. Moreover, they did not establish a direct dependency between features and bounding box representations, thereby restricting the fusion features to conventional object detection paradigm. Consequently, the interactions between regression results and encoded features are constrained. To address these limitations, a generative paradigm is harnessed to directly generate discrete coordinates sequence in an auto-regressive manner, which explores the interaction between direct regression features and encoded multi-modal features. Meanwhile, a novel multi-stage synergistic aggregation module is proposed to facilitate the acquisition of multi-modal features at multiple scales by effectively aggregating visual and textual contexts, enhancing the overall performance. In this work, we validate our framework on the DIOR-RSVG dataset and conduct a comparative analysis with existing methods, achieving a noteworthy improvement in accuracy. The proposed approach presents a promising direction for advancing visual grounding techniques in the context of remote sensing applications. The related code and weights are available at https://github.com/waynamigo/MSAM.

Visual Analysis for Multi-Spectral Images Comparisons

A Collaborative Visual Analysis System for Communication Pattern Discovery

Interactive and Collaborative Visual Analysis on Traffic Sensor Data.

Visual Bird Watcher: Interactive Visual Analysis on Bird Distribution and Migration

Visual Analysis for Wildlife Preserve Based on Muti-systems

STAD-HD : Spatial Temporal Anomaly Detection for Heterogeneous Data through Visual Analytics

Visual analytics support for collecting and correlating evidence for intelligence analysis.

A Visual Analysis Approach for Community Detection of Multi-Context Mobile Social Networks

Visual Analytics of the Spatio-temporal Multidimensional Air Monitoring Data

The Realistic Fusion of Multi-spectral Images

STC: Spatio-Temporal Contrastive Learning for Video Instance Segmentation.

Salient Object Detection In Hyperspectral Imagery

Behavior Analysis Through Collaborative Visual Exploration on Trajectory Data.

SpectrumVA: Visual Analysis of Astronomical Spectra for Facilitating Classification Inspection

Visual Analysis of Multivariate Time Series of Static and Mobile Sensors.

Visual Analytics for Efficient Image Exploration and User-Guided Image Captioning

Temporal Pattern Analysis and Source Detection Through Visual Analysis on Multi-Dimensional Time Series Data.

A Vision Sensor Network to Study Viewers' Visible Behavior of Art Appreciation.

Autogrouped Sparse Representation for Visual Analysis

Multi-Stage Synergistic Aggregation Network for Remote Sensing Visual Grounding

Visual Analysis for Microblog Topic Modeling