Abstract:Recently, visual grounding and multi-sensors setting have been incorporated into perception system for terrestrial autonomous driving systems and Unmanned Surface Vehicles (USVs), yet the high complexity of modern learning-based visual grounding model using multi-sensors prevents such model to be deployed on USVs in the real-life. To this end, we design a low-power multi-task model named NanoMVG for waterway embodied perception, guiding both camera and 4D millimeter-wave radar to locate specific object(s) through natural language. NanoMVG can perform both box-level and mask-level visual grounding tasks simultaneously. Compared to other visual grounding models, NanoMVG achieves highly competitive performance on the WaterVG dataset, particularly in harsh environments and boasts ultra-low power consumption for long endurance.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to achieve low - power, high - efficiency multi - task visual positioning on Unmanned Surface Vehicles (USVs). Specifically, in view of the high - complexity problem faced by existing multi - sensor - based visual positioning models in practical applications, the paper proposes a lightweight multi - task model named NanoMVG. This model aims to locate specific objects by guiding cameras and 4D millimeter - wave radars with natural language, while performing box - level and mask - level visual positioning tasks. NanoMVG is especially suitable for waterway perception in harsh environments and can operate for a long time with extremely low power consumption, thus supporting the continuous monitoring requirements of USVs. ### Main Contributions 1. **Proposing the NanoMVG Model**: This is a multi - task, low - power - consumption model specially designed for USVs, which can run in real - time on embedded edge devices and perform comprehensive visual positioning by combining image and radar data. 2. **Efficient Three - Modal Dynamic Fusion Module (TMDF)**: Effectively integrate information from three modalities of image, radar and text to achieve global and synchronous semantic alignment and cross - modal fusion. 3. **Lightweight Mixture - of - Experts Module (EN - MoE)**: Adaptively allocate edge and neighborhood features according to the different requirements of detection and segmentation tasks, significantly improving performance. ### Technical Details - **Input and Output**: NanoMVG accepts three inputs - RGB images, 2D radar maps and text prompts, and generates two outputs - predicted object masks and bounding boxes. - **Three - Modal Dynamic Fusion (TMDF)**: By simplifying the cross - attention mechanism, dynamically align and construct sensor features under text conditions, and efficiently integrate image and radar data. - **Edge - Neighborhood Mixture - of - Experts (EN - MoE)**: Through adaptive weight adjustment, optimize shared features to ensure that detection and segmentation tasks obtain sufficient different representations. - **Prediction Head**: Adopt the anchor - free REC head based on the center point and the re - parameterized RES head to reduce redundant calculation operations and accelerate the inference speed. ### Experimental Results - **Performance Comparison**: NanoMVG performs excellently on the WaterVG dataset, especially achieving a good balance between low - power consumption and high performance. - **Power Consumption and Inference Speed**: Compared with other models, NanoMVG has obvious advantages in power consumption and inference speed and can achieve real - time inference on embedded devices. - **Ablation Experiment**: Verified the effectiveness of EN - MoE and TMDF, and these components provide a better trade - off between accuracy and power consumption. ### Conclusion The NanoMVG model proposed in the paper provides an effective solution for low - power - consumption interactive waterway perception of USVs. It not only reaches the advanced level in performance but also performs excellently in power consumption and computational efficiency. This model is expected to play an important role in practical applications, especially in scenarios that require long - term autonomous operation.

NanoMVG: USV-Centric Low-Power Multi-Task Visual Grounding based on Prompt-Guided Camera and 4D mmWave Radar

WaterVG: Waterway Visual Grounding based on Text-Guided Vision and mmWave Radar

Assisting the Visually Impaired: Multitarget Warning Through Millimeter Wave Radar and RGB-depth Sensors.

USV-Tracker: A Novel USV Tracking System for Surface Investigation with Limited Resources

A Novel Unmanned Surface Vehicle with 2D-3D Fused Perception and Obstacle Avoidance Module

A Multi-modality Sensor System for Unmanned Surface Vehicle

A Method Integrating Human Visual Attention and Consciousness of Radar and Vision Fusion for Autonomous Vehicle Navigation

A Millimeter-Wave Radar-Aided Vision Detection Method for Water Surface Small Object Detection

Real-Time Volumetric Perception for Unmanned Surface Vehicles Through Fusion of Radar and Camera

Mask-VRDet: A Robust Riverway Panoptic Perception Model Based on Dual Graph Fusion of Vision and 4D Mmwave Radar

Iterative Robust Visual Grounding with Masked Reference based Centerpoint Supervision

Achelous: A Fast Unified Water-surface Panoptic Perception Framework based on Fusion of Monocular Camera and 4D mmWave Radar

GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Grounding

GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging Cross-Modal Attention with Large Language Models

ASY-VRNet: Waterway Panoptic Driving Perception Model based on Asymmetric Fair Fusion of Vision and 4D mmWave Radar

MS-VRO: A Multi-Stage Visual-Millimeter Wave Radar Fusion Odometry

Marine$\mathcal{X}$: Design and Implementation of Unmanned Surface Vessel for Vision Guided Navigation

MetaVG: A Meta-Learning Framework for Visual Grounding

Immersive virtual simulation system design for the guidance, navigation and control of unmanned surface vehicles

Deep Visual Waterline Detection for Inland Marine Unmanned Surface Vehicles

A 3D Object Detection Based on Multi-Modality Sensors of USV