NanoMVG: USV-Centric Low-Power Multi-Task Visual Grounding based on Prompt-Guided Camera and 4D mmWave Radar

Runwei Guan,Jianan Liu,Liye Jia,Haocheng Zhao,Shanliang Yao,Xiaohui Zhu,Ka Lok Man,Eng Gee Lim,Jeremy Smith,Yutao Yue
2024-08-30
Abstract:Recently, visual grounding and multi-sensors setting have been incorporated into perception system for terrestrial autonomous driving systems and Unmanned Surface Vehicles (USVs), yet the high complexity of modern learning-based visual grounding model using multi-sensors prevents such model to be deployed on USVs in the real-life. To this end, we design a low-power multi-task model named NanoMVG for waterway embodied perception, guiding both camera and 4D millimeter-wave radar to locate specific object(s) through natural language. NanoMVG can perform both box-level and mask-level visual grounding tasks simultaneously. Compared to other visual grounding models, NanoMVG achieves highly competitive performance on the WaterVG dataset, particularly in harsh environments and boasts ultra-low power consumption for long endurance.
Computer Vision and Pattern Recognition,Robotics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to achieve low - power, high - efficiency multi - task visual positioning on Unmanned Surface Vehicles (USVs). Specifically, in view of the high - complexity problem faced by existing multi - sensor - based visual positioning models in practical applications, the paper proposes a lightweight multi - task model named NanoMVG. This model aims to locate specific objects by guiding cameras and 4D millimeter - wave radars with natural language, while performing box - level and mask - level visual positioning tasks. NanoMVG is especially suitable for waterway perception in harsh environments and can operate for a long time with extremely low power consumption, thus supporting the continuous monitoring requirements of USVs. ### Main Contributions 1. **Proposing the NanoMVG Model**: This is a multi - task, low - power - consumption model specially designed for USVs, which can run in real - time on embedded edge devices and perform comprehensive visual positioning by combining image and radar data. 2. **Efficient Three - Modal Dynamic Fusion Module (TMDF)**: Effectively integrate information from three modalities of image, radar and text to achieve global and synchronous semantic alignment and cross - modal fusion. 3. **Lightweight Mixture - of - Experts Module (EN - MoE)**: Adaptively allocate edge and neighborhood features according to the different requirements of detection and segmentation tasks, significantly improving performance. ### Technical Details - **Input and Output**: NanoMVG accepts three inputs - RGB images, 2D radar maps and text prompts, and generates two outputs - predicted object masks and bounding boxes. - **Three - Modal Dynamic Fusion (TMDF)**: By simplifying the cross - attention mechanism, dynamically align and construct sensor features under text conditions, and efficiently integrate image and radar data. - **Edge - Neighborhood Mixture - of - Experts (EN - MoE)**: Through adaptive weight adjustment, optimize shared features to ensure that detection and segmentation tasks obtain sufficient different representations. - **Prediction Head**: Adopt the anchor - free REC head based on the center point and the re - parameterized RES head to reduce redundant calculation operations and accelerate the inference speed. ### Experimental Results - **Performance Comparison**: NanoMVG performs excellently on the WaterVG dataset, especially achieving a good balance between low - power consumption and high performance. - **Power Consumption and Inference Speed**: Compared with other models, NanoMVG has obvious advantages in power consumption and inference speed and can achieve real - time inference on embedded devices. - **Ablation Experiment**: Verified the effectiveness of EN - MoE and TMDF, and these components provide a better trade - off between accuracy and power consumption. ### Conclusion The NanoMVG model proposed in the paper provides an effective solution for low - power - consumption interactive waterway perception of USVs. It not only reaches the advanced level in performance but also performs excellently in power consumption and computational efficiency. This model is expected to play an important role in practical applications, especially in scenarios that require long - term autonomous operation.