Abstract:Sensing technology is widely used for comprehending the physical world, with numerous modalities explored in past decades. While there has been considerable work on multi-modality learning, they all require data of all modalities be paired. How to leverage multi-modality data with partially pairings remains an open problem. To tackle this challenge, we introduce the Babel framework, encompassing the neural network architecture, data preparation and processing, as well as the training strategies. Babel serves as a scalable pre-trained multi-modal sensing neural network, currently aligning six sensing modalities, namely Wi-Fi, mmWave, IMU, LiDAR, video, and depth. To overcome the scarcity of complete paired data, the key idea of Babel involves transforming the N-modality alignment into a series of two-modality alignments by devising the expandable network architecture. This concept is also realized via a series of novel techniques, including the pre-trained modality tower that capitalizes on available single-modal networks, and the adaptive training strategy balancing the contribution of the newly incorporated modality with the previously established modality alignment. Evaluation demonstrates Babel's outstanding performance on eight human activity recognition datasets, compared to various baselines e.g., the top multi-modal sensing framework, single-modal sensing networks, and multi-modal large language models. Babel not only effectively fuses multiple available modalities (up to 22% accuracy increase), but also enhance the performance of individual modality (12% averaged accuracy improvement). Case studies also highlight exciting application scenarios empowered by Babel, including cross-modality retrieval (i.e., sensing imaging), and bridging LLM for sensing comprehension.

MTA: Multimodal Task Alignment for BEV Perception and Captioning

MaskBEV: Towards A Unified Framework for BEV Detection and Map Segmentation

HybridVocab: Towards Multi-Modal Machine Translation Via Multi-Aspect Alignment

X-Align++: cross-modal cross-view alignment for Bird's-eye-view segmentation

MetaBEV: Solving Sensor Failures for BEV Detection and Map Segmentation

BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-Centric Autonomous Driving

Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding

BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving

GraphBEV: Towards Robust BEV Feature Alignment for Multi-Modal 3D Object Detection

Center-enhanced video captioning model with multimodal semantic alignment

BEVPose: Unveiling Scene Semantics through Pose-Guided Multi-Modal BEV Alignment

Enhancing the Alignment Between Target Words and Corresponding Frames for Video Captioning.

Learning Video-Text Aligned Representations for Video Captioning

Multi-level textual-visual alignment and fusion network for multimodal aspect-based sentiment analysis

Video Captioning with Guidance of Multimodal Latent Topics

ContrastAlign: Toward Robust BEV Feature Alignment via Contrastive Learning for Multi-Modal 3D Object Detection

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

Vision-and-Language Navigation via Latent Semantic Alignment Learning

Advancing Multi-Modal Sensing Through Expandable Modality Alignment

BEVUDA: Multi-geometric Space Alignments for Domain Adaptive BEV 3D Object Detection

GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging Cross-Modal Attention with Large Language Models