Abstract:The plethora of sensors in our commodity devices provides a rich substrate for sensor-fused tracking. Yet, today's solutions are unable to deliver robust and high tracking accuracies across multiple agents in practical, everyday environments - a feature central to the future of immersive and collaborative applications. This can be attributed to the limited scope of diversity leveraged by these fusion solutions, preventing them from catering to the multiple dimensions of accuracy, robustness (diverse environmental conditions) and scalability (multiple agents) simultaneously. In this work, we take an important step towards this goal by introducing the notion of dual-layer diversity to the problem of sensor fusion in multi-agent tracking. We demonstrate that the fusion of complementary tracking modalities, - passive/relative (e.g., visual odometry) and active/absolute tracking (e.g., infrastructure-assisted RF localization) offer a key first layer of diversity that brings scalability while the second layer of diversity lies in the methodology of fusion, where we bring together the complementary strengths of algorithmic (for robustness) and data-driven (for accuracy) approaches. RoVaR is an embodiment of such a dual-layer diversity approach that intelligently attends to cross-modal information using algorithmic and data-driven techniques that jointly share the burden of accurately tracking multiple agents in the wild. Extensive evaluations reveal RoVaR's multi-dimensional benefits in terms of tracking accuracy (median of 15cm), robustness (in unseen environments), light weight (runs in real-time on mobile platforms such as Jetson Nano/TX2), to enable practical multi-agent immersive applications in everyday environments.

Audio-Visual Variational Fusion for Multi-Person Tracking with Robots

Multi-modal 3D Human Tracking for Robots in Complex Environment with Siamese Point-Video Transformer

Beyond Traditional Driving Scenes: A Robotic-Centric Paradigm for 2D+3D Human Tracking Using Siamese Transformer Network

Accurate and Real-Time 3-D Tracking for the Following Robots by Fusing Vision and Ultrasonar Information

Fast-Tracker 2.0: Improving Autonomy of Aerial Tracking with Active Vision and Human Location Regression.

Audio-Visual Bimodal Combination-Based Speaker Tracking Method for Mobile Robot

Multi-person Multi-Camera Tracking for Live Stream Videos Based on Improved Motion Model and Matching Cascade

Real-time 3D Human Tracking for Mobile Robots with Multisensors

Audio-visual multi-person tracking and identification for smart environments

Multi-modal Tracking of People Using Laser Scanners and Video Camera

Tracking by segmentation with future motion estimation applied to person-following robots

Multi-features Guided Robust Visual Tracking.

Adaptive Multi-Pedestrian Tracking by Multi-Sensor: Track-to-Track Fusion Using Monocular 3D Detection and MMW Radar

Vision-Guided Robot Hearing

Online Multi-Object Tracking from A Bird's-Eye View by Fusion of Millimeter-Wave Radar and Vision

A novel tracking system for human following robots with fusion of MMW radar and monocular vision

Real-Time Visual Tracking and Identification for a Team of Homogeneous Humanoid Robots

RoVaR: Robust Multi-agent Tracking through Dual-layer Diversity in Visual and RF Sensor Fusion

When We First Met: Visual-Inertial Person Localization for Co-Robot Rendezvous

Human–robot non-verbal interaction empowered by real-time auditory and visual multiple-talker tracking

Visual Perception for Multiple Human–Robot Interaction From Motion Behavior