Sim-to-Real Vision-depth Fusion CNNs for Robust Pose Estimation Aboard Autonomous Nano-quadcopter

Luca Crupi,Elia Cereda,Alessandro Giusti,Daniele Palossi
2023-08-03
Abstract:Nano-quadcopters are versatile platforms attracting the interest of both academia and industry. Their tiny form factor, i.e., $\,$10 cm diameter, makes them particularly useful in narrow scenarios and harmless in human proximity. However, these advantages come at the price of ultra-constrained onboard computational and sensorial resources for autonomous operations. This work addresses the task of estimating human pose aboard nano-drones by fusing depth and images in a novel CNN exclusively trained in simulation yet capable of robust predictions in the real world. We extend a commercial off-the-shelf (COTS) Crazyflie nano-drone -- equipped with a 320$\times$240 px camera and an ultra-low-power System-on-Chip -- with a novel multi-zone (8$\times$8) depth sensor. We design and compare different deep-learning models that fuse depth and image inputs. Our models are trained exclusively on simulated data for both inputs, and transfer well to the real world: field testing shows an improvement of 58% and 51% of our depth+camera system w.r.t. a camera-only State-of-the-Art baseline on the horizontal and angular mean pose errors, respectively. Our prototype is based on COTS components, which facilitates reproducibility and adoption of this novel class of systems.
Robotics
What problem does this paper attempt to address?
This paper aims to address the problem of human pose estimation using nano-quadcopters under extremely resource-constrained conditions. Specifically: - **Research Background**: Nano-quadcopters, due to their small size (approximately 10 cm in diameter), are very suitable for working in confined environments and are harmless when operating near humans. However, these advantages also bring significant limitations in computational power and sensor resources. - **Target Problem**: The goal of this paper is to achieve robust human pose estimation on such resource-constrained platforms. The authors improve existing camera-only methods by integrating depth information and image data, thereby enhancing the accuracy of pose estimation. - **Main Contributions**: - Designed and analyzed various CNN models that fuse inputs from two complementary sensors (depth sensor and monocular camera); - Detailed a training pipeline from simulation to reality, which utilizes aggressive photometric augmentation and balanced label distribution; - Provided comprehensive field experiment results, demonstrating the system's performance under real-world conditions and comparing it with various configurations, including state-of-the-art baseline methods. - **Experimental Results**: In a previously unseen flight arena, the system significantly outperformed the camera-only baseline methods, reducing horizontal pose error and angular pose error by 58% and 51%, respectively. In summary, this paper proposes a novel method that combines depth information and image data to overcome the challenges of resource-constrained nano-quadcopters, achieving higher accuracy in human pose estimation.