Abstract:Object detection in real-world scenarios with multi-modal inputs is crucial for some safety-critical systems, such as autonomous driving, security monitoring, and traffic management. Despite significant progress in previous work, existing methods still suffer from insufficient fusion, feature loss, and poor performance in images with complex textures and occlusions. In this paper, we propose a novel framework for multi-modal object detection, multi-modal EfficientDet with multi-scale CapsNet (MEDMCN). In MEDMCN, the depth information of depth image and texture details of RGB image is well integrated by our residual iterative bi-directional feature pyramid network (ResIBi-FPN) to overcome the issues of insufficient fusion and feature loss. In addition, a novel multi-scale CapsNet-based component, EfficientDet-Caps, is presented as the detection head of MEDMCN, which allows MEDMCN to focus on the whole-part correlation and the spatial position relationship of entities, enhancing its performance in real-world scenarios with complex textures and occlusions. Extensive experiments on MS COCO 2017 and MAVD datasets demonstrate that MEDMCN achieves great results when evaluated using the average precision (AP) metric. Specifically, it shows significant improvements of +2.8AP and +6.9AP compared to its baseline on MS COCO 2017 and MAVD datasets, respectively.

MEDMCN: a Novel Multi-Modal EfficientDet with Multi-Scale CapsNet for Object Detection