A perceptual manipulation system for audio-visual fusion of robots

WANG Yefei,GE Quanbo,LIU Huaping,LU Zhenyu
DOI: https://doi.org/10.11992/tis.202111036
2023-01-01
Abstract:The ability of intelligent robots to function in complex environments has been a longstanding challenge in the field of robotic applications. Referential expressions are frequently utilized for object positioning, making this method a common approach in robot interactions. However, relying on a single visual modality alone is not adequate for all tasks in real-world scenarios. This study proposes a robot perception system based on the fusion of visual and auditory modalities. The system employs a deep learning algorithm model to realize the visual and auditory perceptions of the robot,and it processes natural language and scene information for visual positioning and collects data from 12 types of sound signals for audio recognition. The experimental results indicate that the system integrated into the UR robot has a strong visual positioning ability and audio prediction, and it has successfully carried out an instruction-based audio-visual operation task. The results confirm that audio-visual data has a higher expressive capability than single-modal data.
What problem does this paper attempt to address?