Cross-modal Task Understanding and Execution of Voice-fingertip Reading Instruction by Using Small Family Service Robotic

Zhihui Zhou,Shiqiang Zhu,Kaivuan Zhu,Chao Cheng,Jason Gu
DOI: https://doi.org/10.1109/CBS55922.2023.10115355
2022-01-01
Abstract:The correct understanding of human task instructions is an important basic condition for family service robots to carry out their work. In daily family scenarios, single-modal voice commands often have the problem of missing pronoun references, which makes robots unable to identify the target object to be operated. In this paper, a novel cross-modal instruction task understanding and execution framework based on the fusion of speech and visual information is proposed, which is based on the computing architecture of robot terminal and cloud server. By inputting the speech recognition results of the sound into the pre-trained Bert model, the first-level task classification label is obtained. Then, the robotic turns the camera to the location of the sound source. By using the lightweight visual object detection model to obtain the target area pointed by the finger, the robotic completes the confirmation of the visual instruction and entity of the object, and obtains the semantic label of the visual entity. The information of visual entity semantic label and the first level task classification label is fused to obtain the second level subtask classification label. Finally, the experimental results confirm that the framework can be used for robot task understanding and execution of cross-modal instructions, and will be helpful for promoting the application of family service robots.
What problem does this paper attempt to address?