Visual-and-Language Multimodal Fusion for Sweeping Robot Navigation Based on CNN and GRU

Yiping Zhang,Kolja Wilker
DOI: https://doi.org/10.4018/joeuc.338388
2024-02-20
Journal of Organizational and End User Computing
Abstract:Effectively fusing information between the visual and language modalities remains a significant challenge. To achieve deep integration of natural language and visual information, this research introduces a multimodal fusion neural network model, which combines visual information (RGB images and depth maps) with language information (natural language navigation instructions). Firstly, the authors used faster R-CNN and ResNet50 to extract image features and attention mechanism to further extract effective information. Secondly, GRU model is used to extract language features. Finally, another GRU model is used to fuse the visual- language features, and then the history information is retained to give the next action instruction to the robot. Experimental results demonstrate that the proposed method effectively addresses the localization and decision-making challenges for robotic vacuum cleaners.
information science & library science,management,computer science, information systems
What problem does this paper attempt to address?