Multimodal fusion-powered English speaking robot

Ruiying Pan
DOI: https://doi.org/10.3389/fnbot.2024.1478181
IF: 3.493
2024-11-20
Frontiers in Neurorobotics
Abstract:Introduction: Speech recognition and multimodal learning are two critical areas in machine learning. Current multimodal speech recognition systems often encounter challenges such as high computational demands and model complexity. Methods: To overcome these issues, we propose a novel framework-EnglishAL-Net, a Multimodal Fusion-powered English Speaking Robot. This framework leverages the ALBEF model, optimizing it for real-time speech and multimodal interaction, and incorporates a newly designed text and image editor to fuse visual and textual information. The robot processes dynamic spoken input through the integration of Neural Machine Translation (NMT), enhancing its ability to understand and respond to spoken language. Results and discussion: In the experimental section, we constructed a dataset containing various scenarios and oral instructions for testing. The results show that compared to traditional unimodal processing methods, our model significantly improves both language understanding accuracy and response time. This research not only enhances the performance of multimodal interaction in robots but also opens up new possibilities for applications of robotic technology in education, rescue, customer service, and other fields, holding significant theoretical and practical value.
robotics,computer science, artificial intelligence,neurosciences
What problem does this paper attempt to address?