GMM Enabled by Multimodal Information Fusion Network for Detection and Motion Planning of Robotic Liquid Pouring

Zhongli Wang,Guohui Tian,Shijie Guo
DOI: https://doi.org/10.1109/TNNLS.2024.3476685
2024-10-21
Abstract:When humans perform pouring tasks, they exhibit consistent accuracy, regardless of the liquid type, container, or environmental conditions. This proficiency stems from their ability to effectively utilize both vision and hearing while also considering various factors. However, in the domain of robotic liquid pouring, the combination of multimodal information is effectively rarely leveraged to accomplish automatic control of robotic liquid pouring. To address this limitation, a multimodal information fusion network (MMFNet) is designed for estimating liquid height and pouring state. The MMFNet employs cross-attention networks and motion features to enhance visual features (VFs). Subsequently, multimodal transformers are utilized to fuse audio features with the enhanced VFs, enabling the MMFNet to estimate both liquid height and pouring state accurately. Finally, the detection results are combined with demonstration learning to make robots learn pouring motion trajectory encoded by the Gaussian mixture model (GMM). The experimental results demonstrate the effectiveness of MMFNet in significantly improving the detection accuracy of liquid height and pouring state. Furthermore, by employing the GMM enabled by MMFNet, robots can acquire robust pouring motion planning, enhancing their capabilities in performing pouring tasks.
What problem does this paper attempt to address?