Multi-modal Human-machine Conversation System for Real Physical World

Yitao Chen,Shibo Nie,Mandan Guan,Jie Wang,Ruoyi Du,Dongliang Chang,Kongming Liang,Zhanyu Ma
DOI: https://doi.org/10.1109/mmsp55362.2022.9949120
2022-01-01
Abstract:Enabling machines to process multi-modal information and understand the real physical world is an important step to achieving free human-machine conversation. However, previous human-machine conversation systems are mostly limited to single-modal (e.g., chat robot), single-round (e.g., visual Q & A), and static visual information (e.g., visual dialogue). To address the above problem, we develop a multi-modal human-machine conversation system for specific application scenarios. The system includes two modules: (i) an interactive visual grounding module that can actively disambiguate user's queries, and (ii) an interactive fine-grained recognition module that can model objects in the 3D environment and actively ask for missing visual information. A video demo of our system under the automobile sales scenario can be found here 1 1 https://drive.google.com/file/d/1IfBsMKq55ryLOZchIT6G-R5CyWnC3Skiew?usp=sharing.
What problem does this paper attempt to address?