META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI

Liangtai Sun,Xingyu Chen,Lu Chen,Tianle Dai,Zichen Zhu,Kai Yu
DOI: https://doi.org/10.18653/v1/2022.emnlp-main.449
2022-01-01
Abstract:Task-oriented dialogue (TOD) systems have been widely used by mobile phoneintelligent assistants to accomplish tasks such as calendar scheduling or hotelreservation. Current TOD systems usually focus on multi-turn text/speechinteraction, then they would call back-end APIs designed for TODs to performthe task. However, this API-based architecture greatly limits theinformation-searching capability of intelligent assistants and may even lead totask failure if TOD-specific APIs are not available or the task is toocomplicated to be executed by the provided APIs. In this paper, we propose anew TOD architecture: GUI-based task-oriented dialogue system (GUI-TOD). AGUI-TOD system can directly perform GUI operations on real APPs and executetasks without invoking TOD-specific backend APIs. Furthermore, we releaseMETA-GUI, a dataset for training a Multi-modal convErsaTional Agent on mobileGUI. We also propose a multi-model action prediction and response model, whichshow promising results on META-GUI. The dataset, codes and leaderboard arepublicly available.
What problem does this paper attempt to address?