Generating Explanations for Embodied Action Decision from Visual Observation

Xiaohan Wang,Yuehu Liu,Xinhang Song,Beibei Wang,Shuqiang Jiang
DOI: https://doi.org/10.1145/3581783.3612351
2023-01-01
Abstract:Getting trust is crucial for embodied agents (such as robots and autonomous vehicles) to collaborate with human beings, especially non-experts. The most direct way for mutual understanding is through natural language explanation. Existing researches consider generating visual explanations for object recognition, while the exploration of explaining embodied decisions remains vacant. In this paper, we study generating action decisions and explanations based on visual observation. Distinct to explanations for recognition, justifying an action needs to show why it's better than other actions. Besides, the understanding of scene structure is required since the agent needs to interact with the environment (e.g. navigation, moving objects). We introduce a new dataset THOR-EAE (Embodied Action Explanation) collected based on AI2-THOR simulator. The dataset consists of over 840,000 egocentric images of indoor embodied observation which are annotated with the optimal action labels and explanation sentences. An explainable decision-making criterion is developed considering scene layout and action attributes for efficient annotation. We propose a graph action justification model, exploiting graph neural networks for obstacle-surroundings relations representation and justifying the actions under the guidance of decision results. Experimental results on THOR-EAE dataset showcase its challenge and the effectiveness of the proposed method.
What problem does this paper attempt to address?