Fully Automated Task Management for Generation, Execution, and Evaluation: A Framework for Fetch-and-Carry Tasks with Natural Language Instructions in Continuous Space

Motonari Kambara,Komei Sugiura
2023-11-07
Abstract:This paper aims to develop a framework that enables a robot to execute tasks based on visual information, in response to natural language instructions for Fetch-and-Carry with Object Grounding (FCOG) tasks. Although there have been many frameworks, they usually rely on manually given instruction sentences. Therefore, evaluations have only been conducted with fixed tasks. Furthermore, many multimodal language understanding models for the benchmarks only consider discrete actions. To address the limitations, we propose a framework for the full automation of the generation, execution, and evaluation of FCOG tasks. In addition, we introduce an approach to solving the FCOG tasks by dividing them into four distinct subtasks.
Robotics,Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper aims to develop a framework that enables robots to perform "Fetch - and - Carry with Object Grounding (FCOG)" tasks based on natural language instructions according to visual information. Many existing frameworks usually rely on manually - given instruction sentences, so the evaluation is limited to fixed tasks. In addition, many multimodal language understanding models for benchmarking only consider discrete actions, which makes them difficult to be applied in the real world where continuous actions are required. To overcome these limitations, the author proposes a fully - automated framework for generating, executing and evaluating FCOG tasks, and solves the FCOG tasks by dividing them into four different subtasks. Specifically, the goals of this framework include: - **Automated task generation**: Automatically generate tasks in the simulation environment, including the selection of environment settings, target objects and destinations. - **Automated task execution**: The robot executes tasks by using visual information according to natural language instructions, including navigating to the specified location, identifying the target object, grasping the target object and placing it at the destination. - **Automated task evaluation**: Conduct real - time evaluation of the task execution process, determine whether the task is successfully completed or failed, and initialize the environment to start the next task session. Through this method, the author hopes to improve the practical application ability of robots in scenarios such as home care, especially in the ability to understand and execute natural language instructions.