Abstract:Imitation learning and instruction-following are two common approaches to communicate a user's intent to a learning agent. However, as the complexity of tasks grows, it could be beneficial to use both demonstrations and language to communicate with an agent. In this work, we propose a novel setting where an agent is given both a demonstration and a description, and must combine information from both the modalities. Specifically, given a demonstration for a task (the source task), and a natural language description of the differences between the demonstrated task and a related but different task (the target task), our goal is to train an agent to complete the target task in a zero-shot setting, that is, without any demonstrations for the target task. To this end, we introduce Language-Aided Reward and Value Adaptation (LARVA) which, given a source demonstration and a linguistic description of how the target task differs, learns to output a reward / value function that accurately describes the target task. Our experiments show that on a diverse set of adaptations, our approach is able to complete more than 95% of target tasks when using template-based descriptions, and more than 70% when using free-form natural language.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how, in the absence of demonstrations of the target task, to train an agent to complete the target task by given demonstrations of the source task and a natural - language description of the differences between the target task and the source task. Specifically, the paper proposes a new setting, that is, under zero - shot conditions, using the Language - Aided Reward and Value Adaptation (LARVA) model, combined with demonstrations of the source task and natural - language descriptions, enabling the agent to infer the goal of the target task and successfully complete the task. This setting is especially suitable for situations where it is necessary to expand from a single or a few demonstrations to multiple related tasks, while using a more natural language modality to convey the details of complex tasks. ### Paper Background In the field of artificial intelligence, teaching a learning agent to perform new tasks is a core issue. Existing methods are mainly divided into two categories: imitation learning and instruction - following. Imitation learning allows the agent to infer the intention of the executor by showing demonstrations of the task, thereby learning the task strategy. However, as the complexity of tasks increases, it becomes impractical to provide new demonstrations for each new task. On the other hand, instruction - following conveys the target task to the learning agent through natural language, but as the complexity of tasks increases, it also becomes more difficult to convey complex details using natural language. ### Paper Goals The paper proposes a new paradigm, aiming to enable the agent to complete the target task under zero - shot conditions by combining demonstrations and natural - language descriptions. Specifically, the goals of the paper are: 1. **Combining Demonstrations and Language**: Use demonstrations of the source task and a natural - language description of the differences between the target task and the source task to enable the agent to infer the goal of the target task. 2. **Zero - shot Task Adaptation**: Enable the agent to complete the target task in the absence of demonstrations of the target task. 3. **Handling Multiple Task Adaptation Situations**: Be able to handle multiple task adaptation situations, such as missing steps, extra steps, and exchange of the final positions of objects. ### Main Contributions 1. **Proposing the LARVA Model**: The LARVA model can predict the reward or value function of the target task based on demonstrations of the source task and a natural - language description. 2. **Experimental Verification**: Through experiments, the effectiveness of the LARVA model in multiple task adaptation situations has been verified. In particular, when using language descriptions generated by templates, the success rate exceeds 95%, and when using free - form natural - language descriptions, the success rate exceeds 70%. 3. **Cross - task Adaptation**: Demonstrated the generalization ability of the LARVA model in different task types, especially its performance in handling complex task dynamics. ### Experimental Environment The paper conducted experiments in an environment named "Organizer Environment", which contains an organizer with 3 - tier shelves, and different objects can be placed on each tier. In the experiments, tasks with 2 or 3 objects were used, and a total of 285,120 states were generated. The action space includes 30 movement actions and a termination action. ### Data Set The data set includes 6,600 pairs of source tasks and target tasks, of which 6,000 pairs use language descriptions generated by templates, and 600 pairs use natural - language descriptions collected by Amazon Mechanical Turk. The data set is divided into training sets, validation sets, and test sets. ### Experimental Results The experimental results show that the LARVA model has a success rate of over 95% when using language descriptions generated by templates and over 70% when using natural - language descriptions. In addition, the paper also verified the importance of each module through ablation experiments, especially the impact of the goal prediction module on the overall performance. ### Conclusion The LARVA model proposed in the paper successfully completes the target task under zero - shot conditions by combining demonstrations of the source task and natural - language descriptions. This method provides a new solution for cross - task adaptation in handling complex tasks.

Zero-shot Task Adaptation using Natural Language

RL Zero: Zero-Shot Language to Behaviors without any Supervision

Policy Adaptation via Language Optimization: Decomposing Tasks for Few-Shot Imitation

AdaptAgent: Adapting Multimodal Web Agents with Few-Shot Learning from Human Demonstrations

Zero-Shot Visual Imitation

Language-Model-Assisted Bi-Level Programming for Reward Learning from Internet Videos

Language-Conditioned Imitation Learning with Base Skill Priors under Unstructured Data

Zero-Shot Compositional Policy Learning via Language Grounding

Zero-Shot Adaptive Transfer for Conversational Language Understanding

Towards Few-shot Coordination: Revisiting Ad-hoc Teamplay Challenge In the Game of Hanabi

Zero-shot Policy Learning with Spatial Temporal RewardDecomposition on Contingency-aware Observation

Zero-shot Policy Learning with Spatial Temporal Reward Decomposition on Contingency-aware Observation.

LaDA: Latent Dialogue Action For Zero-shot Cross-lingual Neural Network Language Modeling

Video-Language Critic: Transferable Reward Functions for Language-Conditioned Robotics

Zero-Shot Goal-Directed Dialogue via RL on Imagined Conversations

Intra-agent speech permits zero-shot task acquisition

Skill Induction and Planning with Latent Language

Language Models as Zero-Shot Trajectory Generators

Zero-Shot Learning of Text Adventure Games with Sentence-Level Semantics

Using Natural Language for Reward Shaping in Reinforcement Learning

Zero-shot Sim2Real Adaptation Across Environments