Abstract:This work explores the zero-shot adaptation capability of semantic skills, semantically interpretable experts' behavior patterns, in cross-domain settings, where a user input in interleaved multi-modal snippets can prompt a new long-horizon task for different domains. In these cross-domain settings, we present a semantic skill translator framework SemTra which utilizes a set of multi-modal models to extract skills from the snippets, and leverages the reasoning capabilities of a pretrained language model to adapt these extracted skills to the target domain. The framework employs a two-level hierarchy for adaptation: task adaptation and skill adaptation. During task adaptation, seq-to-seq translation by the language model transforms the extracted skills into a semantic skill sequence, which is tailored to fit the cross-domain contexts. Skill adaptation focuses on optimizing each semantic skill for the target domain context, through parametric instantiations that are facilitated by language prompting and contrastive learning-based context inferences. This hierarchical adaptation empowers the framework to not only infer a complex task specification in one-shot from the interleaved multi-modal snippets, but also adapt it to new domains with zero-shot learning abilities. We evaluate our framework with Meta-World, Franka Kitchen, RLBench, and CARLA environments. The results clarify the framework's superiority in performing long-horizon tasks and adapting to different domains, showing its broad applicability in practical use cases, such as cognitive robots interpreting abstract instructions and autonomous vehicles operating under varied configurations.

What problem does this paper attempt to address?

This paper mainly discusses the problem of zero-shot policy adaptation in cross-domain scenarios, especially in new long-term tasks triggered by multi-modal user inputs. The study proposes a semantic skill translation framework called SemTra, which can extract and adapt semantic skills from fragments in different domains. SemTra adopts a two-stage adaptation process: task adaptation and skill adaptation. In the task adaptation stage, a pre-trained language model is used to transform multi-modal fragments into semantic skill sequences to adapt them to cross-domain contexts. In the skill adaptation stage, each semantic skill is optimized through parameterized instantiation and contrastive learning methods to adapt to the target domain environment. The contributions of SemTra include: 1. Proposing a framework SemTra to handle cross-domain long-term tasks for practical problem-solving. 2. Designing a hierarchical adaptation algorithm that utilizes pre-trained language models for task adaptation and separates semantic skills from domain contexts through parameterized instantiation for skill adaptation. 3. Extensively evaluating SemTra in multiple simulated environments, demonstrating its wide applicability in practical applications such as cognitive robotics and autonomous driving. The paper mentions that although zero-shot policy adaptation has great potential in key domains such as autonomous driving and robotics, it still faces challenges such as task complexity and environment dynamics. SemTra enhances the framework's zero-shot learning capability by decomposing task and skill adaptation, enabling effective task execution in different environments.

SemTra: A Semantic Skill Translator for Cross-Domain Zero-Shot Policy Adaptation

Domain Adaptation Meets Zero-Shot Learning: an Annotation-Efficient Approach to Multi-Modality Medical Image Segmentation

One-shot Imitation in a Non-Stationary Environment via Multi-Modal Skill

Semantic Skill Grounding for Embodied Instruction-Following in Cross-Domain Environments

A Unified Multi-Task Semantic Communication System with Domain Adaptation

Cross-domain transfer via semantic skill imitation

Meta-Transfer Networks for Zero-Shot Learning

Zero-shot Task Adaptation using Natural Language

Plug, Play, and Fuse: Zero-Shot Joint Decoding via Word-Level Re-ranking Across Diverse Vocabularies

SkillS: Adaptive Skill Sequencing for Efficient Temporally-Extended Exploration

Zero-shot Cross-lingual Conversational Semantic Role Labeling

Zero-shot Sim2Real Adaptation Across Environments

Cross-lingual Back-Parsing: Utterance Synthesis from Meaning Representation for Zero-Resource Semantic Parsing

Subwords as Skills: Tokenization for Sparse-Reward Reinforcement Learning

SkillNet-X: A Multilingual Multitask Model with Sparsely Activated Skills

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

Learning Disentangled Semantic Representations for Zero-Shot Cross-Lingual Transfer in Multilingual Machine Reading Comprehension

Zero-Shot Adaptive Transfer for Conversational Language Understanding

Improving Zero-shot Cross-domain Slot Filling Via Transformer-based Slot Semantics Fusion

Semantic Parsing by Large Language Models for Intricate Updating Strategies of Zero-Shot Dialogue State Tracking

Meta-Learning a Cross-lingual Manifold for Semantic Parsing