Abstract:Language is highly structured, with syntactic and semantic structures, to some extent, agreed upon by speakers of the same language. With implicit or explicit awareness of such structures, humans can learn and use language efficiently and generalize to sentences that contain unseen words. Motivated by human language learning, in this dissertation, we consider a family of machine learning tasks that aim to learn language structures through grounding. We seek distant supervision from other data sources (i.e., grounds), including but not limited to other modalities (e.g., vision), execution results of programs, and other languages. We demonstrate the potential of this task formulation and advocate for its adoption through three schemes. In Part I, we consider learning syntactic parses through visual grounding. We propose the task of visually grounded grammar induction, present the first models to induce syntactic structures from visually grounded text and speech, and find that the visual grounding signals can help improve the parsing quality over language-only models. As a side contribution, we propose a novel evaluation metric that enables the evaluation of speech parsing without text or automatic speech recognition systems involved. In Part II, we propose two execution-aware methods to map sentences into corresponding semantic structures (i.e., programs), significantly improving compositional generalization and few-shot program synthesis. In Part III, we propose methods that learn language structures from annotations in other languages. Specifically, we propose a method that sets a new state of the art on cross-lingual word alignment. We then leverage the learned word alignments to improve the performance of zero-shot cross-lingual dependency parsing, by proposing a novel substructure-based projection method that preserves structural knowledge learned from the source language.

Jointly Learning Grounded Task Structures from Language Instruction and Visual Demonstration.

Task Learning Through Visual Demonstration and Situated Dialogue.

Grounding Language for Robotic Manipulation via Skill Library

Towards Learning from Demonstration System for Parts Assembly: A Graph Based Representation for Knowledge

Grounding Language with Visual Affordances over Unstructured Data

Robot Learning With A Spatial, Temporal, And Causal And-Or Graph

Grounding Language Plans in Demonstrations Through Counterfactual Perturbations

Modeling Long-horizon Tasks as Sequential Interaction Landscapes

Collaborative Language Grounding Toward Situated Human‐Robot Dialogue

Grounding Language Models in Autonomous Loco-manipulation Tasks

Interactive Learning of State Representation through Natural Language Instruction and Explanation

Interactive Robot Learning of Gestures, Language and Affordances

Learning Instruction-Guided Manipulation Affordance via Large Models for Embodied Robotic Tasks

Learning Language Structures through Grounding

Grounding Robot Policies with Visuomotor Language Guidance

Grounded Decoding: Guiding Text Generation with Grounded Models for Embodied Agents

Modular Framework for Visuomotor Language Grounding

Continual Skill and Task Learning via Dialogue

Learning to communicate about shared procedural abstractions

Generalized Grounding Graphs: A Probabilistic Framework for Understanding Grounded Commands

Gated-Attention Architectures for Task-Oriented Language Grounding