Abstract:Cognitive grammar suggests that the acquisition of language grammar is grounded within visual structures. While grammar is an essential representation of natural language, it also exists ubiquitously in vision to represent the hierarchical part-whole structure. In this work, we study grounded grammar induction of vision and language in a joint learning framework. Specifically, we present VLGrammar, a method that uses compound probabilistic context-free grammars (compound PCFGs) to induce the language grammar and the image grammar simultaneously. We propose a novel contrastive learning framework to guide the joint learning of both modules. To provide a benchmark for the grounded grammar induction task, we collect a large-scale dataset, PARTIT, which contains human-written sentences that describe part-level semantics for 3D objects. Experiments on the PARTIT dataset show that VLGrammar outperforms all baselines in image grammar induction and language grammar induction. The learned VLGrammar naturally benefits related downstream tasks. Specifically, it improves the image unsupervised clustering accuracy by 30%, and performs well in image retrieval and text retrieval. Notably, the induced grammar shows superior generalizability by easily generalizing to unseen categories. Code and pre-trained models are released at https://github.com/evelinehong/VLGrammar.

A Regularization-based Framework for Bilingual Grammar Induction.

Multilingual Grammar Induction with Continuous Language Identification.

Unsupervised Discriminative Induction of Synchronous Grammar for Machine Translation.

Bilingually-Guided Monolingual Dependency Grammar Induction.

Leveraging Grammar Induction for Language Understanding and Generation

Joint Learning of Constituency and Dependency Grammars by Decomposed Cross-Lingual Induction

Duality Regularization for Unsupervised Bilingual Lexicon Induction

A Universal Framework for Inductive Transfer Parsing Across Multi-typed Treebanks.

Dependency Induction Through the Lens of Visual Perception

Learning a Grammar Inducer from Massive Uncurated Instructional Videos

VLGrammar: Grounded Grammar Induction of Vision and Language

Re-evaluating the Need for Multimodal Signals in Unsupervised Grammar Induction

Inducing Bilingual Lexica from Non-Parallel Data with Earth Mover's Distance Regularization.

Bayesian Constituent Context Model for Grammar Induction

Grammar Induction from Visual, Speech and Text

Grammar induction by MDL-based distributional classification

Dependency Grammar Induction with Neural Lexicalization and Big Training Data

Bayesian Grammar Induction for Language Modeling

A Representation Learning Framework For Multi-Source Transfer Parsing

LLM-based Translation Inference with Iterative Bilingual Understanding

Synchronous Constituent Context Model for Inducing Bilingual Synchronous Structures