Abstract:Human language is full of compositional syntactic structures, and although neural networks have contributed to groundbreaking improvements in computer systems that process language, widely-used neural network architectures still exhibit limitations in their ability to process syntax. To address this issue, prior work has proposed adding stack data structures to neural networks, drawing inspiration from theoretical connections between syntax and stacks. However, these methods employ deterministic stacks that are designed to track one parse at a time, whereas syntactic ambiguity, which requires a nondeterministic stack to parse, is extremely common in language. In this dissertation, we remedy this discrepancy by proposing a method of incorporating nondeterministic stacks into neural networks. We develop a differentiable data structure that efficiently simulates a nondeterministic pushdown automaton, representing an exponential number of computations with a dynamic programming algorithm. We incorporate this module into two predominant architectures: recurrent neural networks (RNNs) and transformers. We show that this raises their formal recognition power to arbitrary context-free languages, and also aids training, even on deterministic context-free languages. Empirically, neural networks with nondeterministic stacks learn context-free languages much more effectively than prior stack-augmented models, including a language with theoretically maximal parsing difficulty. We also show that an RNN augmented with a nondeterministic stack is capable of surprisingly powerful behavior, such as learning cross-serial dependencies, a well-known non-context-free pattern. We demonstrate improvements on natural language modeling and provide analysis on a syntactic generalization benchmark. This work represents an important step toward building systems that learn to use syntax in more human-like fashion.

A Transformer with Stack Attention

Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns

Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers

Nondeterministic Stacks in Neural Networks

Selective Attention Improves Transformer

Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains

Transformers learn variable-order Markov chains in-context

Probabilistic Transformer: A Probabilistic Dependency Model for Contextual Word Representation

Relaxed Attention for Transformer Models

Agglomerative Attention

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context.

The Antecedents of Transformer Models

Fovea Transformer: Efficient Long-Context Modeling with Structured Fine-to-Coarse Attention

Transformers are Universal In-context Learners

Dependency Transformer Grammars: Integrating Dependency Structures into Transformer Language Models

Learning Hierarchical Structures with Differentiable Nondeterministic Stacks

Attention is All you Need

Dynamic Evaluation of Transformer Language Models

Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers

Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer

Predictive Attention Transformer: Improving Transformer with Attention Map Prediction