Abstract:While recurrent models have been effective in NLP tasks, their performance on context-free languages (CFLs) has been found to be quite weak. Given that CFLs are believed to capture important phenomena such as hierarchical structure in natural languages, this discrepancy in performance calls for an explanation. We study the performance of recurrent models on Dyck-n languages, a particularly important and well-studied class of CFLs. We find that while recurrent models generalize nearly perfectly if the lengths of the training and test strings are from the same range, they perform poorly if the test strings are longer. At the same time, we observe that recurrent models are expressive enough to recognize Dyck words of arbitrary lengths in finite precision if their depths are bounded. Hence, we evaluate our models on samples generated from Dyck languages with bounded depth and find that they are indeed able to generalize to much higher lengths. Since natural language datasets have nested dependencies of bounded depth, this may help explain why they perform well in modeling hierarchical dependencies in natural language data despite prior works indicating poor generalization performance on Dyck languages. We perform probing studies to support our results and provide comparisons with Transformers.

What problem does this paper attempt to address?

The problem this paper attempts to address is the capability and limitations of Recurrent Neural Networks (RNNs) in recognizing Context-Free Languages (CFLs). Specifically, the authors focus on the performance of RNNs in recognizing the Dyck language, an important type of context-free language. ### Background and Motivation - **Background**: Although recurrent models such as RNNs and LSTMs perform well in natural language processing (NLP) tasks, their performance in recognizing context-free languages (CFLs) is relatively weak. CFLs are considered to capture the hierarchical structure in natural languages, thus this performance discrepancy needs to be explained. - **Motivation**: Studying the ability of RNNs to recognize the Dyck language can help understand their performance in modeling the hierarchical structure of natural languages. ### Main Contributions 1. **Experimental Setup**: - The authors considered three types of Dyck languages: Dyck-2, Dyck-3, and Dyck-4. - Three different types of training and validation sets were generated, including randomly sampled strings and depth-limited strings. 2. **Experimental Results**: - When the length of the test strings is the same as the length of the training strings and the depth is not limited, LSTM can generalize well. - When the length of the test strings exceeds the length of the training strings, the performance of LSTM drops significantly. - When both the training and test strings are depth-limited, LSTM can generalize to longer strings. - Transformer struggles to generalize to longer strings in all cases, possibly due to receiving unseen positional encodings during testing. 3. **Theoretical Analysis**: - The authors constructed an RNN to directly simulate a deterministic pushdown automaton (PDA), demonstrating that RNNs can recognize any deterministic context-free language with infinite precision. - Fixed-precision RNNs can recognize strings of arbitrary length when the stack depth is limited. 4. **Probing Experiments**: - Probing experiments further validated the performance of LSTM in recognizing the Dyck language, including extracting stack depth and stack elements from the hidden states of LSTM. ### Discussion - **LSTM vs Transformer**: LSTM performs well under depth-limited conditions, while Transformer has no problem handling fixed-length inputs but performs poorly in generalizing to longer strings. - **Hierarchical Structure in Natural Language**: Natural language data typically contains nested dependencies of limited depth, which may explain why LSTM performs well in modeling the hierarchical structure of natural languages. ### Conclusion - This study reveals the performance characteristics of RNNs in recognizing the Dyck language, particularly that LSTM can generalize well to longer strings under depth-limited conditions. This helps explain why LSTM performs well in handling natural language data, despite its limitations in recognizing the Dyck language.

On the Practical Ability of Recurrent Neural Networks to Recognize Hierarchical Languages

When Are Tree Structures Necessary for Deep Learning of Representations?

Learning Hierarchical Structures On-The-Fly With A Recurrent-Recursive Model For Sequences

Colorless green recurrent networks dream hierarchically

A Recurrent Neural Network that Learns to Count

Training Neural Networks as Recognizers of Formal Languages

Multiresolution Transformer Networks: Recurrence is Not Essential for Modeling Hierarchical Structure

RNNs can generate bounded hierarchical languages with optimal memory

Precision, Stability, and Generalization: A Comprehensive Assessment of RNNs learnability capability for Classifying Counter and Dyck Languages

On Efficiently Representing Regular Languages as RNNs

Lower Bounds on the Expressivity of Recurrent Neural Language Models

Recursion in Recursion: Two-Level Nested Recursion for Length Generalization with Scalability

A Hierarchical Model with Recurrent Convolutional Neural Networks for Sequential Sentence Classification

Finding Hierarchical Structure in Binary Sequences: Evidence from Lindenmayer Grammar Learning

Learning Hierarchical Structures with Differentiable Nondeterministic Stacks

Separations in the Representational Capabilities of Transformers and Recurrent Architectures

Representation of linguistic form and function in recurrent neural networks

Autoregressive + Chain of Thought = Recurrent: Recurrence's Role in Language Models' Computability and a Revisit of Recurrent Transformer

A Formal Hierarchy of RNN Architectures

Learning Hierarchical Information Flow with Recurrent Neural Modules

Just read twice: closing the recall gap for recurrent language models