Hierarchical Attention Encoder Decoder

Asier Mujika
2023-06-02
Abstract:Recent advances in large language models have shown that autoregressive modeling can generate complex and novel sequences that have many real-world applications. However, these models must generate outputs autoregressively, which becomes time-consuming when dealing with long sequences. Hierarchical autoregressive approaches that compress data have been proposed as a solution, but these methods still generate outputs at the original data frequency, resulting in slow and memory-intensive models. In this paper, we propose a model based on the Hierarchical Recurrent Encoder Decoder (HRED) architecture. This model independently encodes input sub-sequences without global context, processes these sequences using a lower-frequency model, and decodes outputs at the original data frequency. By interpreting the encoder as an implicitly defined embedding matrix and using sampled softmax estimation, we develop a training algorithm that can train the entire model without a high-frequency decoder, which is the most memory and compute-intensive part of hierarchical approaches. In a final, brief phase, we train the decoder to generate data at the original granularity. Our algorithm significantly reduces memory requirements for training autoregressive models and it also improves the total training wall-clock time.
Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the issues of computational efficiency and memory consumption in autoregressive models when handling long sequence data. Specifically, while current large language models can generate complex and novel sequences, they must produce output in an autoregressive manner, which becomes very time-consuming when dealing with long sequences. Although hierarchical autoregressive methods based on compressed data have been proposed, these methods still need to generate output at the original data frequency, resulting in slow model operation and high memory usage. To solve these problems, the authors propose a new model based on the Hierarchical Recurrent Encoder-Decoder (HRED) architecture—Hierarchical Attention Encoder-Decoder (HAED). The main contributions of this model include: 1. **Analyzing different components of HRED**: Identifying which parts contribute most to the model's performance. 2. **Improving HRED**: Proposing a new Hierarchical Attention Encoder-Decoder (HAED) architecture that significantly enhances the performance of the original model. The new model replaces the Recurrent Neural Network (RNN) in the encoder with a Multi-Layer Perceptron (MLP) and uses a more advanced Transformer architecture in the main model. 3. **Introducing a learning algorithm**: Developing an Implicit Embedding Matrix (IEM) algorithm that can learn directly on low-frequency targets, significantly reducing computation time, thus allowing for training larger models and more training data. Through these improvements, the authors aim to significantly reduce the memory required to train autoregressive models and improve the overall training time.