Hierarchical Attention Encoder Decoder

Asier Mujika

2023-06-02

Abstract:Recent advances in large language models have shown that autoregressive modeling can generate complex and novel sequences that have many real-world applications. However, these models must generate outputs autoregressively, which becomes time-consuming when dealing with long sequences. Hierarchical autoregressive approaches that compress data have been proposed as a solution, but these methods still generate outputs at the original data frequency, resulting in slow and memory-intensive models. In this paper, we propose a model based on the Hierarchical Recurrent Encoder Decoder (HRED) architecture. This model independently encodes input sub-sequences without global context, processes these sequences using a lower-frequency model, and decodes outputs at the original data frequency. By interpreting the encoder as an implicitly defined embedding matrix and using sampled softmax estimation, we develop a training algorithm that can train the entire model without a high-frequency decoder, which is the most memory and compute-intensive part of hierarchical approaches. In a final, brief phase, we train the decoder to generate data at the original granularity. Our algorithm significantly reduces memory requirements for training autoregressive models and it also improves the total training wall-clock time.

Machine Learning

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the issues of computational efficiency and memory consumption in autoregressive models when handling long sequence data. Specifically, while current large language models can generate complex and novel sequences, they must produce output in an autoregressive manner, which becomes very time-consuming when dealing with long sequences. Although hierarchical autoregressive methods based on compressed data have been proposed, these methods still need to generate output at the original data frequency, resulting in slow model operation and high memory usage. To solve these problems, the authors propose a new model based on the Hierarchical Recurrent Encoder-Decoder (HRED) architecture—Hierarchical Attention Encoder-Decoder (HAED). The main contributions of this model include: 1. **Analyzing different components of HRED**: Identifying which parts contribute most to the model's performance. 2. **Improving HRED**: Proposing a new Hierarchical Attention Encoder-Decoder (HAED) architecture that significantly enhances the performance of the original model. The new model replaces the Recurrent Neural Network (RNN) in the encoder with a Multi-Layer Perceptron (MLP) and uses a more advanced Transformer architecture in the main model. 3. **Introducing a learning algorithm**: Developing an Implicit Embedding Matrix (IEM) algorithm that can learn directly on low-frequency targets, significantly reducing computation time, thus allowing for training larger models and more training data. Through these improvements, the authors aim to significantly reduce the memory required to train autoregressive models and improve the overall training time.

Hierarchical Attention Encoder Decoder

A Hierarchical Neural Autoencoder for Paragraphs and Documents

Learning Hierarchical Structures On-The-Fly With A Recurrent-Recursive Model For Sequences

Hierarchical Skip Decoding for Efficient Autoregressive Text Generation

Hierarchical Boundary-Aware Neural Encoder for Video Captioning

Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning

Fast Decoding in Sequence Models using Discrete Latent Variables

Focused Hierarchical RNNs for Conditional Sequence Processing

Hierarchical Autoregressive Modeling for Neural Video Compression

Hiformer: Sequence Modeling Networks with Hierarchical Attention Mechanisms.

Hierarchical Memory Decoder for Visual Narrating

Hierarchically Gated Recurrent Neural Network for Sequence Modeling

Reinforced Decoder: Towards Training Recurrent Neural Networks for Time Series Forecasting

A New Hierarchical Temporal Memory Based on Recurrent Learning Unit

Hierarchical Adversarially Learned Inference

Deep Hierarchical Video Compression

High-Efficiency Neural Video Compression via Hierarchical Predictive Learning

Hierarchical Text Classification as Sub-hierarchy Sequence Generation

Residual Attention Net for Superior Cross-Domain Time Sequence Modeling

Fast Structured Decoding for Sequence Models

Hierarchical Conflict Propagation: Sequence Learning in a Recurrent Deep Neural Network