Abstract:Adjusting the latency, power, and accuracy of natural language understanding models is a desirable objective of an efficient architecture. This paper proposes an efficient Transformer architecture that adjusts the inference computational cost adaptively with a desired inference latency speedup. In fine-tuning phase, the proposed method detects less important hidden sequence elements (word-vectors) and eliminates them in each encoder layer using a proposed Attention Context Contribution (ACC) metric. After the fine-tuning phase, with the novel offline-tuning property, the inference latency of the model can be adjusted in a wide range of inference speedup selections without any further training. The proposed method is applied to the BERT_base, GPT-2 and Flan-T5 models for evaluation. Extensive experiments show that most of the word-vectors in higher Transformer layers have less contribution to the subsequent layers; hence, they can be eliminated to improve the inference latency. Experimental results on extensive sentiment analysis, classification, text generation tasks and regression benchmarks like GLUE showed that the method is effective in various datasets with minimal impact on the input's global context. The method was also evaluated under the instruction tuning paradigm, and its performance was measured using different types of prompting. The proposed method mathematically and experimentally improves the inference latency of BERT_base and GPT-2 by up to 4.8 and 3.72 times with less than 0.75% accuracy drop and passable perplexity on average. The suggested approach posits that in Large Language Models (LLMs), although the complete network is necessary for training, it can be truncated during the fine-tuning phase.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in natural language understanding models, how to adjust the latency, power consumption, and accuracy during the inference process. Specifically, the author proposes an efficient Transformer architecture that can adaptively adjust the inference computation cost while maintaining the model performance and optimize according to the required inference speed - up ratio. ### Background and Motivation of the Problem In recent years, Transformer - based architectures have achieved remarkable success in various natural language processing (NLP) tasks. However, these models face several major problems in practical applications: 1. **Training Difficulty**: Large - scale pre - training models require a large amount of computational resources. 2. **Inference Latency**: The high computational complexity in the inference stage leads to excessive latency. 3. **Energy Consumption**: Especially in resource - constrained systems such as edge devices, the energy consumption of these models is too high. These problems make it difficult to deploy Transformer models in practical applications, especially in scenarios with high requirements for real - time performance and energy efficiency. ### Core Contributions of the Paper To solve the above problems, this paper proposes a novel Transformer encoder architecture, which realizes the adjustability of inference latency in the following ways: - **Attention Context Contribution (ACC) Metric**: It is used to evaluate the importance of word vectors in each attention layer and eliminate unimportant word vectors accordingly. - **Sorting and Elimination Layers**: Sorting and elimination layers are introduced in each encoder layer to reduce the number of effective word vectors, thereby reducing the number of floating - point operations (FLOPs). - **Offline Adjustment Feature**: After fine - tuning, the inference latency can be controlled by adjusting hyper - parameters without further training. ### Experimental Results The experimental results show that this method performs well on multiple benchmark datasets (such as GLUE, sentiment analysis, classification, text generation, etc. tasks) and can significantly improve the inference speed without affecting the global context. For example, experiments on BERT base and GPT - 2 models show that this method can respectively increase the inference speed to 4.8 times and 3.72 times of the original, while the accuracy drop is no more than 0.75%. ### Summary In general, this paper aims to solve the latency and power consumption problems in the inference process of natural language understanding models by improving the Transformer architecture, providing a flexible and efficient method to adjust the inference performance of the model.

Latency Adjustable Transformer Encoder for Language Understanding

Efficiently Scaling Transformer Inference

Jump to Conclusions: Short-Cutting Transformers With Linear Transformations

A Multi-Level Framework for Accelerating Training Transformer Models

LAIT: Efficient Multi-Segment Encoding in Transformers with Layer-Adjustable Interaction

LiteTransformerSearch: Training-free Neural Architecture Search for Efficient Language Models

Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs

Decoupled Transformer for Scalable Inference in Open-domain Question Answering

Enhancing Parameter Efficiency in Model Inference Using an Ultralight Inter-Transformer Linear Structure

DTATrans: Leveraging Dynamic Token-Based Quantization with Accuracy Compensation Mechanism for Efficient Transformer Architecture.

No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models

Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations

Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers

Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

COST-EFF: Collaborative Optimization of Spatial and Temporal Efficiency with Slenderized Multi-exit Language Models

Exploring the Impact of a Transformer's Latent Space Geometry on Downstream Task Performance

Accelerating Attention through Gradient-Based Learned Runtime Pruning

Efficient Fine-Tuning of Compressed Language Models with Learners

Selective Attention Improves Transformer

Breaking Free Transformer Models: Task-specific Context Attribution Promises Improved Generalizability Without Fine-tuning Pre-trained LLMs

HARP: Hesitation-Aware Reframing in Transformer Inference Pass