Abstract:With the tremendous success of large transformer models in natural language understanding, down-sizing them for cost-effective deployments has become critical. Recent studies have explored the low-rank weight factorization techniques which are efficient to train, and apply out-of-the-box to any transformer architecture. Unfortunately, the low-rank assumption tends to be over-restrictive and hinders the expressiveness of the compressed model. This paper proposes, DSFormer, a simple alternative factorization scheme which expresses a target weight matrix as the product of a small dense and a semi-structured sparse matrix. The resulting approximation is more faithful to the weight distribution in transformers and therefore achieves a stronger efficiency-accuracy trade-off. Another concern with existing factorizers is their dependence on a task-unaware initialization step which degrades the accuracy of the resulting model. DSFormer addresses this issue through a novel Straight-Through Factorizer (STF) algorithm that jointly learns all the weight factorizations to directly maximize the final task accuracy. Extensive experiments on multiple natural language understanding benchmarks demonstrate that DSFormer obtains up to 40% better compression than the state-of-the-art low-rank factorizers, leading semi-structured sparsity baselines and popular knowledge distillation approaches. Our approach is also orthogonal to mainstream compressors and offers up to 50% additional compression when added to popular distilled, layer-shared and quantized transformers. We empirically evaluate the benefits of STF over conventional optimization practices.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to effectively compress large Transformer models (such as BERT) without significantly reducing the model performance, so as to reduce their deployment costs and resource consumption. Specifically, the paper proposes a new factorization scheme - DSFormer, aiming to overcome the limitations of existing low - rank factorization methods and achieve a higher compression rate and a better efficiency - accuracy trade - off. ### Background and Problem In recent years, large Transformer models (such as BERT) have achieved great success in natural language understanding (NLU) tasks. However, the large scale of these models has led to high computational and storage costs. Especially in resource - constrained environments such as mobile devices or edge devices, it is very difficult to deploy these models. Therefore, researching how to reduce the scale of these models without affecting the model performance has become an important topic. ### Limitations of Existing Methods Existing compression methods include: 1. **Low - rank factorization**: The number of parameters is reduced by decomposing the weight matrix into two smaller matrices. However, the low - rank assumption is too strict, which limits the expressive ability of the compressed model and leads to performance degradation. 2. **Knowledge distillation**: The knowledge of a large teacher model is transferred to a small student model. Although this method can achieve a high compression ratio, it requires a large amount of training time and complex architecture search. 3. **Quantization, pruning, parameter sharing**: These methods can reduce the model size or inference time, but they usually depend on specific hardware and have limited effects. ### Solution Proposed in the Paper To overcome the limitations of the above methods, the paper proposes DSFormer, a new method based on dense - sparse factorization. The main features of DSFormer are as follows: 1. **More flexible factorization**: DSFormer represents the weight matrix as the product of a small dense matrix and a semi - structured sparse matrix. This decomposition method is more in line with the weight distribution in the Transformer model, thus achieving a stronger efficiency - accuracy trade - off. 2. **Task - aware optimization**: Existing factorization methods usually adopt a task - independent initialization step, which will affect the accuracy of the final model. DSFormer introduces a new straight - through factorizer (STF) algorithm to jointly learn all weight factorizations and directly maximize the accuracy of the final task. 3. **Wide applicability**: DSFormer can be used not only independently, but also in combination with other mainstream compression methods (such as distillation, layer sharing, quantization) to further improve the compression rate without significantly affecting the prediction quality. ### Experimental Results The experimental results show that DSFormer performs excellently in multiple natural language understanding benchmark tests. Compared with the state - of - the - art low - rank factorization methods and other compression techniques, DSFormer can achieve a higher compression rate (up to 40%) and provide a significant efficiency improvement while maintaining high accuracy. In conclusion, by proposing DSFormer, this paper effectively solves the deployment cost and resource consumption problems of large Transformer models in practical applications and provides a new and effective solution in the field of model compression.

DSFormer: Effective Compression of Text-Transformers by Dense-Sparse Weight Factorization

Joint Structured Pruning and Dense Knowledge Distillation for Efficient Transformer Model Compression

Multi-Dimension Compression of Feed-Forward Network in Vision Transformers

Compressing Transformers: Features Are Low-Rank, but Weights Are Not!

DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging

An Empirical Investigation of Matrix Factorization Methods for Pre-trained Transformers

AdaPTwin: Low-Cost Adaptive Compression of Product Twins in Transformers

White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?

On Compressing Deep Models by Low Rank and Sparse Decomposition.

DynaSlim: Dynamic Slimming for Vision Transformers.

Two Sparse Matrices are Better than One: Sparsifying Neural Networks with Double Sparse Factorization

Smartformer: An Intelligent Model Compression Framework for Transformer

Prune Once for All: Sparse Pre-Trained Language Models

Smartformer: an Intelligent Transformer Compression Framework for Time-Series Modeling

Sparse Binary Transformers for Multivariate Time Series Modeling

Compressing Deep Neural Networks With Sparse Matrix Factorization

DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and Vision Transformers

Semi-tensor Product-based TensorDecomposition for Neural Network Compression

Do Efficient Transformers Really Save Computation?

Deep Learning Model Compression with Rank Reduction in Tensor Decomposition.

Compression of Recurrent Neural Networks using Matrix Factorization