DSFormer: Effective Compression of Text-Transformers by Dense-Sparse Weight Factorization

Rahul Chand,Yashoteja Prabhu,Pratyush Kumar
2023-12-21
Abstract:With the tremendous success of large transformer models in natural language understanding, down-sizing them for cost-effective deployments has become critical. Recent studies have explored the low-rank weight factorization techniques which are efficient to train, and apply out-of-the-box to any transformer architecture. Unfortunately, the low-rank assumption tends to be over-restrictive and hinders the expressiveness of the compressed model. This paper proposes, DSFormer, a simple alternative factorization scheme which expresses a target weight matrix as the product of a small dense and a semi-structured sparse matrix. The resulting approximation is more faithful to the weight distribution in transformers and therefore achieves a stronger efficiency-accuracy trade-off. Another concern with existing factorizers is their dependence on a task-unaware initialization step which degrades the accuracy of the resulting model. DSFormer addresses this issue through a novel Straight-Through Factorizer (STF) algorithm that jointly learns all the weight factorizations to directly maximize the final task accuracy. Extensive experiments on multiple natural language understanding benchmarks demonstrate that DSFormer obtains up to 40% better compression than the state-of-the-art low-rank factorizers, leading semi-structured sparsity baselines and popular knowledge distillation approaches. Our approach is also orthogonal to mainstream compressors and offers up to 50% additional compression when added to popular distilled, layer-shared and quantized transformers. We empirically evaluate the benefits of STF over conventional optimization practices.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to effectively compress large Transformer models (such as BERT) without significantly reducing the model performance, so as to reduce their deployment costs and resource consumption. Specifically, the paper proposes a new factorization scheme - DSFormer, aiming to overcome the limitations of existing low - rank factorization methods and achieve a higher compression rate and a better efficiency - accuracy trade - off. ### Background and Problem In recent years, large Transformer models (such as BERT) have achieved great success in natural language understanding (NLU) tasks. However, the large scale of these models has led to high computational and storage costs. Especially in resource - constrained environments such as mobile devices or edge devices, it is very difficult to deploy these models. Therefore, researching how to reduce the scale of these models without affecting the model performance has become an important topic. ### Limitations of Existing Methods Existing compression methods include: 1. **Low - rank factorization**: The number of parameters is reduced by decomposing the weight matrix into two smaller matrices. However, the low - rank assumption is too strict, which limits the expressive ability of the compressed model and leads to performance degradation. 2. **Knowledge distillation**: The knowledge of a large teacher model is transferred to a small student model. Although this method can achieve a high compression ratio, it requires a large amount of training time and complex architecture search. 3. **Quantization, pruning, parameter sharing**: These methods can reduce the model size or inference time, but they usually depend on specific hardware and have limited effects. ### Solution Proposed in the Paper To overcome the limitations of the above methods, the paper proposes DSFormer, a new method based on dense - sparse factorization. The main features of DSFormer are as follows: 1. **More flexible factorization**: DSFormer represents the weight matrix as the product of a small dense matrix and a semi - structured sparse matrix. This decomposition method is more in line with the weight distribution in the Transformer model, thus achieving a stronger efficiency - accuracy trade - off. 2. **Task - aware optimization**: Existing factorization methods usually adopt a task - independent initialization step, which will affect the accuracy of the final model. DSFormer introduces a new straight - through factorizer (STF) algorithm to jointly learn all weight factorizations and directly maximize the accuracy of the final task. 3. **Wide applicability**: DSFormer can be used not only independently, but also in combination with other mainstream compression methods (such as distillation, layer sharing, quantization) to further improve the compression rate without significantly affecting the prediction quality. ### Experimental Results The experimental results show that DSFormer performs excellently in multiple natural language understanding benchmark tests. Compared with the state - of - the - art low - rank factorization methods and other compression techniques, DSFormer can achieve a higher compression rate (up to 40%) and provide a significant efficiency improvement while maintaining high accuracy. In conclusion, by proposing DSFormer, this paper effectively solves the deployment cost and resource consumption problems of large Transformer models in practical applications and provides a new and effective solution in the field of model compression.