OSSCAR: One-Shot Structured Pruning in Vision and Language Models with Combinatorial Optimization

Xiang Meng,Shibal Ibrahim,Kayhan Behdin,Hussein Hazimeh,Natalia Ponomareva,Rahul Mazumder
2024-03-03
Abstract:Structured pruning is a promising approach for reducing the inference costs of large vision and language models. By removing carefully chosen structures, e.g., neurons or attention heads, the improvements from this approach can be realized on standard deep learning hardware. In this work, we focus on structured pruning in the one-shot (post-training) setting, which does not require model retraining after pruning. We propose a novel combinatorial optimization framework for this problem, based on a layer-wise reconstruction objective and a careful reformulation that allows for scalable optimization. Moreover, we design a new local combinatorial optimization algorithm, which exploits low-rank updates for efficient local search. Our framework is time and memory-efficient and considerably improves upon state-of-the-art one-shot methods on vision models (e.g., ResNet50, MobileNet) and language models (e.g., OPT-1.3B -- OPT-30B). For language models, e.g., OPT-2.7B, OSSCAR can lead to $125\times$ lower test perplexity on WikiText with $2\times$ inference time speedup in comparison to the state-of-the-art ZipLM approach. Our framework is also $6\times$ -- $8\times$ faster. Notably, our work considers models with tens of billions of parameters, which is up to $100\times$ larger than what has been previously considered in the structured pruning literature.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper aims to address the inference cost issues of large-scale vision and language models, specifically by reducing the size and inference time of these models through structured pruning. Specifically, the paper focuses on one-shot structured pruning, which does not require retraining. The key challenge of this method is to efficiently remove redundant structures (such as neurons or attention heads) from the model without significantly degrading its performance, and to achieve performance improvements on standard deep learning hardware. ### Main Contributions 1. **Proposed a new combinatorial optimization framework**: This framework is based on inter-layer reconstruction objectives and uses carefully designed formula transformations to make the optimization process more efficient. The framework is named OSSCAR (One-Shot Structured Compression Algorithm). 2. **Designed a new local combinatorial optimization algorithm**: This algorithm uses low-rank matrix update techniques for efficient local search, significantly reducing computation time and memory usage while maintaining performance. 3. **Application to large-scale models**: OSSCAR can handle models with up to 30 billion parameters, a scale that previous methods could not achieve. Experimental results show that OSSCAR achieves significant performance improvements and inference acceleration on vision models (such as ResNet50, MobileNet) and language models (such as OPT-1.3B to OPT-30B). ### Solutions - **Inter-layer reconstruction objectives**: By minimizing the squared error loss between the outputs of each layer before and after pruning, it ensures that the pruned model's performance is close to the original model. - **Reformulation of the combinatorial optimization problem**: The pruning problem is transformed into a mixed-integer quadratic programming (MIQP) problem and efficiently solved using a local search algorithm. - **Low-rank matrix update techniques**: Using low-rank matrix update techniques makes the local search process more efficient, allowing for quick identification of optimal solutions even in large-scale models. ### Experimental Results - **Vision models**: For example, on ResNet50, OSSCAR achieved approximately 10% accuracy improvement and 2 times inference acceleration compared to previous state-of-the-art methods. - **Language models**: For example, on OPT-2.7B, OSSCAR reduced the test perplexity on the WikiText dataset by 125 times and accelerated inference time by 2 times, being 6 to 8 times faster than the current state-of-the-art ZipLM method. ### Conclusion OSSCAR successfully addresses the one-shot structured pruning problem for large-scale vision and language models through an innovative combinatorial optimization framework and efficient local search algorithm, significantly improving model performance and inference efficiency.