Abstract:Structured pruning is a promising approach for reducing the inference costs of large vision and language models. By removing carefully chosen structures, e.g., neurons or attention heads, the improvements from this approach can be realized on standard deep learning hardware. In this work, we focus on structured pruning in the one-shot (post-training) setting, which does not require model retraining after pruning. We propose a novel combinatorial optimization framework for this problem, based on a layer-wise reconstruction objective and a careful reformulation that allows for scalable optimization. Moreover, we design a new local combinatorial optimization algorithm, which exploits low-rank updates for efficient local search. Our framework is time and memory-efficient and considerably improves upon state-of-the-art one-shot methods on vision models (e.g., ResNet50, MobileNet) and language models (e.g., OPT-1.3B -- OPT-30B). For language models, e.g., OPT-2.7B, OSSCAR can lead to $125\times$ lower test perplexity on WikiText with $2\times$ inference time speedup in comparison to the state-of-the-art ZipLM approach. Our framework is also $6\times$ -- $8\times$ faster. Notably, our work considers models with tens of billions of parameters, which is up to $100\times$ larger than what has been previously considered in the structured pruning literature.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address the inference cost issues of large-scale vision and language models, specifically by reducing the size and inference time of these models through structured pruning. Specifically, the paper focuses on one-shot structured pruning, which does not require retraining. The key challenge of this method is to efficiently remove redundant structures (such as neurons or attention heads) from the model without significantly degrading its performance, and to achieve performance improvements on standard deep learning hardware. ### Main Contributions 1. **Proposed a new combinatorial optimization framework**: This framework is based on inter-layer reconstruction objectives and uses carefully designed formula transformations to make the optimization process more efficient. The framework is named OSSCAR (One-Shot Structured Compression Algorithm). 2. **Designed a new local combinatorial optimization algorithm**: This algorithm uses low-rank matrix update techniques for efficient local search, significantly reducing computation time and memory usage while maintaining performance. 3. **Application to large-scale models**: OSSCAR can handle models with up to 30 billion parameters, a scale that previous methods could not achieve. Experimental results show that OSSCAR achieves significant performance improvements and inference acceleration on vision models (such as ResNet50, MobileNet) and language models (such as OPT-1.3B to OPT-30B). ### Solutions - **Inter-layer reconstruction objectives**: By minimizing the squared error loss between the outputs of each layer before and after pruning, it ensures that the pruned model's performance is close to the original model. - **Reformulation of the combinatorial optimization problem**: The pruning problem is transformed into a mixed-integer quadratic programming (MIQP) problem and efficiently solved using a local search algorithm. - **Low-rank matrix update techniques**: Using low-rank matrix update techniques makes the local search process more efficient, allowing for quick identification of optimal solutions even in large-scale models. ### Experimental Results - **Vision models**: For example, on ResNet50, OSSCAR achieved approximately 10% accuracy improvement and 2 times inference acceleration compared to previous state-of-the-art methods. - **Language models**: For example, on OPT-2.7B, OSSCAR reduced the test perplexity on the WikiText dataset by 125 times and accelerated inference time by 2 times, being 6 to 8 times faster than the current state-of-the-art ZipLM method. ### Conclusion OSSCAR successfully addresses the one-shot structured pruning problem for large-scale vision and language models through an innovative combinatorial optimization framework and efficient local search algorithm, significantly improving model performance and inference efficiency.

OSSCAR: One-Shot Structured Pruning in Vision and Language Models with Combinatorial Optimization

Structured Pruning for Efficient Convolutional Neural Networks Via Incremental Regularization

Preserving Deep Representations In One-Shot Pruning: A Hessian-Free Second-Order Optimization Framework

ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models

Isomorphic Pruning for Vision Models

Only Train Once: A One-Shot Neural Network Training And Pruning Framework

Comb, Prune, Distill: Towards Unified Pruning for Vision Model Compression

OATS: Outlier-Aware Pruning Through Sparse and Low Rank Decomposition

GOHSP: A Unified Framework of Graph and Optimization-based Heterogeneous Structured Pruning for Vision Transformer

Structured Pruning Learns Compact and Accurate Models

Structured Optimal Brain Pruning for Large Language Models

One-Cycle Pruning: Pruning ConvNets Under a Tight Training Budget

Adaptive Activation-based Structured Pruning

CRISP: Hybrid Structured Sparsity for Class-aware Model Pruning

Structurally Prune Anything: Any Architecture, Any Framework, Any Time

Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models

Multi-Dimensional Pruning: Joint Channel, Layer and Block Pruning with Latency Constraint

Scalable iterative pruning of large language and vision models using block coordinate descent

A Comprehensive Study of Structural Pruning for Vision Models

Structured Pruning of Large Language Models

HESSO: Towards Automatic Efficient and User Friendly Any Neural Network Training and Pruning