Abstract:Although deep neural networks are well-known for their remarkable performance in tackling complex tasks, their hunger for computational resources remains a significant hurdle, posing energy-consumption issues and restricting their deployment on resource-constrained devices, which stalls their widespread adoption. In this paper, we present an optimal transport method to reduce the depth of over-parametrized deep neural networks, alleviating their computational burden. More specifically, we propose a new regularization strategy based on the Max-Sliced Wasserstein distance to minimize the distance between the intermediate feature distributions in the neural network. We show that minimizing this distance enables the complete removal of intermediate layers in the network, with almost no performance loss and without requiring any finetuning. We assess the effectiveness of our method on traditional image classification setups. We commit to releasing the source code upon acceptance of the article.

What problem does this paper attempt to address?

This paper proposes a method called LACOOT (Layer Collapse Through Optimal Transport) aimed at addressing the high computational demand and energy consumption problem of deep neural networks (DNNs). Although DNNs perform well in handling complex tasks, their computation requirements limit their applications on resource-limited devices. LACOOT utilizes optimal transport theory to reduce the depth of over-parameterized DNNs, thereby alleviating their computational burden. Specifically, the paper introduces a regularization strategy based on maximum sliced Wasserstein distance to minimize the distance between intermediate feature distributions in neural networks. By minimizing this distance, intermediate layers in the network can be completely removed with little performance loss and without the need for fine-tuning. This method is evaluated on traditional image classification tasks and the code will be open-sourced after the paper is accepted. Compared to traditional parameter pruning methods, LACOOT focuses more on reducing the network depth, while most existing methods are often less efficient or unable to directly remove redundant layers while maintaining performance. In addition, LACOOT operates internally within the model instead of training multiple networks, using optimal transport tools to quantify and control learning redundancy, enabling the network to identify and remove the least contributing blocks. The paper also investigates the application of Wasserstein distance and its sliced version in deep compression strategies, pointing out that it can effectively quantify distribution differences and avoid the need for pre-determining the number of layers to be pruned or relying on ranking-based criteria. LACOOT induces layer collapsing during the training process, collapsing multiple layers at once, improving efficiency. The experimental section demonstrates the effectiveness of LACOOT on various architectures and datasets, indicating that the method can significantly reduce the computational demand of models while maintaining comparable performance.

LaCoOT: Layer Collapse through Optimal Transport

CPOT: Channel Pruning via Optimal Transport

Optimal Transport Classifier: Defending Against Adversarial Attacks by Regularized Deep Embedding

OT-net: a reusable neural optimal transport solver

Model Compression Using Optimal Transport

Scalable Optimal Transport in High Dimensions for Graph Distances, Embedding Alignment, and More

From Alexnet to Transformers: Measuring the Non-linearity of Deep Neural Networks with Affine Optimal Transport

Embedding Semantic Hierarchy in Discrete Optimal Transport for Risk Minimization

Structure-Preserving Network Compression Via Low-Rank Induced Training Through Linear Layers Composition

Low-rank Optimal Transport: Approximation, Statistics and Debiasing

Neural Estimation Of Entropic Optimal Transport

LayerOut: Freezing Layers in Deep Neural Networks

Proving Linear Mode Connectivity of Neural Networks via Optimal Transport

Deep neural network compression by Tucker decomposition with nonlinear response

Linear Optimal Partial Transport Embedding

LORS: Low-rank Residual Structure for Parameter-Efficient Network Stacking

Inducing Semantic Hierarchy Structure in Empirical Risk Minimization with Optimal Transport Measures

Improving Neural Optimal Transport via Displacement Interpolation

An Optimal Transport Approach for Computing Adversarial Training Lower Bounds in Multiclass Classification

OTAD: An Optimal Transport-Induced Robust Model for Agnostic Adversarial Attack

Overparametrization of HyperNetworks at Fixed FLOP-Count Enables Fast Neural Image Enhancement