Incrementally-Computable Neural Networks: Efficient Inference for Dynamic Inputs

Or Sharir,Anima Anandkumar

2023-07-28

Abstract:Deep learning often faces the challenge of efficiently processing dynamic inputs, such as sensor data or user inputs. For example, an AI writing assistant is required to update its suggestions in real time as a document is edited. Re-running the model each time is expensive, even with compression techniques like knowledge distillation, pruning, or quantization. Instead, we take an incremental computing approach, looking to reuse calculations as the inputs change. However, the dense connectivity of conventional architectures poses a major obstacle to incremental computation, as even minor input changes cascade through the network and restrict information reuse. To address this, we use vector quantization to discretize intermediate values in the network, which filters out noisy and unnecessary modifications to hidden neurons, facilitating the reuse of their values. We apply this approach to the transformers architecture, creating an efficient incremental inference algorithm with complexity proportional to the fraction of the modified inputs. Our experiments with adapting the OPT-125M pre-trained language model demonstrate comparable accuracy on document classification while requiring 12.1X (median) fewer operations for processing sequences of atomic edits.

Machine Learning,Computation and Language

What problem does this paper attempt to address?

The paper aims to address the efficiency challenges faced by deep learning models when handling dynamic inputs (such as sensor data or user input), particularly in natural language processing tasks. Specifically, the paper proposes an incremental computation method to improve the computational efficiency of neural networks when dealing with text revisions. The authors point out that existing large language models (such as those based on the Transformer architecture) need to rerun from scratch with each document modification, wasting a significant amount of computational resources. Although each modification may involve only a small amount of text (e.g., a single word), current models still recompute the entire document, leading to computational redundancy. To address this, the authors propose a method based on Vector Quantization (VQ) to improve Transformer models, enabling them to incrementally update computation results. By introducing VQ layers, the model can filter out insignificant intermediate value changes, thereby reusing computation results. This method not only reduces the amount of computation but also maintains performance comparable to the original model. Experimental results show that after adjusting the pre-trained language model OPT-125M, the new model VQ-OPT performs similarly to the original model in document classification tasks but requires 12.1 times fewer arithmetic operations (median) when handling atomic edits. Additionally, the paper explores how to handle text insertion and deletion operations and demonstrates the potential applications of this method in different scenarios.

Incrementally-Computable Neural Networks: Efficient Inference for Dynamic Inputs

Efficient Online Processing with Deep Neural Networks

Vision Transformer Computation and Resilience for Dynamic Inference

Nimble: Efficiently Compiling Dynamic Neural Networks for Model Inference

Efficient Deep Learning Inference Based on Model Compression.

Incremental Training and Group Convolution Pruning for Runtime DNN Performance Scaling on Heterogeneous Embedded Platforms

A neural transducer

Towards Incremental Transformers: An Empirical Analysis of Transformer Models for Incremental NLU

Harnessing Neural Unit Dynamics for Effective and Scalable Class-Incremental Learning

Efficient Incremental Training for Deep Convolutional Neural Networks

DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and Vision Transformers

Cavs: An Efficient Runtime System For Dynamic Neural Networks

Input Compression with Positional Consistency for Efficient Training and Inference of Transformer Neural Networks

Efficient Incremental Learning Using Dynamic Correction Vector

Adaptive Computation Modules: Granular Conditional Computation For Efficient Inference

Training Recurrent Neural Networks against Noisy Computations during Inference

Incremental Processing in the Age of Non-Incremental Encoders: An Empirical Assessment of Bidirectional Models for Incremental NLU

No Multiplication? No Floating Point? No Problem! Training Networks for Efficient Inference

Accelerating Neural Network Inference by Overflow Aware Quantization

GreenLightningAI: An Efficient AI System with Decoupled Structural and Quantitative Knowledge

Adaptive knowledge transfer for class incremental learning