Incrementally-Computable Neural Networks: Efficient Inference for Dynamic Inputs

Or Sharir,Anima Anandkumar
2023-07-28
Abstract:Deep learning often faces the challenge of efficiently processing dynamic inputs, such as sensor data or user inputs. For example, an AI writing assistant is required to update its suggestions in real time as a document is edited. Re-running the model each time is expensive, even with compression techniques like knowledge distillation, pruning, or quantization. Instead, we take an incremental computing approach, looking to reuse calculations as the inputs change. However, the dense connectivity of conventional architectures poses a major obstacle to incremental computation, as even minor input changes cascade through the network and restrict information reuse. To address this, we use vector quantization to discretize intermediate values in the network, which filters out noisy and unnecessary modifications to hidden neurons, facilitating the reuse of their values. We apply this approach to the transformers architecture, creating an efficient incremental inference algorithm with complexity proportional to the fraction of the modified inputs. Our experiments with adapting the OPT-125M pre-trained language model demonstrate comparable accuracy on document classification while requiring 12.1X (median) fewer operations for processing sequences of atomic edits.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
The paper aims to address the efficiency challenges faced by deep learning models when handling dynamic inputs (such as sensor data or user input), particularly in natural language processing tasks. Specifically, the paper proposes an incremental computation method to improve the computational efficiency of neural networks when dealing with text revisions. The authors point out that existing large language models (such as those based on the Transformer architecture) need to rerun from scratch with each document modification, wasting a significant amount of computational resources. Although each modification may involve only a small amount of text (e.g., a single word), current models still recompute the entire document, leading to computational redundancy. To address this, the authors propose a method based on Vector Quantization (VQ) to improve Transformer models, enabling them to incrementally update computation results. By introducing VQ layers, the model can filter out insignificant intermediate value changes, thereby reusing computation results. This method not only reduces the amount of computation but also maintains performance comparable to the original model. Experimental results show that after adjusting the pre-trained language model OPT-125M, the new model VQ-OPT performs similarly to the original model in document classification tasks but requires 12.1 times fewer arithmetic operations (median) when handling atomic edits. Additionally, the paper explores how to handle text insertion and deletion operations and demonstrates the potential applications of this method in different scenarios.