A Fast Optimization View: Reformulating Single Layer Attention in LLM Based on Tensor and SVM Trick, and Solving It in Matrix Multiplication Time

Yeqi Gao,Zhao Song,Weixin Wang,Junze Yin

DOI: https://doi.org/10.48550/arXiv.2309.07418

2023-09-14

Abstract:Large language models (LLMs) have played a pivotal role in revolutionizing various facets of our daily existence. Solving attention regression is a fundamental task in optimizing LLMs. In this work, we focus on giving a provable guarantee for the one-layer attention network objective function $L(X,Y) = \sum_{j_0 = 1}^n \sum_{i_0 = 1}^d ( \langle \langle \exp( \mathsf{A}_{j_0} x ) , {\bf 1}_n \rangle^{-1} \exp( \mathsf{A}_{j_0} x ), A_{3} Y_{*,i_0} \rangle - b_{j_0,i_0} )^2$. Here $\mathsf{A} \in \mathbb{R}^{n^2 \times d^2}$ is Kronecker product between $A_1 \in \mathbb{R}^{n \times d}$ and $A_2 \in \mathbb{R}^{n \times d}$. $A_3$ is a matrix in $\mathbb{R}^{n \times d}$, $\mathsf{A}_{j_0} \in \mathbb{R}^{n \times d^2}$ is the $j_0$-th block of $\mathsf{A}$. The $X, Y \in \mathbb{R}^{d \times d}$ are variables we want to learn. $B \in \mathbb{R}^{n \times d}$ and $b_{j_0,i_0} \in \mathbb{R}$ is one entry at $j_0$-th row and $i_0$-th column of $B$, $Y_{*,i_0} \in \mathbb{R}^d$ is the $i_0$-column vector of $Y$, and $x \in \mathbb{R}^{d^2}$ is the vectorization of $X$. In a multi-layer LLM network, the matrix $B \in \mathbb{R}^{n \times d}$ can be viewed as the output of a layer, and $A_1= A_2 = A_3 \in \mathbb{R}^{n \times d}$ can be viewed as the input of a layer. The matrix version of $x$ can be viewed as $QK^\top$ and $Y$ can be viewed as $V$. We provide an iterative greedy algorithm to train loss function $L(X,Y)$ up $\epsilon$ that runs in $\widetilde{O}( ({\cal T}_{\mathrm{mat}}(n,n,d) + {\cal T}_{\mathrm{mat}}(n,d,d) + d^{2\omega}) \log(1/\epsilon) )$ time. Here ${\cal T}_{\mathrm{mat}}(a,b,c)$ denotes the time of multiplying $a \times b$ matrix another $b \times c$ matrix, and $\omega\approx 2.37$ denotes the exponent of matrix multiplication.

Data Structures and Algorithms,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to optimize the Attention Mechanism in large - scale language models (LLMs). Specifically, the author focuses on the objective function $L(X, Y)$ of a single - layer attention network and provides a provable optimization guarantee. The form of the objective function is as follows: \[L(X, Y)=\sum_{j_0 = 1}^{n}\sum_{i_0 = 1}^{d}\left(\langle\langle\exp(A_{j_0}x),\frac{1}{n}\rangle-\exp(A_{j_0}x),A_3Y^{*}_{i_0}\rangle - b_{j_0,i_0}\right)^2\] where: - $A\in\mathbb{R}^{n^{2}\times d^{2}}$ is the Kronecker product of $A_1\in\mathbb{R}^{n\times d}$ and $A_2\in\mathbb{R}^{n\times d}$. - $A_3\in\mathbb{R}^{n\times d}$. - $A_{j_0}\in\mathbb{R}^{n\times d^{2}}$ is the $j_0$-th block of $A$. - $X,Y\in\mathbb{R}^{d\times d}$ are variables to be learned. - $B\in\mathbb{R}^{n\times d}$. - $b_{j_0,i_0}\in\mathbb{R}$ is the element in the $j_0$-th row and $i_0$-th column of matrix $B$. - $Y^{*}_{i_0}\in\mathbb{R}^d$ is the $i_0$-th column vector of matrix $Y$. - $x\in\mathbb{R}^{d^{2}}$ is the vectorized form of matrix $X$. The main contribution of the paper lies in proposing an iterative greedy algorithm to train the loss function $L(X, Y)$ and achieving $\epsilon$ precision within the time complexity $\tilde{O}((T_{\text{mat}}(n, n, d)+T_{\text{mat}}(n, d, d)+d^{2\omega})\log(1 /\epsilon))$. Here, $T_{\text{mat}}(a, b, c)$ represents the time of multiplying an $a\times b$ matrix by a $b\times c$ matrix, and $\omega\approx2.37$ is the exponent of matrix multiplication. Through this method, the author hopes to accelerate the optimization process of the attention mechanism while maintaining the model performance, thereby improving the training efficiency of large - scale language models.

A Fast Optimization View: Reformulating Single Layer Attention in LLM Based on Tensor and SVM Trick, and Solving It in Matrix Multiplication Time

Cross-layer Attention Sharing for Large Language Models

In-Context Learning for Attention Scheme: from Single Softmax Regression to Multiple Softmax Regression via a Tensor Trick

Algorithm and Hardness for Dynamic Attention Maintenance in Large Language Models

Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization

Beyond Linear Approximations: A Novel Pruning Approach for Attention Matrix

HSR-Enhanced Sparse Attention Acceleration

An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks

Tuning LayerNorm in Attention: Towards Efficient Multi-Modal LLM Finetuning

Randomized and Deterministic Attention Sparsification Algorithms for Over-parameterized Feature Dimension

LoLCATs: On Low-Rank Linearizing of Large Language Models

On Speeding Up Language Model Evaluation

Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction

Attention Is All You Need But You Don't Need All Of It For Inference of Large Language Models

Self-Selected Attention Span for Accelerating Large Language Model Inference

Scalable Efficient Training of Large Language Models with Low-dimensional Projected Attention

EchoAtt: Attend, Copy, then Adjust for More Efficient Large Language Models

Skipping Computations in Multimodal LLMs

Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models

Attend First, Consolidate Later: On the Importance of Attention in Different LLM Layers