A Fast Optimization View: Reformulating Single Layer Attention in LLM Based on Tensor and SVM Trick, and Solving It in Matrix Multiplication Time

Yeqi Gao,Zhao Song,Weixin Wang,Junze Yin
DOI: https://doi.org/10.48550/arXiv.2309.07418
2023-09-14
Abstract:Large language models (LLMs) have played a pivotal role in revolutionizing various facets of our daily existence. Solving attention regression is a fundamental task in optimizing LLMs. In this work, we focus on giving a provable guarantee for the one-layer attention network objective function $L(X,Y) = \sum_{j_0 = 1}^n \sum_{i_0 = 1}^d ( \langle \langle \exp( \mathsf{A}_{j_0} x ) , {\bf 1}_n \rangle^{-1} \exp( \mathsf{A}_{j_0} x ), A_{3} Y_{*,i_0} \rangle - b_{j_0,i_0} )^2$. Here $\mathsf{A} \in \mathbb{R}^{n^2 \times d^2}$ is Kronecker product between $A_1 \in \mathbb{R}^{n \times d}$ and $A_2 \in \mathbb{R}^{n \times d}$. $A_3$ is a matrix in $\mathbb{R}^{n \times d}$, $\mathsf{A}_{j_0} \in \mathbb{R}^{n \times d^2}$ is the $j_0$-th block of $\mathsf{A}$. The $X, Y \in \mathbb{R}^{d \times d}$ are variables we want to learn. $B \in \mathbb{R}^{n \times d}$ and $b_{j_0,i_0} \in \mathbb{R}$ is one entry at $j_0$-th row and $i_0$-th column of $B$, $Y_{*,i_0} \in \mathbb{R}^d$ is the $i_0$-column vector of $Y$, and $x \in \mathbb{R}^{d^2}$ is the vectorization of $X$. In a multi-layer LLM network, the matrix $B \in \mathbb{R}^{n \times d}$ can be viewed as the output of a layer, and $A_1= A_2 = A_3 \in \mathbb{R}^{n \times d}$ can be viewed as the input of a layer. The matrix version of $x$ can be viewed as $QK^\top$ and $Y$ can be viewed as $V$. We provide an iterative greedy algorithm to train loss function $L(X,Y)$ up $\epsilon$ that runs in $\widetilde{O}( ({\cal T}_{\mathrm{mat}}(n,n,d) + {\cal T}_{\mathrm{mat}}(n,d,d) + d^{2\omega}) \log(1/\epsilon) )$ time. Here ${\cal T}_{\mathrm{mat}}(a,b,c)$ denotes the time of multiplying $a \times b$ matrix another $b \times c$ matrix, and $\omega\approx 2.37$ denotes the exponent of matrix multiplication.
Data Structures and Algorithms,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to optimize the Attention Mechanism in large - scale language models (LLMs). Specifically, the author focuses on the objective function \(L(X, Y)\) of a single - layer attention network and provides a provable optimization guarantee. The form of the objective function is as follows: \[L(X, Y)=\sum_{j_0 = 1}^{n}\sum_{i_0 = 1}^{d}\left(\langle\langle\exp(A_{j_0}x),\frac{1}{n}\rangle-\exp(A_{j_0}x),A_3Y^{*}_{i_0}\rangle - b_{j_0,i_0}\right)^2\] where: - \(A\in\mathbb{R}^{n^{2}\times d^{2}}\) is the Kronecker product of \(A_1\in\mathbb{R}^{n\times d}\) and \(A_2\in\mathbb{R}^{n\times d}\). - \(A_3\in\mathbb{R}^{n\times d}\). - \(A_{j_0}\in\mathbb{R}^{n\times d^{2}}\) is the \(j_0\)-th block of \(A\). - \(X,Y\in\mathbb{R}^{d\times d}\) are variables to be learned. - \(B\in\mathbb{R}^{n\times d}\). - \(b_{j_0,i_0}\in\mathbb{R}\) is the element in the \(j_0\)-th row and \(i_0\)-th column of matrix \(B\). - \(Y^{*}_{i_0}\in\mathbb{R}^d\) is the \(i_0\)-th column vector of matrix \(Y\). - \(x\in\mathbb{R}^{d^{2}}\) is the vectorized form of matrix \(X\). The main contribution of the paper lies in proposing an iterative greedy algorithm to train the loss function \(L(X, Y)\) and achieving \(\epsilon\) precision within the time complexity \(\tilde{O}((T_{\text{mat}}(n, n, d)+T_{\text{mat}}(n, d, d)+d^{2\omega})\log(1 /\epsilon))\). Here, \(T_{\text{mat}}(a, b, c)\) represents the time of multiplying an \(a\times b\) matrix by a \(b\times c\) matrix, and \(\omega\approx2.37\) is the exponent of matrix multiplication. Through this method, the author hopes to accelerate the optimization process of the attention mechanism while maintaining the model performance, thereby improving the training efficiency of large - scale language models.