Knowledge Composition using Task Vectors with Learned Anisotropic Scaling

Frederic Z. Zhang,Paul Albert,Cristian Rodriguez-Opazo,Anton van den Hengel,Ehsan Abbasnejad
2024-10-29
Abstract:Pre-trained models produce strong generic representations that can be adapted via fine-tuning. The learned weight difference relative to the pre-trained model, known as a task vector, characterises the direction and stride of fine-tuning. The significance of task vectors is such that simple arithmetic operations on them can be used to combine diverse representations from different domains. This paper builds on these properties of task vectors and aims to answer (1) whether components of task vectors, particularly parameter blocks, exhibit similar characteristics, and (2) how such blocks can be used to enhance knowledge composition and transfer. To this end, we introduce aTLAS, an algorithm that linearly combines parameter blocks with different learned coefficients, resulting in anisotropic scaling at the task vector level. We show that such linear combinations explicitly exploit the low intrinsic dimensionality of pre-trained models, with only a few coefficients being the learnable parameters. Furthermore, composition of parameter blocks leverages the already learned representations, thereby reducing the dependency on large amounts of data. We demonstrate the effectiveness of our method in task arithmetic, few-shot recognition and test-time adaptation, with supervised or unsupervised objectives. In particular, we show that (1) learned anisotropic scaling allows task vectors to be more disentangled, causing less interference in composition; (2) task vector composition excels with scarce or no labeled data and is less prone to domain shift, thus leading to better generalisability; (3) mixing the most informative parameter blocks across different task vectors prior to training can reduce the memory footprint and improve the flexibility of knowledge transfer. Moreover, we show the potential of aTLAS as a PEFT method, particularly with less data, and demonstrate its scalibility.
Machine Learning,Artificial Intelligence,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: how to achieve knowledge combination and transfer by learning the linear combination of task vectors, especially in the case of scarce data. Specifically, the author proposes an algorithm named aTLAS. This algorithm independently scales different parameter blocks of task vectors (anisotropic scaling) to improve the performance of multi - task models and reduce interference between different tasks. ### Core Problems of the Paper 1. **Characteristics of Task Vectors**: Research whether each component (such as parameter blocks) of task vectors has similar characteristics. 2. **Knowledge Combination and Transfer**: Explore how to use these parameter blocks to enhance the ability of knowledge combination and transfer. ### Specific Application Scenarios - **Task Arithmetic**: Edit pre - trained models by combining task vectors, such as task negation and task addition. - **Few - shot Recognition**: Learn new tasks with limited labeled data, especially when the label data is very scarce. - **Test - time Adaptation**: Fine - tune the model during the inference stage to adapt to new tasks or domains. - **Parameter - efficient Fine - tuning (PEFT)**: Optimize the model performance with a small number of learnable parameters in the case of limited data. ### Method Overview The core idea of the aTLAS algorithm is to achieve heterogeneous scaling of each parameter block by independently scaling different parameter blocks of task vectors. The specific steps are as follows: 1. **Define Task Vectors**: \[ \tau_i=\theta_i - \theta_0 \] where \(\theta_0\) is the weight of the pre - trained model, and \(\theta_i\) is the weight after fine - tuning for a specific task. 2. **Introduce Block - Diagonal Matrix**: \[ \Lambda=\begin{bmatrix} \lambda^{(1)}I^{(1)}&\cdots&0\\ \vdots&\ddots&\vdots\\ 0&\cdots&\lambda^{(m)}I^{(m)} \end{bmatrix} \] where \(\lambda^{(j)}\) is the learning coefficient of each parameter block, and \(I^{(j)}\) is the identity matrix that matches the dimension of the parameter block. 3. **Optimization Objective**: \[ \arg\min_{\Lambda_1,\ldots,\Lambda_n}\mathbb{E}_{(x,y)\in D_t}\left[L\left(f(x;\theta_0+\sum_{i = 1}^n\Lambda_i\tau_i),y\right)\right] \] Find the optimal combination of task vectors by minimizing the loss function. ### Experimental Results - **Task Arithmetic**: aTLAS is significantly better than standard task vectors and linear task vectors in task negation and task addition. - **Few - shot Recognition**: aTLAS is significantly better than other methods in the 1 - shot case, and its performance can be further improved when combined with existing methods. - **Test - time Adaptation**: aTLAS has stronger generalization ability on out - of - domain data sets and can continuously improve the performance of pre - trained models. - **Parameter - efficient Fine - tuning**: aTLAS performs well in the case of limited data, especially for large - scale pre - trained models such as CLIP. ### Summary This paper shows the potential of task vectors in knowledge combination and transfer by proposing the aTLAS algorithm, especially in the case of scarce data. The experimental results show that aTLAS can not only effectively reduce different tasks.