LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction

Bo Zou,Chao Yang,Yu Qiao,Chengbin Quan,Youjian Zhao
2024-04-01
Abstract:Existing methods to fine-tune LLMs, like Adapter, Prefix-tuning, and LoRA, which introduce extra modules or additional input sequences to inject new skills or knowledge, may compromise the innate abilities of LLMs. In this paper, we propose LLaMA-Excitor, a lightweight method that stimulates the LLMs' potential to better follow instructions by gradually paying more attention to worthwhile information. Specifically, the LLaMA-Excitor does not directly change the intermediate hidden state during the self-attention calculation of the transformer structure. We designed the Excitor block as a bypass module for the similarity score computation in LLMs' self-attention to reconstruct keys and change the importance of values by learnable prompts. LLaMA-Excitor ensures a self-adaptive allocation of additional attention to input instructions, thus effectively preserving LLMs' pre-trained knowledge when fine-tuning LLMs on low-quality instruction-following datasets. Furthermore, we unify the modeling of multi-modal tuning and language-only tuning, extending LLaMA-Excitor to a powerful visual instruction follower without the need for complex multi-modal alignment. Our proposed approach is evaluated in language-only and multi-modal tuning experimental scenarios. Notably, LLaMA-Excitor is the only method that maintains basic capabilities while achieving a significant improvement (+6%) on the MMLU benchmark. In the visual instruction tuning, we achieve a new state-of-the-art image captioning performance of 157.5 CIDEr on MSCOCO, and a comparable performance (88.39%) on ScienceQA to cutting-edge models with more parameters and extensive vision-language pertaining.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to address the issues encountered by large language models (LLMs) during fine-tuning, specifically how to enhance instruction-following capabilities without sacrificing their pre-trained abilities. Existing fine-tuning methods such as Adapter, Prefix-tuning, and LoRA, while capable of injecting new skills or knowledge, may impair the inherent capabilities of LLMs, leading to catastrophic forgetting and other issues. To tackle these problems, the paper proposes LLaMA-Excitor, a lightweight approach that stimulates the potential of LLMs through indirect feature interaction, enabling them to better follow instructions. LLaMA-Excitor introduces learnable prompts in the self-attention mechanism, gradually increasing the focus on valuable information without directly altering the intermediate hidden states. This method ensures that the pre-trained knowledge of LLMs is effectively preserved during fine-tuning on low-quality or non-target datasets. Additionally, LLaMA-Excitor unifies the fine-tuning of multimodal and pure text tasks, allowing language models to be cost-effectively extended into powerful vision-language models without complex multimodal alignment. This is particularly notable in tasks such as image caption generation and visual question answering.