Weiyang Liu,Zeju Qiu,Yao Feng,Yuliang Xiu,Yuxuan Xue,Longhui Yu,Haiwen Feng,Zhen Liu,Juyeon Heo,Songyou Peng,Yandong Wen,Michael J. Black,Adrian Weller,Bernhard Schölkopf
Abstract:Large foundation models are becoming ubiquitous, but training them from scratch is prohibitively expensive. Thus, efficiently adapting these powerful models to downstream tasks is increasingly important. In this paper, we study a principled finetuning paradigm -- Orthogonal Finetuning (OFT) -- for downstream task adaptation. Despite demonstrating good generalizability, OFT still uses a fairly large number of trainable parameters due to the high dimensionality of orthogonal matrices. To address this, we start by examining OFT from an information transmission perspective, and then identify a few key desiderata that enable better parameter-efficiency. Inspired by how the Cooley-Tukey fast Fourier transform algorithm enables efficient information transmission, we propose an efficient orthogonal parameterization using butterfly structures. We apply this parameterization to OFT, creating a novel parameter-efficient finetuning method, called Orthogonal Butterfly (BOFT). By subsuming OFT as a special case, BOFT introduces a generalized orthogonal finetuning framework. Finally, we conduct an extensive empirical study of adapting large vision transformers, large language models, and text-to-image diffusion models to various downstream tasks in vision and language.
Machine Learning,Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily focuses on the effective fine-tuning of large foundational models, particularly on how to reduce the number of trainable parameters while maintaining model performance. To address this issue, the paper proposes a new method called "Butterfly Orthogonal Fine-Tuning" (BOFT).
### Research Background
As large foundational models like ChatGPT and Stable Diffusion demonstrate exceptional generalization capabilities, the number of parameters in these models has also increased dramatically (for example, GPT-3 has about 175 billion parameters). This makes training these models from scratch extremely expensive and difficult to achieve. Therefore, efficiently adapting these powerful pre-trained models to downstream tasks becomes particularly important. Currently, common efficient task adaptation methods include model fine-tuning, adapter fine-tuning, and prompt fine-tuning.
### Main Contributions
1. **Orthogonal Fine-Tuning from the Perspective of Information Transmission**: The authors first re-examine Orthogonal Fine-Tuning (OFT) from the perspective of information transmission and identify several key requirements to achieve better parameter efficiency. Inspired by the butterfly structure in the Cooley-Tukey Fast Fourier Transform algorithm, they propose an efficient orthogonal parameterization method based on the butterfly structure.
2. **Butterfly Orthogonal Fine-Tuning (BOFT)**: By applying the butterfly structure to OFT, a new parameter-efficient fine-tuning method called BOFT is created. This method not only significantly reduces the number of trainable parameters but also retains good expressive power and training stability.
3. **Theoretical Insights**: The paper provides several theoretical insights into why BOFT can maintain good expressiveness and training stability while significantly reducing the number of trainable parameters. Additionally, through matrix decomposition, BOFT also exhibits an interesting weight interpolation property.
4. **Extensive Application Demonstration**: The paper is the first to apply orthogonal fine-tuning to various tasks beyond controllable text-to-image generation, demonstrating its great potential as a general model fine-tuning method. Specifically, BOFT is applied to downstream tasks in multiple fields such as computer vision and natural language processing, showing significant advantages over existing state-of-the-art methods.
### Core Innovations
- **Application of Butterfly Structure**: Utilizing the butterfly structure to enhance the parameter efficiency of OFT, thereby enabling the construction of dense orthogonal matrices without losing parameter efficiency.
- **Theoretical and Empirical Analysis**: Not only theoretically proving that BOFT has higher expressiveness compared to OFT but also conducting extensive experimental validation on multiple downstream tasks, demonstrating its superior parameter efficiency and generalization capability.
In summary, the paper proposes a new fine-tuning method, BOFT, which significantly reduces the number of trainable parameters while ensuring model performance, making it highly significant for the effective application of large-scale foundational models.