You Never Know: Quantization Induces Inconsistent Biases in Vision-Language Foundation Models

Eric Slyman,Anirudh Kanneganti,Sanghyun Hong,Stefan Lee
2024-10-27
Abstract:We study the impact of a standard practice in compressing foundation vision-language models - quantization - on the models' ability to produce socially-fair outputs. In contrast to prior findings with unimodal models that compression consistently amplifies social biases, our extensive evaluation of four quantization settings across three datasets and three CLIP variants yields a surprising result: while individual models demonstrate bias, we find no consistent change in bias magnitude or direction across a population of compressed models due to quantization.
Computer Vision and Pattern Recognition,Computers and Society,Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to study the impact of quantization on the ability of multi - modal vision - language foundation models (ViL models) to produce socially fair outputs. Specifically, the author explores the following issues: 1. **The impact of quantization on model fairness**: - Quantization is a common method for compressing deep - learning models. By converting model parameters from 32 - bit floating - point numbers to low - bit integers (such as 8 - bit or 4 - bit), the memory footprint and inference latency of the model can be significantly reduced. - However, this conversion may introduce small numerical perturbations, which can lead to changes in model behavior, including changes in bias towards different social groups. 2. **Whether there is a consistent bias amplification phenomenon**: - Previous studies have shown that in unimodal models (such as pure vision or pure language models), compression usually amplifies social bias. However, for multi - modal vision - language models, it is not clear whether this effect is consistent. - By conducting extensive evaluations on four quantization settings, three datasets, and three CLIP variants, the author found that quantization does not consistently change the magnitude or direction of bias in all compressed models. ### Main contributions - **Filling the knowledge gap**: This is the first work to systematically study the impact of quantization on the fairness of multi - modal vision - language models, filling an important gap in existing research. - **Complex and context - dependent results**: Different from previous unimodal model studies, the author found that the impact of quantization on the bias of multi - modal models is not consistent, indicating that the impact of compression techniques on fairness may be more complex and context - dependent. - **Challenging existing assumptions**: These findings challenge the assumption that "quantization will consistently amplify bias", suggesting that we need to understand more precisely how compression techniques affect fairness in different architectures and applications. ### Method overview - **Quantization methods**: The author used three common quantization methods: 8 - bit and 4 - bit quantization from HuggingFace and 8 - bit dynamic quantization from PyTorch. - **Evaluation metrics**: The accuracy and fairness of the model were evaluated through zero - shot image classification, text - image retrieval tasks, and FACET and FairFace datasets. - **Experimental settings**: Different variants of the CLIP model were selected and experiments were carried out on multiple training data sources, covering a total of 32 different scenarios. ### Conclusion The author's research reveals that the impact of quantization on the bias of multi - modal vision - language models is neither consistent nor uniform, and its direction and magnitude vary depending on the model, method, and dataset. This indicates that the impact of quantization on fairness is complex and context - dependent, challenges the assumption that quantization will consistently affect bias, and emphasizes the need for further research.