LowCLIP: Adapting the CLIP Model Architecture for Low-Resource Languages in Multimodal Image Retrieval Task

Ali Asgarov,Samir Rustamov
2024-08-26
Abstract:This research explores the development of multimodal vision-language models for image retrieval in low-resource languages, specifically Azerbaijani. Existing vision-language models primarily support high-resource languages, and fine-tuning them remains computationally demanding. To address challenges in vision-language retrieval for low-resource languages, we integrated the CLIP model architecture and employed several techniques to balance computational efficiency with performance. These techniques include synthetic data generation through machine translation, image augmentation, and further training the attention mechanisms of transformer-based models with domain-specific data. We integrated Multilingual BERT as a text encoder with image encoders like ResNet50, EfficientNet0, Vision Transformer (ViT), and Tiny Swin Transformer. Our study found that models like EfficientNet0 and Tiny Swin Transformer perform best on the datasets they were trained on, such as COCO, Flickr30k, and Flickr8k. Augmentation techniques boosted EfficientNet0 MAP on Flickr30k from 0.84 to 0.87 and ResNet50 MAP on MSCOCO from 0.70 to 0.80, contributing to a new state of the art in vision-language retrieval. We share our configurations and results to support further research. Code and pre-trained models are available at <a class="link-external link-https" href="https://github.com/aliasgerovs/azclip" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the challenges of low-resource languages in multimodal image retrieval tasks, specifically focusing on Azerbaijani. Existing vision-language models primarily support high-resource languages (such as English), and fine-tuning these models requires substantial computational resources. This results in inadequate performance for low-resource languages in multimodal data retrieval. Specifically, the paper attempts to solve the following key issues: 1. **Balancing computational efficiency and performance**: How to reduce the computational demands of the model while maintaining high performance, making it suitable for low-resource language environments. 2. **Data scarcity**: How to train an effective multimodal model with limited data resources, especially in low-resource languages. 3. **Model generalization ability**: How to improve the model's generalization ability across different datasets, ensuring it performs well not only on training data but also on unseen data. 4. **Cross-language adaptability**: How to extend the model's training methods and architecture to other similar low-resource languages, such as Kazakh or Uzbek. ### Main Contributions 1. **Developed and validated a multimodal vision-language retrieval model specifically for Azerbaijani**, creating a custom image retrieval model that operates effectively in low-resource language environments. 2. **Improved the computational efficiency of the model design**, making it easily replicable for other low-resource languages. Its high computational efficiency reduces operational demands, allowing for practical deployment. 3. **Conducted comparative analysis of different visual encoders and text decoders and their performance on in-domain and out-of-domain data**, evaluating their generalization and scalability in new environments. Through these efforts, the paper hopes to extend powerful AI technologies to diverse and resource-limited linguistic environments, enabling more people to benefit from these technologies.