Abstract:This research explores the development of multimodal vision-language models for image retrieval in low-resource languages, specifically Azerbaijani. Existing vision-language models primarily support high-resource languages, and fine-tuning them remains computationally demanding. To address challenges in vision-language retrieval for low-resource languages, we integrated the CLIP model architecture and employed several techniques to balance computational efficiency with performance. These techniques include synthetic data generation through machine translation, image augmentation, and further training the attention mechanisms of transformer-based models with domain-specific data. We integrated Multilingual BERT as a text encoder with image encoders like ResNet50, EfficientNet0, Vision Transformer (ViT), and Tiny Swin Transformer. Our study found that models like EfficientNet0 and Tiny Swin Transformer perform best on the datasets they were trained on, such as COCO, Flickr30k, and Flickr8k. Augmentation techniques boosted EfficientNet0 MAP on Flickr30k from 0.84 to 0.87 and ResNet50 MAP on MSCOCO from 0.70 to 0.80, contributing to a new state of the art in vision-language retrieval. We share our configurations and results to support further research. Code and pre-trained models are available at <a class="link-external link-https" href="https://github.com/aliasgerovs/azclip" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the challenges of low-resource languages in multimodal image retrieval tasks, specifically focusing on Azerbaijani. Existing vision-language models primarily support high-resource languages (such as English), and fine-tuning these models requires substantial computational resources. This results in inadequate performance for low-resource languages in multimodal data retrieval. Specifically, the paper attempts to solve the following key issues: 1. **Balancing computational efficiency and performance**: How to reduce the computational demands of the model while maintaining high performance, making it suitable for low-resource language environments. 2. **Data scarcity**: How to train an effective multimodal model with limited data resources, especially in low-resource languages. 3. **Model generalization ability**: How to improve the model's generalization ability across different datasets, ensuring it performs well not only on training data but also on unseen data. 4. **Cross-language adaptability**: How to extend the model's training methods and architecture to other similar low-resource languages, such as Kazakh or Uzbek. ### Main Contributions 1. **Developed and validated a multimodal vision-language retrieval model specifically for Azerbaijani**, creating a custom image retrieval model that operates effectively in low-resource language environments. 2. **Improved the computational efficiency of the model design**, making it easily replicable for other low-resource languages. Its high computational efficiency reduces operational demands, allowing for practical deployment. 3. **Conducted comparative analysis of different visual encoders and text decoders and their performance on in-domain and out-of-domain data**, evaluating their generalization and scalability in new environments. Through these efforts, the paper hopes to extend powerful AI technologies to diverse and resource-limited linguistic environments, enabling more people to benefit from these technologies.

LowCLIP: Adapting the CLIP Model Architecture for Low-Resource Languages in Multimodal Image Retrieval Task

Realizing Efficient On-Device Language-based Image Retrieval

LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models

CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements

Vision-by-Language for Training-Free Compositional Image Retrieval

ResCLIP: Residual Attention for Training-free Dense Vision-language Inference

Embracing Language Inclusivity and Diversity in CLIP through Continual Language Learning

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

VisionCLIP: An Med-AIGC based Ethical Language-Image Foundation Model for Generalizable Retina Image Analysis

CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement

Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment

CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision

FLEX-CLIP: Feature-Level GEneration Network Enhanced CLIP for X-shot Cross-modal Retrieval

From Scarcity to Efficiency: Improving CLIP Training via Visual-enriched Captions

AgriCLIP: Adapting CLIP for Agriculture and Livestock via Domain-Specialized Cross-Model Alignment

FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos

AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization

Enhancing Image Retrieval : A Comprehensive Study on Photo Search using the CLIP Mode

ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation

Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment