Cosmos-LLaVA: Chatting with the Visual Cosmos-LLaVA: Görselle Sohbet Etmek

Ahmed Zeer,Eren Dogan,Yusuf Erdem,Elif Ince,Osama Shbib,M. Egemen Uzun,Atahan Uz,M. Kaan Yuce,H. Toprak Kesgin,M. Fatih Amasyali

DOI: https://doi.org/10.1109/IDAP64064.2024.10710874

2024-12-04

Abstract:In this study, a Turkish visual instruction model was developed and various model architectures and dataset combinations were analysed to improve the performance of this model. The Cosmos-LLaVA model, which is built by combining different large language models and image coders, is designed to overcome the deficiencies in the Turkish language. In the experiments, the effects of fine-tuning with various datasets on the model performance are analysed in detail. The results show that model architecture and dataset selection have a significant impact on performance. Bu çalışmada bir Türkçe görsel talimat modeli geliştirilerek bu modelin performansını artırmaya yönelik çeşitli model mimarileri ve veri kümesi kombinasyonları derinlemesine incelenmiştir. Farklı büyük dil modelleri ve görüntü kodlayıcılarının bir araya getirilmesiyle oluşturulan Cosmos-LLaVA modeli, Türkçe dilindeki eksiklikleri gidermeye yönelik olarak tasarlanmıştır. Yapılan deneylerde, çeşitli veri kümeleri ile yapılan ince ayarların model performansını nasıl etkilediği detaylı olarak ele alınmıştır. Sonuçlar, model mimarisi ve veri kümesi seçiminin performans üzerinde önemli bir etkiye sahip olduğunu göstermektedir.

Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition,Machine Learning

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to develop an efficient Turkish visual instruction model to overcome the deficiencies of existing models in processing Turkish. Specifically, the research objectives include: 1. **Improve the performance of the Turkish visual instruction model**: By designing and evaluating different model architectures and dataset combinations, the aim is to enhance the model's ability to understand image content and generate corresponding text descriptions. 2. **Solve the problem of scarce Turkish resources**: Compared to other languages, there is less research on Turkish. This study aims to fill this gap and provide a solid foundation for subsequent research. 3. **Optimize model architecture and data selection**: Analyze the combined effects of different large - language models (LLMs) and image encoders, and explore how to improve model performance by fine - tuning various datasets. 4. **Evaluate the effectiveness of the model**: Through multiple evaluation methods (such as GPT - 4o as a judge, human annotator scores, binary classification tasks, etc.), comprehensively evaluate the performance of the model and determine the best configuration. In summary, this research is committed to building a deep - learning model that can efficiently handle Turkish visual tasks, while exploring the key factors that affect model performance, thereby providing valuable references and starting points for future research.

Cosmos-LLaVA: Chatting with the Visual Cosmos-LLaVA: Görselle Sohbet Etmek

Büyük dil modellerinin Türkçe verisetleri ile eğitilmesi ve ince ayarlanması

Türkçe Dil Modellerinin Performans Karşılaştırması Performance Comparison of Turkish Language Models

Introducing cosmosGPT: Monolingual Training for Turkish Language Models

Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task Instruction Tuning

Optimizing Large Language Models for Turkish: New Methodologies in Corpus Selection and Training

VBART: The Turkish LLM

A Comprehensive Evaluation of Large Language Models for Turkish Abstractive Dialogue Summarization

Reading Gokturkish text with the Yolo object detection algorithm

LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding

Avrupa Meme Görüntüleme Diploması Çoktan Seçmeli Örnek Soruları: Büyük Dil Modellerinin Yeteneklerinin Değerlendirilmesi

MSVD-Turkish: A Comprehensive Multimodal Dataset for Integrated Vision and Language Research in Turkish

Lip Reading Using Various Deep Learning Models with Visual Turkish Data

LLaVA-VSD: Large Language-and-Vision Assistant for Visual Spatial Description

Can large language models be new supportive tools in coronary computed tomography angiography reporting?

ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements

LLaVA-Endo: a Large Language-and-vision Assistant for Gastrointestinal Endoscopy

Assessing Fine-Tuning Efficacy in LLMs: A Case Study with Learning Guidance Chatbots

LLaVA-Docent: Instruction Tuning with Multimodal Large Language Model to Support Art Appreciation Education

CosmoCLIP: Generalizing Large Vision-Language Models for Astronomical Imaging

LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial Relations