Cosmos-LLaVA: Chatting with the Visual Cosmos-LLaVA: Görselle Sohbet Etmek

Ahmed Zeer,Eren Dogan,Yusuf Erdem,Elif Ince,Osama Shbib,M. Egemen Uzun,Atahan Uz,M. Kaan Yuce,H. Toprak Kesgin,M. Fatih Amasyali
DOI: https://doi.org/10.1109/IDAP64064.2024.10710874
2024-12-04
Abstract:In this study, a Turkish visual instruction model was developed and various model architectures and dataset combinations were analysed to improve the performance of this model. The Cosmos-LLaVA model, which is built by combining different large language models and image coders, is designed to overcome the deficiencies in the Turkish language. In the experiments, the effects of fine-tuning with various datasets on the model performance are analysed in detail. The results show that model architecture and dataset selection have a significant impact on performance. Bu çalışmada bir Türkçe görsel talimat modeli geliştirilerek bu modelin performansını artırmaya yönelik çeşitli model mimarileri ve veri kümesi kombinasyonları derinlemesine incelenmiştir. Farklı büyük dil modelleri ve görüntü kodlayıcılarının bir araya getirilmesiyle oluşturulan Cosmos-LLaVA modeli, Türkçe dilindeki eksiklikleri gidermeye yönelik olarak tasarlanmıştır. Yapılan deneylerde, çeşitli veri kümeleri ile yapılan ince ayarların model performansını nasıl etkilediği detaylı olarak ele alınmıştır. Sonuçlar, model mimarisi ve veri kümesi seçiminin performans üzerinde önemli bir etkiye sahip olduğunu göstermektedir.
Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to develop an efficient Turkish visual instruction model to overcome the deficiencies of existing models in processing Turkish. Specifically, the research objectives include: 1. **Improve the performance of the Turkish visual instruction model**: By designing and evaluating different model architectures and dataset combinations, the aim is to enhance the model's ability to understand image content and generate corresponding text descriptions. 2. **Solve the problem of scarce Turkish resources**: Compared to other languages, there is less research on Turkish. This study aims to fill this gap and provide a solid foundation for subsequent research. 3. **Optimize model architecture and data selection**: Analyze the combined effects of different large - language models (LLMs) and image encoders, and explore how to improve model performance by fine - tuning various datasets. 4. **Evaluate the effectiveness of the model**: Through multiple evaluation methods (such as GPT - 4o as a judge, human annotator scores, binary classification tasks, etc.), comprehensively evaluate the performance of the model and determine the best configuration. In summary, this research is committed to building a deep - learning model that can efficiently handle Turkish visual tasks, while exploring the key factors that affect model performance, thereby providing valuable references and starting points for future research.