Abstract:In this paper, we introduce an open-source Korean-English vision-language model (VLM), VARCO-VISION. We incorporate a step-by-step training strategy that allows a model learn both linguistic and visual information while preserving the backbone model's knowledge. Our model demonstrates outstanding performance in diverse settings requiring bilingual image-text understanding and generation abilities compared to models of similar size. VARCO-VISION is also capable of grounding, referring, and OCR, expanding its usage and potential applications for real-world scenarios. In addition to the model, we release five Korean evaluation datasets, including four closed-set and one openset benchmarks. We anticipate that our milestone will broaden the opportunities for AI researchers aiming to train VLMs. VARCO-VISION is available at <a class="link-external link-https" href="https://huggingface.co/NCSOFT/VARCO-VISION-14B" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve include: 1. **Lack of open - source multimodal large models supporting Korean**: - Although there are currently many multimodal large models supporting mainstream languages (such as English and Chinese), the open - source or commercial multimodal large models supporting low - resource languages (such as Korean) are very limited. This has led to a heavy dependence of users on proprietary model APIs and hindered the development of the research environment. - To this end, the author has developed a powerful open - source English - Korean bilingual vision - language model (VLM), named VARCO - VISION - 14B, and hopes to promote a more open AI community through its release. 2. **Lack of high - quality Korean evaluation benchmarks**: - The currently available Korean datasets are mainly used for simple vision - text tasks (such as visual question answering VQA or optical character recognition OCR), and it is difficult to comprehensively evaluate the overall performance of the model. - To this end, the author has released five new Korean evaluation datasets, including four closed - set and one open - set benchmarks, aiming to evaluate the bilingual ability of VLM in processing image - text information. 3. **Improve the performance of the model in vision - text understanding and generation tasks**: - Through a step - by - step training strategy, the model can learn and integrate visual and language understanding abilities while retaining the knowledge of the pre - trained backbone model. Specifically, VARCO - VISION goes through four different training stages to gradually absorb visual and language abilities, thus performing well in various benchmark tests. 4. **Expand the practical application scenarios of the model**: - VARCO - VISION not only performs well in image - text understanding and generation tasks, but also has the capabilities of grounding, referring, and OCR, which make it have great potential in real - world applications. In summary, this paper aims to fill the gaps in the field of Korean multimodal models and promote the development of related research and technology by developing and supporting open - source English - Korean bilingual vision - language models and their evaluation benchmarks.

VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models

CoLLaVO: Crayon Large Language and Vision mOdel

COGVLM: VISUAL EXPERT FOR LARGE LANGUAGE MODELS

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

OpenVLA: An Open-Source Vision-Language-Action Model

OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset

On Efficient Language and Vision Assistants for Visually-Situated Natural Language Understanding: What Matters in Reading and Reasoning

CogVLM: Visual Expert for Pretrained Language Models

Pushing the Limits of Vision-Language Models in Remote Sensing without Human Annotations

Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration

Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment

Retrieval-Augmented Open-Vocabulary Object Detection

Vision-Language Models for Vision Tasks: A Survey

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

X-LLaVA: Optimizing Bilingual Large Vision-Language Alignment

RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data

X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks

Vision-Language Models under Cultural and Inclusive Considerations

Towards Better Vision-Inspired Vision-Language Models

VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis