VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models

Jeongho Ju,Daeyoung Kim,SunYoung Park,Youngjune Kim
2024-11-28
Abstract:In this paper, we introduce an open-source Korean-English vision-language model (VLM), VARCO-VISION. We incorporate a step-by-step training strategy that allows a model learn both linguistic and visual information while preserving the backbone model's knowledge. Our model demonstrates outstanding performance in diverse settings requiring bilingual image-text understanding and generation abilities compared to models of similar size. VARCO-VISION is also capable of grounding, referring, and OCR, expanding its usage and potential applications for real-world scenarios. In addition to the model, we release five Korean evaluation datasets, including four closed-set and one openset benchmarks. We anticipate that our milestone will broaden the opportunities for AI researchers aiming to train VLMs. VARCO-VISION is available at <a class="link-external link-https" href="https://huggingface.co/NCSOFT/VARCO-VISION-14B" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
The main problems that this paper attempts to solve include: 1. **Lack of open - source multimodal large models supporting Korean**: - Although there are currently many multimodal large models supporting mainstream languages (such as English and Chinese), the open - source or commercial multimodal large models supporting low - resource languages (such as Korean) are very limited. This has led to a heavy dependence of users on proprietary model APIs and hindered the development of the research environment. - To this end, the author has developed a powerful open - source English - Korean bilingual vision - language model (VLM), named VARCO - VISION - 14B, and hopes to promote a more open AI community through its release. 2. **Lack of high - quality Korean evaluation benchmarks**: - The currently available Korean datasets are mainly used for simple vision - text tasks (such as visual question answering VQA or optical character recognition OCR), and it is difficult to comprehensively evaluate the overall performance of the model. - To this end, the author has released five new Korean evaluation datasets, including four closed - set and one open - set benchmarks, aiming to evaluate the bilingual ability of VLM in processing image - text information. 3. **Improve the performance of the model in vision - text understanding and generation tasks**: - Through a step - by - step training strategy, the model can learn and integrate visual and language understanding abilities while retaining the knowledge of the pre - trained backbone model. Specifically, VARCO - VISION goes through four different training stages to gradually absorb visual and language abilities, thus performing well in various benchmark tests. 4. **Expand the practical application scenarios of the model**: - VARCO - VISION not only performs well in image - text understanding and generation tasks, but also has the capabilities of grounding, referring, and OCR, which make it have great potential in real - world applications. In summary, this paper aims to fill the gaps in the field of Korean multimodal models and promote the development of related research and technology by developing and supporting open - source English - Korean bilingual vision - language models and their evaluation benchmarks.