Empowering MultiModal Models' In-Context Learning Ability through Large Language Models.

Wenjuan Han,Haozhe Zhao,Zefan Cai
DOI: https://doi.org/10.1145/3603165.3607368
2023-01-01
Abstract:Pretrained visual-language models (VLMs) have made progress in developing multimodal models to improve various tasks. However, they lack reasoning and in-context learning ability. Building on the success of large language models (LLMs) in general-purple NLP tasks, researchers anticipate that the VLM should also have the same strong reasoning and ICL ability through specific techniques, for example benefiting from LLMs. To boost VLMs to solve vision-language problems via few-shot exemplars, we suggest a vision-language model, called MIC1.
What problem does this paper attempt to address?