Exploring the Capabilities of Large Multimodal Models on Dense Text

Shuo Zhang,Biao Yang,Zhang Li,Zhiyin Ma,Yuliang Liu,Xiang Bai
DOI: https://doi.org/10.1007/978-3-031-70552-6_17
2024-01-01
Abstract:While large multi-modal models (LMM) have shown notable progress inmulti-modal tasks, their capabilities in tasks involving dense textual contentremains to be fully explored. Dense text, which carries important information,is often found in documents, tables, and product descriptions. Understandingdense text enables us to obtain more accurate information, assisting in makingbetter decisions. To further explore the capabilities of LMM in complex texttasks, we propose the DT-VQA dataset, with 170k question-answer pairs. In thispaper, we conduct a comprehensive evaluation of GPT4V, Gemini, and variousopen-source LMMs on our dataset, revealing their strengths and weaknesses.Furthermore, we evaluate the effectiveness of two strategies for LMM: promptengineering and downstream fine-tuning. We find that even with automaticallylabeled training datasets, significant improvements in model performance can beachieved. We hope that this research will promote the study of LMM in densetext tasks. Code will be released athttps://github.com/Yuliang-Liu/MultimodalOCR.
What problem does this paper attempt to address?