TextSquare: Scaling up Text-Centric Visual Instruction Tuning

Jingqun Tang,Chunhui Lin,Zhen Zhao,Shu Wei,Binghong Wu,Qi Liu,Hao Feng,Yang Li,Siqi Wang,Lei Liao,Wei Shi,Yuliang Liu,Hao Liu,Yuan Xie,Xiang Bai,Can Huang
2024-04-19
Abstract:Text-centric visual question answering (VQA) has made great strides with the development of Multimodal Large Language Models (MLLMs), yet open-source models still fall short of leading models like GPT4V and Gemini, partly due to a lack of extensive, high-quality instruction tuning data. To this end, we introduce a new approach for creating a massive, high-quality instruction-tuning dataset, Square-10M, which is generated using closed-source MLLMs. The data construction process, termed Square, consists of four steps: Self-Questioning, Answering, Reasoning, and Evaluation. Our experiments with Square-10M led to three key findings: 1) Our model, TextSquare, considerably surpasses open-source previous state-of-the-art Text-centric MLLMs and sets a new standard on OCRBench(62.2%). It even outperforms top-tier models like GPT4V and Gemini in 6 of 10 text-centric benchmarks. 2) Additionally, we demonstrate the critical role of VQA reasoning data in offering comprehensive contextual insights for specific questions. This not only improves accuracy but also significantly mitigates hallucinations. Specifically, TextSquare scores an average of 75.1% across four general VQA and hallucination evaluation datasets, outperforming previous state-of-the-art models. 3) Notably, the phenomenon observed in scaling text-centric VQA datasets reveals a vivid pattern: the exponential increase of instruction tuning data volume is directly proportional to the improvement in model performance, thereby validating the necessity of the dataset scale and the high quality of Square-10M.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem addressed by this paper is how to bridge the performance gap between open-source and closed-source multimodal large language models in the field of Visual Question Answering (VQA). Specifically, due to the lack of a large amount of high-quality instruction fine-tuning data, the performance of open-source models is inferior to closed-source models such as GPT4V and Gemini. The paper proposes a method called Square, which generates a large-scale high-quality instruction fine-tuning dataset called Square-10M, and improves model performance through four steps: self-questioning, answering, reasoning, and evaluation. The experiments show that the TextSquare model trained with Square-10M surpasses the previous best models in multiple benchmark tests, both open-source and closed-source, and proves the critical role of inference data in reducing illusions and improving accuracy. In addition, the study reveals an exponential relationship between the amount of instruction fine-tuning data and model performance.