TextSquare: Scaling up Text-Centric Visual Instruction Tuning

Jingqun Tang,Chunhui Lin,Zhen Zhao,Shu Wei,Binghong Wu,Qi Liu,Hao Feng,Yang Li,Siqi Wang,Lei Liao,Wei Shi,Yuliang Liu,Hao Liu,Yuan Xie,Xiang Bai,Can Huang

2024-04-19

Abstract:Text-centric visual question answering (VQA) has made great strides with the development of Multimodal Large Language Models (MLLMs), yet open-source models still fall short of leading models like GPT4V and Gemini, partly due to a lack of extensive, high-quality instruction tuning data. To this end, we introduce a new approach for creating a massive, high-quality instruction-tuning dataset, Square-10M, which is generated using closed-source MLLMs. The data construction process, termed Square, consists of four steps: Self-Questioning, Answering, Reasoning, and Evaluation. Our experiments with Square-10M led to three key findings: 1) Our model, TextSquare, considerably surpasses open-source previous state-of-the-art Text-centric MLLMs and sets a new standard on OCRBench(62.2%). It even outperforms top-tier models like GPT4V and Gemini in 6 of 10 text-centric benchmarks. 2) Additionally, we demonstrate the critical role of VQA reasoning data in offering comprehensive contextual insights for specific questions. This not only improves accuracy but also significantly mitigates hallucinations. Specifically, TextSquare scores an average of 75.1% across four general VQA and hallucination evaluation datasets, outperforming previous state-of-the-art models. 3) Notably, the phenomenon observed in scaling text-centric VQA datasets reveals a vivid pattern: the exponential increase of instruction tuning data volume is directly proportional to the improvement in model performance, thereby validating the necessity of the dataset scale and the high quality of Square-10M.

Computer Vision and Pattern Recognition,Machine Learning

What problem does this paper attempt to address?

The problem addressed by this paper is how to bridge the performance gap between open-source and closed-source multimodal large language models in the field of Visual Question Answering (VQA). Specifically, due to the lack of a large amount of high-quality instruction fine-tuning data, the performance of open-source models is inferior to closed-source models such as GPT4V and Gemini. The paper proposes a method called Square, which generates a large-scale high-quality instruction fine-tuning dataset called Square-10M, and improves model performance through four steps: self-questioning, answering, reasoning, and evaluation. The experiments show that the TextSquare model trained with Square-10M surpasses the previous best models in multiple benchmark tests, both open-source and closed-source, and proves the critical role of inference data in reducing illusions and improving accuracy. In addition, the study reveals an exponential relationship between the amount of instruction fine-tuning data and model performance.

TextSquare: Scaling up Text-Centric Visual Instruction Tuning

SVIT: Scaling up Visual Instruction Tuning

3DBench: A Scalable 3D Benchmark and Instruction-Tuning Dataset

ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models

MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

VQA$^2$:Visual Question Answering for Video Quality Assessment

Personalized Visual Instruction Tuning

MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale

Raw Text is All you Need: Knowledge-intensive Multi-turn Instruction Tuning for Large Language Model

CITING: Large Language Models Create Curriculum for Instruction Tuning

Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data

MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity

MAmmoTH2: Scaling Instructions from the Web

OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models

An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models

Structured Multimodal Attentions for TextVQA

SQATIN: Supervised Instruction Tuning Meets Question Answering for Improved Dialogue NLU

What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning

CT-Eval: Benchmarking Chinese Text-to-Table Performance in Large Language Models

Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch