Abstract:Understanding visually situated language requires interpreting complex layouts of textual and visual elements. Pre-processing tools, such as optical character recognition (OCR), can map document image inputs to textual tokens, then large language models (LLMs) can reason over text. However, such methods have high computational and engineering complexity. Can small pretrained image-to-text models accurately understand visual documents through similar recognition and reasoning steps instead? We propose Rationale Distillation (RD), which incorporates the outputs of OCR tools, LLMs, and larger multimodal models as intermediate "rationales", and trains a small student model to predict both rationales and answers. On three visual document understanding benchmarks representing infographics, scanned documents, and figures, our Pix2Struct (282M parameters) student model finetuned with RD outperforms the base model by 4-5% absolute accuracy with only 1% higher computational cost.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: How to train small image - to - text models so that they can understand visual documents efficiently and accurately, while reducing the dependence on external tools (such as OCR, large - language models, etc.), thereby reducing computational complexity and engineering difficulty. ### Problem Background When dealing with Visual Document Understanding (VDU) tasks, existing methods usually rely on external tools, such as Optical Character Recognition (OCR) systems and Large - Language Models (LLMs). These tools can convert document images into text and perform reasoning. However, this method has the following problems: 1. **High Computational Cost**: Using external tools will increase the computational overhead. 2. **Engineering Complexity**: Integrating multiple external tools increases the complexity and maintenance difficulty of the system. ### Paper Goals This paper proposes a method named "Rationale Distillation (RD)" which aims to solve the problem in the following ways: - **Reduce Dependence on External Tools**: Do not use any external tools in the reasoning stage, and only rely on the trained small model. - **Improve Efficiency and Accuracy**: By introducing an intermediate "rationale" (reason or explanation), the small model can understand visual documents more efficiently and give correct answers. ### Specific Methods 1. **Rationale Generation**: Use OCR tools, large - language models and other tools to generate intermediate "rationale", which include textual evidence, tabular representations and simple program code. 2. **Multi - task Training**: Train a small student model so that it can predict these rationales as well as the final answer. 3. **Data Augmentation and Filtering**: Increase the quantity and quality of training data by cropping images and screening useful rationales, making the student model more robust. ### Experimental Results The experimental results show that the small model using the Rationale Distillation method performs well in three visual document understanding benchmark tests (InfoVQA, DocVQA and ChartQA), with a significant performance improvement compared to the baseline model, while the computational cost is only slightly increased. ### Summary The main contribution of this paper is to provide an efficient and accurate method for visual document understanding, reducing the dependence on external tools and reducing computational complexity and engineering difficulty.

Efficient End-to-End Visual Document Understanding with Rationale Distillation

Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models

HRVDA: High-Resolution Visual Document Assistant

DUBLIN -- Document Understanding By Language-Image Network

DistilDoc: Knowledge Distillation for Visually-Rich Document Applications

Beyond Accuracy: Ensuring Correct Predictions With Correct Rationales

Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs

DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models

Fact :Teaching MLLMs with Faithful, Concise and Transferable Rationales

3MVRD: Multimodal Multi-task Multi-teacher Visually-Rich Form Document Understanding

Enhancing Document Information Analysis with Multi-Task Pre-training: A Robust Approach for Information Extraction in Visually-Rich Documents

Describe-then-Reason: Improving Multimodal Mathematical Reasoning through Visual Comprehension Training

On Efficient Language and Vision Assistants for Visually-Situated Natural Language Understanding: What Matters in Reading and Reasoning

MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding

UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model

Visually Descriptive Language Model for Vector Graphics Reasoning

DOMINO: A Dual-System for Multi-step Visual Language Reasoning

MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale

Convincing Rationales for Visual Question Answering Reasoning

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

Read and Think: An Efficient Step-wise Multimodal Language Model for Document Understanding and Reasoning