Efficient End-to-End Visual Document Understanding with Rationale Distillation

Wang Zhu,Alekh Agarwal,Mandar Joshi,Robin Jia,Jesse Thomason,Kristina Toutanova
2024-04-02
Abstract:Understanding visually situated language requires interpreting complex layouts of textual and visual elements. Pre-processing tools, such as optical character recognition (OCR), can map document image inputs to textual tokens, then large language models (LLMs) can reason over text. However, such methods have high computational and engineering complexity. Can small pretrained image-to-text models accurately understand visual documents through similar recognition and reasoning steps instead? We propose Rationale Distillation (RD), which incorporates the outputs of OCR tools, LLMs, and larger multimodal models as intermediate "rationales", and trains a small student model to predict both rationales and answers. On three visual document understanding benchmarks representing infographics, scanned documents, and figures, our Pix2Struct (282M parameters) student model finetuned with RD outperforms the base model by 4-5% absolute accuracy with only 1% higher computational cost.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: How to train small image - to - text models so that they can understand visual documents efficiently and accurately, while reducing the dependence on external tools (such as OCR, large - language models, etc.), thereby reducing computational complexity and engineering difficulty. ### Problem Background When dealing with Visual Document Understanding (VDU) tasks, existing methods usually rely on external tools, such as Optical Character Recognition (OCR) systems and Large - Language Models (LLMs). These tools can convert document images into text and perform reasoning. However, this method has the following problems: 1. **High Computational Cost**: Using external tools will increase the computational overhead. 2. **Engineering Complexity**: Integrating multiple external tools increases the complexity and maintenance difficulty of the system. ### Paper Goals This paper proposes a method named "Rationale Distillation (RD)" which aims to solve the problem in the following ways: - **Reduce Dependence on External Tools**: Do not use any external tools in the reasoning stage, and only rely on the trained small model. - **Improve Efficiency and Accuracy**: By introducing an intermediate "rationale" (reason or explanation), the small model can understand visual documents more efficiently and give correct answers. ### Specific Methods 1. **Rationale Generation**: Use OCR tools, large - language models and other tools to generate intermediate "rationale", which include textual evidence, tabular representations and simple program code. 2. **Multi - task Training**: Train a small student model so that it can predict these rationales as well as the final answer. 3. **Data Augmentation and Filtering**: Increase the quantity and quality of training data by cropping images and screening useful rationales, making the student model more robust. ### Experimental Results The experimental results show that the small model using the Rationale Distillation method performs well in three visual document understanding benchmark tests (InfoVQA, DocVQA and ChartQA), with a significant performance improvement compared to the baseline model, while the computational cost is only slightly increased. ### Summary The main contribution of this paper is to provide an efficient and accurate method for visual document understanding, reducing the dependence on external tools and reducing computational complexity and engineering difficulty.