PDF-to-Tree: Parsing PDF Text Blocks into a Tree

Yue Zhang,Zhihao Zhang,Wenbin Lai,Chong Zhang,Tao Gui,Qi Zhang,Xuanjing Huang
DOI: https://doi.org/10.18653/v1/2024.findings-emnlp.628
2024-01-01
Abstract:In many PDF documents, the reading order of text blocks is missing, which can hinder machine understanding of the document’s content.Existing works try to extract one universal reading order for a PDF file.However, applications, like Retrieval Augmented Generation (RAG), require breaking long articles into sections and subsections for better indexing.For this reason, this paper introduces a new task and dataset, PDF-to-Tree, which organizes the text blocks of a PDF into a tree structure.Since a PDF may contain thousands of text blocks, far exceeding the number of words in a sentence, this paper proposes a transition-based parser that uses a greedy strategy to build the tree structure.Compared to parser for plain text, we also use multi-modal features to encode the parser state.Experiments show that our approach achieves an accuracy of 93.93%, surpassing the performance of baseline methods by an improvement of 6.72%.
What problem does this paper attempt to address?