ICDAR 2021 Competition on Scientific Literature Parsing

Antonio Jimeno Yepes,Xu Zhong,Douglas Burdick
DOI: https://doi.org/10.48550/arXiv.2106.14616
2021-06-08
Abstract:Scientific literature contain important information related to cutting-edge innovations in diverse domains. Advances in natural language processing have been driving the fast development in automated information extraction from scientific literature. However, scientific literature is often available in unstructured PDF format. While PDF is great for preserving basic visual elements, such as characters, lines, shapes, etc., on a canvas for presentation to humans, automatic processing of the PDF format by machines presents many challenges. With over 2.5 trillion PDF documents in existence, these issues are prevalent in many other important application domains as well. Our ICDAR 2021 Scientific Literature Parsing Competition (ICDAR2021-SLP) aims to drive the advances specifically in document understanding. ICDAR2021-SLP leverages the PubLayNet and PubTabNet datasets, which provide hundreds of thousands of training and evaluation examples. In Task A, Document Layout Recognition, submissions with the highest performance combine object detection and specialised solutions for the different categories. In Task B, Table Recognition, top submissions rely on methods to identify table components and post-processing methods to generate the table structure and content. Results from both tasks show an impressive performance and opens the possibility for high performance practical applications.
Information Retrieval
What problem does this paper attempt to address?
The paper attempts to address the challenges encountered in automating the extraction of information from scientific literature, particularly in handling non-natural language content such as charts. Specifically, the paper describes the ICDAR 2021 Scientific Literature Parsing Competition (ICDAR2021-SLP), which aims to advance document understanding technologies, especially in the areas of document layout recognition and table recognition. The competition utilized the PubLayNet and PubTabNet datasets, providing a large number of training and evaluation samples. Through the competition, researchers hope to improve the ability of machines to automatically process PDF format documents, particularly in recognizing document layout elements (such as text, titles, tables, images, and lists) and converting table images into machine-readable formats (such as HTML code). Advances in these technologies are of significant importance for efficiently extracting valuable information from a large volume of scientific literature.