Table of Contents Recognition in OCR Documents using Image-based Machine Learning

Sai Kosaraju,Nelson Zange Tsaku,Pritesh Patel,Tanju Bayramoglu,Girish Modgil,Mingon Kang
DOI: https://doi.org/10.1145/3299815.3314455
2019-04-18
Abstract:The importance of automatic analysis of Optical Character Recognition (OCR) documents has been increasingly recognized to assist with efficient data managements and accessibility. However, most OCR documents are unstructured, making the analysis extremely challenging. A document's Table Of Contents (TOC) provides an overall structure of a document, such as chapters and appendixes. Hence, TOC recognition enables more effect analyze OCR documents effectively. Most existing related works are based on textual features, such as keywords and font sizes. However, textual-based TOC recognition in OCR often fail when OCR documents are complex. In this study, we develop a novel image-based machine learning approach for recognition of TOC, where one-dimensional horizontal projections of TOC are proposed for classifying TOC and non-TOC. To the best of our knowledge, this is the first work to recognize TOC by image-based analysis. We evaluated the proposed methods with PDF documents of thesis and dissertations. The experimental results show that our proposed methods outperformed others.
What problem does this paper attempt to address?