TOC Structure Extraction from OCR-ed Books.

Caihua Liu,Jiajun Chen,Xiaofeng Zhang,Jie Liu,Yalou Huang
DOI: https://doi.org/10.1007/978-3-642-35734-3_8
2012-01-01
Abstract:This paper addresses the task of extracting the table of contents (TOC) from OCR-ed books. Since the OCR process misses a lot of layout and structural information, it is incapable of enabling navigation experience. A TOC is needed to provide a convenient and quick way to locate the content of interest. In this paper, we propose a hybrid method to extract TOC, which is composed of rule-based method and SVMbased method. The rule-based method mainly focuses on discovering the TOC from the books with TOC pages while the SVM-based method is employed to handle with the books without TOC pages. Experimental results indicate that the proposed methods obtain comparable performance against the other participants of the ICDAR 2011 Book structure extraction competition. © Springer-Verlag 2012.
What problem does this paper attempt to address?