Abstract:In recent years, the problem of scene text extraction from images has received extensive attention and significant progress. However, text extraction from scholarly figures such as plots and charts remains an open problem, in part due to the difficulty of locating irregularly placed text lines. To the best of our knowledge, literature has not described the implementation of a text extraction system for scholarly figures that adapts deep convolutional neural networks used for scene text detection. In this paper, we propose a text extraction approach for scholarly figures that forgoes preprocessing in favor of using a deep convolutional neural network for text line localization. Our system uses a publicly available scene text detection approach whose network architecture is well suited to text extraction from scholarly figures. Training data are derived from charts in arXiv papers which are extracted using Allen Institute's pdffigures tool. Since this tool analyzes PDF data as a container format in order to extract text location through the mechanisms which render it, we were able to gather a large set of labeled training samples. We show significant improvement from methods in the literature, and discuss the structural changes of the text extraction pipeline.

A Neural Approach for Text Extraction from Scholarly Figures