Automatic Document Metadata Extraction Based on Deep Networks.
Runtao Liu,Liangcai Gao,Dong An,Zhuoren Jiang,Zhi Tang
DOI: https://doi.org/10.1007/978-3-319-73618-1_26
2017-01-01
Abstract:Metadata information extraction from academic papers is of great value to many applications such as scholar search, digital library, and so on. This task has attracted much attention from researchers in the past decades, and many templates-based or statistical machine learning (e.g. SVM, CRF, etc.)-based extraction methods have been proposed, while this task is still a challenge because of the variety and complexity of page layout. To address this challenge, we try introducing the deep learning networks to this task in this paper, since deep learning has shown great power in many areas like computer vision (CV) and natural language processing (NLP). Firstly, we employ the deep learning networks to model the image information and the text information of paper headers respectively, which allow our approach to perform metadata extraction with little information loss. Then we formulate the problem, metadata extraction from a paper header, as two typical tasks of different areas: object detection in the area of CV, and sequence labeling in the area of NLP. Finally, the two deep networks generated from the above two tasks are combined together to give extraction results. The primary experiments show that our approach achieves state-of-the-art performance on several open datasets. At the same time, this approach can process both image data and text data, and does not need to design any classification feature.