Contextual Modeling for Logical Labeling of PDF Documents

X. Tao,Z. Tang,C. Xu
DOI: https://doi.org/10.1016/j.compeleceng.2014.01.005
IF: 4.152
2014-01-01
Computers & Electrical Engineering
Abstract:The widely-used Portable Document Format (PDF) documents are known to be layout-oriented and not suitable for mobile applications. In this paper, a Conditional Random Fields (CRF) based model is proposed to learn latent semantics of PDF page content. Local and contextual observations constructed from PDF attributes are incorporated to facilitate the determination of semantic roles. The observations are carefully designed to work even in different styles of documents. A local classifier is first used to generate posterior probabilities. The local estimate is then fed to the CRF model for joint classification. The experimental results evidently approve the positive effects of contextual information in logical labeling. Our work has revealed the potential usability of existing born-digital fixed-layout documents for mobile applications.
What problem does this paper attempt to address?