Converting PDF to HTML Approach Based on Text Detection

Deliang Jiang,Xiaohu Yang
DOI: https://doi.org/10.1145/1655925.1656103
2009-01-01
Abstract:Converting PDF document to HTML document with the same layout format is a very important and interesting research problem. After the conversion, it is easy for PDF document to be browsed online and information extracted. Based on the extraction result of the PDF document of the open source tool PDFBox, the paper described a method that can detect the layout information of the PDF document and convert the PDF document to HTML page effectively.
What problem does this paper attempt to address?