Optimized Hierarchy Clustering Based Extraction for Logical Document Structures

ZHANG Kuo,XU Peng,Li Juanzi,WANG Kehong
DOI: https://doi.org/10.3321/j.issn:1000-0054.2005.04.013
2005-01-01
Abstract:Automatic identification of logical structures in semi-structured documents enables reading by browsing and the reuse of content components. A method developed for loosely-structured documents, CEDLS, extracts the logical structures from semi-structured documents using an optimized hierarchical clustering algorithm. The method first identifies the characteristic information and selects the features in the logical structure, and then applies an improved hierarchical clustering algorithm to extract the hierarchical logical structures. Tests on annual reports from the Shanghai Stock Exchange illustrate the precision and robustness of the method.
What problem does this paper attempt to address?