Functional Structure Identification of Scientific Documents in Computer Science.

Wei Lu,Yong Huang,Yi Bu,Qikai Cheng
DOI: https://doi.org/10.1007/s11192-018-2640-y
IF: 3.801
2018-01-01
Scientometrics
Abstract:The increasing number of open-access full-text scientific documents promotes the transformation from metadata- to content-based studies, which is more detailed and semantic. Along with the benefits of ample data, the confused internal structure introduces great difficulties to data organization and analysis. Each unit in scientific documents has its own function in expressing authors' research ideas, such as introducing motivations, describing methods, stating related work, and drawing conclusions; these could be used to identify functional structure of scientific documents. This paper firstly proposes a clustering method to generate domain-specific structures based on high-frequency section headers in scientific documents of a domain. To automatically identify the structure of scientific documents, we categorize scientific documents into three types: (1) strong-structure documents; (2) weak-structure documents; and (3) no-structure documents. We further divide the identification into three levels--section header-based identification, section content-based identification, and paragraph-based identification--corresponding to the three types of documents. Our experiments on documents in the field of computer science show that: (1) section header-based identification is the most direct and simplest method, but its accuracy is limited by unknown words in section headers; (2) section content-based identification is more stable and obtains good performance; and (3) paragraph-based identification is promising in identifying functions of no-structure documents. Additionally, we apply our methods to two tasks: academic search and keyword extraction. Both tasks demonstrate the effectiveness of functional structure.
What problem does this paper attempt to address?