Automatic Labeling of Semantic Clauses in Research Articles

王越千,黄文彬,车尚锟,步一
DOI: https://doi.org/10.3772/j.issn.1000-0135.2021.06.007
2021-01-01
Abstract:Analyzing the semantic structure of research articles can be widely used to address multiple issues such as infor‐ mation extraction and retrieval. This paper describes the semantic structure of research articles by applying machine learn‐ ing techniques to recognize the semantic types of discourse segments in these articles. We extracted the macro structure of research articles, including the syntactic and lexical information of each discourse segment as input features, and trained five models, namely support vector machines (SVM), conditional random fields (CRF), random forests (RF), gradient boost classifier (GBC), and stochastic gradient descent classifier (SGD). We integrated three best-performing models, that is, CRF, SVM, and GBC, to form a bagging model for classifying all discourse segments from the full text. Experimental results showed that our bagging model outperformed the baseline model on tasks of classifying discourse segments from full text and result sections with a higher accuracy and F-score. Furthermore, a topic-clustering experiment demonstrated the effectiveness of the model on topic detection, which is a common task in the field of text mining.
What problem does this paper attempt to address?