Information Extraction from Chinese Research Papers Based on Conditional Random Fields

Jiang-de Yu,Xiao-zhong Fan,Ji-hao Yin
DOI: https://doi.org/10.3321/j.issn:1000-565x.2007.09.019
2007-01-01
Abstract:The information of headers and citations of research papers is necessary for many applications, such as the field-based paper search, the paper statistics and the citation analysis. In order to enhance the utilization of context features for information extraction which is greatly restricted by the hidden Markov model (HMM), a method based on the conditional random fields (CRFs) is proposed to extract the information of paper header and citation from Chinese research papers. The proposed method, whose key is the parameter estimation and the feature selection, employs L-BFGS algorithm for the estimation of model parameters in the experiment and selects the categories features of location, layout, lexicon and state transition as the feature set of the model. During the information extraction, the format information about list separators and special-labels is used to segment the text, and then CRFs are applied to the extraction in special fields. Experimental results show that the proposed method possesses better performance than that based on the HMM, and that the performance improvement varies with the features sets.
What problem does this paper attempt to address?