A semi-structured information semantic annotation method for Web pages

Lu Zhang,Tiantian Wang,Yiran Liu,Qingling Duan
DOI: https://doi.org/10.1007/s00521-018-03999-5
2019-01-01
Neural Computing and Applications
Abstract:There is a large amount of semi-structured information on Web pages. Comprehensive and accurate annotation of Web page information with uniform semantics can enhance the use value of information and provide support for Web site information integration. According to the characteristics of semi-structured information on Web pages, a semantic annotation method based on header recognition and data item classification is proposed. Firstly, a description model is constructed for the domain to be annotated. Secondly, header recognition is used to annotate data items on extracted pages. For those data items fail to be annotated by header recognition, feature vectors are constructed based on the feature sets in the domain description model and semantics of those data items are annotated by the classification results of back-propagation neural network. The proposed method is tested on 19,657 data items in the domain of agricultural product price and 8089 data items in the domain of recruitment information. The annotation precision is 97.39% and 95.67% respectively, and the annotation recall is 95.41% and 95.67%, respectively. These results show that the proposed method can annotate semi-structured information on Web pages accurately and completely.
What problem does this paper attempt to address?