Extract Core Toponyms from Web Page Text Based on Link Analysis
Xiang ZHONG,Yong GAO,Lun WU
DOI: https://doi.org/10.3724/SP.J.1047.2016.00435
2016-01-01
Abstract:Geographical information explodes with the emergence of Internet, which also adopts brand new ideas to obtain geospatial data with traditional GIS methods. With the abundant geospatial information on the web, we proposed a toponym co-occurrences network model by extracting the toponym entities from web page texts using nature language process methods, as well as uniforming the toponyms, in order to conduct a comprehensive analysis of the web pages. The network set up in this paper is a weighted directed graph, of which every vertex represents a distinct toponym, and the co-occurrence of each two toponyms is displayed as one edge of this network. The frequency of geographic names is taken into consideration synthetically, which shows the weight of each network edge, as well as explains the co-occurrence relationship and transformation occurrence characteristics of those toponyms. On this basis, a method of toponym extraction from web page texts based on link analysis is carried out, taking advantage of the PageRank algorithm to calculate the link weight of every toponym in the co-occurrence network and rank each geographical name with a PageRank score. In this way, the importance of the toponym is calculated and the core geographic names with remarkable features or navigation features in all huge network resources can be found. A case study based on the actual data extracted from People's Daily and Sina News Sport web pages is carried out to verify the technical solution, which shows that the proposed solution is both feasible and practically effective, which can also be applied to geographical information retrieval. Results show that the core toponym of co-occurrence network differs in different themes of web pages, and when the time sequence factor is taken into account, the core toponym results may also be different within a single theme of web pages.