Parallel webpage obtaining method and parallel webpage obtaining device

Liu Qi,Liu Yang,Sun Maosong
2013-01-01
Abstract:The invention discloses a parallel webpage obtaining method and a parallel webpage obtaining device, and belongs to the field of text message processing. The parallel webpage obtaining method comprises the following steps that synchronous recursive access to parallel webpages of a parallel website is realized through hypertext markup language (HTML) structure formation, routes of the traversal parallel website are optimized through a uniform resource locator (URL) naming pattern, and a classifier is used for judging candidate parallel webpages. For a webpage pair judged as parallel webpages, the naming pattern corresponding to the URLs of the parallel webpages is learnt, a bilingual text in the parallel webpage pair and a subordinate candidate parallel webpage link pair directed by the parallel webpage pair are extracted, and a priority queue of the candidate parallel webpage link pair is set up by utilizing the learnt URL pattern. Whether searching of the parallel webpages should be finished or not is judged, and the searching of the parallel webpages of the parallel website and excavation of the bilingual text are finally finished. The invention correspondingly provides a parallel webpage obtaining device. The parallel webpage obtaining method and the parallel webpage obtaining device combine the URL naming pattern and the HTML structure formation and achieve efficient and accurate searching and obtaining of parallel webpages. Meanwhile, the processing speed is improved, and bandwidth consumption is reduced.
What problem does this paper attempt to address?