A Parallel Pages Mining Approach: Combining URL Patterns and HTML Structures

LIU Qi,LIU Yang,SUN Maosong
DOI: https://doi.org/10.3969/j.issn.1003-0077.2013.03.012
2013-01-01
Abstract:Parallel corpus is the fundamental resource for statistical machine translation,cross-lingual information retrieval and others information processing technologies.Although the amount of parallel data on the web is continually increasing,the heterogeneity and complexity of parallel website make it still a challenge to collect such parallel texts.This paper presents a new parallel web pages mining approach,which combines URL patterns and HTML structure together.First,we use HTML structure to recursively visit parallel pages.Then,URL patterns are used to optimize the traverse sequence of parallel web site topology.Thus an efficient and accurate parallel pages mining system is relaized.Compared with traditional approach,experiments on two parallel web sites(www.un.org and www.gov.hk1) show that this approach saves more than 50% processing timeand improves 15% accuracy,resulting a significant increase in the translation quality of MT System.
What problem does this paper attempt to address?