Combining Topic Models and String Kernel for Deep Web Categorization

Guangyue Xu,Weimin Zheng,Haiping Wu,Yujiu Yang
DOI: https://doi.org/10.1109/fskd.2010.5569236
2010-01-01
Abstract:Online databases maintain a collection of structured domain-specific documents dynamically generated in response to users' queries instead of being accessed by static URLs. Categorizing deep webs according to their object domains is a critical step to integrate such sources. While existing methods focus on supervised or post-query methodologies, we propose a more practical pre-query algorithm operating in an unsupervised manner. Given the domain number, our two phase approach firstly investigates the hidden domain distribution for each query form using topic models and each query form's object domain can be identified preliminarily. In this phase, we construct our training set composing the query forms deemed to have already been categorized correctly, and beside, the deep webs needed to be reclassified are also selected in this phase. In the second phase, we train a classifier with String Kernel methods to reclassify the uncertain deep webs to improve the overall performance. The advantage of our algorithm over previous ones is that we capture the semantic structure for each query form. Based on the two phase architecture, our framework works in an unsupervised manner and achieves satisfactory results. Experiments on the TEL-8 dataset from the UIUC Web integration repository1 show the effectiveness and efficiency of our algorithm.
What problem does this paper attempt to address?