An Improved Method for Deep Web Sources Classification Based on the Theme and Form Attributes
ZHU Guan-wen,WANG Nian-bin,WANG Hong-bin
DOI: https://doi.org/10.3969/j.issn.0372-2112.2013.02.009
2013-01-01
Abstract:Nowadays,Deep web consists of vast amounts of high quality information which is rising rapidly.However,because of its distributed character,heterogeneity,autonomy etc,it is faced with huge challenges for users to obtain the information efficiently and quickly which they are interested in.Deep Web data sources are organized by the domains in the real world,which is the foundation for addressing this challenge.In this paper,based on the statistics and analysis on more than 200 data sources which are from four different fields(i.e.,Airfares,Books,Automobiles and Real estates,a novel classification method and an improved similarity measure of query interfaces were proposed to realize the automatic classification of large masses of deep web sources,which make full use of theme information and form attributes.In addition,we present a strategy of tagging query interface to reduce the influence resulted from choosing initial centers randomly.The experimental results indicated that the method is effective and has higher accuracy.