Illegal website identification method based on template detection

Hanlong Zhang,Beijun Shen,Yongjian Wang
DOI: https://doi.org/10.14177/j.cnki.32-1397n.2015.39.03.003
2015-01-01
Abstract:A new method is proposed to identify illegal website efficiently. Essential information extracted from HTTP POST is hashed;the degree of website similarity associated with hash value match is measured;unknown websites are classified by the illegal website templates extracted from a large uncategorized corpus by clustering. The identification efficiency is improved by filtering legal websites using graph mining. The method is experimented and tested on gambling websites massively in a real environment. The results show that the precision of gambling website test of this method is 1;compared with URL,HTML and semantic features,the F-Measure of HTTP POST features is the best;legal websites can be filtered effectively using graph mining,and the operational efficiency can be improved by 20%.
What problem does this paper attempt to address?