Density-Based Clustering Algorithm for Hybrid Coding Detection in Search Engines

Zhang Cheng,Zhang Qifei,Pan Xuezeng,Zhu Xuhui
DOI: https://doi.org/10.16337/j.1004-9037.2011.01.019
2011-01-01
Abstract:Aimed at Chinese HTML hybrid coding documents on the internet,this paper studies the character encoding composition of Chinese HTML files and clusters the contents of the hybrid coding files.The HTML files are separated into several categories using the classical data mining algorithms DBSCAN.Then,based on feature encoding each class is detected,after clustering hybrid encoding files.Experimental results show that when selecting the appropriate parameters each class in line with the Chinese character encoding features reaches 100%.The method can be used in the field of search engines.
What problem does this paper attempt to address?