A Lightweight Graph-based Method to Detect Pornographic and Gambling Websites with Imperfect Datasets

Xiaoqing Ma,Chao Zheng,Zhao Li,Jiangyi Yin,Qingyun Liu,Xunxun Chen
DOI: https://doi.org/10.1109/trustcom56396.2022.00048
2022-01-01
Abstract:With the widespread abuse of information technology, pornographic and gambling websites develop rapidly. They affect the physical and mental health of children and endanger personal property. Therefore, it is necessary to detect them. However, the existing detection methods ignored that imperfect datasets are common in the scenario of pornographic and gambling websites which are hence adverse to the detection. Those imperfections specifically include sparse samples, mismatch and imbalanced datasets. In addition, over-reliance on visual features incurred high overhead.To overcome these shortcomings, we innovatively propose a lightweight graph-based method to detect pornographic and gambling websites through semi-supervised learning of textual content. The semi-supervised learning is to solve sparse samples and mismatch datasets, while the graph-based approach can combine the semi-supervised part with community discovery to deal with imbalanced datasets. Specifically, we perform the detection process with the utilization of modified TF-IDF and Louvain during the iteration and updating by the EM algorithm. The experimental results show that our method achieves the best 92.01% Macro-Avg-F1 with the shortest CPU time and outperforms all baselines. We also illustrate that the designed components in our model do contribute to the detection.
What problem does this paper attempt to address?