A systematic empirical study on word embedding based methods in discovering Chinese black keywords
Chenyang Wang,Yi Shen,Yuwei Li,Min Zhang,Miao Hu,Jinghua Zheng
DOI: https://doi.org/10.1016/j.engappai.2023.106775
IF: 8
2023-01-01
Engineering Applications of Artificial Intelligence
Abstract:With the development of online transactions, the Chinese cyber black market is proliferating and facilitates many cybercrimes. It is difficult to understand the cyber black market due to the confusing jargon (called black keywords in this paper) used by criminals to conceal underground transactions. To discover black keywords automatically, some natural language processing based methods have been proposed by comparing the similarity of word vectors generated by word embedding models. Therefore, the quality of word vectors generated has a significant impact on black keyword discovery and it is necessary to evaluate different word embedding models in discovering black keywords. To this end, we design a Chinese black keyword discovery framework and conduct a systematic empirical study on six existing word embedding models including both static and dynamic types in discovering Chinese black keywords. In specific, we classify Chinese black keywords in four types: domain specific words (DSWs), new meaning words (NMWs), similar pronunciation words (SPWs), and similar glyph words (SGWs). We experimentally find that different word embedding models vary greatly in performance when discovering black keywords, e.g., dynamic models perform well in discovering DSWs and NMWs, static ones perform poorly in discovering NMWs. We improve the static word embedding model based NMW discovery algorithm by additionally comparing the differences in cross-corpus word nearest -neighbors before and after domain incremental training. For effectively discovering variant words like SPWs and SGWs, we additionally introduce Chinese pronunciation and glyph features. The experimental results demonstrate the effectiveness of the proposed Chinese black keyword discovery framework, with detection accuracies of over 90% for DSWs, 80% for NWMs, 90% for SPWs, and 61% for SGWs.