Abstract:With the development of online transactions, the Chinese cyber black market is proliferating and facilitates many cybercrimes. It is difficult to understand the cyber black market due to the confusing jargon (called black keywords in this paper) used by criminals to conceal underground transactions. To discover black keywords automatically, some natural language processing based methods have been proposed by comparing the similarity of word vectors generated by word embedding models. Therefore, the quality of word vectors generated has a significant impact on black keyword discovery and it is necessary to evaluate different word embedding models in discovering black keywords. To this end, we design a Chinese black keyword discovery framework and conduct a systematic empirical study on six existing word embedding models including both static and dynamic types in discovering Chinese black keywords. In specific, we classify Chinese black keywords in four types: domain specific words (DSWs), new meaning words (NMWs), similar pronunciation words (SPWs), and similar glyph words (SGWs). We experimentally find that different word embedding models vary greatly in performance when discovering black keywords, e.g., dynamic models perform well in discovering DSWs and NMWs, static ones perform poorly in discovering NMWs. We improve the static word embedding model based NMW discovery algorithm by additionally comparing the differences in cross-corpus word nearest -neighbors before and after domain incremental training. For effectively discovering variant words like SPWs and SGWs, we additionally introduce Chinese pronunciation and glyph features. The experimental results demonstrate the effectiveness of the proposed Chinese black keyword discovery framework, with detection accuracies of over 90% for DSWs, 80% for NWMs, 90% for SPWs, and 61% for SGWs.

Chinese Keyword Extraction Based on N-Gram and Word Co-occurrence

Chinese Keyword Extraction Based on Word Platform

New Word Identification in Social Network Text Based on Time Series Information

Chinese Keyword Extraction Algorithm Based on Neighbour Words

Novel Chinese Text Format Based on Word Encoding

Automatic Keywords Extraction Based on Co-Occurrence and Semantic Relationships Between Words

Study of Word-Based Chinese Document Experimental System and Chinese Free-Text Information Extraction Experiment Based on It

Keyword Extraction Based on Tf/idf for Chinese News Document

Automatic keyphrase extraction from chinese news documents

Auto-Indexing Based on Chinese Characters Coding on Words Platform

New Word Extraction from Chinese Financial Documents.

Research on Chinese Keywords Extraction Based on Characters Sequence Annotation

News-oriented Automatic Chinese Keyword Indexing

A Local Information Perception Enhancement–Based Method for Chinese NER

Chinese Documents Classification Based on N-Grams

A systematic empirical study on word embedding based methods in discovering Chinese black keywords

Chinese Documents Categorization Based on N-gram Information

Topic Detection Technology for Chinese Text Based on Statistics and Semantic Information

Extracting terminologically relevant collocations in the translation of chinese monograph

Hierarchical Classification of Chinese Documents Based on N-grams

A Way to Improve Graph-Based Keyword Extraction