Abstract:With the development of online transactions, the Chinese cyber black market is proliferating and facilitates many cybercrimes. It is difficult to understand the cyber black market due to the confusing jargon (called black keywords in this paper) used by criminals to conceal underground transactions. To discover black keywords automatically, some natural language processing based methods have been proposed by comparing the similarity of word vectors generated by word embedding models. Therefore, the quality of word vectors generated has a significant impact on black keyword discovery and it is necessary to evaluate different word embedding models in discovering black keywords. To this end, we design a Chinese black keyword discovery framework and conduct a systematic empirical study on six existing word embedding models including both static and dynamic types in discovering Chinese black keywords. In specific, we classify Chinese black keywords in four types: domain specific words (DSWs), new meaning words (NMWs), similar pronunciation words (SPWs), and similar glyph words (SGWs). We experimentally find that different word embedding models vary greatly in performance when discovering black keywords, e.g., dynamic models perform well in discovering DSWs and NMWs, static ones perform poorly in discovering NMWs. We improve the static word embedding model based NMW discovery algorithm by additionally comparing the differences in cross-corpus word nearest -neighbors before and after domain incremental training. For effectively discovering variant words like SPWs and SGWs, we additionally introduce Chinese pronunciation and glyph features. The experimental results demonstrate the effectiveness of the proposed Chinese black keyword discovery framework, with detection accuracies of over 90% for DSWs, 80% for NWMs, 90% for SPWs, and 61% for SGWs.

New Word Identification in Social Network Text Based on Time Series Information

Learning Shapelet Patterns from Network-Based Time Series

Research on algorithm for networks new words identification

New Words Recognition Algorithm and Application Based on Micro-Blog Hot

New Word Detection Using BiLSTM+CRF Model with Features

New Word Extraction from Chinese Financial Documents.

Detecting new Chinese words from massive domain texts with word embedding

How Does Language Change As a Lexical Network? an Investigation Based on Written Chinese Word Co-Occurrence Networks

Linguistic emergence from a networks approach: The case of modern Chinese two-character words.

Domain-Specific New Words Detection in Chinese.

A Local Information Perception Enhancement–Based Method for Chinese NER

New Cyber Word Discovery Using Chinese Word Segmentation

Dynamic Network Embeddings for Network Evolution Analysis.

Research on Intelligent Construction of China English Network New Words Database Based on Adjacent Entropy Recognition Algorithm

Internet-oriented Chinese New Words Detection

New words discovery in microblog content

SVM-based Hybrid Pattern for New Word Discovery

Chinese New Word Detection from Query Logs.

New Word Detection For Sentiment Analysis

A systematic empirical study on word embedding based methods in discovering Chinese black keywords

Towards Unified Chinese Segmentation Algorithm