Abstract:Netizens all over the world can search malicious websites by malicious long-tail keywords on search engines, but malicious websites are strictly prohibited by law in most countries, so the construction of short text classification models for detecting malicious long-tail keywords has become a key research topic of Natural Language Processing (NLP). Most of the short text classification models are for English, with the widespread use of Chinese, there is an urgent need to develop Chinese short text classification models to help Chinese search engines detect malicious long-tail keywords. Considering that malicious long-tail keywords often evade detection by using homophones that have the same pinyin, pinyin is added to solve the homophonic typos problem of Chinese short text. Considering the data sparsity problem of malicious long-tail keywords, pinyin and radicals are added to obtain more abundant features. Since there is no publicly available pre-trained pinyin and radical embedding models, the embedding vectors of Chinese words, radical and pinyin are trained through Word2Vec, and 492,345 individual word embedding vectors, 207,995 individual radical embedding vectors and 402,071 individual pinyin embedding vectors are obtained. In addition, the positional encoding and part-of-speech coefficient are added to them, and the TF-IDF and MI values are added to reflect the word frequency weight and the different importance of the same word in different documents. Then, two parallel co-attention networks are used to fuse features of words, pinyin, and radicals, BIGRU is used to extract temporal features, and CNN is used to extract local features. Multi-group comparative experiment results on two short text benchmark datasets and pornographic and gambling long-tail keywords show that the proposed model outperforms the state-of-the-art text classification model.

Radical-attended and Pinyin-attended malicious long-tail keywords detection

A Radical-Aware Attention-Based Model for Chinese Text Classification

A systematic empirical study on word embedding based methods in discovering Chinese black keywords

RoBERTa-wwm-ext Fine-Tuning for Chinese Text Classification

A Novel Model Based on Big Data Environment for Text Content Security Recognition

TEXTSHIELD: Robust Text Classification Based on Multimodal Embedding and Neural Machine Translation

WordChange: Adversarial Examples Generation Approach for Chinese Text Classification

A Black-box NLP Classifier Attacker

Malicious URL Detection via Pretrained Language Model Guided Multi-Level Feature Attention Network

Feature-Enhanced Nonequilibrium Bidirectional Long Short-Term Memory Model for Chinese Text Classification

Bigram and Unigram Based Text Attack Via Adaptive Monotonic Heuristic Search

A Multi-oriented Chinese Keyword Spotter Guided by Text Line Detection

An Improved Double Channel Long Short-Term Memory Model for Medical Text Classification

Multi-Granularity Tibetan Textual Adversarial Attack Method Based on Masked Language Model

Detecting Malicious Web Requests Using an Enhanced TextCNN.

Exploiting Language Model for Efficient Linguistic Steganalysis

Multiscale Positive-Unlabeled Detection of AI-Generated Texts

Adaptive Topic Modeling for Detection Objectionable Text

New Word Detection Using BiLSTM+CRF Model with Features