Abstract:Netizens all over the world can search malicious websites by malicious long-tail keywords on search engines, but malicious websites are strictly prohibited by law in most countries, so the construction of short text classification models for detecting malicious long-tail keywords has become a key research topic of Natural Language Processing (NLP). Most of the short text classification models are for English, with the widespread use of Chinese, there is an urgent need to develop Chinese short text classification models to help Chinese search engines detect malicious long-tail keywords. Considering that malicious long-tail keywords often evade detection by using homophones that have the same pinyin, pinyin is added to solve the homophonic typos problem of Chinese short text. Considering the data sparsity problem of malicious long-tail keywords, pinyin and radicals are added to obtain more abundant features. Since there is no publicly available pre-trained pinyin and radical embedding models, the embedding vectors of Chinese words, radical and pinyin are trained through Word2Vec, and 492,345 individual word embedding vectors, 207,995 individual radical embedding vectors and 402,071 individual pinyin embedding vectors are obtained. In addition, the positional encoding and part-of-speech coefficient are added to them, and the TF-IDF and MI values are added to reflect the word frequency weight and the different importance of the same word in different documents. Then, two parallel co-attention networks are used to fuse features of words, pinyin, and radicals, BIGRU is used to extract temporal features, and CNN is used to extract local features. Multi-group comparative experiment results on two short text benchmark datasets and pornographic and gambling long-tail keywords show that the proposed model outperforms the state-of-the-art text classification model.

Radical-vectors with Pre-Trained Models for Chinese Text Classification

Chinese Text Classification Using BERT and Flat-Lattice Transformer.

A Radical-Aware Attention-Based Model for Chinese Text Classification

Radical features for Chinese text classification.

A Chinese text classification model based on radicals and character distinctions

Reading Chinese in Natural Scenes with a Bag-of-Radicals Prior

RRecT: Chinese Text Recognition with Radical-Enhanced Recognition Transformer

Radical-attended and Pinyin-attended malicious long-tail keywords detection

Radical-level Ideograph Encoder for RNN-based Sentiment Analysis of Chinese and Japanese

A Radical Cascade Classifier for Handwritten Chinese Character Recognition

Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than Character Level

Radical-Enhanced Chinese Character Embedding

Radical Counter Network for Robust Chinese Character Recognition

Chinese Named Entity Recognition with Bert

Empirical Study on Character Level Neural Network Classifier for Chinese Text.

Sentence Segmentation for Classical Chinese Based on LSTM with Radical Embedding

Trajectory-based Radical Analysis Network for Online Handwritten Chinese Character Recognition

Learning Radicals from Tangut Characters.

A Transformer-based Radical Analysis Network for Chinese Character Recognition

Chinese character recognition with radical-structured stroke trees