Abstract:Netizens all over the world can search malicious websites by malicious long-tail keywords on search engines, but malicious websites are strictly prohibited by law in most countries, so the construction of short text classification models for detecting malicious long-tail keywords has become a key research topic of Natural Language Processing (NLP). Most of the short text classification models are for English, with the widespread use of Chinese, there is an urgent need to develop Chinese short text classification models to help Chinese search engines detect malicious long-tail keywords. Considering that malicious long-tail keywords often evade detection by using homophones that have the same pinyin, pinyin is added to solve the homophonic typos problem of Chinese short text. Considering the data sparsity problem of malicious long-tail keywords, pinyin and radicals are added to obtain more abundant features. Since there is no publicly available pre-trained pinyin and radical embedding models, the embedding vectors of Chinese words, radical and pinyin are trained through Word2Vec, and 492,345 individual word embedding vectors, 207,995 individual radical embedding vectors and 402,071 individual pinyin embedding vectors are obtained. In addition, the positional encoding and part-of-speech coefficient are added to them, and the TF-IDF and MI values are added to reflect the word frequency weight and the different importance of the same word in different documents. Then, two parallel co-attention networks are used to fuse features of words, pinyin, and radicals, BIGRU is used to extract temporal features, and CNN is used to extract local features. Multi-group comparative experiment results on two short text benchmark datasets and pornographic and gambling long-tail keywords show that the proposed model outperforms the state-of-the-art text classification model.

A Novel Model Based on Big Data Environment for Text Content Security Recognition

A Novel Approach to Text Detection and Extraction from Videos by Discriminative Features and Density

Radical-attended and Pinyin-attended malicious long-tail keywords detection

A Deep Learning-Based RNNs Model for Automatic Security Audit of Short Messages

Improved BTM topic embedding method for Web text data extraction

Security Analysis of Social Network Topic Mining Using Big Data and Optimized Deep Convolutional Neural Network

CS-BTM: a semantics-based hot topic detection method for social network

A Novel Threat Intelligence Information Extraction System Combining Multiple Models

A network security situational awareness model based on multi-source heterogeneous sensors

A Network Security Situation Prediction Method through the Use of Improved TCN and BiDLSTM

A Novel Text Structure Feature Extractor for Chinese Scene Text Detection and Recognition.

A Content-Based Chinese Spam Detection Method Using a Capsule Network With Long-Short Attention

Real-Time Text Detection with Similar Mask in Traffic, Industrial, and Natural Scenes

TEXTSHIELD: Robust Text Classification Based on Multimodal Embedding and Neural Machine Translation

Chinese Named Entity Recognition Method for the Field of Network Security Based on RoBERTa

UTTSR: A Novel Non-Structured Text Table Recognition Model Powered by Deep Learning Technology

A text matching model based on dynamic multi‐mask and augmented adversarial

Chinese Text Detection Using Deep Learning Model And Synthetic Data

Short Text Classification Model based on Pre-trained Language Model with Feature Fusion

Using Big Data From The Web To Train Chinese Traffic Word Representation Model In Vector Space

DB-EAC and LSTR: DBnet based seal text detection and Lightweight Seal Text Recognition