Radical-attended and Pinyin-attended malicious long-tail keywords detection

DOI: https://doi.org/10.1007/s00521-024-09871-z
2024-05-11
Neural Computing and Applications
Abstract:Netizens all over the world can search malicious websites by malicious long-tail keywords on search engines, but malicious websites are strictly prohibited by law in most countries, so the construction of short text classification models for detecting malicious long-tail keywords has become a key research topic of Natural Language Processing (NLP). Most of the short text classification models are for English, with the widespread use of Chinese, there is an urgent need to develop Chinese short text classification models to help Chinese search engines detect malicious long-tail keywords. Considering that malicious long-tail keywords often evade detection by using homophones that have the same pinyin, pinyin is added to solve the homophonic typos problem of Chinese short text. Considering the data sparsity problem of malicious long-tail keywords, pinyin and radicals are added to obtain more abundant features. Since there is no publicly available pre-trained pinyin and radical embedding models, the embedding vectors of Chinese words, radical and pinyin are trained through Word2Vec, and 492,345 individual word embedding vectors, 207,995 individual radical embedding vectors and 402,071 individual pinyin embedding vectors are obtained. In addition, the positional encoding and part-of-speech coefficient are added to them, and the TF-IDF and MI values are added to reflect the word frequency weight and the different importance of the same word in different documents. Then, two parallel co-attention networks are used to fuse features of words, pinyin, and radicals, BIGRU is used to extract temporal features, and CNN is used to extract local features. Multi-group comparative experiment results on two short text benchmark datasets and pornographic and gambling long-tail keywords show that the proposed model outperforms the state-of-the-art text classification model.
computer science, artificial intelligence
What problem does this paper attempt to address?