Perplexed by Quality: A Perplexity-based Method for Adult and Harmful Content Detection in Multilingual Heterogeneous Web Data

Tim Jansen,Yangling Tong,Victoria Zevallos,Pedro Ortiz Suarez
DOI: https://doi.org/10.48550/arXiv.2212.10440
2022-12-21
Abstract:As demand for large corpora increases with the size of current state-of-the-art language models, using web data as the main part of the pre-training corpus for these models has become a ubiquitous practice. This, in turn, has introduced an important challenge for NLP practitioners, as they are now confronted with the task of developing highly optimized models and pipelines for pre-processing large quantities of textual data, which implies, effectively classifying and filtering multilingual, heterogeneous and noisy data, at web scale. One of the main components of this pre-processing step for the pre-training corpora of large language models, is the removal of adult and harmful content. In this paper we explore different methods for detecting adult and harmful of content in multilingual heterogeneous web data. We first show how traditional methods in harmful content detection, that seemingly perform quite well in small and specialized datasets quickly break down when confronted with heterogeneous noisy web data. We then resort to using a perplexity based approach but with a twist: Instead of using a so-called "clean" corpus to train a small language model and then use perplexity so select the documents with low perplexity, i.e., the documents that resemble this so-called "clean" corpus the most. We train solely with adult and harmful textual data, and then select the documents having a perplexity value above a given threshold. This approach will virtually cluster our documents into two distinct groups, which will greatly facilitate the choice of the threshold for the perplexity and will also allow us to obtain higher precision than with the traditional classification methods for detecting adult and harmful content.
Computation and Language
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of detecting adult and harmful content from multilingual heterogeneous network data. With the increasing demand of current state - of - the - art large language models (LLMs) for large - scale corpora, it has become a common practice to use network data as the main part of the pre - training corpus. However, this also poses a challenge for natural language processing (NLP) practitioners, that is, how to efficiently develop models and pipelines to pre - process a large amount of text data, especially to effectively classify and filter multilingual, heterogeneous and noisy data. Specifically, the paper explores different methods for detecting adult and harmful content in multilingual heterogeneous network data and proposes a new perplexity - based method. Traditional methods perform well on small and specialized datasets, but they are not effective when facing heterogeneous and noisy network data. Therefore, the author proposes a new strategy: instead of using a "clean" corpus to train a small - scale language model and select documents with low perplexity, specifically use adult and harmful text data for training and select documents with perplexity higher than a certain threshold. This method can virtually cluster documents into two different groups, making it more convenient to select the perplexity threshold and achieving higher accuracy than traditional classification methods. ### Main contributions of the paper 1. **Proposed a new perplexity - based method**: By training a language model that only contains adult and harmful content and screening documents according to the perplexity threshold, the accuracy of detecting harmful content is improved. 2. **Solved the limitations of traditional methods**: Traditional methods perform well on small - scale and specialized datasets, but they are not effective in large - scale heterogeneous network data. The new method overcomes this problem through more targeted training data. 3. **Provided experimental verification**: Through comparative experiments on multiple datasets and different methods, the effectiveness and superiority of the new method are proved. ### Formula representation Some technical details involved in the paper, such as the perplexity calculation formula, can be represented in the following Markdown format: Perplexity is an important indicator for measuring the performance of a language model, which is defined as: \[ \text{Perplexity} = 2^{-\frac{1}{N}\sum_{i = 1}^{N}\log_2 P(w_i)} \] where \( N \) is the number of words in the text, and \( P(w_i) \) is the probability estimate of the \( i \) - th word by the language model. By adjusting the perplexity threshold, harmful and non - harmful content can be effectively distinguished.