Abstract:As demand for large corpora increases with the size of current state-of-the-art language models, using web data as the main part of the pre-training corpus for these models has become a ubiquitous practice. This, in turn, has introduced an important challenge for NLP practitioners, as they are now confronted with the task of developing highly optimized models and pipelines for pre-processing large quantities of textual data, which implies, effectively classifying and filtering multilingual, heterogeneous and noisy data, at web scale. One of the main components of this pre-processing step for the pre-training corpora of large language models, is the removal of adult and harmful content. In this paper we explore different methods for detecting adult and harmful of content in multilingual heterogeneous web data. We first show how traditional methods in harmful content detection, that seemingly perform quite well in small and specialized datasets quickly break down when confronted with heterogeneous noisy web data. We then resort to using a perplexity based approach but with a twist: Instead of using a so-called "clean" corpus to train a small language model and then use perplexity so select the documents with low perplexity, i.e., the documents that resemble this so-called "clean" corpus the most. We train solely with adult and harmful textual data, and then select the documents having a perplexity value above a given threshold. This approach will virtually cluster our documents into two distinct groups, which will greatly facilitate the choice of the threshold for the perplexity and will also allow us to obtain higher precision than with the traditional classification methods for detecting adult and harmful content.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of detecting adult and harmful content from multilingual heterogeneous network data. With the increasing demand of current state - of - the - art large language models (LLMs) for large - scale corpora, it has become a common practice to use network data as the main part of the pre - training corpus. However, this also poses a challenge for natural language processing (NLP) practitioners, that is, how to efficiently develop models and pipelines to pre - process a large amount of text data, especially to effectively classify and filter multilingual, heterogeneous and noisy data. Specifically, the paper explores different methods for detecting adult and harmful content in multilingual heterogeneous network data and proposes a new perplexity - based method. Traditional methods perform well on small and specialized datasets, but they are not effective when facing heterogeneous and noisy network data. Therefore, the author proposes a new strategy: instead of using a "clean" corpus to train a small - scale language model and select documents with low perplexity, specifically use adult and harmful text data for training and select documents with perplexity higher than a certain threshold. This method can virtually cluster documents into two different groups, making it more convenient to select the perplexity threshold and achieving higher accuracy than traditional classification methods. ### Main contributions of the paper 1. **Proposed a new perplexity - based method**: By training a language model that only contains adult and harmful content and screening documents according to the perplexity threshold, the accuracy of detecting harmful content is improved. 2. **Solved the limitations of traditional methods**: Traditional methods perform well on small - scale and specialized datasets, but they are not effective in large - scale heterogeneous network data. The new method overcomes this problem through more targeted training data. 3. **Provided experimental verification**: Through comparative experiments on multiple datasets and different methods, the effectiveness and superiority of the new method are proved. ### Formula representation Some technical details involved in the paper, such as the perplexity calculation formula, can be represented in the following Markdown format: Perplexity is an important indicator for measuring the performance of a language model, which is defined as: \[ \text{Perplexity} = 2^{-\frac{1}{N}\sum_{i = 1}^{N}\log_2 P(w_i)} \] where \( N \) is the number of words in the text, and \( P(w_i) \) is the probability estimate of the \( i \) - th word by the language model. By adjusting the perplexity threshold, harmful and non - harmful content can be effectively distinguished.

Perplexed by Quality: A Perplexity-based Method for Adult and Harmful Content Detection in Multilingual Heterogeneous Web Data

Creating a Children-Friendly Reading Environment via Joint Learning of Content and Human Attention

Fine-Tuning Llama 2 Large Language Models for Detecting Online Sexual Predatory Chats and Abusive Texts

Towards Harmful Erotic Content Detection through Coreference-Driven Contextual Analysis

Adaptive Topic Modeling for Detection Objectionable Text

Contrastive Perplexity for Controlled Generation: An Application in Detoxifying Large Language Models

A Lightweight Graph-based Method to Detect Pornographic and Gambling Websites with Imperfect Datasets

Efficient Models for the Detection of Hate, Abuse and Profanity

Concerned with Data Contamination? Assessing Countermeasures in Code Language Model

Ultra Low-Cost Two-Stage Multimodal System for Non-Normative Behavior Detection

Efficient Toxic Content Detection by Bootstrapping and Distilling Large Language Models

Helpful or Harmful? Exploring the Efficacy of Large Language Models for Online Grooming Prevention

Protecting marginalized communities by mitigating discrimination in toxic language detection

Efficient Detection of Toxic Prompts in Large Language Models

You Only Prompt Once: On the Capabilities of Prompt Learning on Large Language Models to Tackle Toxic Content

Unveiling the Spectrum of Data Contamination in Language Models: A Survey from Detection to Remediation

Simple Text Detoxification by Identifying a Linear Toxic Subspace in Language Model Embeddings

ToxiCloakCN: Evaluating Robustness of Offensive Language Detection in Chinese with Cloaking Perturbations

Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model Evaluation

Mitigating harm in language models with conditional-likelihood filtration

Beyond Binary: Towards Fine-Grained LLM-Generated Text Detection via Role Recognition and Involvement Measurement