What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to use machine - learning methods, especially the probability classifier based on the bag - of - words model, to detect vandalism in Wikipedia. Specifically, the author aims to train a regularized logistic regression model to automatically identify and mark the destructive edits of anonymous users to Wikipedia articles, thereby reducing the workload of manual review and improving the accuracy and efficiency of detection. ### Problem Background Wikipedia is an open - source collaborative encyclopedia, and anyone can edit and improve it. Although this openness is one of the reasons for its success, it also gives some unethical editors the opportunity to introduce destructive edits (such as maliciously tampering with content). Currently, these destructive edits are mainly identified and reverted by human editors or automated anti - vandalism bots (such as ClueBot and VoABot II). However, the existing methods have the following problems: 1. **Difficulty in Manual Creation and Maintenance of Rules**: Existing anti - vandalism bots rely on manually written regular expressions and user blacklists, which are difficult to maintain. 2. **Low Recall Rate**: These bots can only detect about 30% of the destructive edits, with a low recall rate. 3. **Failure to Consider Cost - Sensitivity**: Existing methods do not fully consider the cost differences of misclassifications (for example, the cost of false positives is usually higher than that of false negatives). ### Solutions To solve the above problems, the author proposes the following solutions: - **Probability Classifier Based on the Bag - of - Words Model**: Use regularized logistic regression to train a probability classifier, and predict whether an edit is a destructive edit by analyzing the changes in the article before and after the edit (such as newly added or deleted words). - **Feature Engineering**: Extract the content changes and metadata of the edit (such as IP address, edit summary, etc.) as features to construct a high - dimensional sparse feature vector. - **Calibrate Probability Output**: Use isotonic regression to calibrate the probability output by the classifier to improve the reliability of prediction. - **Cost - Sensitive Analysis**: Consider the cost differences of different types of misclassifications (false positives and false negatives) and optimize the decision threshold. ### Goals Through the above methods, the author hopes to develop an automated system that is efficient, accurate, and has a low false - positive rate, which can help Wikipedia better identify and handle destructive edits, thereby improving the overall quality and user experience of the platform.

Vandalism Detection in Wikipedia: a Bag-of-Words Classifier Approach

Fair multilingual vandalism detection system for Wikipedia

Large-Scale Vandalism Detection with Linear Classifiers - The Conkerberry Vandalism Detector at WSDM Cup 2017

The Class Imbalance Problem in the Machine Learning Based Detection of Vandalism in Wikipedia across Languages

Automatically Labeling Low Quality Content on Wikipedia by Leveraging Patterns in Editing Behaviors

Proceedings of the WSDM Cup 2017: Vandalism Detection and Triple Scoring

Language-Agnostic Modeling of Source Reliability on Wikipedia

Feature Analysis for Assessing the Quality of Wikipedia Articles through Supervised Classification

Language-Agnostic Modeling of Wikipedia Articles for Content Quality Assessment across Languages

Detecting Hallucination and Coverage Errors in Retrieval Augmented Generation for Controversial Topics

Automated Software Vulnerability Assessment with Concept Drift

Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia

Ex Machina: Personal Attacks Seen at Scale

Towards Improving Wikipedia As An Image-Rich Encyclopaedia Through Analyzing Appropriateness Of Images For An Article

Neural Word Decomposition Models for Abusive Language Detection

An Empirical Evaluation of Text Representation Schemes on Multilingual Social Web to Filter the Textual Aggression

A Modified Word Saliency-Based Adversarial Attack on Text Classification Models

Simulation, Modelling and Classification of Wiki Contributors: Spotting The Good, The Bad, and The Ugly

BordIRlines: A Dataset for Evaluating Cross-lingual Retrieval-Augmented Generation

Is cross-linguistic advert flaw detection in Wikipedia feasible? A multilingual-BERT-based transfer learning approach

Detecting Potential Topics In News Using BERT, CRF and Wikipedia