Abstract:We propose a method to predict toxicity and other textual attributes through the use of natural language processing (NLP) techniques for two recent events: the Ukraine-Russia and Hamas-Israel conflicts. This article provides a basis for exploration in future conflicts with hopes to mitigate risk through the analysis of social media before and after a conflict begins. Our work compiles several datasets from Twitter and Reddit for both conflicts in a before and after separation with an aim of predicting a future state of social media for avoidance. More specifically, we show that: (1) there is a noticeable difference in social media discussion leading up to and following a conflict and (2) social media discourse on platforms like Twitter and Reddit is useful in identifying future conflicts before they arise. Our results show that through the use of advanced NLP techniques (both supervised and unsupervised) toxicity and other attributes about language before and after a conflict is predictable with a low error of nearly 1.2 percent for both conflicts.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to predict the changes in toxicity and other text attributes on social media before and after two recent conflict events - the Ukraine - Russia conflict and the Hamas - Israel conflict - through natural language processing (NLP) techniques. Specifically, the author hopes: 1. **Reveal significant differences in social media discussions before and after the conflict**: By analyzing social media data, identify how the discussion content differs before and after the conflict, especially whether more negative or toxic remarks appear in these discussions. 2. **Use social media discourse to predict future conflicts**: Explore whether discussions on social media platforms (such as Twitter and Reddit) can be used as a tool to identify potential conflicts, so as to take measures in advance to avoid the occurrence of conflicts. To achieve the above goals, the author collected data from Twitter and Reddit during the periods before and after the conflicts and used advanced NLP techniques (including supervised and unsupervised methods) to analyze these data. The research results show that through these techniques, the toxicity changes on social media before and after the conflict can be predicted with a relatively low error (about 1.2%). ### Main methods and techniques 1. **Data collection and processing**: - Collected four main data sets: the Ukraine - Russia data sets before and after the conflict (URB and URA) and the Hamas - Israel data sets before and after the conflict (HIB and HIA). - Data processing includes removing URLs, non - alphabetic characters, accents, and English stop words, using NLTK for word segmentation and lemmatization, and using WordNinja to break down hashtags in tweets. 2. **LDA topic modeling**: - Use LDA (Latent Dirichlet Allocation) for unsupervised topic modeling to determine whether documents can be grouped according to their text data. - For the Ukraine - Russia conflict, 9 topics were determined; for the Hamas - Israel conflict, 7 topics were determined. 3. **Toxicity prediction**: - Define toxicity as content that promotes polarization between opposing sides, spreads distrust, and reinforces the "us - them" narrative. - Use the Detoxify library to assign a toxicity score to each lexical feature, ranging from 0.00 (completely non - toxic) to 1.00 (extremely toxic). 4. **Linear regression**: - Use a supervised linear regression model (LR) to establish a baseline toxicity prediction, where URB and HIB are used to predict the toxicity scores of URA and HIA. - Calculate the average toxicity score of each document and use these scores and the word frequency matrix to predict the toxicity score after the conflict. 5. **BERT model**: - In contrast to the linear regression model, use a BERT - based Transformer model for toxicity prediction. - Select a pre - trained BERT model that has been trained on Twitter and YouTube to distinguish phishing, attacks, and cyberbullying. ### Results 1. **LDA topic modeling results**: - After the conflict began, the average and total toxicity scores of most topics increased significantly. - In particular, in the Ukraine - Russia data set, the toxicity change of Topic 6 was the most significant. 2. **Linear regression and BERT model results**: - The two models showed similar behavior when predicting the toxicity scores after the conflict, but the Hamas - Israel data set performed slightly better. - The linear regression model tended to underestimate when predicting high toxicity scores, while the BERT model performed better in terms of central clustering. 3. **Accuracy comparison and threshold**: - By setting different thresholds to evaluate the model's accuracy, it was found that as the threshold increased, the model performance improved. - The optimal thresholds were 0.157 for the Ukraine - Russia data set and 0.099 for the Hamas - Israel data set. ### Discussion The author believes that through LDA topic modeling and toxicity prediction, the language changes of users during the crisis can be detected, so as to identify social media discussions that may trigger conflicts. Governments and non - government organizations can use these prediction tools to monitor potential tensions and take timely measures to avoid the escalation of conflicts. In addition, policy makers and social media platforms can use these tools to understand in real - time.

NLP Case Study on Predicting the Before and After of the Ukraine-Russia and Hamas-Israel Conflicts

Characterizing the 2022 Russo-Ukrainian Conflict Through the Lenses of Aspect-Based Sentiment Analysis: Dataset, Methodology, and Preliminary Findings

Sentiment analysis for measuring hope and fear from Reddit posts during the 2022 Russo-Ukrainian conflict

Challenges and Opportunities in Information Manipulation Detection: An Examination of Wartime Russian Media

War of Words: Harnessing the Potential of Large Language Models and Retrieval Augmented Generation to Classify, Counter and Diffuse Hate Speech

Sentiment Analysis and Comprehensive Evaluation of Supervised Machine Learning Models Using Twitter Data on Russia–Ukraine War

IsamasRed: A Public Dataset Tracking Reddit Discussions on Israel-Hamas Conflict

Application of Data Science to Discover Violence-Related Issues in Iraq

A Group-Specific Approach to NLP for Hate Speech Detection

Detecting Human Rights Violations on Social Media during Russia-Ukraine War

Detection of Toxic Language in Short Text Messages

Forecasting the presence and intensity of hostility on Instagram using linguistic and social features

Data Association at the Level of Narrative Plots to Support Analysis of Spatiotemporal Evolvement of Conflict: A Case Study in Nigeria

Influence of social bots in information warfare: A case study on @UAWeapons Twitter account in the context of Russia–Ukraine conflict

Automated multilingual detection of Pro-Kremlin propaganda in newspapers and Telegram posts

Modeling Information Narrative Detection and Evolution on Telegram during the Russia-Ukraine War

What you say or how you say it? Predicting Conflict Outcomes in Real and LLM-Generated Conversations

The Effects of Natural Language Processing on Big Data Analysis: Sentiment Analysis Case Study

Countering Online Hate Speech: An NLP Perspective

Identifying Partisan Slant in News Articles and Twitter during Political Crises

Measuring, Predicting and Visualizing Short-Term Change in Word Representation and Usage in VKontakte Social Network