NLP Case Study on Predicting the Before and After of the Ukraine-Russia and Hamas-Israel Conflicts

Jordan Miner,John E. Ortega
2024-10-09
Abstract:We propose a method to predict toxicity and other textual attributes through the use of natural language processing (NLP) techniques for two recent events: the Ukraine-Russia and Hamas-Israel conflicts. This article provides a basis for exploration in future conflicts with hopes to mitigate risk through the analysis of social media before and after a conflict begins. Our work compiles several datasets from Twitter and Reddit for both conflicts in a before and after separation with an aim of predicting a future state of social media for avoidance. More specifically, we show that: (1) there is a noticeable difference in social media discussion leading up to and following a conflict and (2) social media discourse on platforms like Twitter and Reddit is useful in identifying future conflicts before they arise. Our results show that through the use of advanced NLP techniques (both supervised and unsupervised) toxicity and other attributes about language before and after a conflict is predictable with a low error of nearly 1.2 percent for both conflicts.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to predict the changes in toxicity and other text attributes on social media before and after two recent conflict events - the Ukraine - Russia conflict and the Hamas - Israel conflict - through natural language processing (NLP) techniques. Specifically, the author hopes: 1. **Reveal significant differences in social media discussions before and after the conflict**: By analyzing social media data, identify how the discussion content differs before and after the conflict, especially whether more negative or toxic remarks appear in these discussions. 2. **Use social media discourse to predict future conflicts**: Explore whether discussions on social media platforms (such as Twitter and Reddit) can be used as a tool to identify potential conflicts, so as to take measures in advance to avoid the occurrence of conflicts. To achieve the above goals, the author collected data from Twitter and Reddit during the periods before and after the conflicts and used advanced NLP techniques (including supervised and unsupervised methods) to analyze these data. The research results show that through these techniques, the toxicity changes on social media before and after the conflict can be predicted with a relatively low error (about 1.2%). ### Main methods and techniques 1. **Data collection and processing**: - Collected four main data sets: the Ukraine - Russia data sets before and after the conflict (URB and URA) and the Hamas - Israel data sets before and after the conflict (HIB and HIA). - Data processing includes removing URLs, non - alphabetic characters, accents, and English stop words, using NLTK for word segmentation and lemmatization, and using WordNinja to break down hashtags in tweets. 2. **LDA topic modeling**: - Use LDA (Latent Dirichlet Allocation) for unsupervised topic modeling to determine whether documents can be grouped according to their text data. - For the Ukraine - Russia conflict, 9 topics were determined; for the Hamas - Israel conflict, 7 topics were determined. 3. **Toxicity prediction**: - Define toxicity as content that promotes polarization between opposing sides, spreads distrust, and reinforces the "us - them" narrative. - Use the Detoxify library to assign a toxicity score to each lexical feature, ranging from 0.00 (completely non - toxic) to 1.00 (extremely toxic). 4. **Linear regression**: - Use a supervised linear regression model (LR) to establish a baseline toxicity prediction, where URB and HIB are used to predict the toxicity scores of URA and HIA. - Calculate the average toxicity score of each document and use these scores and the word frequency matrix to predict the toxicity score after the conflict. 5. **BERT model**: - In contrast to the linear regression model, use a BERT - based Transformer model for toxicity prediction. - Select a pre - trained BERT model that has been trained on Twitter and YouTube to distinguish phishing, attacks, and cyberbullying. ### Results 1. **LDA topic modeling results**: - After the conflict began, the average and total toxicity scores of most topics increased significantly. - In particular, in the Ukraine - Russia data set, the toxicity change of Topic 6 was the most significant. 2. **Linear regression and BERT model results**: - The two models showed similar behavior when predicting the toxicity scores after the conflict, but the Hamas - Israel data set performed slightly better. - The linear regression model tended to underestimate when predicting high toxicity scores, while the BERT model performed better in terms of central clustering. 3. **Accuracy comparison and threshold**: - By setting different thresholds to evaluate the model's accuracy, it was found that as the threshold increased, the model performance improved. - The optimal thresholds were 0.157 for the Ukraine - Russia data set and 0.099 for the Hamas - Israel data set. ### Discussion The author believes that through LDA topic modeling and toxicity prediction, the language changes of users during the crisis can be detected, so as to identify social media discussions that may trigger conflicts. Governments and non - government organizations can use these prediction tools to monitor potential tensions and take timely measures to avoid the escalation of conflicts. In addition, policy makers and social media platforms can use these tools to understand in real - time.