Raj Jagtap,Abhinav Kumar,Rahul Goel,Shakshi Sharma,Rajesh Sharma,Clint P. George
Abstract:Millions of people use platforms such as YouTube, Facebook, Twitter, and other mass media. Due to the accessibility of these platforms, they are often used to establish a narrative, conduct propaganda, and disseminate misinformation. This work proposes an approach that uses state-of-the-art NLP techniques to extract features from video captions (subtitles). To evaluate our approach, we utilize a publicly accessible and labeled dataset for classifying videos as misinformation or not. The motivation behind exploring video captions stems from our analysis of videos metadata. Attributes such as the number of views, likes, dislikes, and comments are ineffective as videos are hard to differentiate using this information. Using caption dataset, the proposed models can classify videos among three classes (Misinformation, Debunking Misinformation, and Neutral) with 0.85 to 0.90 F1-score. To emphasize the relevance of the misinformation class, we re-formulate our classification problem as a two-class classification - Misinformation vs. others (Debunking Misinformation and Neutral). In our experiments, the proposed models can classify videos with 0.92 to 0.95 F1-score and 0.78 to 0.90 AUC ROC.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to detect and identify videos containing misinformation on the YouTube platform. Specifically, the authors propose a method based on video captions, using advanced natural language processing (NLP) techniques to extract features from video captions, and classifying videos through machine - learning models to determine whether they contain misinformation.
### Problem Background
With the popularization of social media platforms (such as YouTube, Facebook, Twitter, etc.), these platforms are widely used to spread information, but they have also become important channels for spreading misinformation, conducting propaganda and spreading rumors. In particular, YouTube, due to its large user base and the amount of content uploaded, makes the spread of misinformation more difficult to control. This not only affects public perception but may also cause social problems. Therefore, there is an urgent need for an effective method to detect and filter these misinformation videos.
### Solution
To solve this problem, the authors propose a method based on video captions. The specific steps are as follows:
1. **Data Collection and Pre - processing**: Use the existing YouTube video data set, which covers five topics (vaccine controversy, 9/11 conspiracy theory, chemtrail conspiracy theory, moon landing conspiracy theory and flat earth theory), and label each video as one of three categories: misinformation, debunking information or neutral information. In addition, the authors also develop a script to scrape video captions from YouTube.
2. **Feature Extraction**: Convert the video caption text into a numerical vector representation. To this end, the authors use four pre - trained word vector embedding models (Stanford GloVe Wikipedia vectors - 100D and 300D, Word2Vec Google News - 300D, Word2Vec Twitter - 200D), and vectorize each caption as a weighted average of word vectors.
3. **Model Building**: Build a multi - classification model to divide videos into three categories (misinformation, debunking information, neutral information). To emphasize the importance of the misinformation category, the authors also re - define the problem as a binary classification problem (misinformation vs. the other two categories). To deal with the class imbalance problem, the authors use the SMOTE (Synthetic Minority Over - sampling Technique) technique.
4. **Performance Evaluation**: Use indicators such as F1 - score, AUC - ROC, Precision, Recall to evaluate the model performance. The experimental results show that the proposed model achieves an F1 - score of 0.85 to 0.90 in the three - classification task, and an F1 - score of 0.92 to 0.95 and an AUC ROC of 0.78 to 0.90 in the binary classification task.
### Main Contributions
- Propose a misinformation detection method based on video captions, which makes up for the deficiency that relying solely on video metadata (such as the number of views, the number of likes, etc.) cannot effectively distinguish different categories.
- Through experiments, it is proved that the caption - based feature extraction and classification method can effectively identify misinformation videos and has high accuracy and robustness.
### Conclusion
This research shows how to use natural language processing techniques to extract features from video captions and achieve effective classification of misinformation videos through machine - learning models. Future work can further improve the caption embedding method, for example, develop an embedding model specifically for YouTube videos to improve classification performance.