Thematic context vector association based on event uncertainty for Twitter

Vaibhav Khatavkar,Swapnil Mane,Parag Kulkarni
2023-04-04
Abstract:Keyword extraction is a crucial process in text mining. The extraction of keywords with respective contextual events in Twitter data is a big challenge. The challenging issues are mainly because of the informality in the language used. The use of misspelled words, acronyms, and ambiguous terms causes informality. The extraction of keywords with informal language in current systems is pattern based or event based. In this paper, contextual keywords are extracted using thematic events with the help of data association. The thematic context for events is identified using the uncertainty principle in the proposed system. The thematic contexts are weighed with the help of vectors called thematic context vectors which signifies the event as certain or uncertain. The system is tested on the Twitter COVID-19 dataset and proves to be effective. The system extracts event-specific thematic context vectors from the test dataset and ranks them. The extracted thematic context vectors are used for the clustering of contextual thematic vectors which improves the silhouette coefficient by 0.5% than state of art methods namely TF and TF-IDF. The thematic context vector can be used in other applications like Cyberbullying, sarcasm detection, figurative language detection, etc.
Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the problem of extracting keywords related to specific events from Twitter data and better understanding and clustering these events through thematic context vectors. Specifically, the paper proposes a thematic context vector method based on event uncertainty to tackle the challenges posed by informal language in Twitter data (such as misspellings, abbreviations, and ambiguous terms). This method can identify key information related to events and has been proven effective in experiments, particularly in improving clustering performance compared to traditional Term Frequency (TF) and Term Frequency-Inverse Document Frequency (TF-IDF) methods. Additionally, this method can be applied to other fields such as cyberbullying detection and sarcasm detection.