Evaluation of LLM Chatbots for OSINT-based Cyber Threat Awareness

Samaneh Shafee,Alysson Bessani,Pedro M. Ferreira
2024-04-19
Abstract:Knowledge sharing about emerging threats is crucial in the rapidly advancing field of cybersecurity and forms the foundation of Cyber Threat Intelligence (CTI). In this context, Large Language Models are becoming increasingly significant in the field of cybersecurity, presenting a wide range of opportunities. This study surveys the performance of ChatGPT, GPT4all, Dolly, Stanford Alpaca, Alpaca-LoRA, Falcon, and Vicuna chatbots in binary classification and Named Entity Recognition (NER) tasks performed using Open Source INTelligence (OSINT). We utilize well-established data collected in previous research from Twitter to assess the competitiveness of these chatbots when compared to specialized models trained for those tasks. In binary classification experiments, Chatbot GPT-4 as a commercial model achieved an acceptable F1 score of 0.94, and the open-source GPT4all model achieved an F1 score of 0.90. However, concerning cybersecurity entity recognition, all evaluated chatbots have limitations and are less effective. This study demonstrates the capability of chatbots for OSINT binary classification and shows that they require further improvement in NER to effectively replace specially trained models. Our results shed light on the limitations of the LLM chatbots when compared to specialized models, and can help researchers improve chatbots technology with the objective to reduce the required effort to integrate machine learning in OSINT-based CTI tools.
Cryptography and Security,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the performance of large language model (LLM) chatbots in open - source intelligence (OSINT) - based cyber threat awareness tasks, especially whether they can compete with specially - trained models. Specifically, the research focuses on the following two aspects: 1. **Binary classification task**: Evaluate the performance of chatbots in determining whether a tweet is related to cybersecurity. 2. **Named entity recognition (NER) task**: Evaluate the performance of chatbots in identifying cybersecurity - related entities (such as vulnerabilities, affected products, etc.). ### Research background and motivation With the rapid development of the cybersecurity field, it is crucial to detect and respond to emerging threats in a timely manner. To achieve this goal, researchers have explored the use of large language model (LLM) chatbots to enhance cyber threat intelligence (CTI) capabilities. These chatbots can be used to process and analyze intelligence data from public sources, thereby helping organizations better cope with potential security risks. ### Research question The core question of this research is: **Can LLM chatbots compete with specially - trained models in OSINT - based CTI tasks?** ### Method To answer this question, the researchers carried out the following work: - **Dataset selection**: A labeled dataset collected from Twitter was used, containing 31,281 tweets. Each tweet was pre - processed and marked as whether it was related to cybersecurity. - **Experimental design**: Two types of prompts were designed for each tweet: - **Binary classification prompt**: Ask whether the tweet is related to cybersecurity and require the chatbot to answer "yes" or "no". - **Named entity recognition prompt**: Require the chatbot to extract entity information related to cybersecurity from the tweet. - **Evaluation metrics**: Standard evaluation metrics such as F1 - score were used to evaluate the performance of chatbots and compare them with specially - trained models. ### Results The research shows that in the binary classification task, some commercial chatbots (such as GPT - 4) perform excellently, with an F1 - score reaching 0.94, and open - source models (such as GPT4all) also achieve good results, with an F1 - score of 0.90. However, in the named entity recognition task, all the evaluated chatbots have limitations and are less effective than specially - trained models. ### Conclusion This research reveals the potential and limitations of LLM chatbots in OSINT - based CTI tasks. Although they perform well in binary classification tasks, they still need to be improved in named entity recognition. The research results provide directions for future research, aiming to further optimize the performance of chatbots and make them more suitable for the cybersecurity field. In this way, this research not only evaluates the capabilities of existing chatbots but also provides valuable references for future research and development.