A RAG-Based Question-Answering Solution for Cyber-Attack Investigation and Attribution

Sampath Rajapaksha,Ruby Rani,Erisa Karafili
2024-08-13
Abstract:In the constantly evolving field of cybersecurity, it is imperative for analysts to stay abreast of the latest attack trends and pertinent information that aids in the investigation and attribution of cyber-attacks. In this work, we introduce the first question-answering (QA) model and its application that provides information to the cybersecurity experts about cyber-attacks investigations and attribution. Our QA model is based on Retrieval Augmented Generation (RAG) techniques together with a Large Language Model (LLM) and provides answers to the users' queries based on either our knowledge base (KB) that contains curated information about cyber-attacks investigations and attribution or on outside resources provided by the users. We have tested and evaluated our QA model with various types of questions, including KB-based, metadata-based, specific documents from the KB, and external sources-based questions. We compared the answers for KB-based questions with those from OpenAI's GPT-3.5 and the latest GPT-4o LLMs. Our proposed QA model outperforms OpenAI's GPT models by providing the source of the answers and overcoming the hallucination limitations of the GPT models, which is critical for cyber-attack investigation and attribution. Additionally, our analysis showed that when the RAG QA model is given few-shot examples rather than zero-shot instructions, it generates better answers compared to cases where no examples are supplied in addition to the query.
Cryptography and Security
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in the ever - evolving field of cybersecurity, how to help analysts obtain the latest information related to cyber - attack investigation and attribution efficiently and accurately. Specifically, the paper proposes a question - answering (QA) model based on Retrieval - Augmented Generation (RAG) technology to address the following challenges: 1. **Complexity of cyber - attack attribution**: Cyber - attack attribution involves identifying attackers and their tools, tactics, techniques, and procedures (TTPs), geolocating, and determining the individuals or organizations behind the attacks. This process is very complex and resource - intensive and usually requires manual effort. 2. **Large amount of unstructured data**: Information about cyber - attacks is scattered in texts of various formats (such as reports, PDFs, blogs, etc.), and this information lacks a standardized format, making it difficult to extract meaningful intelligence from it. 3. **Limitations of large - language models (LLMs)**: - **Hallucination problem**: LLMs may generate misleading or completely fictional answers. - **Lag in knowledge update**: The knowledge of LLMs may be out - of - date and unable to provide the latest cybersecurity information. To solve these problems, the paper proposes a RAG - based QA model that combines large - language models and external knowledge bases and generates more accurate and reliable answers by retrieving relevant contexts. Specific contributions include: - **Specialized knowledge base and question - answer pair data set**: Used for developing and evaluating the RAG model, especially for cyber - attack investigation and attribution. - **RAG - based QA model**: Capable of providing accurate answers and improving reliability by citing sources, reducing the hallucination problem of LLM. - **Chat interface**: Supports question - answering functions based on knowledge bases, private repositories, and network resources. - **Performance evaluation**: Evaluates the reliability and context retrieval ability of the model through multiple indicators (such as fidelity, answer relevance, context precision, etc.). Through these improvements, this model aims to help cybersecurity experts conduct cyber - attack investigations and attributions more effectively, thereby formulating effective countermeasures and mitigating future threats.