Abstract:In the constantly evolving field of cybersecurity, it is imperative for analysts to stay abreast of the latest attack trends and pertinent information that aids in the investigation and attribution of cyber-attacks. In this work, we introduce the first question-answering (QA) model and its application that provides information to the cybersecurity experts about cyber-attacks investigations and attribution. Our QA model is based on Retrieval Augmented Generation (RAG) techniques together with a Large Language Model (LLM) and provides answers to the users' queries based on either our knowledge base (KB) that contains curated information about cyber-attacks investigations and attribution or on outside resources provided by the users. We have tested and evaluated our QA model with various types of questions, including KB-based, metadata-based, specific documents from the KB, and external sources-based questions. We compared the answers for KB-based questions with those from OpenAI's GPT-3.5 and the latest GPT-4o LLMs. Our proposed QA model outperforms OpenAI's GPT models by providing the source of the answers and overcoming the hallucination limitations of the GPT models, which is critical for cyber-attack investigation and attribution. Additionally, our analysis showed that when the RAG QA model is given few-shot examples rather than zero-shot instructions, it generates better answers compared to cases where no examples are supplied in addition to the query.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in the ever - evolving field of cybersecurity, how to help analysts obtain the latest information related to cyber - attack investigation and attribution efficiently and accurately. Specifically, the paper proposes a question - answering (QA) model based on Retrieval - Augmented Generation (RAG) technology to address the following challenges: 1. **Complexity of cyber - attack attribution**: Cyber - attack attribution involves identifying attackers and their tools, tactics, techniques, and procedures (TTPs), geolocating, and determining the individuals or organizations behind the attacks. This process is very complex and resource - intensive and usually requires manual effort. 2. **Large amount of unstructured data**: Information about cyber - attacks is scattered in texts of various formats (such as reports, PDFs, blogs, etc.), and this information lacks a standardized format, making it difficult to extract meaningful intelligence from it. 3. **Limitations of large - language models (LLMs)**: - **Hallucination problem**: LLMs may generate misleading or completely fictional answers. - **Lag in knowledge update**: The knowledge of LLMs may be out - of - date and unable to provide the latest cybersecurity information. To solve these problems, the paper proposes a RAG - based QA model that combines large - language models and external knowledge bases and generates more accurate and reliable answers by retrieving relevant contexts. Specific contributions include: - **Specialized knowledge base and question - answer pair data set**: Used for developing and evaluating the RAG model, especially for cyber - attack investigation and attribution. - **RAG - based QA model**: Capable of providing accurate answers and improving reliability by citing sources, reducing the hallucination problem of LLM. - **Chat interface**: Supports question - answering functions based on knowledge bases, private repositories, and network resources. - **Performance evaluation**: Evaluates the reliability and context retrieval ability of the model through multiple indicators (such as fidelity, answer relevance, context precision, etc.). Through these improvements, this model aims to help cybersecurity experts conduct cyber - attack investigations and attributions more effectively, thereby formulating effective countermeasures and mitigating future threats.

A RAG-Based Question-Answering Solution for Cyber-Attack Investigation and Attribution

RAG based Question-Answering for Contextual Response Prediction System

Aggregated Knowledge Model: Enhancing Domain-Specific QA with Fine-Tuned and Retrieval-Augmented Generation Models

QA-RAG: Exploring LLM Reliance on External Knowledge

A Multi-Source Retrieval Question Answering Framework Based on RAG

TRAQ: Trustworthy Retrieval Augmented Question Answering via Conformal Prediction

PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models

CRAG -- Comprehensive RAG Benchmark

Hierarchical Retrieval-Augmented Generation Model with Rethink for Multi-hop Question Answering

Rationale-Guided Retrieval Augmented Generation for Medical Question Answering

Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation

Evaluating Retrieval-Augmented Generation Models for Financial Report Question and Answering

Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models

Beyond-RAG: Question Identification and Answer Generation in Real-Time Conversations

Advancing Question-Answering in Ophthalmology with Retrieval Augmented Generations (RAG): Benchmarking Open-source and Proprietary Large Language Models

From RAG to QA-RAG: Integrating Generative AI for Pharmaceutical Regulatory Compliance Process

Retrieval-augmented Generation to Improve Math Question-Answering: Trade-offs Between Groundedness and Human Preference

Meta Knowledge for Retrieval Augmented Large Language Models

Retrieval-Augmented Generation with Knowledge Graphs for Customer Service Question Answering

Enhancing classroom teaching with LLMs and RAG