A dataset for cyber threat intelligence modeling of connected autonomous vehicles

Yinghui Wang,Yilong Ren,Hongmao Qin,Zhiyong Cui,Yanan Zhao,Haiyang Yu
2024-10-19
Abstract:Cyber attacks have become a vital threat to connected autonomous vehicles in intelligent transportation systems. Cyber threat intelligence, as the collection of cyber threat information, provides an ideal approach for responding to emerging vehicle cyber threats and enabling proactive security defense. Obtaining valuable information from enormous cybersecurity data using knowledge extraction technologies to achieve cyber threat intelligence modeling is an effective means to ensure automotive cybersecurity. Unfortunately, there is no existing cybersecurity dataset available for cyber threat intelligence modeling research in the automotive field. This paper reports the creation of a cyber threat intelligence corpus focusing on vehicle cybersecurity knowledge mining. This dataset, annotated using a joint labeling strategy, comprises 908 real automotive cybersecurity reports, containing 3678 sentences, 8195 security entities and 4852 semantic relations. We further conduct a comprehensive analysis of cyber threat intelligence mining algorithms based on this corpus. The proposed dataset will serve as a valuable resource for evaluating the performance of existing algorithms and advancing research in cyber threat intelligence modeling within the automotive field.
Cryptography and Security
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problem of cybersecurity threats faced by Connected Autonomous Vehicles (CAVs) in intelligent transportation systems. Specifically, the paper focuses on how to support cybersecurity knowledge mining and modeling research by constructing a Cyber - Threat Intelligence (CTI) dataset specifically for the automotive field. #### Background and Challenges With the rapid development of connected autonomous vehicles, these vehicles have shown great potential in improving traffic efficiency, reducing congestion, and decreasing the accident rate. However, new cybersecurity risks and vulnerabilities also come along. Hackers can take advantage of these potential attack surfaces and even gain control of the vehicles. In recent years, the frequency, scale, and complexity of cyber - attacks against connected autonomous vehicles have increased exponentially, which may lead to privacy leakage, economic losses, personal injury, and even endanger national security. Existing security measures such as access control, firewalls, Intrusion Detection and Prevention Systems (IDPS), and Security Operations Centers (SOC), although effective, have limitations, such as passive protection and limited threat identification capabilities. Therefore, there is an urgent need for a method that can achieve active defense and timely response to unknown or emerging threats. #### The Role of Cyber - Threat Intelligence (CTI) As a method of collecting cyber - threat information, cyber - threat intelligence provides an ideal way to deal with emerging vehicle cybersecurity threats. By extracting valuable information from a large amount of cybersecurity data and realizing CTI modeling, automotive cybersecurity can be ensured. However, currently, there is a lack of CTI datasets specifically for the automotive field, which limits the progress of related research. #### Main Contributions of the Paper To solve this problem, the author created an automotive CTI dataset named **Acti**, focusing on mining entities related to automotive cybersecurity and their association relationships. This dataset contains 908 real - world automotive cybersecurity reports, involving 3,678 sentences, 8,195 security entities, and 4,852 semantic relationships. These data are annotated using a joint annotation strategy, covering 10 entity concepts and 10 semantic relationship categories, based on the defined automotive CTI ontology model. In addition, the author also conducted a comprehensive analysis of CTI mining algorithms based on this dataset and trained two CTI mining models to verify the reliability of the dataset. This dataset not only fills the gap in the automotive CTI field dataset but also provides a valuable resource for evaluating the performance of existing algorithms and promoting CTI modeling research. ### Key Formulas and Technical Details - **Joint Annotation Format**: A label scheme for annotating entity boundaries, entity types, relationship types, and entity roles. - Entity boundaries use the "BIOES" (Begin, Inside, Other, End, Single) format. - Entity types are divided into 10 categories: Component, Consequence, Identity, Vehicle, Location, Attack Vector, Attack Pattern, Tool, Vulnerability, Course of Action. - Relationship types include: "hasVulnerability", "hasInterface", "hasImpact", "targets", "uses", "mitigates", "related - to", "located - at", "based - on", "consists - of". - **Deep Learning Model**: - **BERT - BiLSTM - att - CRF** model structure: - **Embedding Layer**: Converts the input text into word vector representations. - **Encoder Layer**: Uses a Bidirectional Long - Short - Term Memory Network (BiLSTM) to capture semantic information in the sequence. - **Attention Mechanism Layer**: Introduces a self - attention mechanism to focus on key information. - **Decode