Abstract:A concern about cutting-edge or "frontier" AI foundation models is that an adversary may use the models for preparing chemical, biological, radiological, nuclear, (CBRN), cyber, or other attacks. At least two methods can identify foundation models with potential dual-use capability; each has advantages and disadvantages: A. Open benchmarks (based on openly available questions and answers), which are low-cost but accuracy-limited by the need to omit security-sensitive details; and B. Closed red team evaluations (based on private evaluation by CBRN and cyber experts), which are higher-cost but can achieve higher accuracy by incorporating sensitive details. We propose a research and risk-management approach using a combination of methods including both open benchmarks and closed red team evaluations, in a way that leverages advantages of both methods. We recommend that one or more groups of researchers with sufficient resources and access to a range of near-frontier and frontier foundation models run a set of foundation models through dual-use capability evaluation benchmarks and red team evaluations, then analyze the resulting sets of models' scores on benchmark and red team evaluations to see how correlated those are. If, as we expect, there is substantial correlation between the dual-use potential benchmark scores and the red team evaluation scores, then implications include the following: The open benchmarks should be used frequently during foundation model development as a quick, low-cost measure of a model's dual-use potential; and if a particular model gets a high score on the dual-use potential benchmark, then more in-depth red team assessments of that model's dual-use capability should be performed. We also discuss limitations and mitigations for our approach, e.g., if model developers try to game benchmarks by including a version of benchmark test data in a model's training data.

Benchmark Early and Red Team Often: A Framework for Assessing and Managing Dual-Use Hazards of AI Foundation Models

Mind the Gap: Foundation Models and the Covert Proliferation of Military Intelligence, Surveillance, and Targeting

Boosting Adversarial Training in Safety-Critical Systems Through Boundary Data Selection

The GPT Dilemma: Foundation Models and the Shadow of Dual-Use

Confidence-Building Measures for Artificial Intelligence: Workshop Proceedings

Prioritizing High-Consequence Biological Capabilities in Evaluations of Artificial Intelligence Models

BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices

Defense Priorities in the Open-Source AI Debate: A Preliminary Assessment

Red-Teaming for Generative AI: Silver Bullet or Security Theater?

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation Models

AIR-Bench 2024: A Safety Benchmark Based on Risk Categories from Regulations and Policies

Effective Mitigations for Systemic Risks from General-Purpose AI

Adversaries Can Misuse Combinations of Safe Models

Embodied Red Teaming for Auditing Robotic Foundation Models

Countering Autonomous Cyber Threats

Red-Teaming Segment Anything Model

More than Marketing? On the Information Value of AI Benchmarks for Practitioners

AI Cyber Risk Benchmark: Automated Exploitation Capabilities

A Safe Harbor for AI Evaluation and Red Teaming

Coordinated Disclosure of Dual-Use Capabilities: An Early Warning System for Advanced AI