Abstract:It is widely known that state-of-the-art machine learning models, including vision and language models, can be seriously compromised by adversarial perturbations. It is therefore increasingly relevant to develop capabilities to certify their performance in the presence of the most effective adversarial attacks. Our paper offers a new approach to certify the performance of machine learning models in the presence of adversarial attacks with population level risk guarantees. In particular, we introduce the notion of $(\alpha,\zeta)$ machine learning model safety. We propose a hypothesis testing procedure, based on the availability of a calibration set, to derive statistical guarantees providing that the probability of declaring that the adversarial (population) risk of a machine learning model is less than $\alpha$ (i.e. the model is safe), while the model is in fact unsafe (i.e. the model adversarial population risk is higher than $\alpha$), is less than $\zeta$. We also propose Bayesian optimization algorithms to determine efficiently whether a machine learning model is $(\alpha,\zeta)$-safe in the presence of an adversarial attack, along with statistical guarantees. We apply our framework to a range of machine learning models including various sizes of vision Transformer (ViT) and ResNet models impaired by a variety of adversarial attacks, such as AutoAttack, SquareAttack and natural evolution strategy attack, to illustrate the operation of our approach. Importantly, we show that ViT's are generally more robust to adversarial attacks than ResNets, and ViT-large is more robust than smaller models. Our approach goes beyond existing empirical adversarial risk-based certification guarantees. It formulates rigorous (and provable) performance guarantees that can be used to satisfy regulatory requirements mandating the use of state-of-the-art technical tools.

What problem does this paper attempt to address?

The problem this paper attempts to address is: how to provide certified security performance with population-level risk guarantees for machine learning models in the presence of adversarial attacks. Specifically, the paper proposes a new method called PROSAC to evaluate the robustness of machine learning models under adversarial attacks and ensure that this evaluation meets regulatory requirements. ### Main Contributions of the Paper: 1. **Proposing the PROSAC Framework**: This is a new certification framework to determine whether a machine learning model is robust to specific adversarial attacks. The framework is based on the concept of (α, ζ) security of machine learning models, i.e., the probability that the adversarial risk of the model is less than a preset threshold α is greater than ζ. 2. **Bayesian Optimization Algorithm**: The paper proposes an improved Gaussian Process Upper Confidence Bound (GP-UCB) algorithm to efficiently approximate the p-value related to the hypothesis testing problem. The number of queries of this algorithm is much less than the number of hyperparameter configurations available to the attacker. 3. **Strict Certification**: Through a more rigorous testing procedure, this framework can strictly certify the (α, ζ) security of machine learning models under specific adversarial attacks. 4. **Experimental Validation**: The paper demonstrates the (α, ζ) security of different machine learning models (such as Vision Transformer and ResNet) under different adversarial attacks (such as AutoAttack, SquareAttack, and Natural Evolution Strategies Attack) through a series of experiments. The experimental results show that Vision Transformer (especially large models) is generally more robust than ResNet. ### Problem Background: With the continuous development of autonomous machine learning systems and their widespread application in fields such as healthcare, banking and finance, education, and e-commerce, policymakers are formulating detailed regulatory requirements to ensure the safety and reliability of these systems. The European Union is at the forefront in this regard, proposing regulations such as the EU Artificial Intelligence Act. These regulations require precise evaluation of AI systems' performance, including their accuracy and resilience when subjected to interference or unauthorized use. ### Method Overview: The paper uses hypothesis testing techniques to determine the robustness of machine learning models under adversarial attacks. The specific steps include: 1. **Hypothesis Testing Problem**: Set the null hypothesis H0 as the maximum adversarial risk being greater than α. 2. **Calculate p-value**: Use the calibration set and the randomness of the attack to calculate the p-value of the finite sample. 3. **Reject the Null Hypothesis**: If the p-value is less than ζ, reject the null hypothesis and consider the model to be (α, ζ) secure. ### Experimental Results: - **Vision Transformer (ViT)**: Compared to ResNet, ViT performs more robustly under adversarial attacks. - **Large Models**: ViT-Large is more robust than smaller models. - **Different Attacks**: The experimental results validate some existing trends and reveal some new trends, such as the robustness performance of different models under different adversarial attacks. ### Conclusion: The PROSAC framework provides a rigorous certification method for the robustness of machine learning models under adversarial attacks, meeting regulatory requirements and offering strong support for model selection and evaluation in practical applications.

PROSAC: Provably Safe Certification for Machine Learning Models under Adversarial Attacks

Boosting Adversarial Training in Safety-Critical Systems Through Boundary Data Selection

Improving Adversarial Robustness of 3D Point Cloud Classification Models

CC-CERT: A Probabilistic Approach to Certify General Robustness of Neural Networks

Towards Certified Probabilistic Robustness with High Accuracy

Security Versus Accuracy: Trade-Off Data Modeling to Safe Fault Classification Systems

A Survey of Robustness and Safety of 2D and 3D Deep Learning Models Against Adversarial Attacks

ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models

Certifiable Robustness to Adversarial State Uncertainty in Deep Reinforcement Learning

A practical approach to evaluating the adversarial distance for machine learning classifiers

Advanced Techniques for Improving Model Robustness in Adversarial Machine Learning

Uncovering Safety Risks of Large Language Models through Concept Activation Vector

Robustra: Training Provable Robust Neural Networks over Reference Adversarial Space.

Et Tu Certifications: Robustness Certificates Yield Better Adversarial Examples

Impact of Architectural Modifications on Deep Learning Adversarial Robustness

Certifiable Black-Box Attacks with Randomized Adversarial Examples: Breaking Defenses with Provable Confidence

Secure Learning In Adversarial Environments

Trust but Verify: An Information-Theoretic Explanation for the Adversarial Fragility of Machine Learning Systems, and a General Defense against Adversarial Attacks

Certifying Safety in Reinforcement Learning under Adversarial Perturbation Attacks

COMMIT: Certifying Robustness of Multi-Sensor Fusion Systems against Semantic Attacks

The security of machine learning in an adversarial setting: A survey