Abstract:It is widely known that state-of-the-art machine learning models, including vision and language models, can be seriously compromised by adversarial perturbations. It is therefore increasingly relevant to develop capabilities to certify their performance in the presence of the most effective adversarial attacks. Our paper offers a new approach to certify the performance of machine learning models in the presence of adversarial attacks with population level risk guarantees. In particular, we introduce the notion of $(\alpha,\zeta)$ machine learning model safety. We propose a hypothesis testing procedure, based on the availability of a calibration set, to derive statistical guarantees providing that the probability of declaring that the adversarial (population) risk of a machine learning model is less than $\alpha$ (i.e. the model is safe), while the model is in fact unsafe (i.e. the model adversarial population risk is higher than $\alpha$), is less than $\zeta$. We also propose Bayesian optimization algorithms to determine efficiently whether a machine learning model is $(\alpha,\zeta)$-safe in the presence of an adversarial attack, along with statistical guarantees. We apply our framework to a range of machine learning models including various sizes of vision Transformer (ViT) and ResNet models impaired by a variety of adversarial attacks, such as AutoAttack, SquareAttack and natural evolution strategy attack, to illustrate the operation of our approach. Importantly, we show that ViT's are generally more robust to adversarial attacks than ResNets, and ViT-large is more robust than smaller models. Our approach goes beyond existing empirical adversarial risk-based certification guarantees. It formulates rigorous (and provable) performance guarantees that can be used to satisfy regulatory requirements mandating the use of state-of-the-art technical tools.

Safe machine learning model release from Trusted Research Environments: The SACRO-ML package

Machine learning models in trusted research environments - understanding operational risks

Disclosure control of machine learning models from trusted research environments (TRE): New challenges and opportunities

secml: A Python Library for Secure and Explainable Machine Learning

Concrete Safety for ML Problems: System Safety for ML Development and Assessment

PROSAC: Provably Safe Certification for Machine Learning Models under Adversarial Attacks

SecurityNet: Assessing Machine Learning Vulnerabilities on Public Models

The Explabox: Model-Agnostic Machine Learning Transparency & Analysis

Systematic Attack Surface Reduction For Deployed Sentiment Analysis Models

Security and Machine Learning in the Real World

Towards the Science of Security and Privacy in Machine Learning

SoK: Explainable Machine Learning for Computer Security Applications

SecureML: A System for Scalable Privacy-Preserving Machine Learning

SoK: Machine Learning Governance

Machine Learning Models Disclosure from Trusted Research Environments (TRE), Challenges and Opportunities

SecureMLDebugger: A Privacy-Preserving Machine Learning Debugging Tool.

Unsolved Problems in ML Safety

Beyond the ML Model: Applying Safety Engineering Frameworks to Text-to-Image Development

GRAIMATTER Green Paper: Recommendations for disclosure control of trained Machine Learning (ML) models from Trusted Research Environments (TREs)

Balancing Transparency and Risk: The Security and Privacy Risks of Open-Source Machine Learning Models

ML-On-Rails: Safeguarding Machine Learning Models in Software Systems A Case Study