Abstract:Deep neural networks (DNNs) have achieved unprecedented success across many scientific and engineering fields in the last decades. Despite its empirical success, unfortunately, recent studies have shown that there are various failure modes and blindspots in DNN models which may result in unexpected serious failures and potential harms, e.g. the existence of adversarial examples and small perturbations. This is not acceptable especially for safety critical and high stakes applications in the real-world, including healthcare, self-driving cars, aircraft control systems, hiring and malware detection protocols. Moreover, it has been challenging to understand why and when DNNs will fail due to their complicated structures and black-box behaviors. Lacking interpretability is one critical issue that may seriously hinder the deployment of DNNs in high-stake applications, which need interpretability to trust the prediction, to understand potential failures, and to be able to mitigate harms and eliminate biases in the model. To make DNNs trustworthy and reliable for deployment, it is necessary and urgent to develop methods and tools that can (i) quantify and improve their robustness against adversarial and natural perturbations, and (ii) understand their underlying behaviors and further correct errors to prevent injuries and damages. These are the important first steps to enable Trustworthy AI and Trustworthy Machine Learning. In this talk, I will survey a series of research efforts in my lab contributed to tackling the grand challenges in (i) and (ii). In the first part of my talk, I will overview our research effort in Robust Machine Learning since 2017, where we have proposed the first attack-agnostic robustness evaluation metric, the first efficient robustness certification algorithms for various types of perturbations, and efficient robust learning algorithms across supervised learning to deep reinforcement learning. In the second part of my talk, I will survey a series of exciting results in my lab on accelerating interpretable machine learning and explainable AI. Specifically, I will show how we could bring interpretability into deep learning by leveraging recent advances in multi-modal models. I'll present recent works in our group on automatically dissecting neural networks with open vocabulary concepts, designing interpretable neural networks without concept labels, and briefly overview our recent efforts on demystifying black-box DNN training process, automated neuron explanations for Large Language Models and the first robustness evaluation of a family of neuron-level interpretation techniques.

Fidelity - A Property of Deep Neural Networks to Measure the Trustworthiness of Prediction Results.

Fidelity: Towards Measuring the Trustworthiness of Neural Network Classification

F-Fidelity: A Robust Framework for Faithfulness Evaluation of Explainable AI

Towards Robust Fidelity for Evaluating Explainability of Graph Neural Networks

Fed-Credit: Robust Federated Learning with Credibility Management

Trust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning

Beyond Fidelity: Explaining Vulnerability Localization of Learning-based Detectors

A Trustworthiness Score to Evaluate DNN Predictions

Fidelity of Interpretability Methods and Perturbation Artifacts in Neural Networks

Backdoor Watermarking Deep Learning Classification Models With Deep Fidelity

Deep fidelity in DNN watermarking: A study of backdoor watermarking for classification models

Discovering Differential Features: Adversarial Learning for Information Credibility Evaluation

Fault Tolerance of Neural Networks in Adversarial Settings

Towards Trustworthy Deep Learning

Trustworthy machine learning in the context of security and privacy

Enhancing trustworthy deep learning for image classification against evasion attacks: a systematic literature review

Faithfulness and the Notion of Adversarial Sensitivity in NLP Explanations

FRAUDability: Estimating Users' Susceptibility to Financial Fraud Using Adversarial Machine Learning

Trust but Verify: An Information-Theoretic Explanation for the Adversarial Fragility of Machine Learning Systems, and a General Defense against Adversarial Attacks