Yue Huang,Lichao Sun,Haoran Wang,Siyuan Wu,Qihui Zhang,Yuan Li,Chujie Gao,Yixin Huang,Wenhan Lyu,Yixuan Zhang,Xiner Li,Hanchi Sun,Zhengliang Liu,Yixin Liu,Yijue Wang,Zhikun Zhang,Bertie Vidgen,Bhavya Kailkhura,Caiming Xiong,Chaowei Xiao,Chunyuan Li,Eric Xing,Furong Huang,Hao Liu,Heng Ji,Hongyi Wang,Huan Zhang,Huaxiu Yao,Manolis Kellis,Marinka Zitnik,Meng Jiang,Mohit Bansal,James Zou,Jian Pei,Jian Liu,Jianfeng Gao,Jiawei Han,Jieyu Zhao,Jiliang Tang,Jindong Wang,Joaquin Vanschoren,John Mitchell,Kai Shu,Kaidi Xu,Kai-Wei Chang,Lifang He,Lifu Huang,Michael Backes,Neil Gong,Philip Yu,Pin-Yu Chen,Quanquan Gu,Ran Xu,ZHITAO YING,Shuiwang Ji,Suman Jana,Tianlong Chen,Tianming Liu,Tianyi Zhou,William Wang,Xiang Li,Xiangliang Zhang,Xiao Wang,Xing Xie,Xun Chen,Xuyu Wang,Yan Liu,Yanfang Ye,Yinzhi Cao,Yong Chen,Yue Zhao

Abstract:Large language models (LLMs) have gained considerable attention for their excellent natural language processing capabilities. Nonetheless, these LLMs present many challenges, particularly in the realm of trustworthiness. This paper introduces TrustLLM, a comprehensive study of trustworthiness in LLMs, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream LLMs, and discussion of open challenges and future directions. Specifically, we first propose a set of principles for trustworthy LLMs that span eight different dimensions. Based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. We then present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of over 30 datasets. Our findings firstly show that in general trustworthiness and capability (i.e., functional effectiveness) are positively related. Secondly, our observations reveal that proprietary LLMs generally outperform most open-source counterparts in terms of trustworthiness, raising concerns about the potential risks of widely accessible open-source LLMs. However, a few open-source LLMs come very close to proprietary ones, suggesting that open-source models can achieve high levels of trustworthiness without additional mechanisms likemoderator, offering valuable insights for developers in this field. Thirdly, it is important to note that some LLMs may be overly calibrated towards exhibiting trustworthiness, to the extent that they compromise their utility by mistakenly treating benign prompts as harmful and consequently not responding. Besides these observations, we've uncovered key insights into the multifaceted trustworthiness in LLMs. We emphasize the importance of ensuring transparency not only in the models themselves but also in the technologies that underpin trustworthiness. We advocate that the establishment of an AI alliance between industry, academia, the open-source community to foster collaboration is imperative to advance the trustworthiness of LLMs.

Trustworthy Alignment of Retrieval-Augmented Large Language Models via Reinforcement Learning

ARL2: Aligning Retrievers for Black-box Large Language Models via Self-guided Adaptive Relevance Labeling

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

The Real, the Better: Aligning Large Language Models with Online Human Behaviors

Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy

RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

Position: TrustLLM: Trustworthiness in Large Language Models

Aligning Large Multimodal Models with Factually Augmented RLHF

Enhancing Large Language Models' Situated Faithfulness to External Contexts

TrustLLM: Trustworthiness in Large Language Models

When to Trust LLMs: Aligning Confidence with Response Quality

Progressively Label Enhancement for Large Language Model Alignment

Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models

Unsupervised Large Language Model Alignment for Information Retrieval Via Contrastive Feedback

XTRUST: On the Multilingual Trustworthiness of Large Language Models

RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness

Improving Retrieval Augmented Language Model with Self-Reasoning

Unraveling and Mitigating Retriever Inconsistencies in Retrieval-Augmented Large Language Models

Large Language Model Alignment: A Survey

Aligning Large Language Models via Fine-grained Supervision

On the Calibration of Large Language Models and Alignment