Abstract:Recent breakthroughs in natural language processing (NLP) have permitted the synthesis and comprehension of coherent text in an open-ended way, therefore translating the theoretical algorithms into practical applications. The large language models (LLMs) have significantly impacted businesses such as report summarization software and copywriters. Observations indicate, however, that LLMs may exhibit social prejudice and toxicity, posing ethical and societal dangers of consequences resulting from irresponsibility. Large-scale benchmarks for accountable LLMs should consequently be developed. Although several empirical investigations reveal the existence of a few ethical difficulties in advanced LLMs, there is little systematic examination and user study of the risks and harmful behaviors of current LLM usage. To further educate future efforts on constructing ethical LLMs responsibly, we perform a qualitative research method called ``red teaming'' on OpenAI's ChatGPT\footnote{In this paper, ChatGPT refers to the version released on Dec 15th.} to better understand the practical features of ethical dangers in recent LLMs. We analyze ChatGPT comprehensively from four perspectives: 1) \textit{Bias} 2) \textit{Reliability} 3) \textit{Robustness} 4) \textit{Toxicity}. In accordance with our stated viewpoints, we empirically benchmark ChatGPT on multiple sample datasets. We find that a significant number of ethical risks cannot be addressed by existing benchmarks, and hence illustrate them via additional case studies. In addition, we examine the implications of our findings on AI ethics and harmal behaviors of ChatGPT, as well as future problems and practical design considerations for responsible LLMs. We believe that our findings may give light on future efforts to determine and mitigate the ethical hazards posed by machines in LLM applications.

Safeguarding Crowdsourcing Surveys from ChatGPT with Prompt Injection

Comprehensive Evaluation of ChatGPT Reliability Through Multilingual Inquiries

SoK: Prompt Hacking of Large Language Models

Automatic and Universal Prompt Injection Attacks against Large Language Models

Assessing Prompt Injection Risks in 200+ Custom GPTs

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

An Early Categorization of Prompt Injection Attacks on Large Language Models

Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity

ChatGPT: The End of Online Exam Integrity?

Fine-tuned Large Language Models (LLMs): Improved Prompt Injection Attacks Detection

Prompt Injection attack against LLM-integrated Applications

Prompt Stealing Attacks Against Large Language Models

Bot or Human? Detecting ChatGPT Imposters with A Single Question

Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts?

Efficient Detection of Toxic Prompts in Large Language Models

Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility

A Survey on Detection of LLMs-Generated Content

GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models

"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction

Safety Assessment of Chinese Large Language Models