Abstract:Recent breakthroughs in natural language processing (NLP) have permitted the synthesis and comprehension of coherent text in an open-ended way, therefore translating the theoretical algorithms into practical applications. The large language models (LLMs) have significantly impacted businesses such as report summarization software and copywriters. Observations indicate, however, that LLMs may exhibit social prejudice and toxicity, posing ethical and societal dangers of consequences resulting from irresponsibility. Large-scale benchmarks for accountable LLMs should consequently be developed. Although several empirical investigations reveal the existence of a few ethical difficulties in advanced LLMs, there is little systematic examination and user study of the risks and harmful behaviors of current LLM usage. To further educate future efforts on constructing ethical LLMs responsibly, we perform a qualitative research method called ``red teaming'' on OpenAI's ChatGPT\footnote{In this paper, ChatGPT refers to the version released on Dec 15th.} to better understand the practical features of ethical dangers in recent LLMs. We analyze ChatGPT comprehensively from four perspectives: 1) \textit{Bias} 2) \textit{Reliability} 3) \textit{Robustness} 4) \textit{Toxicity}. In accordance with our stated viewpoints, we empirically benchmark ChatGPT on multiple sample datasets. We find that a significant number of ethical risks cannot be addressed by existing benchmarks, and hence illustrate them via additional case studies. In addition, we examine the implications of our findings on AI ethics and harmal behaviors of ChatGPT, as well as future problems and practical design considerations for responsible LLMs. We believe that our findings may give light on future efforts to determine and mitigate the ethical hazards posed by machines in LLM applications.

Walking in Others' Shoes: How Perspective-Taking Guides Large Language Models in Reducing Toxicity and Bias

You Only Prompt Once: On the Capabilities of Prompt Learning on Large Language Models to Tackle Toxic Content

Mitigating Social Bias in Large Language Models: A Multi-Objective Approach within a Multi-Agent Framework

Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective

Steering LLMs Towards Unbiased Responses: A Causality-Guided Debiasing Framework

Testing and Evaluation of Large Language Models: Correctness, Non-Toxicity, and Fairness

Protecting marginalized communities by mitigating discrimination in toxic language detection

Promoting Equality in Large Language Models: Identifying and Mitigating the Implicit Bias based on Bayesian Theory

Learning from Red Teaming: Gender Bias Provocation and Mitigation in Large Language Models

Towards Safety and Helpfulness Balanced Responses via Controllable Large Language Models

Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity

The Dark Side of Human Feedback: Poisoning Large Language Models via User Inputs

Unveiling the Implicit Toxicity in Large Language Models

Self-Detoxifying Language Models via Toxification Reversal

Evaluating and Mitigating Social Bias for Large Language Models in Open-ended Settings

Towards Understanding and Mitigating Social Biases in Language Models

Evaluating Psychological Safety of Large Language Models

Eliminating Position Bias of Language Models: A Mechanistic Approach

Efficient Toxic Content Detection by Bootstrapping and Distilling Large Language Models

Watch Your Language: Investigating Content Moderation with Large Language Models

PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language Models