Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models

Do Xuan Long,Duong Ngoc Yen,Anh Tuan Luu,Kenji Kawaguchi,Min-Yen Kan,Nancy F. Chen
2024-11-01
Abstract:We present Multi-expert Prompting, a novel enhancement of ExpertPrompting (Xu et al., 2023), designed to improve the large language model (LLM) generation. Specifically, it guides an LLM to fulfill an input instruction by simulating multiple experts, aggregating their responses, and selecting the best among individual and aggregated responses. This process is performed in a single chain of thoughts through our seven carefully designed subtasks derived from the Nominal Group Technique (Ven and Delbecq, 1974), a well-established decision-making framework. Our evaluations demonstrate that Multi-expert Prompting significantly outperforms ExpertPrompting and comparable baselines in enhancing the truthfulness, factuality, informativeness, and usefulness of responses while reducing toxicity and hurtfulness. It further achieves state-of-the-art truthfulness by outperforming the best baseline by 8.69% with ChatGPT. Multi-expert Prompting is efficient, explainable, and highly adaptable to diverse scenarios, eliminating the need for manual prompt construction.
Computation and Language
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address the reliability and safety issues in the responses generated by large language models (LLMs). Specifically, the authors propose a new method called **Multi-expert Prompting** to improve the following aspects of LLM-generated responses: 1. **Truthfulness**: Ensuring that the model-generated responses are consistent with facts, reducing misleading information. 2. **Factuality**: Ensuring that the generated content is based on real data and facts. 3. **Toxicity**: Reducing harmful or offensive language in the generated content. 4. **Hurtfulness**: Avoiding content that may cause emotional harm to users. 5. **Informativeness**: Increasing the amount of information in the generated content, providing more details and in-depth insights. 6. **Usefulness**: Ensuring that the generated content has practical value for users and effectively conveys information. ### Method Overview **Multi-expert Prompting** generates responses by simulating multiple experts and then aggregating these responses to select the best answer. The specific steps are as follows: 1. **Expert and Response Generation**: - Given an input instruction, the model first generates the identities and brief descriptions of multiple experts. - Each expert independently responds to the input instruction, generating multiple long-form expert responses. 2. **Expert Response Aggregation**: - Through 7 carefully designed sub-tasks, the multiple expert responses are aggregated into a final response. - These sub-tasks include identifying consensus views, conflicting views, unique perspectives, and ultimately selecting the best response. ### Main Contributions 1. **Performance Improvement**: Experimental results show that Multi-expert Prompting significantly outperforms existing baseline methods, excelling in truthfulness, factuality, non-toxicity, and non-hurtfulness. 2. **High Adaptability**: This method is applicable to various scenarios without the need for manually constructed prompts. 3. **Strong Interpretability**: Through the 7 sub-tasks, the contribution of each step can be clearly seen, enhancing the model's interpretability. ### Experimental Validation The authors validated the effectiveness of Multi-expert Prompting through multiple benchmark tests, including TruthfulQA, FactualityPrompt, BOLD, and HONEST. The results show that Multi-expert Prompting significantly outperforms other methods on all metrics, achieving a new state-of-the-art level on the TruthfulQA dataset. ### Conclusion By integrating the perspectives of multiple experts, Multi-expert Prompting not only improves the quality of the generated content but also enhances the reliability and safety of the model, providing a new approach to solving the generation problems of large language models.