Abstract:Recent advancements in Large Language Models (LLMs) have revolutionized the AI field but also pose potential safety and ethical risks. Deciphering LLMs' embedded values becomes crucial for assessing and mitigating their risks. Despite extensive investigation into LLMs' values, previous studies heavily rely on human-oriented value systems in social sciences. Then, a natural question arises: Do LLMs possess unique values beyond those of humans? Delving into it, this work proposes a novel framework, ValueLex, to reconstruct LLMs' unique value system from scratch, leveraging psychological methodologies from human personality/value research. Based on Lexical Hypothesis, ValueLex introduces a generative approach to elicit diverse values from 30+ LLMs, synthesizing a taxonomy that culminates in a comprehensive value framework via factor analysis and semantic clustering. We identify three core value dimensions, Competence, Character, and Integrity, each with specific subdimensions, revealing that LLMs possess a structured, albeit non-human, value system. Based on this system, we further develop tailored projective tests to evaluate and analyze the value inclinations of LLMs across different model sizes, training methods, and data sources. Our framework fosters an interdisciplinary paradigm of understanding LLMs, paving the way for future AI alignment and regulation.
What problem does this paper attempt to address?
### Problems the paper attempts to solve
The problems that the paper attempts to solve are: Do large language models (LLMs) have values that surpass those of humans? If such values exist, what are they like? How can these values be evaluated?
### Background and motivation
In recent years, the development of large language models has greatly promoted the field of artificial intelligence, but it has also brought potential safety and ethical risks at the same time. Understanding the values embedded in these models is crucial for evaluating and mitigating these risks. However, most of the existing research relies on the human value system in social sciences to evaluate the values of LLMs. This raises a natural question: Do LLMs have unique, non - human values?
### Research methods
To solve the above - mentioned problems, the paper proposes a new framework - **ValueLex** for constructing the unique value system of LLMs from scratch and evaluating their value tendencies. The specific steps are as follows:
1. **Generative value construction**:
- **Value vocabulary hypothesis**: It is hypothesized that the important values of LLMs will exist in the form of single words in their internal parameter spaces.
- **Value extraction**: Through designed abductive reasoning and summarization, value - descriptive words are collected from more than 30 LLMs with different settings.
- **Factor analysis and semantic clustering**: Using factor analysis and semantic clustering methods, the most representative value - descriptive words are identified, and finally a comprehensive value framework containing three core dimensions (ability, character, integrity) and their sub - dimensions is formed.
2. **Projection test evaluation**:
- **Test creation**: Design a series of sentence - completion tests, and let LLMs generate continuations according to these sentences, thereby reflecting their values.
- **Result analysis**: Map the generated responses to the quantified value space through a classifier, and calculate the scores of each model on each value dimension.
### Main findings
1. **Ability dimension**: LLMs generally value ability, but there are differences among different models. For example, Mistral and Tulu emphasize ability more, while Baichuan is more inclined towards integrity.
2. **Influence of training methods**: Pretrained models do not have significant value tendencies; instruction - tuning enhances the consistency of each dimension, and alignment further diversifies the values.
3. **Ability expansion**: Larger models prefer ability more, but the values of other dimensions may be ignored.
### Contributions
- **First revelation**: Reveal the three core value dimensions of LLMs, their sub - dimensions and structures.
- **Evaluation tool**: Develop a special projection test for evaluating the potential value tendencies of LLMs.
- **Impact analysis**: Explore the influence of factors such as model size and training methods on the value tendencies of LLMs, and discuss the differences between LLMs and human values.
### Conclusion
Although the value system of LLMs has certain similarities with that of humans, the values of LLMs are more specialized, reflecting clear human expectations. This indicates that intentional guidance (such as alignment) can effectively change the underlying value system of AI. However, this system still lacks the dynamic and motivational aspects of human values, pointing out the direction for the continuous improvement of AI values in the future.