What problem does this paper attempt to address?

The problem that this paper attempts to solve is whether the performance of language models (LMs) on the next - word prediction task can match human generative behavior in large - scale cloze tasks. Specifically, the researchers compared the generative behaviors of multiple language models in the next - word prediction task and contrasted them with human performance in the same task. The study found that although large - language models generally perform better in predicting human generation, they still have some significant problems, such as underestimating the probability of human responses, overestimating the probability of rare responses, underestimating the probability of top - ranked responses, and generating highly different semantic spaces. These problems indicate that the generation of language models cannot be used as a substitute for or a model of human cloze tasks. ### Main Problems 1. **Consistency between Language Models and Human Generative Behavior**: - Researchers hope to evaluate whether language models can accurately simulate human language - generating behavior by comparing the performance of language models and humans in cloze tasks. 2. **Accuracy of Probability Estimation**: - Language models are biased when estimating the probability of human responses, specifically manifested as underestimating the probability of common responses and overestimating the probability of rare responses. 3. **Differences in Semantic Space**: - There are significant differences in the semantic space between the text generated by language models and that generated by humans, which indicates that language models may not be able to fully capture the complexity and diversity of human language generation. ### Research Methods - **Data Set**: Use the cloze - specification data set of Peelle et al. (2020), which contains 3,085 English sentences, each with at least 100 manually - verified human responses. - **Models**: Compare multiple neural language models, including GPT - 2, RoBERTa, and Pythia series models. - **Experimental Design**: - **Experiment 1**: Compare the probability distributions of the next word generated by language models and humans. - **Experiment 2**: Evaluate the accuracy of language models in ranking human responses. - **Experiment 3**: Analyze the influence of model size and training progress on the correlation between language models and human response rankings. - **Experiment 4**: Evaluate the distribution of text generated by humans and language models in the semantic space through cluster analysis. ### Key Findings - **Probability Estimation Problems**: Language models generally underestimate the probability of human responses, especially performing poorly on high - probability responses. - **Ranking Problems**: Language models rank rare responses too high and common responses too low. - **Differences in Semantic Space**: The text generated by humans and language models shows obvious separation in the semantic space, indicating that language models are insufficient in capturing the semantic features of human language generation. ### Conclusion This paper reveals some key problems of language models in cloze tasks through a series of experiments, especially the differences in probability estimation and semantic space. These findings indicate that although modern language models perform excellently in some tasks, they still cannot completely replace or accurately simulate human language - generating behavior. Future research needs to further explore how to improve language models to make them closer to human language - generating ability.

Large-scale cloze evaluation reveals that token prediction tasks are neither lexically nor semantically aligned

Large-scale cloze evaluation reveals that token prediction tasks are neither lexically nor semantically aligned

Language models are better than humans at next-token prediction

Language Models Outperform Cloze Predictability in a Cognitive Model of Reading

"My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models

Humans and language models diverge when predicting repeating text

Large Language Models aren't all that you need

Can large language models understand uncommon meanings of common words?

Predict the Next Word: Humans exhibit uncertainty in this task and language models _____

Language in Vivo vs. in Silico: Size Matters but Larger Language Models Still Do Not Comprehend Language on a Par with Humans

Large GPT-like Models are Bad Babies: A Closer Look at the Relationship between Linguistic Competence and Psycholinguistic Measures

Are Some Words Worth More than Others?

Rethinking Generative Large Language Model Evaluation for Semantic Comprehension

Tokenization Falling Short: On Subword Robustness in Large Language Models

Identifying and Analyzing Task-Encoding Tokens in Large Language Models

A Sentence is Worth a Thousand Pictures: Can Large Language Models Understand Hum4n L4ngu4ge and the W0rld behind W0rds?

Do Large Language Models Exhibit Cognitive Dissonance? Studying the Difference Between Revealed Beliefs and Stated Answers

From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples

Better & Faster Large Language Models via Multi-token Prediction

Constructions Are So Difficult That Even Large Language Models Get Them Right for the Wrong Reasons