Large-scale cloze evaluation reveals that token prediction tasks are neither lexically nor semantically aligned

Cassandra L. Jacobs,Loïc Grobol,Alvin Tsang
2024-10-29
Abstract:In this work we compare the generative behavior at the next token prediction level in several language models by comparing them to human productions in the cloze task. We find that while large models trained for longer are typically better estimators of human productions, but they reliably under-estimate the probabilities of human responses, over-rank rare responses, under-rank top responses, and produce highly distinct semantic spaces. Altogether, this work demonstrates in a tractable, interpretable domain that LM generations can not be used as replacements of or models of the cloze task.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is whether the performance of language models (LMs) on the next - word prediction task can match human generative behavior in large - scale cloze tasks. Specifically, the researchers compared the generative behaviors of multiple language models in the next - word prediction task and contrasted them with human performance in the same task. The study found that although large - language models generally perform better in predicting human generation, they still have some significant problems, such as underestimating the probability of human responses, overestimating the probability of rare responses, underestimating the probability of top - ranked responses, and generating highly different semantic spaces. These problems indicate that the generation of language models cannot be used as a substitute for or a model of human cloze tasks. ### Main Problems 1. **Consistency between Language Models and Human Generative Behavior**: - Researchers hope to evaluate whether language models can accurately simulate human language - generating behavior by comparing the performance of language models and humans in cloze tasks. 2. **Accuracy of Probability Estimation**: - Language models are biased when estimating the probability of human responses, specifically manifested as underestimating the probability of common responses and overestimating the probability of rare responses. 3. **Differences in Semantic Space**: - There are significant differences in the semantic space between the text generated by language models and that generated by humans, which indicates that language models may not be able to fully capture the complexity and diversity of human language generation. ### Research Methods - **Data Set**: Use the cloze - specification data set of Peelle et al. (2020), which contains 3,085 English sentences, each with at least 100 manually - verified human responses. - **Models**: Compare multiple neural language models, including GPT - 2, RoBERTa, and Pythia series models. - **Experimental Design**: - **Experiment 1**: Compare the probability distributions of the next word generated by language models and humans. - **Experiment 2**: Evaluate the accuracy of language models in ranking human responses. - **Experiment 3**: Analyze the influence of model size and training progress on the correlation between language models and human response rankings. - **Experiment 4**: Evaluate the distribution of text generated by humans and language models in the semantic space through cluster analysis. ### Key Findings - **Probability Estimation Problems**: Language models generally underestimate the probability of human responses, especially performing poorly on high - probability responses. - **Ranking Problems**: Language models rank rare responses too high and common responses too low. - **Differences in Semantic Space**: The text generated by humans and language models shows obvious separation in the semantic space, indicating that language models are insufficient in capturing the semantic features of human language generation. ### Conclusion This paper reveals some key problems of language models in cloze tasks through a series of experiments, especially the differences in probability estimation and semantic space. These findings indicate that although modern language models perform excellently in some tasks, they still cannot completely replace or accurately simulate human language - generating behavior. Future research needs to further explore how to improve language models to make them closer to human language - generating ability.