Can Large Language Models Understand Context?

Yilun Zhu,Joel Ruben Antony Moniz,Shruti Bhargava,Jiarui Lu,Dhivya Piraviperumal,Site Li,Yuan Zhang,Hong Yu,Bo-Hsiang Tseng
2024-02-02
Abstract:Understanding context is key to understanding human language, an ability which Large Language Models (LLMs) have been increasingly seen to demonstrate to an impressive extent. However, though the evaluation of LLMs encompasses various domains within the realm of Natural Language Processing, limited attention has been paid to probing their linguistic capability of understanding contextual features. This paper introduces a context understanding benchmark by adapting existing datasets to suit the evaluation of generative models. This benchmark comprises of four distinct tasks and nine datasets, all featuring prompts designed to assess the models' ability to understand context. First, we evaluate the performance of LLMs under the in-context learning pretraining scenario. Experimental results indicate that pre-trained dense models struggle with understanding more nuanced contextual features when compared to state-of-the-art fine-tuned models. Second, as LLM compression holds growing significance in both research and real-world applications, we assess the context understanding of quantized models under in-context-learning settings. We find that 3-bit post-training quantization leads to varying degrees of performance reduction on our benchmark. We conduct an extensive analysis of these scenarios to substantiate our experimental results.
Computation and Language
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on evaluating the ability of large language models (LLMs) in understanding context. Specifically, the author points out that although LLMs perform excellently in various tasks of natural language processing, relatively little research has been done on their ability to understand and process contextual features. Therefore, the paper introduces a context - understanding benchmark test to evaluate the understanding ability of generative models by adjusting existing data sets. This benchmark test includes four different tasks and nine data sets, and special prompts are designed to evaluate the model's ability to understand context. Specific problems include: 1. **Evaluation of context understanding**: By constructing a benchmark test that includes four tasks, namely coreference resolution, dialogue state tracking, implicit discourse relation classification, and query rewriting, evaluate the performance of LLMs in different context - understanding tasks. 2. **Comparison between pre - trained and fine - tuned models**: Research the differences between pre - trained dense models and the state - of - the - art fine - tuned models in understanding more detailed contextual features. 3. **Impact of model compression**: Evaluate the context - understanding ability of quantized models in the context - learning setting, especially the performance degradation of 3 - bit post - training quantization. These studies aim to provide a comprehensive evaluation framework to help better understand the advantages and limitations of LLMs in context understanding and explore the impact of model compression techniques on context - understanding ability.