Unmasking the Limits of Large Language Models: A Systematic Evaluation of Masked Text Processing Ability through MskQA and MskCal

Fuka Matsuzaki,Haru-Tada Sato
2024-11-09
Abstract:This paper sheds light on the limitations of Large Language Models (LLMs) by rigorously evaluating their ability to process masked text. We introduce two novel tasks: MskQA, measuring reasoning on masked question-answering datasets like RealtimeQA, and MskCal, assessing numerical reasoning on masked arithmetic <a class="link-external link-http" href="http://problems.Testing" rel="external noopener nofollow">this http URL</a> GPT-4o and 4o-mini reveals that while LLMs exhibit some resilience to masked text, their performance is highly contingent on masking rates and semantic cues. Specifically, "solid masking," where semantic clues are entirely absent, leads to a significant performance drop compared to "partial lifting," where some semantic information is retained, indicating LLMs' reliance on surface-level patterns. Interestingly, GPT-4o consistently outperforms 4o-mini, particularly in MskCal, demonstrating a greater ability to handle numerical reasoning with masked text. This underscores the crucial role of semantic cues in the reasoning process of LLMs. Our study illuminates the interplay between background knowledge and reasoning ability in masked text processing, paving the way for a deeper understanding of LLM capabilities and limitations, and highlighting the need for more robust evaluation methods to accurately assess their true comprehension abilities.
Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to reveal the limitations of large language models (LLMs) in handling masked text. Specifically, the authors systematically evaluate the reasoning abilities of LLMs in handling masked text through two new tasks—MskQA (Masked Question Answering) and MskCal (Masked Calculation). 1. **MskQA**: Measures the reasoning ability of LLMs on masked question-answering datasets, such as RealtimeQA. 2. **MskCal**: Evaluates the numerical reasoning ability of LLMs on masked arithmetic problems. By testing GPT-4o and 4o-mini, the study found: - LLMs exhibit a certain degree of resilience to masked text, but their performance highly depends on the masking rate and semantic cues. - "Fully masked" (i.e., no semantic cues) leads to a significant performance drop, while "partially unmasked" (retaining some semantic information) performs better. - GPT-4o consistently outperforms 4o-mini in the MskCal task, especially in handling masked numerical reasoning, demonstrating stronger capabilities. These findings emphasize the critical role of semantic cues in the reasoning process of LLMs and reveal the interaction between background knowledge and reasoning ability. The results provide new perspectives for further understanding the capabilities and limitations of LLMs and highlight the need for more robust evaluation methods to accurately assess their true understanding abilities.