Unmasking the Limits of Large Language Models: A Systematic Evaluation of Masked Text Processing Ability through MskQA and MskCal

Fuka Matsuzaki,Haru-Tada Sato

2024-11-09

Abstract:This paper sheds light on the limitations of Large Language Models (LLMs) by rigorously evaluating their ability to process masked text. We introduce two novel tasks: MskQA, measuring reasoning on masked question-answering datasets like RealtimeQA, and MskCal, assessing numerical reasoning on masked arithmetic <a class="link-external link-http" href="http://problems.Testing" rel="external noopener nofollow">this http URL</a> GPT-4o and 4o-mini reveals that while LLMs exhibit some resilience to masked text, their performance is highly contingent on masking rates and semantic cues. Specifically, "solid masking," where semantic clues are entirely absent, leads to a significant performance drop compared to "partial lifting," where some semantic information is retained, indicating LLMs' reliance on surface-level patterns. Interestingly, GPT-4o consistently outperforms 4o-mini, particularly in MskCal, demonstrating a greater ability to handle numerical reasoning with masked text. This underscores the crucial role of semantic cues in the reasoning process of LLMs. Our study illuminates the interplay between background knowledge and reasoning ability in masked text processing, paving the way for a deeper understanding of LLM capabilities and limitations, and highlighting the need for more robust evaluation methods to accurately assess their true comprehension abilities.

Computation and Language

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to reveal the limitations of large language models (LLMs) in handling masked text. Specifically, the authors systematically evaluate the reasoning abilities of LLMs in handling masked text through two new tasks—MskQA (Masked Question Answering) and MskCal (Masked Calculation). 1. **MskQA**: Measures the reasoning ability of LLMs on masked question-answering datasets, such as RealtimeQA. 2. **MskCal**: Evaluates the numerical reasoning ability of LLMs on masked arithmetic problems. By testing GPT-4o and 4o-mini, the study found: - LLMs exhibit a certain degree of resilience to masked text, but their performance highly depends on the masking rate and semantic cues. - "Fully masked" (i.e., no semantic cues) leads to a significant performance drop, while "partially unmasked" (retaining some semantic information) performs better. - GPT-4o consistently outperforms 4o-mini in the MskCal task, especially in handling masked numerical reasoning, demonstrating stronger capabilities. These findings emphasize the critical role of semantic cues in the reasoning process of LLMs and reveal the interaction between background knowledge and reasoning ability. The results provide new perspectives for further understanding the capabilities and limitations of LLMs and highlight the need for more robust evaluation methods to accurately assess their true understanding abilities.

Unmasking the Limits of Large Language Models: A Systematic Evaluation of Masked Text Processing Ability through MskQA and MskCal

Leveraging Large Language Models for Multiple Choice Question Answering

A Study on Large Language Models' Limitations in Multiple-Choice Question Answering

Is Bigger and Deeper Always Better? Probing LLaMA Across Scales and Layers

Do Large Language Models Truly Grasp Mathematics? An Empirical Exploration From Cognitive Psychology

The Curious Case of Nonverbal Abstract Reasoning with Multi-Modal Large Language Models

Do Large Language Models Truly Grasp Mathematics? An Empirical Exploration From A Psychological Perspective

Can Language Models Rival Mathematics Students? Evaluating Mathematical Reasoning through Textual Manipulation and Human Experiments

Masked Thought: Simply Masking Partial Reasoning Steps Can Improve Mathematical Reasoning Learning of Language Models

Analysing the Effect of Masking Length Distribution of MLM: An Evaluation Framework and Case Study on Chinese MRC Datasets

Benchmarking Hallucination in Large Language Models based on Unanswerable Math Word Problem

Do Large Language Models Exhibit Cognitive Dissonance? Studying the Difference Between Revealed Beliefs and Stated Answers

Analyzing the Effect of Masking Length Distribution of MLM: an Evaluation Framework and Case Study on Chinese MRC Datasets

Large Language Models Are Not Strong Abstract Reasoners

Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data

Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning

Recovering from Privacy-Preserving Masking with Large Language Models

LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMs

Case Study: Testing Model Capabilities in Some Reasoning Tasks

Large Language Models are Zero-Shot Reasoners

MenatQA: A New Dataset for Testing the Temporal Comprehension and Reasoning Abilities of Large Language Models