Non Verbis, Sed Rebus: Large Language Models are Weak Solvers of Italian Rebuses

Gabriele Sarti,Tommaso Caselli,Malvina Nissim,Arianna Bisazza
2024-08-01
Abstract:Rebuses are puzzles requiring constrained multi-step reasoning to identify a hidden phrase from a set of images and letters. In this work, we introduce a large collection of verbalized rebuses for the Italian language and use it to assess the rebus-solving capabilities of state-of-the-art large language models. While general-purpose systems such as LLaMA-3 and GPT-4o perform poorly on this task, ad-hoc fine-tuning seems to improve models' performance. However, we find that performance gains from training are largely motivated by memorization. Our results suggest that rebus solving remains a challenging test bed to evaluate large language models' linguistic proficiency and sequential instruction-following skills.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the evaluation of large language models (LLMs) in solving Italian rebus puzzles. Specifically: 1. **Dataset Construction**: The paper proposes a new method to extract plain text forms of rebus puzzles from transcribed intermediate puzzle-solving results and creates a dataset containing over 80,000 puzzles. 2. **Model Evaluation**: Using this dataset, the authors evaluated the performance of current state-of-the-art large language models (including open-source systems and proprietary models) in solving Italian rebus puzzles. Experiments found that even general-purpose systems like LLaMA-3 and GPT-4 performed poorly on this task. 3. **Fine-Tuning Experiments**: The authors fine-tuned a small but powerful LLM for the specific task, significantly improving the model's performance. However, further analysis indicated that this performance boost was mainly due to memory effects rather than a genuine improvement in reasoning ability. 4. **Multi-Step Reasoning Challenge**: The study also revealed the limitations of current large language models in multi-step reasoning tasks, particularly in following constraints and handling complex sequences of instructions. In summary, by introducing a novel rebus puzzle-solving task, the paper not only highlights the shortcomings of current LLMs in such tasks but also provides valuable benchmarks for future research.