Abstract:The Artificial Intelligence (AI) research community has used ad-hoc benchmarks to measure the "intelligence" level of Large Language Models (LLMs). In humans, intelligence is closely linked to the functional integrity of the prefrontal lobes, which are essential for higher-order cognitive processes. Previous research has found that LLMs struggle with cognitive tasks that rely on these prefrontal functions, highlighting a significant challenge in replicating human-like intelligence. In December 2022, OpenAI released ChatGPT, a new chatbot based on the GPT-3.5 model that quickly gained popularity for its impressive ability to understand and respond to human instructions, suggesting a significant step towards intelligent behaviour in AI. Therefore, to rigorously investigate LLMs' level of "intelligence," we evaluated the GPT-3.5 and GPT-4 versions through a neuropsychological assessment using tests in the Italian language routinely employed to assess prefrontal functioning in humans. The same tests were also administered to Claude2 and Llama2 to verify whether similar language models perform similarly in prefrontal tests. When using human performance as a reference, GPT-3.5 showed inhomogeneous results on prefrontal tests, with some tests well above average, others in the lower range, and others frankly impaired. Specifically, we have identified poor planning abilities and difficulty in recognising semantic absurdities and understanding others' intentions and mental states. Claude2 exhibited a similar pattern to GPT-3.5, while Llama2 performed poorly in almost all tests. These inconsistent profiles highlight how LLMs' emergent abilities do not yet mimic human cognitive functioning. The sole exception was GPT-4, which performed within the normative range for all the tasks except planning. Furthermore, we showed how standardised neuropsychological batteries developed to assess human cognitive functions may be suitable for challenging LLMs' performance.

GePpeTto Carves Italian into a Language Model

Unipa-GPT: Large Language Models for university-oriented QA in Italian

IT5: Text-to-text Pretraining for Italian Language Understanding and Generation

LLaMAntino: LLaMA 2 Models for Effective Text Generation in Italian Language

Unraveling ChatGPT: A Critical Analysis of AI-Generated Goal-Oriented Dialogues and Annotations

The Invalsi Benchmarks: measuring Linguistic and Mathematical understanding of Large Language Models in Italian

Challenging large language models' " intelligence" with human tools: A neuropsychological investigation in Italian language on prefrontal functioning

Neural Generative Models and the Parallel Architecture of Language: A Critical Review and Outlook

RoGPT2: Romanian GPT2 for Text Generation

A Comparative Study of Code Generation using ChatGPT 3.5 across 10 Programming Languages

Igea: a Decoder-Only Language Model for Biomedical Text Generation in Italian

Neural Poetry: Learning to Generate Poems using Syllables

Large Language Models Are State-of-the-Art Evaluators of Translation Quality

On the assessment of generative AI in modeling tasks: an experience report with ChatGPT and UML

mGPT: Few-Shot Learners Go Multilingual

How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation

Camoscio: an Italian Instruction-tuned LLaMA

Language Models are Few-Shot Learners

Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks

Benchmarking GPT-4 on Algorithmic Problems: A Systematic Evaluation of Prompting Strategies