Do Large Language Models Pay Similar Attention Like Human Programmers When Generating Code?

Bonan Kou,Shengmai Chen,Zhijie Wang,Lei Ma,Tianyi Zhang
DOI: https://doi.org/10.1145/3660807
2024-05-24
Abstract:Large Language Models (LLMs) have recently been widely used for code generation. Due to the complexity and opacity of LLMs, little is known about how these models generate code. We made the first attempt to bridge this knowledge gap by investigating whether LLMs attend to the same parts of a task description as human programmers during code generation. An analysis of six LLMs, including GPT-4, on two popular code generation benchmarks revealed a consistent misalignment between LLMs' and programmers' attention. We manually analyzed 211 incorrect code snippets and found five attention patterns that can be used to explain many code generation errors. Finally, a user study showed that model attention computed by a perturbation-based method is often favored by human programmers. Our findings highlight the need for human-aligned LLMs for better interpretability and programmer trust.
Software Engineering,Human-Computer Interaction,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is whether large - language models (LLMs) pay attention to the same parts in task descriptions as human programmers do when generating code. Specifically, the authors explored the following research questions: 1. **Degree of Consistency between Model Attention and Human Attention** (RQ1): To what extent is the attention of LLMs consistent with that of human programmers? 2. **Can Attention Explain Errors in Code - Generation Models?** (RQ2): Can the attention pattern of the model be used to explain errors in code generation? 3. **The Impact of Different Attention - Calculation Methods on Attention Consistency** (RQ3): What is the impact of different attention - calculation methods on the consistency between model and human attention? 4. **Which Attention - Calculation Method is Most Popular among Programmers?** (RQ4): Which attention - calculation method do programmers prefer to use in practical applications? To answer these questions, the authors conducted large - scale research, analyzed the performance of six LLMs of different scales on two popular code - generation benchmark tests, and created a programmer - attention dataset containing 1,138 programming tasks. Through these studies, the authors hope to reveal how LLMs process natural - language input during the code - generation process and find ways to improve LLMs to better conform to human programming habits.