Abstract:Code comments play a crucial role in software development, as they provide programmers with practical information, allowing them to understand better the intent and semantics of the underpinning code. Nevertheless, developers tend to leave comments unchanged after updating the code, resulting in a discrepancy between the two artifacts. Such a discrepancy may trigger misunderstanding and confusion among developers, impeding various activities, including code comprehension and maintenance. Thus, it is crucial to identify if, given a code snippet, its corresponding comment is coherent and reflects well the intent behind the code. Unfortunately, existing approaches to this problem, while obtaining an encouraging performance, either rely on heavily pre-trained models, or treat input data as text, neglecting the intrinsic features contained in comments and code, including word order and synonyms. This work presents Co3D as a practical approach to the detection of code comment coherence. We pay attention to internal meaning of words and sequential order of words in text while predicting coherence in code-comment pairs. We deployed a combination of Gensim word2vec encoding and a simple recurrent neural network, a combination of Gensim word2vec encoding and an LSTM model, and CodeBERT. The experimental results show that Co3D obtains a promising prediction performance, thus outperforming well-established baselines. We conclude that depending on the context, using a simple architecture can introduce a satisfying prediction.

What problem does this paper attempt to address?

This paper attempts to address the issue of consistency between code comments and their corresponding code snippets. Specifically, the author points out that developers often fail to update comments synchronously after updating the code, resulting in inconsistency between comments and code. This inconsistency may lead to misunderstandings and confusion, which in turn affects the understanding and maintenance of the code. To solve this problem, the author proposes a method named Co3D, which aims to detect the consistency of code comments. Co3D combines word embeddings and long - short - term memory network (LSTM) to capture the internal meaning of words and their order in the code and comments, and uses three different combinations of techniques for prediction: 1. **C1**: Gensim word2vec + Simple Recurrent Neural Network (Simple RNN) 2. **C2**: Gensim word2vec + LSTM 3. **C3**: Tokenization + CodeBERT Experimental verification shows that Co3D outperforms existing baseline models, such as SVM and BERT, on multiple datasets. This indicates that even with a relatively simple architecture, satisfactory results can be achieved in detecting the consistency of code comments. ### Formula Presentation When describing the experimental results, the paper does not involve complex formula derivations. However, to ensure understanding, we can use Markdown format to show some basic concepts: - **Word Embedding**: Map each word in the text to a vector space, usually represented as: \[ \mathbf{w} \in \mathbb{R}^{d} \] where \(d\) is the embedding dimension. - **LSTM Unit**: LSTM is a special type of Recurrent Neural Network (RNN) used to process sequence data. Its core formulas include the forget gate, input gate, and output gate: \[ f_t=\sigma\left(W_f \cdot\left[h_{t - 1}, x_t\right]+b_f\right) \] \[ i_t=\sigma\left(W_i \cdot\left[h_{t - 1}, x_t\right]+b_i\right) \] \[ o_t=\sigma\left(W_o \cdot\left[h_{t - 1}, x_t\right]+b_o\right) \] \[ \tilde{C}_t = \tanh\left(W_C \cdot\left[h_{t - 1}, x_t\right]+b_C\right) \] \[ C_t=f_t\odot C_{t - 1}+i_t\odot\tilde{C}_t \] \[ h_t=o_t\odot\tanh\left(C_t\right) \] These formulas show how LSTM effectively captures long - term dependencies in sequence data through the gating mechanism. ### Summary The main contribution of this paper is to propose a simple and effective solution (Co3D) for detecting the consistency of code comments. By combining traditional machine learning methods (such as Gensim word2vec and LSTM) and pre - trained models (such as CodeBERT), Co3D not only outperforms existing baseline models in performance but also proves the effectiveness of simple architectures in specific tasks. This finding challenges the current widespread assumption in the field that pre - trained models are "silver bullets".

When simplicity meets effectiveness: Detecting code comments coherence with word embeddings and LSTM

Just-In-Time Obsolete Comment Detection and Update.

Text Coherence Analysis Based on Deep Neural Network.

DocChecker: Bootstrapping Code Large Language Model for Detecting and Resolving Code-Comment Inconsistencies

Why My Code Summarization Model Does Not Work: Code Comment Improvement with Category Prediction

Code Comments: A Way of Identifying Similarities in the Source Code

From Code to Natural Language: Type-Aware Sketch-Based Seq2Seq Learning

Code Needs Comments: Enhancing Code LLMs with Comment Augmentation

Code to Comment "Translation": Data, Metrics, Baselining & Evaluation

LAMNER: Code Comment Generation Using Character Language Model and Named Entity Recognition

ComFormer: Code Comment Generation via Transformer and Fusion Method-based Hybrid Code Representation

Towards Usable Neural Comment Generation Via Code-Comment Linkage Interpretation: Method and Empirical Study

Integrating Extractive and Abstractive Models for Code Comment Generation

CodeCSE: A Simple Multilingual Model for Code and Comment Sentence Embeddings

Deep Code Comment Generation with Hybrid Lexical and Syntactical Information

A ML-LLM pairing for better code comment classification

Leveraging Generative AI: Improving Software Metadata Classification with Generated Code-Comment Pairs

Neural Comment Generation for Source Code with Auxiliary Code Classification Task

Dopamin: Transformer-based Comment Classifiers through Domain Post-Training and Multi-level Layer Aggregation

An Intra-Class Relation Guided Approach for Code Comment Generation.

DeepCommenter: a Deep Code Comment Generation Tool with Hybrid Lexical and Syntactical Information