Commit2Vec: Learning Distributed Representations of Code Changes

Rocío Cabrera Lozoya,Arnaud Baumann,Antonino Sabetta,Michele Bezzi
DOI: https://doi.org/10.1007/s42979-021-00566-z
2021-03-19
SN Computer Science
Abstract:Deep learning methods have found successful applications in fields like image classification and natural language processing. They have recently been applied to source code analysis too, due to the enormous amount of freely available source code (e.g., from open-source software repositories). In this work, we elaborate upon a state-of-the-art approach for source code representation, which uses information about its syntactic structure, and we extend it to represent source code <i>changes</i> (i.e., commits). We use this representation to tackle an industrial-relevant task: the classification of security-relevant commits. We leverage on transfer learning, a machine learning technique which reuses, or transfers, information learned from previous tasks (commonly called pretext tasks) to tackle a new target task. We assess the impact of using two different pretext tasks, for which abundant labeled data is available, to tackle the classification of security-relevant commits. Our results indicate that representations that exploit the structural information in code syntax outperform token-based representations. Furthermore, we show that pre-training on a small dataset (<span class="mathjax-tex">\(&gt;10^4\)</span> samples), but for a pretext task that is closely related to the target task, results in better performance metrics than pre-training on a loosely related pretext task with a very large dataset (<span class="mathjax-tex">\(&gt;10^6\)</span> samples).
What problem does this paper attempt to address?