CodeCSE: A Simple Multilingual Model for Code and Comment Sentence Embeddings

Anthony Varkey,Siyuan Jiang,Weijing Huang
2024-07-09
Abstract:Pretrained language models for code token embeddings are used in code search, code clone detection, and other code-related tasks. Similarly, code function embeddings are useful in such tasks. However, there are no out-of-box models for function embeddings in the current literature. So, this paper proposes CodeCSE, a contrastive learning model that learns embeddings for functions and their descriptions in one space. We evaluated CodeCSE using code search. CodeCSE's multi-lingual zero-shot approach is as efficient as the models finetuned from GraphCodeBERT for specific languages. CodeCSE is open source at <a class="link-external link-https" href="https://github.com/emu-se/codecse" rel="external noopener nofollow">this https URL</a> and the pretrained model is available at the HuggingFace public hub: <a class="link-external link-https" href="https://huggingface.co/sjiang1/codecse" rel="external noopener nofollow">this https URL</a>
Software Engineering
What problem does this paper attempt to address?