Abstract:Code idioms are commonly used patterns, techniques, or practices that aid in solving particular problems or specific tasks across multiple software projects. They can improve code quality, performance, and maintainability, and also promote program standardization and reuse across projects. However, identifying code idioms is significantly challenging, as existing studies have still suffered from three main limitations. First, it is difficult to recognize idioms that span non-contiguous code lines. Second, identifying idioms with intricate data flow and code structures can be challenging. More-over, they only extract dataset-specific idioms, so common idioms or well-established code/design patterns that are rarely found in datasets cannot be identified. To overcome these limitations, we propose a novel approach, named Idiomine, to automatically extract generic and specific idioms from both Java projects and libraries. We perform program analysis on Java functions to transform them into concise PDGs, for integrating the data flow and control flow of code fragments. We then develop a novel chain structure, Data-driven Control Chain (DCC), to extract sub-idioms that possess contiguous semantic meanings from PDGs. After that, we utilize GraphCodeBERT to generate code embeddings of these sub-idioms and perform density-based clustering to obtain frequent sub-idioms. We use heuristic rules to identify interrelated sub-idioms among the frequent ones. Finally, we employ ChatGPT to synthesize interrelated sub-idioms into potential code idioms and infer real idioms from them. We conduct well-designed experiments and a user study to evaluate Idiomine's correctness and the practical value of the extracted idioms. Our experimental results show that Idiomine effectively extracts more idioms with better performance in most metrics. We compare our approach with Haggis and ChatGPT, Idiomine outperforms them by 22.8% and 35.5% in Idiom Set Precision (ISP) and by 9.7% and 22.9% in Idiom Coverage (IC) when extracting idioms from libraries. Idiomine also extracts almost twice the size of idioms than the baselines, exhibiting its ability to identify complete idioms. Our user study indicates that idioms extracted by Idiomine are well-formed and semantically clear. Moreover, we conduct a qualitative and quantitative analysis to investigate the primary functionalities of Idiomine's extracted idioms from various projects and libraries.

Translate Meanings, Not Just Words: IdiomKB's Role in Optimizing Idiomatic Translation with Language Models

Improving LLM Abilities in Idiomatic Translation

Streamlining Java Programming: Uncovering Well-Formed Idioms with IdioMine

Automating Idiom Translation with Cross-Lingual Natural Language Generation Grounded In Semantic Analyses Using Large Language Models

Evaluating Machine Translation Performance on Chinese Idioms with a Blacklist Method

Creative and Context-Aware Translation of East Asian Idioms with GPT-4

Comparative Study of Multilingual Idioms and Similes in Large Language Models

Automatic Evaluation and Analysis of Idioms in Neural Machine Translation

Exploring Human-Like Translation Strategy with Large Language Models

Vector Representations of Idioms in Conversational Systems

Idiomify -- Building a Collocation-supplemented Reverse Dictionary of English Idioms with Word2Vec for non-native learners

MAPS-KB: A Million-scale Probabilistic Simile Knowledge Base

A Novel Paradigm Boosting Translation Capabilities of Large Language Models

Synonym Knowledge Enhanced Reader for Chinese Idiom Reading Comprehension

Rethinking Human-like Translation Strategy: Integrating Drift-Diffusion Model with Large Language Models for Machine Translation

MMTE: Corpus and Metrics for Evaluating Machine Translation Quality of Metaphorical Language

Sign of the Times: Evaluating the use of Large Language Models for Idiomaticity Detection

Building a Large-Scale Knowledge Base for Machine Translation

Examining the Efficiency of Machine Translation in Translating English Idioms used in American Media

Refactoring to Pythonic Idioms: A Hybrid Knowledge-Driven Approach Leveraging Large Language Models

That was the last straw, we need more: Are Translation Systems Sensitive to Disambiguating Context?