Accelerating similarity-based model matching using dual hashing

DOI: https://doi.org/10.1007/s10270-024-01173-1
2024-04-30
Software & Systems Modeling
Abstract:Similarity-based model matching is the cornerstone of model versioning. It pairs model elements based on a distance metric (e.g., edit distance). However, calculating the distances between elements is computationally expensive. Consequently, a similarity-based matcher typically suffers from performance issues when the model size increases. Based on observation, there are two main causes of the high computation cost: (1) when matching an element p , the matcher calculates the distance between p and every candidate element q , despite the obvious dissimilarity between p and q ; (2) the matcher always calculates the distance between p and , even though q and are very similar and the distance between p and q is already known. This paper proposes a dual-hash-based approach, which employs two entirely different hashing techniques—similarity-preserving hashing and integrity-based hashing—to accelerate similarity-based model matching. With similarity-preserving hashing, our approach can quickly filter out the dissimilar candidate elements according to their similarity hashes computed using our similarity-preserving hash function, which maps an element to a 64-bit binary hash. With integrity-based hashing, our approach can cache and reuse computed distance values by associating them with the checksums of model elements. We also propose an index structure to facilitate hash-based model matching. Our approach has been implemented and integrated into EMF Compare. We evaluate our approach using open-source Ecore and UML models. The results show that our hash function is effective in preserving the similarity between model elements and our matching approach reduces time costs by 20–88% while assuring the matching results consistent with EMF Compare.
computer science, software engineering
What problem does this paper attempt to address?