Deep Learning Approaches for Similarity Computation: A Survey

Peilun Yang,Hanchen Wang,Jianye Yang,Zhengping Qian,Ying Zhang,Xuemin Lin
DOI: https://doi.org/10.1109/tkde.2024.3422484
IF: 9.235
2024-01-01
IEEE Transactions on Knowledge and Data Engineering
Abstract:The requirement for appropriate ways to measure the similarity between data objects is a common but vital task in various domains, such as data mining, machine learning and so on. Driven by abundant real-world applications, many well-known similarity (distance) metrics are proposed to measure the pairwise similarity of data pairs, e.g., graph edit distance for graphs and dynamic time warping for time series. However, many similarity metrics suffer from the high time complexity. More specifically, most of the well-known similarity metrics often need quadratic time or even much more time to compute the ground truth similarity and some of them are proven to be NP-hard. With the development of deep learning techniques, there is an emerging research trend on the learning for similarity computation on various data types in the field of database (DB) and data mining, which is quite different with the metric learning studies in the machine learning (ML) literature. Specifically, the studies in the ML focus on the learning for semantic similarity in specific tasks, which is implicitly indicated by the training data, on the data in the feature space. While the studies in the DB literature usually consider the learning for well-defined similarity metrics (e.g., graph edit distance) on the data objects (e.g., graphs), such that it can benefit the similarity computation on data in terms of multiple aspects, such as computation time, metric quality and search heuristic, and the learned representation of data can also be naturally fed to downstream tasks. This survey paper provides a comprehensive review of similarity computation learning on several data types, including set, sequence and graph. Moreover, we first classify the learning-based approaches in terms of their learning target into three categories, i.e., similarity learning, cost matrix learning and search heuristic learning. Then we detail some representative approaches for each category on every data type, and analyze some key features that are utilized by these approaches. Finally, we discuss some challenges and future directions towards the learning for similarity learning on these data types.
What problem does this paper attempt to address?