Interactive Ontology Matching with Cost-Efficient Learning

Bin Cheng,Jonathan Fürst,Tobias Jacobs,Celia Garrido-Hidalgo
2024-04-11
Abstract:The creation of high-quality ontologies is crucial for data integration and knowledge-based reasoning, specifically in the context of the rising data economy. However, automatic ontology matchers are often bound to the heuristics they are based on, leaving many matches unidentified. Interactive ontology matching systems involving human experts have been introduced, but they do not solve the fundamental issue of flexibly finding additional matches outside the scope of the implemented heuristics, even though this is highly demanded in industrial settings. Active machine learning methods appear to be a promising path towards a flexible interactive ontology matcher. However, off-the-shelf active learning mechanisms suffer from low query efficiency due to extreme class imbalance, resulting in a last-mile problem where high human effort is required to identify the remaining matches.
Databases,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the "last - mile" problem in **ontology matching**. Specifically, the authors are concerned with the problem that in automatic ontology matching, many potential matching relationships cannot be identified by existing methods. Although some existing interactive ontology - matching systems have introduced human experts to assist in the matching process, these systems are still limited by their built - in heuristic rules and it is difficult to flexibly discover matching relationships outside the scope of these rules. In addition, although existing active learning methods provide a certain degree of flexibility, they are inefficient when dealing with highly unbalanced data, resulting in the need for a large amount of manual annotation to find the remaining matches. Therefore, this paper proposes a new method - **DualLoop** - to improve query efficiency and reduce the workload of manual annotation, thereby effectively discovering more matching relationships. ### Summary of main problems: 1. **Limitations of automatic ontology matching**: Existing methods rely on fixed heuristic rules and cannot flexibly identify all possible matches. 2. **Limitations of interactive ontology matching**: Although human experts are introduced, it is still limited by preset heuristic rules and difficult to explore new matches. 3. **Inefficiency of active learning**: Existing active learning methods perform poorly when dealing with highly unbalanced data, resulting in low query efficiency and high labor costs. ### DualLoop's solutions: - **Short - term Learner**: Through a novel query strategy, focus on using existing heuristic rules to quickly find high - confidence matches. - **Long - term Learners**: By creating and adjusting new heuristic rules, explore potential matching relationships and improve the recall rate. - **Weak Supervision**: Use a set of adjustable heuristic rules for preliminary annotation and gradually optimize with feedback from human experts. Through these techniques, DualLoop can more efficiently discover additional "last - mile" matches while significantly reducing the need for manual annotation. ### Experimental results: Experiments show that DualLoop performs excellently on three datasets of different fields and scales. Compared with existing active learning methods, it has higher F1 scores and recall rates, and reduces the expected query cost required to find 90% of the matches by more than 50%. In addition, DualLoop has also been successfully applied to actual products, demonstrating its practicality and efficiency in industrial applications. ### Formula representation: - Class matching definition: \(\{(e_S, e_T)|e_S\in E_S \text{ equivalent to } e_T\in E_T\}\) - Classification function in the query strategy: \(c_{mt}:U\rightarrow\{0, 1\}\times[0, 1]:u\mapsto(\hat{y}_{c_{mt}u}, p_{c_{mt}u})\) I hope this information can help you understand the main problems and solutions in this paper. If you have more questions or need further explanations, please feel free to let me know!