Text-Based Product Matching -- Semi-Supervised Clustering Approach

Alicja Martinek,Szymon Łukasik,Amir H. Gandomi
2024-02-02
Abstract:Matching identical products present in multiple product feeds constitutes a crucial element of many tasks of e-commerce, such as comparing product offerings, dynamic price optimization, and selecting the assortment personalized for the client. It corresponds to the well-known machine learning task of entity matching, with its own specificity, like omnipresent unstructured data or inaccurate and inconsistent product descriptions. This paper aims to present a new philosophy to product matching utilizing a semi-supervised clustering approach. We study the properties of this method by experimenting with the IDEC algorithm on the real-world dataset using predominantly textual features and fuzzy string matching, with more standard approaches as a point of reference. Encouraging results show that unsupervised matching, enriched with a small annotated sample of product links, could be a possible alternative to the dominant supervised strategy, requiring extensive manual data labeling.
Databases,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the difficult problem of matching identical products in e - commerce. Specifically, the goal of the paper is to improve the product - matching task by proposing a new semi - supervised clustering method, which can improve the matching accuracy while reducing the need for manually - labeled data. Traditional product - matching methods usually rely on a large amount of manually - labeled data, which is not only time - consuming but also costly. Therefore, the paper explores a method that combines a small number of labeled samples with a large amount of unlabeled data to achieve a more efficient and economical product - matching solution. The paper experimentally studies the performance of the IDEC algorithm on real - world data sets and uses text features and fuzzy string matching as the main means, comparing with traditional supervised and unsupervised methods. The research results show that this semi - supervised matching method can be an effective alternative, especially when dealing with a large amount of data and having limited labeling resources.