Multilingual Transformers for Product Matching -- Experiments and a New Benchmark in Polish

Michał Możdżonek,Anna Wróblewska,Sergiy Tkachuk,Szymon Łukasik
DOI: https://doi.org/10.1109/fuzz-ieee55066.2022.9882843
2022-06-01
Abstract:Product matching corresponds to the task of matching identical products across different data sources. It typically employs available product features which, apart from being multimodal, i.e., comprised of various data types, might be non-homogeneous and incomplete. The paper shows that pre-trained, multilingual Transformer models, after fine-tuning, are suitable for solving the product matching problem using textual features both in English and Polish languages. We tested multilingual mBERT and XLM-RoBERTa models in English on Web Data Commons - training dataset and gold standard for large-scale product matching. The obtained results show that these models perform similarly to the latest solutions tested on this set, and in some cases, the results were even better. Additionally, we prepared a new dataset entirely in Polish and based on offers in selected categories obtained from several online stores for the research purpose. It is the first open dataset for product matching tasks in Polish, which allows comparing the effectiveness of the pre-trained models. Thus, we also showed the baseline results obtained by the fine-tuned mBERT and XLM-RoBERTa models on the Polish datasets.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
This paper attempts to address the problem of product matching in a Polish language environment. Specifically, the authors explore how to utilize pre-trained multilingual Transformer models (such as mBERT and XLM-RoBERTa) to solve this problem and conduct experiments and benchmarking on a Polish dataset. ### Main Contributions: 1. **Cross-lingual Transfer Learning**: It is verified that through transfer learning, multilingual Transformer models can be used for product matching in non-English (particularly Polish) environments. 2. **State-of-the-art Results on English Datasets**: The latest level results of fine-tuned multilingual BERT and XLM-RoBERTa models are demonstrated on the English Web Data Commons (WDC) dataset. 3. **Preparation and Baseline Results of Polish Dataset**: A brand new Polish product matching dataset (ProductMatch.pl) is created, and baseline results on these datasets are provided. ### Research Background: - **Product Matching Task**: Matching the same products from different data sources, which often involves multimodal data that may be heterogeneous and incomplete. - **Existing Research**: Recent studies show that deep neural networks perform best in product matching tasks. Especially through transfer learning, Transformer models can be fine-tuned to solve this problem, but most previous studies have focused on English datasets. ### Methodology: - **Datasets**: - **English Dataset**: The Web Data Commons (WDC) dataset is used, which is one of the largest publicly available product matching datasets. - **Polish Dataset**: A new Polish dataset (ProductMatch.pl) is created, containing product information collected from multiple online stores. - **Models**: Pre-trained multilingual mBERT and XLM-RoBERTa models are used, and these models are fine-tuned to solve the product matching task. - **Experimental Setup**: Experiments are conducted on datasets of different sizes, using the AdamW optimizer, with the learning rate linearly increasing from about 1e-7 to about 5e-5, and then linearly decreasing back to the initial value. Mixed precision training (fp16) is used to accelerate the training process. ### Results: - **Polish Dataset**: The mBERT model performs better on small and medium-sized datasets, while XLM-RoBERTa slightly outperforms on large datasets. - **English WDC Dataset**: The mBERT and XLM-RoBERTa models achieve better results than other studies in many cases, especially on small and medium-sized datasets. ### Conclusion: - **Effectiveness of Multilingual Models**: Multilingual Transformer models can be effectively used to solve the product matching problem, with performance comparable to the latest solutions using single-language English models, and sometimes even better. - **Future Research Directions**: Further exploration of the performance of multilingual models on more languages and larger-scale datasets, and improving the adaptability of models in specific domains. Through these contributions, this paper not only provides new solutions for the product matching task in a Polish language environment but also provides strong evidence for the effectiveness of multilingual Transformer models in practical applications.