Abstract:Product matching corresponds to the task of matching identical products across different data sources. It typically employs available product features which, apart from being multimodal, i.e., comprised of various data types, might be non-homogeneous and incomplete. The paper shows that pre-trained, multilingual Transformer models, after fine-tuning, are suitable for solving the product matching problem using textual features both in English and Polish languages. We tested multilingual mBERT and XLM-RoBERTa models in English on Web Data Commons - training dataset and gold standard for large-scale product matching. The obtained results show that these models perform similarly to the latest solutions tested on this set, and in some cases, the results were even better. Additionally, we prepared a new dataset entirely in Polish and based on offers in selected categories obtained from several online stores for the research purpose. It is the first open dataset for product matching tasks in Polish, which allows comparing the effectiveness of the pre-trained models. Thus, we also showed the baseline results obtained by the fine-tuned mBERT and XLM-RoBERTa models on the Polish datasets.

What problem does this paper attempt to address?

This paper attempts to address the problem of product matching in a Polish language environment. Specifically, the authors explore how to utilize pre-trained multilingual Transformer models (such as mBERT and XLM-RoBERTa) to solve this problem and conduct experiments and benchmarking on a Polish dataset. ### Main Contributions: 1. **Cross-lingual Transfer Learning**: It is verified that through transfer learning, multilingual Transformer models can be used for product matching in non-English (particularly Polish) environments. 2. **State-of-the-art Results on English Datasets**: The latest level results of fine-tuned multilingual BERT and XLM-RoBERTa models are demonstrated on the English Web Data Commons (WDC) dataset. 3. **Preparation and Baseline Results of Polish Dataset**: A brand new Polish product matching dataset (ProductMatch.pl) is created, and baseline results on these datasets are provided. ### Research Background: - **Product Matching Task**: Matching the same products from different data sources, which often involves multimodal data that may be heterogeneous and incomplete. - **Existing Research**: Recent studies show that deep neural networks perform best in product matching tasks. Especially through transfer learning, Transformer models can be fine-tuned to solve this problem, but most previous studies have focused on English datasets. ### Methodology: - **Datasets**: - **English Dataset**: The Web Data Commons (WDC) dataset is used, which is one of the largest publicly available product matching datasets. - **Polish Dataset**: A new Polish dataset (ProductMatch.pl) is created, containing product information collected from multiple online stores. - **Models**: Pre-trained multilingual mBERT and XLM-RoBERTa models are used, and these models are fine-tuned to solve the product matching task. - **Experimental Setup**: Experiments are conducted on datasets of different sizes, using the AdamW optimizer, with the learning rate linearly increasing from about 1e-7 to about 5e-5, and then linearly decreasing back to the initial value. Mixed precision training (fp16) is used to accelerate the training process. ### Results: - **Polish Dataset**: The mBERT model performs better on small and medium-sized datasets, while XLM-RoBERTa slightly outperforms on large datasets. - **English WDC Dataset**: The mBERT and XLM-RoBERTa models achieve better results than other studies in many cases, especially on small and medium-sized datasets. ### Conclusion: - **Effectiveness of Multilingual Models**: Multilingual Transformer models can be effectively used to solve the product matching problem, with performance comparable to the latest solutions using single-language English models, and sometimes even better. - **Future Research Directions**: Further exploration of the performance of multilingual models on more languages and larger-scale datasets, and improving the adaptability of models in specific domains. Through these contributions, this paper not only provides new solutions for the product matching task in a Polish language environment but also provides strong evidence for the effectiveness of multilingual Transformer models in practical applications.

Multilingual Transformers for Product Matching -- Experiments and a New Benchmark in Polish

KLEJ: Comprehensive Benchmark for Polish Language Understanding

End-to-end multi-modal product matching in fashion e-commerce

Pre-training Polish Transformer-based Language Models at Scale

Cross-Language Learning for Entity Matching

Introducing a novel dataset for product matching: A new challenge for matching systems

A Comparison of Supervised Learning to Match Methods for Product Search

Product-oriented Machine Translation with Cross-modal Cross-lingual Pre-training

Cross-Lingual Product Retrieval in E-Commerce Search

Unsupervised cross-lingual matching of product classifications

Larger-Scale Transformers for Multilingual Masked Language Modeling

HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish

Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-Modal Pretraining

PL-MTEB: Polish Massive Text Embedding Benchmark

OpenMatch-v2: an All-in-one Multi-Modality PLM-based Information Retrieval Toolkit.

Exceeding the Limits of Visual-Linguistic Multi-Task Learning

PUMGPT: A Large Vision-Language Model for Product Understanding

A Multi-task Learning Framework for Product Ranking with BERT

Multi-domain machine translation enhancements by parallel data extraction from comparable corpora

ProMap: Datasets for Product Mapping in E-commerce

Polish to English Statistical Machine Translation