Multi-Lingual Malaysian Embedding: Leveraging Large Language Models for Semantic Representations

Husein Zolkepli,Aisyah Razak,Kamarul Adha,Ariff Nazhan
2024-02-05
Abstract:In this work, we present a comprehensive exploration of finetuning Malaysian language models, specifically Llama2 and Mistral, on embedding tasks involving negative and positive pairs. We release two distinct models tailored for Semantic Similarity and Retrieval-Augmented Generation (RAG).
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The main goal of this paper is to address the performance deficiencies of the Malay language in semantic embedding and Retrieval-Augmented Generation (RAG) tasks. Specifically, the research team developed an open-source embedding model for the Malay language to replace existing closed-source solutions, such as OpenAI's text-embedding-ada-002 model. This model aims to improve the performance of the Malay language in various application scenarios, particularly in semantic similarity and RAG tasks. The specific methods used to address the problem in the paper are as follows: 1. **Hard Mining Dataset**: - Utilize OpenAI's text-embedding-ada-002 model and Beijing Academy of Artificial Intelligence's bge-large-en model to convert Malay text into embedding representations. - Use hard mining techniques to optimize these embedding representations, improving their quality and relevance. 2. **Synthetic RAG Dataset**: - Use synthetic question-answer pairs to enhance semantic retrieval capabilities, ensuring the model can understand complex contextual relationships. 3. **Fine-Tuning Large-Scale Language Models**: - Fine-tune Llama2 models with different parameter scales (600 million, 1 billion, and 2 billion parameters), extract the initial N layers, and continue pre-training to adapt to different embedding needs. Through these methods, the research team hopes to enhance the performance of the Malay language in the field of natural language processing, promoting innovation and development in this area. Experimental results show that the newly developed model performs excellently in semantic similarity and RAG tasks across multiple test sets, outperforming existing closed-source models.