Tabular Embedding Model (TEM): Finetuning Embedding Models For Tabular RAG Applications

Sujit Khanna,Shishir Subedi
2024-04-28
Abstract:In recent times Large Language Models have exhibited tremendous capabilities, especially in the areas of mathematics, code generation and general-purpose reasoning. However for specialized domains especially in applications that require parsing and analyzing large chunks of numeric or tabular data even state-of-the-art (SOTA) models struggle. In this paper, we introduce a new approach to solving domain-specific tabular data analysis tasks by presenting a unique RAG workflow that mitigates the scalability issues of existing tabular LLM solutions. Specifically, we present Tabular Embedding Model (TEM), a novel approach to fine-tune embedding models for tabular Retrieval-Augmentation Generation (RAG) applications. Embedding models form a crucial component in the RAG workflow and even current SOTA embedding models struggle as they are predominantly trained on textual datasets and thus underperform in scenarios involving complex tabular data. The evaluation results showcase that our approach not only outperforms current SOTA embedding models in this domain but also does so with a notably smaller and more efficient model structure.
Artificial Intelligence,Computation and Language,Information Retrieval
What problem does this paper attempt to address?
The problem this paper attempts to address is the poor performance of existing large language models (LLMs) when handling domain-specific tabular data. Although these models excel in areas such as mathematical reasoning, code generation, and general problem-solving, they struggle with applications that involve parsing and analyzing large amounts of numerical or tabular data. Even the current state-of-the-art (SOTA) models face difficulties in this regard. Specifically, existing embedding models are primarily trained on text datasets, resulting in poor performance when dealing with complex tabular data. To address this issue, the authors propose a new approach—the Tabular Embedding Model (TEM). TEM fine-tunes the embedding model to make it more suitable for retrieval-augmented generation (RAG) applications involving tabular data. The authors chose the domain of financial markets for model evaluation to demonstrate TEM's advantages in handling high-dimensional and complex datasets. Experimental results show that TEM not only significantly outperforms existing SOTA embedding models in this domain but also features a smaller and more efficient model structure.