Tabular Embedding Model (TEM): Finetuning Embedding Models For Tabular RAG Applications

Sujit Khanna,Shishir Subedi

2024-04-28

Abstract:In recent times Large Language Models have exhibited tremendous capabilities, especially in the areas of mathematics, code generation and general-purpose reasoning. However for specialized domains especially in applications that require parsing and analyzing large chunks of numeric or tabular data even state-of-the-art (SOTA) models struggle. In this paper, we introduce a new approach to solving domain-specific tabular data analysis tasks by presenting a unique RAG workflow that mitigates the scalability issues of existing tabular LLM solutions. Specifically, we present Tabular Embedding Model (TEM), a novel approach to fine-tune embedding models for tabular Retrieval-Augmentation Generation (RAG) applications. Embedding models form a crucial component in the RAG workflow and even current SOTA embedding models struggle as they are predominantly trained on textual datasets and thus underperform in scenarios involving complex tabular data. The evaluation results showcase that our approach not only outperforms current SOTA embedding models in this domain but also does so with a notably smaller and more efficient model structure.

Artificial Intelligence,Computation and Language,Information Retrieval

What problem does this paper attempt to address?

The problem this paper attempts to address is the poor performance of existing large language models (LLMs) when handling domain-specific tabular data. Although these models excel in areas such as mathematical reasoning, code generation, and general problem-solving, they struggle with applications that involve parsing and analyzing large amounts of numerical or tabular data. Even the current state-of-the-art (SOTA) models face difficulties in this regard. Specifically, existing embedding models are primarily trained on text datasets, resulting in poor performance when dealing with complex tabular data. To address this issue, the authors propose a new approach—the Tabular Embedding Model (TEM). TEM fine-tunes the embedding model to make it more suitable for retrieval-augmented generation (RAG) applications involving tabular data. The authors chose the domain of financial markets for model evaluation to demonstrate TEM's advantages in handling high-dimensional and complex datasets. Experimental results show that TEM not only significantly outperforms existing SOTA embedding models in this domain but also features a smaller and more efficient model structure.

Tabular Embedding Model (TEM): Finetuning Embedding Models For Tabular RAG Applications

Beyond Extraction: Contextualising Tabular Data for Efficient Summarisation by Language Models

TOTEM: TOkenized Time Series EMbeddings for General Time Series Analysis

MambaTab: A Plug-and-Play Model for Learning Tabular Data

Embeddings for Tabular Data: A Survey

Interpretable Medical Diagnostics with Structured Data Extraction by Large Language Models

Bridging the Gap: Deciphering Tabular Data Using Large Language Model

TableRAG: Million-Token Table Understanding with Language Models

Towards Foundation Models for Learning on Tabular Data

Efficient Ternary Weight Embedding Model: Bridging Scalability and Performance

Enriching Tabular Data with Contextual LLM Embeddings: A Comprehensive Ablation Study for Ensemble Classifiers

Enhancing Temporal Understanding in LLMs for Semi-structured Tables

PTab: Using the Pre-trained Language Model for Modeling Tabular Data

Tree-Regularized Tabular Embeddings

Tabular Transformers for Modeling Multivariate Time Series

Unleashing the Potential of Large Language Models for Predictive Tabular Tasks in Data Science

TabText: A Flexible and Contextual Approach to Tabular Data Representation

TabuLa: Harnessing Language Models for Tabular Data Synthesis

Enhancing Tabular Reasoning with Pattern Exploiting Training

Retrieval & Fine-Tuning for In-Context Tabular Models

TabSAL: Synthesizing Tabular Data with Small Agent Assisted Language Models