Abstract:Tabular data, a prevalent data type across various domains, presents unique challenges due to its heterogeneous nature and complex structural relationships. Achieving high predictive performance and robustness in tabular data analysis holds significant promise for numerous applications. Influenced by recent advancements in natural language processing, particularly transformer architectures, new methods for tabular data modeling have emerged. Early techniques concentrated on pre-training transformers from scratch, often encountering scalability issues. Subsequently, methods leveraging pre-trained language models like BERT have been developed, which require less data and yield enhanced performance. The recent advent of large language models, such as GPT and LLaMA, has further revolutionized the field, facilitating more advanced and diverse applications with minimal fine-tuning. Despite the growing interest, a comprehensive survey of language modeling techniques for tabular data remains absent. This paper fills this gap by providing a systematic review of the development of language modeling for tabular data, encompassing: (1) a categorization of different tabular data structures and data types; (2) a review of key datasets used in model training and tasks used for evaluation; (3) a summary of modeling techniques including widely-adopted data processing methods, popular architectures, and training objectives; (4) the evolution from adapting traditional Pre-training/Pre-trained language models to the utilization of large language models; (5) an identification of persistent challenges and potential future research directions in language modeling for tabular data analysis. GitHub page associated with this survey is available at: <a class="link-external link-https" href="https://github.com/lanxiang1017/Language-Modeling-on-Tabular-Data-Survey.git" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem this paper attempts to address is how to achieve efficient and robust predictive performance in language modeling of tabular data. Specifically, the paper focuses on the unique challenges posed by tabular data due to its heterogeneous nature and complex structural relationships, and explores the impact of recent advances in natural language processing (especially the Transformer architecture) on tabular data modeling methods. The paper aims to fill the current gap in comprehensive reviews of language modeling techniques for tabular data by systematically reviewing existing language modeling techniques, datasets, evaluation tasks, and modeling methods. The main contributions of the paper include: 1. **Classification of Tabular Data**: For the first time, tabular data is divided into one-dimensional (1D) and two-dimensional (2D) data formats, discussing the data structures, data types, downstream tasks, and commonly used datasets for these two types respectively. 2. **Review of Technological Advances**: A detailed review of the latest advances in language modeling of tabular data, providing an exhaustive technical classification. 3. **Challenges and Future Directions**: Identifying ongoing challenges in current research and proposing future research directions. Through these contributions, the paper provides a comprehensive framework for understanding and applying language modeling techniques to tabular data, which helps to advance further development in this field.

Language Modeling on Tabular Data: A Survey of Foundations, Techniques and Evolution

Large language models on tabular data--a survey

Large Language Models(LLMs) on Tabular Data: Prediction, Generation, and Understanding -- A Survey

Large language models (LLMs) on tabular data: Prediction, generation, and understanding-a survey

Natural Language Interfaces for Tabular Data Querying and Visualization: A Survey

PTab: Using the Pre-trained Language Model for Modeling Tabular Data

From Supervised to Generative: A Novel Paradigm for Tabular Deep Learning with Large Language Models

Unleashing the Potential of Large Language Models for Predictive Tabular Tasks in Data Science

A Survey on Deep Tabular Learning

Large Language Model for Table Processing: A Survey

Bridging the Gap: Deciphering Tabular Data Using Large Language Model

TabuLa: Harnessing Language Models for Tabular Data Synthesis

Large Scale Transfer Learning for Tabular Data via Language Modeling

Data Management For Training Large Language Models: A Survey

Towards Foundation Models for Learning on Tabular Data

Making Pre-trained Language Models Great on Tabular Prediction

UniTabE: A Universal Pretraining Protocol for Tabular Foundation Model in Data Science

Table Pre-training: A Survey on Model Architectures, Pre-training Objectives, and Downstream Tasks

Generating Realistic Tabular Data with Large Language Models