Abstract:Tabular data, a prevalent data type across various domains, presents unique challenges due to its heterogeneous nature and complex structural relationships. Achieving high predictive performance and robustness in tabular data analysis holds significant promise for numerous applications. Influenced by recent advancements in natural language processing, particularly transformer architectures, new methods for tabular data modeling have emerged. Early techniques concentrated on pre-training transformers from scratch, often encountering scalability issues. Subsequently, methods leveraging pre-trained language models like BERT have been developed, which require less data and yield enhanced performance. The recent advent of large language models, such as GPT and LLaMA, has further revolutionized the field, facilitating more advanced and diverse applications with minimal fine-tuning. Despite the growing interest, a comprehensive survey of language modeling techniques for tabular data remains absent. This paper fills this gap by providing a systematic review of the development of language modeling for tabular data, encompassing: (1) a categorization of different tabular data structures and data types; (2) a review of key datasets used in model training and tasks used for evaluation; (3) a summary of modeling techniques including widely-adopted data processing methods, popular architectures, and training objectives; (4) the evolution from adapting traditional Pre-training/Pre-trained language models to the utilization of large language models; (5) an identification of persistent challenges and potential future research directions in language modeling for tabular data analysis. GitHub page associated with this survey is available at: <a class="link-external link-https" href="https://github.com/lanxiang1017/Language-Modeling-on-Tabular-Data-Survey.git" rel="external noopener nofollow">this https URL</a>.

Data Management For Training Large Language Models: A Survey

Datasets for Large Language Models: A Comprehensive Survey

Data Management for Machine Learning: A Survey

Applications and Challenges for Large Language Models: from Data Management Perspective

Demystifying Data Management for Large Language Models

Large Language Models for Data Annotation: A Survey

A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers

A Survey on Data Selection for Language Models

Multilingual Large Language Models: A Systematic Survey

Aligning Large Language Models with Human: A Survey

Large Language Models for Data Annotation and Synthesis: A Survey

A Survey on Data Synthesis and Augmentation for Large Language Models

Efficient Large Language Models: A Survey

Continual Learning of Large Language Models: A Comprehensive Survey

A Survey on Large Language Models from General Purpose to Medical Applications: Datasets, Methodologies, and Evaluations

Data Proportion Detection for Optimized Data Management for Large Language Models

A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery

Large Language Models(LLMs) on Tabular Data: Prediction, Generation, and Understanding -- A Survey

How to Train Data-Efficient LLMs

A Survey of Multimodal Large Language Model from A Data-centric Perspective

Language Modeling on Tabular Data: A Survey of Foundations, Techniques and Evolution