Language Modeling on Tabular Data: A Survey of Foundations, Techniques and Evolution

Yucheng Ruan,Xiang Lan,Jingying Ma,Yizhi Dong,Kai He,Mengling Feng
2024-08-20
Abstract:Tabular data, a prevalent data type across various domains, presents unique challenges due to its heterogeneous nature and complex structural relationships. Achieving high predictive performance and robustness in tabular data analysis holds significant promise for numerous applications. Influenced by recent advancements in natural language processing, particularly transformer architectures, new methods for tabular data modeling have emerged. Early techniques concentrated on pre-training transformers from scratch, often encountering scalability issues. Subsequently, methods leveraging pre-trained language models like BERT have been developed, which require less data and yield enhanced performance. The recent advent of large language models, such as GPT and LLaMA, has further revolutionized the field, facilitating more advanced and diverse applications with minimal fine-tuning. Despite the growing interest, a comprehensive survey of language modeling techniques for tabular data remains absent. This paper fills this gap by providing a systematic review of the development of language modeling for tabular data, encompassing: (1) a categorization of different tabular data structures and data types; (2) a review of key datasets used in model training and tasks used for evaluation; (3) a summary of modeling techniques including widely-adopted data processing methods, popular architectures, and training objectives; (4) the evolution from adapting traditional Pre-training/Pre-trained language models to the utilization of large language models; (5) an identification of persistent challenges and potential future research directions in language modeling for tabular data analysis. GitHub page associated with this survey is available at: <a class="link-external link-https" href="https://github.com/lanxiang1017/Language-Modeling-on-Tabular-Data-Survey.git" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
The problem this paper attempts to address is how to achieve efficient and robust predictive performance in language modeling of tabular data. Specifically, the paper focuses on the unique challenges posed by tabular data due to its heterogeneous nature and complex structural relationships, and explores the impact of recent advances in natural language processing (especially the Transformer architecture) on tabular data modeling methods. The paper aims to fill the current gap in comprehensive reviews of language modeling techniques for tabular data by systematically reviewing existing language modeling techniques, datasets, evaluation tasks, and modeling methods. The main contributions of the paper include: 1. **Classification of Tabular Data**: For the first time, tabular data is divided into one-dimensional (1D) and two-dimensional (2D) data formats, discussing the data structures, data types, downstream tasks, and commonly used datasets for these two types respectively. 2. **Review of Technological Advances**: A detailed review of the latest advances in language modeling of tabular data, providing an exhaustive technical classification. 3. **Challenges and Future Directions**: Identifying ongoing challenges in current research and proposing future research directions. Through these contributions, the paper provides a comprehensive framework for understanding and applying language modeling techniques to tabular data, which helps to advance further development in this field.