Tokenize features, enhancing tables: the FT-TABPFN model for tabular classification

Quangao Liu,Wei Yang,Chen Liang,Longlong Pang,Zhuozhang Zou

2024-06-11

Abstract:Traditional methods for tabular classification usually rely on supervised learning from scratch, which requires extensive training data to determine model parameters. However, a novel approach called Prior-Data Fitted Networks (TabPFN) has changed this paradigm. TabPFN uses a 12-layer transformer trained on large synthetic datasets to learn universal tabular representations. This method enables fast and accurate predictions on new tasks with a single forward pass and no need for additional training. Although TabPFN has been successful on small datasets, it generally shows weaker performance when dealing with categorical features. To overcome this limitation, we propose FT-TabPFN, which is an enhanced version of TabPFN that includes a novel Feature Tokenization layer to better handle classification features. By fine-tuning it for downstream tasks, FT-TabPFN not only expands the functionality of the original model but also significantly improves its applicability and accuracy in tabular classification. Our full source code is available for community use and development.

Machine Learning,Artificial Intelligence

What problem does this paper attempt to address?

The paper primarily aims to improve the limitations of the TabPFN model in handling tabular data classification tasks that include categorical features. TabPFN is a model based on a pre-trained Transformer, capable of making quick and accurate predictions on new tasks without additional training. However, TabPFN performs relatively weakly when dealing with categorical features. To address this issue, the paper proposes the FT-TabPFN (Feature Tokenization TabPFN) model. The main contributions of FT-TabPFN include: 1. **Proposing a new feature tokenization layer**: This layer can better handle categorical features in tabular data. By treating each feature as a "word" and each sample as a "sentence," the model improves its ability to handle feature diversity. 2. **Introducing a regularization mechanism for feature identifiers**: This helps maintain the independence and uniqueness of different features, thereby enhancing the model's performance and robustness. 3. **Applying fine-tuning to downstream tasks**: Experiments validate the effectiveness of FT-TabPFN and show that it performs better on datasets containing categorical features compared to the original TabPFN and other baseline models. In summary, the paper aims to improve the performance of the TabPFN model in handling tabular data classification tasks with categorical features by introducing new feature processing methods and regularization techniques.

Tokenize features, enhancing tables: the FT-TABPFN model for tabular classification

TFEformer: Temporal Feature Enhanced Transformer for Multivariate Time Series Forecasting

Unlocking the Transferability of Tokens in Deep Models for Tabular Data

TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

Scaling TabPFN: Sketching and Feature Selection for Tabular Prior-Data Fitted Networks

In-Context Data Distillation with TabPFN

PTab: Using the Pre-trained Language Model for Modeling Tabular Data

TabDPT: Scaling Tabular Foundation Models

TabPFGen -- Tabular Data Generation with TabPFN

TuneTables: Context Optimization for Scalable Prior-Data Fitted Networks

PORTAL: Scalable Tabular Foundation Models via Content-Specific Tokenization

Making Pre-trained Language Models Great on Tabular Prediction

Interpretable Machine Learning for TabPFN

TFWT: Tabular Feature Weighting with Transformer

Towards Foundation Models for Learning on Tabular Data

Why In-Context Learning Transformers are Tabular Data Classifiers

XTab: Cross-table Pretraining for Tabular Transformers

UniTabE: A Universal Pretraining Protocol for Tabular Foundation Model in Data Science

Untrained and Unmatched: Fast and Accurate Zero-Training Classification for Tabular Engineering Data

Cross-Table Pretraining towards a Universal Function Space for Heterogeneous Tabular Data

TabularFM: An Open Framework For Tabular Foundational Models