Abstract:Generative modeling for tabular data has recently gained significant attention in the Deep Learning domain. Its objective is to estimate the underlying distribution of the data. However, estimating the underlying distribution of tabular data has its unique challenges. Specifically, this data modality is composed of mixed types of features, making it a non-trivial task for a model to learn intra-relationships between them. One approach to address mixture is to embed each feature into a continuous matrix via tokenization, while a solution to capture intra-relationships between variables is via the transformer architecture. In this work, we empirically investigate the potential of using embedding representations on tabular data generation, utilizing tensor contraction layers and transformers to model the underlying distribution of tabular data within Variational Autoencoders. Specifically, we compare four architectural approaches: a baseline VAE model, two variants that focus on tensor contraction layers and transformers respectively, and a hybrid model that integrates both techniques. Our empirical study, conducted across multiple datasets from the OpenML CC18 suite, compares models over density estimation and Machine Learning efficiency metrics. The main takeaway from our results is that leveraging embedding representations with the help of tensor contraction layers improves density estimation metrics, albeit maintaining competitive performance in terms of machine learning efficiency.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to effectively generate tabular data, especially dealing with the complexity of mixed - type features in tabular data (i.e., including both continuous and discrete variables simultaneously). Specifically, the main challenges in tabular data generation include: 1. **Diversity of feature types**: Tabular data usually contains mixed - type features, such as continuous variables and discrete variables. This diversity makes it difficult for the model to learn the relationships between these different types of features. 2. **Intrinsic relationships between features**: There may be complex inter - relationships between features in tabular data. Especially when the feature types are different, it becomes more difficult to model these relationships. 3. **Complexity of distribution estimation**: The distribution of tabular data is usually multimodal. In particular, continuous variables may have multiple modes, and discrete variables may have a large number of categories, which makes estimating their underlying distribution very challenging. To solve these problems, the author proposes to use embedding representations, tensor contraction layers and transformers to improve the Variational Auto - Encoder (VAE) architecture. Specifically, the paper conducts experiments through the following methods: - **Baseline VAE model**: The traditional VAE model. - **TensorContracted**: Introduce tensor contraction layers in VAE to handle embedding representations. - **Transformed**: VAE - based transformer model. - **TensorConFormer**: A hybrid model that combines tensor contraction layers and transformers. Through these methods, the paper aims to explore how to better capture the feature relationships in tabular data and improve the quality and diversity of the generated data. The experimental results show that TensorConFormer performs best in density estimation and machine - learning efficiency and can generate synthetic data with higher diversity and fidelity. ### Formula summary - **Embedding representation**: \[ e^{\text{num}}_i = x^{\text{num}}_i w^{\text{num}}_i + b^{\text{num}}_i \] \[ e^{\text{cat}}_i = x^{\text{ohe}}_i W^{\text{cat}}_i + b^{\text{cat}}_i \] - **Tensor contraction layer**: \[ TCL(E)\triangleq E\otimes W + B \] - **Variational lower - bound optimization objective**: \[ L(\phi,\theta;x,c)=\mathbb{E}_{q_\phi(z|x,c)}[\log p_\theta(x|z,c)]-\text{KL}[q_\phi(z|x,c)\|p(z)] \] These formulas show how to convert tabular data into embedding representations and capture the complex relationships between features through tensor contraction layers and transformers, thereby generating high - quality synthetic data.

Tabular data generation with tensor contraction layers and transformers

An improved tabular data generator with VAE-GMM integration

Transformers with Stochastic Competition for Tabular Data Modelling

Tabular Transformers for Modeling Multivariate Time Series

Conditional out-of-sample generation for unpaired data using trVAE

Language Models are Realistic Tabular Data Generators

TAEGAN: Generating Synthetic Tabular Data For Data Augmentation

Conditional Out-of-distribution Generation for Unpaired Data Using Transfer VAE.

A Learnable Discrete-Prior Fusion Autoencoder with Contrastive Learning for Tabular Data Synthesis

Data Augmentation with Variational Autoencoder for Imbalanced Dataset

High-Quality Tabular Data Generation using Post-Selected VAE

Convex space learning for tabular synthetic data generation

TabMDA: Tabular Manifold Data Augmentation for Any Classifier using Transformers with In-context Subsetting

TABCF: Counterfactual Explanations for Tabular Data Using a Transformer-Based VAE

Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space

Diffusion Models for Tabular Data Imputation and Synthetic Data Generation

TabM: Advancing Tabular Deep Learning with Parameter-Efficient Ensembling

A Survey on Deep Tabular Learning

TabEBM: A Tabular Data Augmentation Method with Distinct Class-Specific Energy-Based Models

TabPFGen -- Tabular Data Generation with TabPFN

Tree-Regularized Tabular Embeddings