Abstract:We investigate the approximation and estimation rates of conditional diffusion transformers (DiTs) with classifier-free guidance. We present a comprehensive analysis for ``in-context'' conditional DiTs under four common data assumptions. We show that both conditional DiTs and their latent variants lead to the minimax optimality of unconditional DiTs under identified settings. Specifically, we discretize the input domains into infinitesimal grids and then perform a term-by-term Taylor expansion on the conditional diffusion score function under Hölder smooth data assumption. This enables fine-grained use of transformers' universal approximation through a more detailed piecewise constant approximation and hence obtains tighter bounds. Additionally, we extend our analysis to the latent setting under the linear latent subspace assumption. We not only show that latent conditional DiTs achieve lower bounds than conditional DiTs both in approximation and estimation, but also show the minimax optimality of latent unconditional DiTs. Our findings establish statistical limits for conditional and unconditional DiTs, and offer practical guidance toward developing more efficient and accurate DiT models.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the approximation and estimation rate problems of Conditional Diffusion Transformers (DiTs) under classifier - free guidance. Specifically, the authors aim to: 1. **Analyze the statistical limits of conditional DiTs**: including the approximation of the score function, estimation, and theoretical guarantees for distribution estimation. 2. **Establish minimax optimality**: Prove that conditional DiTs and their latent variable versions can achieve the minimax optimality of unconditional DiTs under specific settings. 3. **Provide practical guidance**: Provide theoretical support and practical suggestions for developing more efficient and accurate DiT models. ### Main research contents 1. **Score Approximation**: - By introducing the Hölder smooth data assumption, the authors perform a fine - grained piecewise approximation of the score function of conditional DiTs. Specifically, they discretize the input domain into infinitesimal grids and perform a term - by - term Taylor expansion of the conditional diffusion score function on each grid. This enables the transformer to utilize a more detailed piecewise - constant approximation, thereby obtaining a tighter error bound. - Under the general Hölder smooth data assumption, the approximation error is \(O\left(\left(\log \frac{1}{\epsilon}\right)^{\frac{d_x}{\sigma_t^4}}\right)\), and under the stronger Hölder smooth data assumption, the error is \(O\left(\left(\log \frac{1}{\epsilon}\right)^{\frac{1}{\sigma_t^2}}\right)\). 2. **Score and Distribution Estimation**: - The problems of score and distribution estimation of conditional DiTs in actual training scenarios are studied. The sample complexity bound for score estimation is provided through the norm covering number bound based on the transformer architecture. - It is proved that the learned score estimator can recover the initial data distribution in conditional DiTs and their latent variable settings. 3. **Minimax Optimal Estimator**: - The analysis is extended to unconditional DiTs to study whether the generated data distribution achieves minimax optimality in terms of the Total Variation Distance. - It is proved that under the stronger Hölder smooth data distribution assumption, the upper and lower bounds of the distribution estimation error match. ### Technical contributions - **Discretize the input domain**: Discretize the input domain into infinitesimal grids in order to better utilize local smoothness. - **Term - by - term Taylor expansion**: Perform a term - by - term Taylor expansion of the conditional diffusion score function on each grid to achieve a more refined approximation. - **Universal approximation of transformers**: Utilize the universal approximation property of transformers to obtain a tighter error bound through a more detailed piecewise - constant approximation. ### Summary This paper fills the gaps in existing theoretical work by in - depth analysis of the approximation and estimation rates of conditional DiTs, and provides new perspectives and methods for the research of conditional diffusion models.

On Statistical Rates of Conditional Diffusion Transformers: Approximation, Estimation and Minimax Optimality

On Statistical Rates and Provably Efficient Criteria of Latent Diffusion Transformers (DiTs)

An Analysis on Quantizing Diffusion Transformers

Scalable Diffusion Models with Transformers

Conditional Diffusion Models are Minimax-Optimal and Manifold-Adaptive for Conditional Distribution Estimation

TimeDiT: General-purpose Diffusion Transformers for Time Series Foundation Model

Topics in Transformation-based Statistical Methods

TerDiT: Ternary Diffusion Models with Transformers

Unveil Conditional Diffusion Models with Classifier-free Guidance: A Sharp Statistical Theory

Universal Regular Conditional Distributions via Probabilistic Transformers

DiTAS: Quantizing Diffusion Transformers via Enhanced Activation Smoothing

TaQ-DiT: Time-aware Quantization for Diffusion Transformers

Dynamic Diffusion Transformer

Transformers are Minimax Optimal Nonparametric In-Context Learners

On Inductive Biases That Enable Generalization of Diffusion Transformers

Physically-guided Temporal Diffusion Transformer for Long-Term Time Series Forecasting

Scaling Diffusion Transformers to 16 Billion Parameters

Approximation and Estimation Ability of Transformers for Sequence-to-Sequence Functions with Infinite Dimensional Input

Practical and Asymptotically Exact Conditional Sampling in Diffusion Models

DifFormer: Multi-Resolutional Differencing Transformer With Dynamic Ranging for Time Series Analysis

SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer