On Statistical Rates of Conditional Diffusion Transformers: Approximation, Estimation and Minimax Optimality

Jerry Yao-Chieh Hu,Weimin Wu,Yi-Chen Lee,Yu-Chao Huang,Minshuo Chen,Han Liu
2024-11-26
Abstract:We investigate the approximation and estimation rates of conditional diffusion transformers (DiTs) with classifier-free guidance. We present a comprehensive analysis for ``in-context'' conditional DiTs under four common data assumptions. We show that both conditional DiTs and their latent variants lead to the minimax optimality of unconditional DiTs under identified settings. Specifically, we discretize the input domains into infinitesimal grids and then perform a term-by-term Taylor expansion on the conditional diffusion score function under Hölder smooth data assumption. This enables fine-grained use of transformers' universal approximation through a more detailed piecewise constant approximation and hence obtains tighter bounds. Additionally, we extend our analysis to the latent setting under the linear latent subspace assumption. We not only show that latent conditional DiTs achieve lower bounds than conditional DiTs both in approximation and estimation, but also show the minimax optimality of latent unconditional DiTs. Our findings establish statistical limits for conditional and unconditional DiTs, and offer practical guidance toward developing more efficient and accurate DiT models.
Machine Learning,Artificial Intelligence,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the approximation and estimation rate problems of Conditional Diffusion Transformers (DiTs) under classifier - free guidance. Specifically, the authors aim to: 1. **Analyze the statistical limits of conditional DiTs**: including the approximation of the score function, estimation, and theoretical guarantees for distribution estimation. 2. **Establish minimax optimality**: Prove that conditional DiTs and their latent variable versions can achieve the minimax optimality of unconditional DiTs under specific settings. 3. **Provide practical guidance**: Provide theoretical support and practical suggestions for developing more efficient and accurate DiT models. ### Main research contents 1. **Score Approximation**: - By introducing the Hölder smooth data assumption, the authors perform a fine - grained piecewise approximation of the score function of conditional DiTs. Specifically, they discretize the input domain into infinitesimal grids and perform a term - by - term Taylor expansion of the conditional diffusion score function on each grid. This enables the transformer to utilize a more detailed piecewise - constant approximation, thereby obtaining a tighter error bound. - Under the general Hölder smooth data assumption, the approximation error is \(O\left(\left(\log \frac{1}{\epsilon}\right)^{\frac{d_x}{\sigma_t^4}}\right)\), and under the stronger Hölder smooth data assumption, the error is \(O\left(\left(\log \frac{1}{\epsilon}\right)^{\frac{1}{\sigma_t^2}}\right)\). 2. **Score and Distribution Estimation**: - The problems of score and distribution estimation of conditional DiTs in actual training scenarios are studied. The sample complexity bound for score estimation is provided through the norm covering number bound based on the transformer architecture. - It is proved that the learned score estimator can recover the initial data distribution in conditional DiTs and their latent variable settings. 3. **Minimax Optimal Estimator**: - The analysis is extended to unconditional DiTs to study whether the generated data distribution achieves minimax optimality in terms of the Total Variation Distance. - It is proved that under the stronger Hölder smooth data distribution assumption, the upper and lower bounds of the distribution estimation error match. ### Technical contributions - **Discretize the input domain**: Discretize the input domain into infinitesimal grids in order to better utilize local smoothness. - **Term - by - term Taylor expansion**: Perform a term - by - term Taylor expansion of the conditional diffusion score function on each grid to achieve a more refined approximation. - **Universal approximation of transformers**: Utilize the universal approximation property of transformers to obtain a tighter error bound through a more detailed piecewise - constant approximation. ### Summary This paper fills the gaps in existing theoretical work by in - depth analysis of the approximation and estimation rates of conditional DiTs, and provides new perspectives and methods for the research of conditional diffusion models.