An Empirical Analysis of Vision Transformer and CNN in Resource-Constrained Federated Learning

Xiaojiang Zuo,Qinglong Zhang,Rui Han
DOI: https://doi.org/10.1145/3568199.3568201
2023-01-01
Abstract:Federated learning (FL) is an emerging distributed machine learning method that collaboratively trains a universal model among clients while maintaining their data privacy. Recently, several efforts attempt to introduce vision transformer (ViT) models into FL training. However, deploying and training such ViT models from scratch in practice is not trivial, existing works overlook the existence of the clients with low resources (e.g., mobile phones), which is a common and practical FL setting. In this paper, we use low-resolution images as model input to satisfy the resource constraints and investigate several ViT models to explore whether ViT models still outperform CNN models in this setting. Our experiment was performed on CIFAR10 and Fashion MNIST with their IID and non-IID versions, and the results demonstrate that ViT models can achieve a better global test accuracy than CNN models while using a comparable training cost, suggesting that they are ideally suitable for FL training with resource-constrained devices.
What problem does this paper attempt to address?