Non-Determinism in TensorFlow ResNets

Miguel Morin,Matthew Willetts
DOI: https://doi.org/10.48550/arXiv.2001.11396
2020-01-30
Abstract:We show that the stochasticity in training ResNets for image classification on GPUs in TensorFlow is dominated by the non-determinism from GPUs, rather than by the initialisation of the weights and biases of the network or by the sequence of minibatches given. The standard deviation of test set accuracy is 0.02 with fixed seeds, compared to 0.027 with different seeds---nearly 74\% of the standard deviation of a ResNet model is non-deterministic. For test set loss the ratio of standard deviations is more than 80\%. These results call for more robust evaluation strategies of deep learning models, as a significant amount of the variation in results across runs can arise simply from GPU randomness.
Machine Learning,Neural and Evolutionary Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: when using TensorFlow to train the ResNet model for image classification, the impact of GPU non - determinism on model performance evaluation. Specifically, the author studied the impact of GPU non - determinism on the standard deviation of test - set accuracy and loss, and explored whether this non - determinism is the main cause of model result variation. ### Research Background and Problem Description 1. **Importance of Repetitive Experiments** - In deep - learning research, researchers usually need to run the model multiple times to understand performance variation. - Repetitive experiments usually use different random seeds to initialize weights and generate minibatches. 2. **Sources of GPU Non - determinism** - GPU non - determinism stems from different orders of floating - point operations, which can lead to different results even on the same system, in the same software environment, and in the same operation mode. - For example, when calculating floating - point numbers, different compilers and architectures may add numbers in different orders, resulting in differences in results. 3. **Research Objectives** - The author hopes to isolate and quantify the impact of GPU non - determinism on model performance by fixing other random sources (such as initial weights and the order of minibatches). - Specifically, the author hopes to answer the following questions: - Is GPU non - determinism the main factor causing model result variation? - How large is the impact of this non - determinism on model performance evaluation? ### Experiment Design and Results 1. **Experimental Setup** - Use the ResNet - 50 model to conduct experiments on the CIFAR - 10 dataset. - Train for 200 epochs with a batch size of 32. - Fix random seeds to ensure that other random sources except GPU non - determinism are consistent. 2. **Experimental Results** - When using the same random seed, the standard deviation of test - set accuracy is \( \sigma(\text{accuracy}) = 1.995\times 10^{- 2}\), and the standard deviation of loss is \( \sigma(\text{loss}) = 3.020\times 10^{-3}\). - When using different random seeds, the standard deviation of test - set accuracy is \( \sigma(\text{accuracy}) = 2.699\times 10^{-2}\), and the standard deviation of loss is \( \sigma(\text{loss}) = 3.464\times 10^{-3}\). - Comparison results show that the proportion of variation caused by GPU non - determinism in the total variation is 74% (for accuracy) and 87% (for loss). ### Conclusions and Recommendations 1. **Conclusions** - Approximately 80% of the standard deviation of ResNet model accuracy is caused by GPU non - determinism, which is much higher than the impact of other random sources (such as initial weights and the order of minibatches). - This indicates that when evaluating deep - learning models, it may be insufficient to only compare a single accuracy value, because most of the variation comes from non - deterministic factors. 2. **Recommendations** - The author recommends that when evaluating new models, the distributions of the new model and the benchmark model should be compared, rather than just comparing a single accuracy value. - This recommendation is in line with the "Machine Learning Reproducibility Checklist" used in conferences such as NeurIPS, which requires researchers to provide error ranges and variation metrics. Through these studies, the author hopes to draw attention to the impact of GPU non - determinism and promote the development of more robust model evaluation methods.