An Automated Monitoring and Repairing System for DNN Training

Xiaoyu Zhang,Chao Shen,Shiqing Ma,Juan Zhai,Chenhao Lin
DOI: https://doi.org/10.1109/tdsc.2024.3450951
2024-01-01
IEEE Transactions on Dependable and Secure Computing
Abstract:With the widespread adoption of machine learning models, especially deep neural networks (DNNs), as an integral part of new intelligent software, the new tools to effectively support the model engineering and debugging process have received extensive attention. However, the existing tools only provide limited support for the training process. They are either post-training tools that fail to detect problems timely, resulting in wasting time and resources on training buggy models, or merely collecting the training data and still require manual analysis. In this paper, we propose AutoTrainer , an automated monitoring and repairing system for DNN training, which provides real-time monitoring for the model training process and automatically repairs eight commonly seen training problems. AutoTrainer monitors the training process and detects potential training problems. For any detected problem, AutoTrainer tries to fix it with the built-in state-of-the-art solutions. Our experiments on six datasets and 701 models show that the problem detection accuracy of AutoTrainer reaches 100% without false positives. Moreover, it fixes 98.42% of all detected problems and improves the model accuracy by 36.42% on average.
What problem does this paper attempt to address?