Abstract:Most research on novel techniques for 3D Medical Image Segmentation (MIS) is currently done using Deep Learning with GPU accelerators. The principal challenge of such technique is that a single input can easily cope computing resources, and require prohibitive amounts of time to be processed. Distribution of deep learning and scalability over computing devices is an actual need for progressing on such research field. Conventional distribution of neural networks consist in data parallelism, where data is scattered over resources (e.g., GPUs) to parallelize the training of the model. However, experiment parallelism is also an option, where different training processes are parallelized across resources. While the first option is much more common on 3D image segmentation, the second provides a pipeline design with less dependence among parallelized processes, allowing overhead reduction and more potential scalability. In this work we present a design for distributed deep learning training pipelines, focusing on multi-node and multi-GPU environments, where the two different distribution approaches are deployed and benchmarked. We take as proof of concept the 3D U-Net architecture, using the MSD Brain Tumor Segmentation dataset, a state-of-art problem in medical image segmentation with high computing and space requirements. Using the BSC MareNostrum supercomputer as benchmarking environment, we use TensorFlow and Ray as neural network training and experiment distribution platforms. We evaluate the experiment speed-up, showing the potential for scaling out on GPUs and nodes. Also comparing the different parallelism techniques, showing how experiment distribution leverages better such resources through scaling. Finally, we provide the implementation of the design open to the community, and the non-trivial steps and methodology for adapting and deploying a MIS case as the here presented.

Benchmarking Performance of Deep Learning Model for Material Segmentation on Two HPC Systems

Benchmarking Resource Usage for Efficient Distributed Deep Learning

Benchmarking State-of-the-Art Deep Learning Software Tools

Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs

Performance, Energy, and Scalability Analysis and Improvement of Parallel Cancer Deep Learning CANDLE Benchmarks

Benchmarking TPU, GPU, and CPU Platforms for Deep Learning

Benchmarking Machine Learning Applications on Heterogeneous Architecture using Reframe

A Deep Learning Method for Material Performance Recognition in Laser Additive Manufacturing

Benchmarking network fabrics for data distributed training of deep neural networks

Performance Modeling of Distributed Deep Neural Networks

TBD: Benchmarking and Analyzing Deep Neural Network Training

Utilizing Ensemble Learning for Performance and Power Modeling and Improvement of Parallel Cancer Deep Learning CANDLE Benchmarks

Scientific Machine Learning Benchmarks

Benchmarking Contemporary Deep Learning Hardware and Frameworks:A Survey of Qualitative Metrics

Benchmarking of Deep Architectures for Segmentation of Medical Images

Distributing Deep Learning Hyperparameter Tuning for 3D Medical Image Segmentation

Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms

Benchmarking Edge AI Platforms for High-Performance ML Inference

Benchmarking Learning Efficiency in Deep Reservoir Computing

MMBench: Benchmarking End-to-End Multi-modal DNNs and Understanding Their Hardware-Software Implications

Understanding HPC Benchmark Performance on Intel Broadwell and Cascade Lake Processors