Comparing Workflow Application Designs for High Resolution Satellite Image Analysis

Aymen Al-Saadi,Ioannis Paraskevakos,Bento Collares Gonçalves,Heather J. Lynch,Shantenu Jha,Matteo Turilli
DOI: https://doi.org/10.48550/arXiv.2010.14756
2020-10-27
Abstract:Very High Resolution satellite and aerial imagery are used to monitor and conduct large scale surveys of ecological systems. Convolutional Neural Networks have successfully been employed to analyze such imagery to detect large animals and salient features. As the datasets increase in volume and number of images, utilizing High Performance Computing resources becomes necessary. In this paper, we investigate three task-parallel, data-driven workflow designs to support imagery analysis pipelines with heterogeneous tasks on HPC. We analyze the capabilities of each design when processing datasets from two use cases for a total of 4,672 satellite and aerial images, and 8.35 TB of data. We experimentally model the execution time of the tasks of the image processing pipelines. We perform experiments to characterize the resource utilization, total time to completion, and overheads of each design. Based on the model, overhead and utilization analysis, we show which design is best suited to scientific pipelines with similar characteristics.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to design and implement an efficient task - parallel, data - driven workflow when dealing with high - resolution satellite image analysis, in order to support the execution of heterogeneous tasks on high - performance computing (HPC) resources. As the scale and quantity of data sets keep increasing, it becomes necessary to utilize high - performance computing resources. By studying three different task - parallel workflow designs, this paper aims to solve the following specific problems: 1. **Support for heterogeneous tasks**: Different tasks may require different numbers of CPUs and GPUs, implement different functions, and also have different running times. There are data - dependency relationships among these tasks, so effective scheduling, correct resource binding, and efficient data management are required. 2. **Challenges in large - scale data processing**: With the growth of data sets, the need to process a large number of images makes parallel processing necessary. However, the existing high - performance computing infrastructure is more inclined to execute a single long - term task, which poses challenges to large - scale data processing. 3. **Performance optimization**: When dealing with large - scale image data sets, how to optimize resource utilization, reduce completion time, and lower overhead are the focuses of this paper. By experimentally modeling task execution time, the author experimentally analyzes the resource utilization, total completion time, and overhead of each design. 4. **Guidance on design choices**: Due to the lack of architecture and performance analysis, it becomes difficult to choose functionally equivalent implementation schemes. By comparing the performance of the three designs, this paper provides a basis for choosing the best design scheme, especially for scientific workflows with similar characteristics. In summary, the main objective of this paper is to provide guidance on performance optimization and design choices for processing large - scale high - resolution satellite image analysis by comparing three different workflow designs.