Workflow Design Analysis for High Resolution Satellite Image Analysis

Ioannis Paraskevakos,Matteo Turilli,Bento Collares Gonçalves,Heather J. Lynch,Shantenu Jha
DOI: https://doi.org/10.48550/arXiv.1905.09766
2020-01-29
Abstract:Ecological sciences are using imagery from a variety of sources to monitor and survey populations and ecosystems. Very High Resolution (VHR) satellite imagery provide an effective dataset for large scale surveys. Convolutional Neural Networks have successfully been employed to analyze such imagery and detect large animals. As the datasets increase in volume, O(TB), and number of images, O(1k), utilizing High Performance Computing (HPC) resources becomes necessary. In this paper, we investigate a task-parallel data-driven workflows design to support imagery analysis pipelines with heterogeneous tasks on HPC. We analyze the capabilities of each design when processing a dataset of 3,000 VHR satellite images for a total of 4~TB. We experimentally model the execution time of the tasks of the image processing pipeline. We perform experiments to characterize the resource utilization, total time to completion, and overheads of each design. Based on the model, overhead and utilization analysis, we show which design approach to is best suited in scientific pipelines with similar characteristics.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to design and implement an efficient computing framework to support the execution of heterogeneous tasks on high - performance computing (HPC) resources when dealing with large - scale satellite image datasets. Specifically, the paper focuses on the design and optimization of high - resolution satellite image analysis processes used in ecological research. As the volume of datasets (reaching the terabyte level) and the number of images (reaching the thousand level) increase, it becomes necessary to utilize high - performance computing resources. The paper aims to determine the design method most suitable for scientific pipelines with similar characteristics by experimentally modeling the task execution times of different design methods and analyzing the resource utilization, total completion time, and overhead of each design method. ### Main contributions of the paper 1. **Improvement indications for the workflow engine**: Specific suggestions on how to further implement the workflow engine are provided in order to maximize resource utilization while minimizing the workflow completion time. 2. **Design guidelines**: Specific design guidelines for task - based computing frameworks to support data - driven, computationally - intensive workflows on high - performance computing resources are proposed. 3. **Experimental comparison method**: An experimental - based method for comparing the performance of different designs is provided, which does not depend on specific use cases or computing frameworks. ### Use cases The paper uses the Antarctic seal survey as an example and analyzes 3,097 satellite images with a total data volume of approximately 4TB. This use case requires repeatedly processing these images, running CPU and GPU code, and exchanging several gigabytes of data. ### Workflow design The paper explores two main workflow design methods: - **Design 1: One pipeline per image**: Each pipeline consists of two stages, and each stage contains one type of task. The tasks in the first stage receive an image as input and generate slices of the image; the tasks in the second stage receive the generated slices as input, calculate the number of seals in each slice, and output the results for the entire image. - **Design 2: One pipeline for multiple images**: A queuing mechanism is introduced, and tasks will continue to execute until resources are exhausted once they are started. Data and control signals between tasks are communicated through queues. ### Experimental results The paper evaluates different design methods through three experiments: 1. **Task execution time**: The relationship between task execution time and image size is analyzed, and it is found that the task execution time has a linear relationship with the image size. 2. **Resource utilization**: The total resource utilization of each design method is measured. 3. **Middleware overhead**: The middleware overhead for implementing each design method is characterized. Through these experiments, the paper obtains the performance comparison of different design methods, providing a basis for choosing the design method most suitable for a specific scientific pipeline.