Dynamic partitioning-based JPEG decompression on heterogeneous multicore architectures
Wasuwee Sodsong,Jingun Hong,Seongwook Chung,Yeongkyu Lim,Shin‐Dug Kim,Bernd Burgstaller,Shin-Dug Kim
DOI: https://doi.org/10.1002/cpe.3620
2015-08-14
Concurrency and Computation: Practice and Experience
Abstract:Summary With the emergence of social networks and improvements in computational photography, billions of JPEG images are shared and viewed on a daily basis. Desktops, tablets, and smartphones constitute the vast majority of hardware platforms used for displaying JPEG images. Despite the fact that these platforms are heterogeneous multicores, no approach exists yet that is capable of joining forces of a system's CPU and graphics processing unit (GPU) for JPEG decoding. In this paper, we introduce a novel JPEG decoding scheme for heterogeneous architectures consisting of a CPU and a general‐purpose GPU. We employ an offline profiling step to determine the performance of a system's CPU and GPU with respect to JPEG decoding. For a given JPEG image, our performance model uses: (1) the CPU and GPU performance characteristics, (2) the image entropy, and (3) the width and height of the image to balance the JPEG decoding workload on the underlying hardware. Our run‐time partitioning and scheduling scheme exploits task, data, and pipeline parallelism by scheduling the non‐parallelizable entropy‐decoding task on the CPU, whereas inverse discrete cosine transformations, color conversions, and upsampling are conducted on both the CPU and the GPU. We have implemented the proposed method in the context of the libjpeg‐turbo library, which is an industrial‐strength JPEG encoding and decoding engine. Libjpeg‐turbo's hand‐optimized SIMD routines for ARM and x86 architectures constitute a competitive yardstick for the comparison with the proposed approach. We have evaluated our approach for a total of 7194 JPEG images across four high‐end and middle‐end CPU–GPU combinations including a mobile GPU. We achieve speedups of up to 5.2× over the SIMD version of libjpeg‐turbo, and speedups of up to 10.5× over its sequential code. Taking into account the non‐parallelizable JPEG entropy‐decoding part, our approach achieves up to 97% of the theoretically attainable maximal speedup, with an average of 94%. Copyright © 2015 John Wiley & Sons, Ltd.
English Else