Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures

Abstract

As many-core accelerators keep integrating more processing units, it becomes increasingly more difficult for a parallel application to make effective use of all available resources. An effective way for improving hardware utilization is to exploit spatial and temporal sharing of the heterogeneous processing units by multiplexing computation and communication tasks - a strategy known as heterogeneous streaming. Achieving effective heterogeneous streaming requires carefully partitioning hardware among tasks, and matching the granularity of task parallelism to the resource partition. However, finding the right resource partitioning and task granularity is extremely challenging, because there is a large number of possible solutions and the optimal solution varies across programs and datasets. This article presents an automatic approach to quickly derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures. Our approach employs a performance model to estimate the resulting performance of the target application under a given resource partition and task granularity configuration. The model is used as a utility to quickly search for a good configuration at runtime. Instead of hand-crafting an analytical model that requires expert insights into low-level hardware details, we employ machine learning techniques to automatically learn it. We achieve this by first learning a predictive model offline using training programs. The learnt model can then be used to predict the performance of any unseen program at runtime. We apply our approach to 39 representative parallel applications and evaluate it on two representative heterogeneous many-core platforms: a CPU-XeonPhi platform and a CPU-GPU platform. Compared to the single-stream version, our approach achieves, on average, a 1.6x and 1.1x speedup on the XeonPhi and the GPU platform, respectively. These results translate to over 93% of the performance delivered by a theoretically perfect predictor.

Metadata

Item Type:	Article
Authors/Creators:	Zhang, P Fang, J Yang, C Huang, C Tang, T Wang, Z https://orcid.org/0000-0001-6157-0662
Copyright, Publisher and Additional Information:	(c) 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Keywords:	—Heterogeneous computing; Parallelism; Performance Tuning; Machine learning
Dates:	Published: August 2020 Published (online): 3 March 2020 Accepted: 26 February 2020
Institution:	The University of Leeds
Academic Units:	The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds)
Depositing User:	Symplectic Publications
Date Deposited:	10 Mar 2020 12:25
Last Modified:	28 Apr 2021 13:32
Status:	Published
Publisher:	Institute of Electrical and Electronics Engineers
Identification Number:	10.1109/TPDS.2020.2978045
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:158101

CORE (COnnecting REpositories)

Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures

Abstract

Metadata

Download

Accepted Version

Export

Statistics