liu.seSearch for publications in DiVA
Change search
ReferencesLink to record
Permanent link

Direct link
Optimized On-Chip-Pipelining for Memory-Intensive Computations on Multi-Core Processors with Explicit Memory Hierarchy
FernUniversität in Hagen, Germany.
Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.ORCID iD: 0000-0001-5241-0026
Linköping University, Department of Computer and Information Science. Linköping University, The Institute of Technology.
2012 (English)In: Journal of Universal Computer Science, ISSN 0948-695X, Vol. 18, no 14, 1987-2023 p.Article in journal (Refereed) Published
Abstract [en]

Limited bandwidth to off-chip main memory tends to be a performance bottleneck in chip multiprocessors, and this will become even more problematic with an increasing number of cores. Especially for streaming computations where the ratio between computational work and memory transfer is low, transforming the program into more memory-efficient code is an important program optimization.

On-chip pipelining reorganizes the computation so that partial results of subtasks are forwarded immediately between the cores over the high-bandwidth internal network, in order to reduce the volume of main memory accesses, and thereby improves the throughput for memory-intensive computations. At the same time, throughput is also constrained by the limited amount of on-chip memory available for buffering forwarded data. By optimizing the mapping of tasks to cores, balancing a trade-off between load balancing, buffer memory consumption, and communication load on the on-chip network, a larger buffer size can be applied, resulting in less DMA communication and scheduling overhead.

In this article, we consider parallel mergesort as a representative memory-intensive application in detail, and focus on the global merging phase, which is dominating the overall sorting time for larger data sets. We work out the technical issues of applying the on-chip pipelining technique, and present several algorithms for optimized mapping of merge trees to the multiprocessor cores. We also demonstrate how some of these algorithms can be used for mapping of other streaming task graphs.

We describe an implementation of pipelined parallel mergesort for the Cell Broadband Engine, which serves as an exemplary target. We evaluate experimentally the influence of buffer sizes and mapping optimizations, and show that optimized on-chip pipelining indeed speeds up, for realistic problem sizes, merging times by up to 70% on QS20 and 143% on PS3 compared to the merge phase of CellSort, which was by now the fastest merge sort implementation on Cell.

Place, publisher, year, edition, pages
Graz University of Technology, Institut f�r Informationssysteme und Computer Medien (IICM) / Graz University of Technology and Know-Center , 2012. Vol. 18, no 14, 1987-2023 p.
Keyword [en]
parallel merge sort, on-chip pipelining, multicore computing, task mapping, streaming computations
National Category
Engineering and Technology
URN: urn:nbn:se:liu:diva-85861DOI: 10.3217/jucs-018-14-1987ISI: 000310430600006OAI: diva2:573245

Funding Agencies|EU FP7 (project PEPPHER)|248481|VR (Integrated Software Pipelining)||SSF (ePUMA)||SeRC||CUGS||

Available from: 2012-11-30 Created: 2012-11-30 Last updated: 2014-10-27

Open Access in DiVA

No full text

Other links

Publisher's full textLink to article

Search in DiVA

By author/editor
Kessler, Christoph W.
By organisation
Software and SystemsThe Institute of TechnologyDepartment of Computer and Information Science
Engineering and Technology

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Altmetric score

Total: 54 hits
ReferencesLink to record
Permanent link

Direct link