liu.seSearch for publications in DiVA
Change search
Link to record
Permanent link

Direct link
BETA
Alternative names
Publications (10 of 100) Show all publications
Li, L., Dastgeer, U. & Kessler, C. (2016). Pruning strategies in adaptive off-line tuning for optimized composition of components on heterogeneous systems. Parallel Computing, 51, 37-45.
Open this publication in new window or tab >>Pruning strategies in adaptive off-line tuning for optimized composition of components on heterogeneous systems
2016 (English)In: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 51, 37-45 p.Article in journal (Refereed) Published
Abstract [en]

Adaptive program optimizations, such as automatic selection of the expected fastest implementation variant for a computation component depending on hardware architecture and runtime context, are important especially for heterogeneous computing systems but require good performance models. Empirical performance models which require no or little human efforts show more practical feasibility if the sampling and training cost can be reduced to a reasonable level. In previous work we proposed an early version of adaptive sampling for efficient exploration and selection of training samples, which yields a decision-tree based method for representing, predicting and selecting the fastest implementation variants for given run-time call contexts property values. For adaptive pruning we use a heuristic convexity assumption. In this paper we consolidate and improve the method by new pruning techniques to better support the convexity assumption and control the trade-off between sampling time, prediction accuracy and runtime prediction overhead. Our results show that the training time can be reduced by up to 39 times without noticeable prediction accuracy decrease. (C) 2015 Elsevier B.V. All rights reserved.

Place, publisher, year, edition, pages
ELSEVIER SCIENCE BV, 2016
Keyword
Smart sampling; Heterogeneous systems; Component selection
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:liu:diva-125830 (URN)10.1016/j.parco.2015.09.003 (DOI)000370093800004 ()
Note

Funding Agencies|EU; SeRC project OpCoReS

Available from: 2016-03-08 Created: 2016-03-04 Last updated: 2018-01-10
Dastgeer, U. & Kessler, C. (2016). Smart Containers and Skeleton Programming for GPU-Based Systems. International journal of parallel programming, 44(3), 506-530.
Open this publication in new window or tab >>Smart Containers and Skeleton Programming for GPU-Based Systems
2016 (English)In: International journal of parallel programming, ISSN 0885-7458, E-ISSN 1573-7640, Vol. 44, no 3, 506-530 p.Article in journal (Refereed) Published
Abstract [en]

In this paper, we discuss the role, design and implementation of smart containers in the SkePU skeleton library for GPU-based systems. These containers provide an interface similar to C++ STL containers but internally perform runtime optimization of data transfers and runtime memory management for their operand data on the different memory units. We discuss how these containers can help in achieving asynchronous execution for skeleton calls while providing implicit synchronization capabilities in a data consistent manner. Furthermore, we discuss the limitations of the original, already optimizing memory management mechanism implemented in SkePU containers, and propose and implement a new mechanism that provides stronger data consistency and improves performance by reducing communication and memory allocations. With several applications, we show that our new mechanism can achieve significantly (up to 33.4 times) better performance than the initial mechanism for page-locked memory on a multi-GPU based system.

Place, publisher, year, edition, pages
SPRINGER/PLENUM PUBLISHERS, 2016
Keyword
SkePU; Smart containers; Skeleton programming; Memory management; Runtime optimizations; GPU-based systems
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:liu:diva-128719 (URN)10.1007/s10766-015-0357-6 (DOI)000374897200008 ()
Note

Funding Agencies|EU; SeRC

Available from: 2016-06-07 Created: 2016-05-30 Last updated: 2018-01-10
Melot, N., Kessler, C., Keller, J. & Eitschberger, P. (2015). Fast Crown Scheduling Heuristics for Energy-Efficient Mapping and Scaling of Moldable Streaming Tasks on Many-Core Systems. Paper presented at The 10th HiPEAC conference, January 19-21, Amsterdam, The Netherlands. ACM Transactions on Architecture and Code Optimization (TACO), 11(4), 62.
Open this publication in new window or tab >>Fast Crown Scheduling Heuristics for Energy-Efficient Mapping and Scaling of Moldable Streaming Tasks on Many-Core Systems
2015 (English)In: ACM Transactions on Architecture and Code Optimization (TACO), ISSN 1544-3566, Vol. 11, no 4, 62- p.Article in journal (Refereed) Published
Abstract [en]

Exploiting effectively massively parallel architectures is a major challenge that stream programming can help facilitate. We investigate the problem of generating energy-optimal code for a collection of streaming tasks that include parallelizable or moldable tasks on a generic manycore processor with dynamic discrete frequency scaling. Streaming task collections differ from classical task sets in that all tasks are running concurrently, so that cores typically run several tasks that are scheduled round-robin at user level in a data-driven way. A stream of data flows through the tasks and intermediate results may be forwarded to other tasks, as in a pipelined task graph. In this article, we consider crown scheduling, a novel technique for the combined optimization of resource allocation, mapping, and discrete voltage/frequency scaling for moldable streaming task collections in order to optimize energy efficiency given a throughput constraint. We first present optimal offline algorithms for separate and integrated crown scheduling based on integer linear programming (ILP). We make no restricting assumption about speedup behavior. We introduce the fast heuristic Longest Task, Lowest Group (LTLG) as a generalization of the Longest Processing Time (LPT) algorithm to achieve a load-balanced mapping of parallel tasks, and the Height heuristic for crown frequency scaling. We use them in feedback loop heuristics based on binary search and simulated annealing to optimize crown allocation.

Our experimental evaluation of the ILP models for a generic manycore architecture shows that at least for small and medium-sized streaming task collections even the integrated variant of crown scheduling can be solved to optimality by a state-of-the-art ILP solver within a few seconds. Our heuristics produce makespan and energy consumption close to optimality within the limits of the phase-separated crown scheduling technique and the crown structure. Their optimization time is longer than the one of other algorithms we test, but our heuristics consistently produce better solutions.

Place, publisher, year, edition, pages
New York, NY, USA: ACM Digital Library, 2015
Keyword
Multicore, frequency scaling, manycore, mapping, parallel energy, scheduling, streaming
National Category
Computer Systems
Identifiers
urn:nbn:se:liu:diva-114280 (URN)10.1145/2687653 (DOI)000348232000028 ()
Conference
The 10th HiPEAC conference, January 19-21, Amsterdam, The Netherlands
Funder
EU, FP7, Seventh Framework Programme, EXCESSSwedish e‐Science Research Center, OpCoReS
Available from: 2015-02-16 Created: 2015-02-16 Last updated: 2015-03-02Bibliographically approved
Li, L. & Kessler, C. (2015). MeterPU: A Generic Measurement Abstraction API Enabling Energy-tuned Skeleton Backend Selection. In: Trustcom/BigDataSE/ISPA, 2015 IEEE: . Paper presented at The 1st IEEE International Workshop on Reengineering for Parallelism in Heterogeneous Parallel Platforms held in conjunction with IEEE ISPA-15, Helsinki, Finland, August 20-22, 2015 (pp. 154-159). IEEE Press, 3.
Open this publication in new window or tab >>MeterPU: A Generic Measurement Abstraction API Enabling Energy-tuned Skeleton Backend Selection
2015 (English)In: Trustcom/BigDataSE/ISPA, 2015 IEEE, IEEE Press, 2015, Vol. 3, 154-159 p.Conference paper, Published paper (Refereed)
Abstract [en]

We present MeterPU, an easy-to-use, generic and low-overhead abstraction API for taking measurements of various metrics (time, energy) on different hardware components (e.g. CPU, DRAM, GPU), using pluggable platform-specific measurement implementations behind a common interface in C++. We show that with MeterPU, not only legacy (time) optimization frameworks, such as autotuned skeleton back-end selection, can be easily retargeted for energy optimization, but also switching different optimization goals for arbitrary code sections now becomes trivial. We apply MeterPU to implement the first energytunable skeleton programming framework, based on the SkePU skeleton programming library.

Place, publisher, year, edition, pages
IEEE Press, 2015
Keyword
MeterPU, measurement abstraction, SkePU, skeleton programming, heterogeneous, GPU, C++ Trait, Energy measurement, Time measurement
National Category
Computer Systems
Identifiers
urn:nbn:se:liu:diva-129103 (URN)10.1109/Trustcom.2015.625 (DOI)000380431400020 ()978-1-4673-7952-6 (ISBN)
Conference
The 1st IEEE International Workshop on Reengineering for Parallelism in Heterogeneous Parallel Platforms held in conjunction with IEEE ISPA-15, Helsinki, Finland, August 20-22, 2015
Projects
EU FP7 EXCESS 611183 (EU Sjunde Ramprogrammet)SeRC OpCoReS (Swedish e-Science Research Centre)
Available from: 2016-06-12 Created: 2016-06-12 Last updated: 2016-10-13Bibliographically approved
Melot, N., Janzen, J. & Kessler, C. (2015). Mimer and Schedeval: Tools for Comparing Static Schedulers for Streaming Applications on Manycore Architectures. In: 2015 44TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING WORKSHOPS: . Paper presented at 44th Annual International Conference on Parallel Processing Workshops (ICPPW) (pp. 146-155). IEEE.
Open this publication in new window or tab >>Mimer and Schedeval: Tools for Comparing Static Schedulers for Streaming Applications on Manycore Architectures
2015 (English)In: 2015 44TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING WORKSHOPS, IEEE , 2015, 146-155 p.Conference paper, Published paper (Refereed)
Abstract [en]

Scheduling algorithms published in the scientific literature are often difficult to evaluate or compare due to differences between the experimental evaluations in any two papers on the topic. Very few researchers share the details about the scheduling problem instances they use in their evaluation section, the code that allows them to transform the numbers they collect into the results and graphs they show, nor the raw data produced in their experiments. Also, many scheduling algorithms published are not tested against a real processor architecture to evaluate their efficiency in a realistic setting. In this paper, we describe Mimer, a modular evaluation tool-chain for static schedulers that enables the sharing of evaluation and analysis tools employed to elaborate scheduling papers. We propose Schedeval that integrates into Mimer to evaluate static schedules of streaming applications under throughput constraints on actual target execution platforms. We evaluate the performance of Schedeval at running streaming applications on the Intel Single-Chip Cloud computer (SCC), and we demonstrate the usefulness of our tool-chain to compare existing scheduling algorithms. We conclude that Mimer and Schedeval are useful tools to study static scheduling and to observe the behavior of streaming applications when running on manycore architectures.

Place, publisher, year, edition, pages
IEEE, 2015
Series
International Conference on Parallel Processing Workshops, ISSN 1530-2016
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-130085 (URN)10.1109/ICPPW.2015.24 (DOI)000377378800021 ()978-1-4673-7589-4 (ISBN)
Conference
44th Annual International Conference on Parallel Processing Workshops (ICPPW)
Available from: 2016-07-06 Created: 2016-07-06 Last updated: 2018-01-10
Dastgeer, U. & Kessler, C. (2015). Performance-aware Composition Framework for GPU-based Systems. Journal of Supercomputing, 71(12), 4646-4662.
Open this publication in new window or tab >>Performance-aware Composition Framework for GPU-based Systems
2015 (English)In: Journal of Supercomputing, ISSN 0920-8542, E-ISSN 1573-0484, Vol. 71, no 12, 4646-4662 p.Article in journal (Refereed) Published
Abstract [en]

User-level components of applications can be made performance-aware by annotating them with performance model and other metadata. We present a component model and a composition framework for the automatically optimized composition of applications for modern GPU-based systems from such components, which may expose multiple implementation variants. The framework targets the composition problem in an integrated manner, with the ability to do global performance-aware composition across multiple invocations. We demonstrate several key features of our framework relating to performance-aware composition including implementation selection, both with performance characteristics being known (or learned) beforehand as well as cases when they are learned at runtime. We also demonstrate hybrid execution capabilities of our framework on real applications. Furthermore, we present a bulk composition technique that can make better composition decisions by considering information about upcoming calls along with data flow information extracted from the source program by static analysis. The bulk composition improves over the traditional greedy performance aware policy that only considers the current call for optimization.

Place, publisher, year, edition, pages
Springer, 2015
Keyword
Global composition, Multivariant software components, Implementation selection, Hybrid parallel execution, GPU-based systems, Performance portability, Autotuning, Optimizing compiler
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-109661 (URN)10.1007/s11227-014-1105-1 (DOI)000365185400015 ()
Projects
EU FP7 EXCESSSeRC OpCoReS
Funder
EU, FP7, Seventh Framework Programme, 611183Swedish e‐Science Research Center, OpCoReS
Available from: 2014-08-22 Created: 2014-08-22 Last updated: 2018-01-11
Kessler, C., Li, L., Atalar, A. & Dobre, A. (2015). XPDL: Extensible Platform Description Language to Support Energy Modeling and Optimization. In: Proc. 44th International Conference on Parallel Processing Workshops, ICPP-EMS Embedded Multicore Systems, in conjunction with ICPP-2015, Beijing, 1-4 sep. 2015: . Paper presented at 44th International Conference on Parallel Processing Workshops, ICPP-EMS Embedded Multicore Systems, Beijing, 1-4 sep. 2015 (pp. 51-60). Institute of Electrical and Electronics Engineers (IEEE).
Open this publication in new window or tab >>XPDL: Extensible Platform Description Language to Support Energy Modeling and Optimization
2015 (English)In: Proc. 44th International Conference on Parallel Processing Workshops, ICPP-EMS Embedded Multicore Systems, in conjunction with ICPP-2015, Beijing, 1-4 sep. 2015, Institute of Electrical and Electronics Engineers (IEEE), 2015, 51-60 p.Conference paper, Published paper (Refereed)
Abstract [en]

We present XPDL, a modular, extensible platform description language for heterogeneous multicore systems and clusters. XPDL specifications provide platform metadata about hardware and installed system software that are relevant for the adaptive static and dynamic optimization of application programs and system settings for improved performance and energy efficiency. XPDL is based on XML and uses hyperlinks to create distributed libraries of platform metadata specifications. We also provide first components of a retarget able tool chain that browses and processes XPDL specifications, and generates driver code for micro benchmarking to bootstrap empirical performance and energy models at deployment time. A C++ based API enables convenient introspection of platform models, even at run-time, which allows for adaptive dynamic program optimizations such as tuned selection of implementation variants.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2015
Series
International Conference on Parallel Processing Workshops, ISSN 1530-2016
Keyword
architecture description language, computer architecture modeling, system modeling, energy optimization toolchain, retargetability, heterogeneous parallel system, platform description language, architecture description language, XML
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-129097 (URN)10.1109/ICPPW.2015.17 (DOI)000377378800008 ()
Conference
44th International Conference on Parallel Processing Workshops, ICPP-EMS Embedded Multicore Systems, Beijing, 1-4 sep. 2015
Projects
EU FP7 EXCESSSeRC OpCoReS
Funder
EU, FP7, Seventh Framework Programme, 611183Swedish e‐Science Research Center, OpCoReS
Available from: 2016-06-11 Created: 2016-06-11 Last updated: 2018-01-10
Hansson, E., Alnervik, E., Kessler, C. & Forsell, M. (2014). A Quantitative Comparison of PRAM based Emulated Shared Memory Architectures to Current Multicore CPUs and GPUs. In: 27th International Conference on Architecture of Computing Systems (ARCS), 2014, ARCS Workshops: Proc. PASA-2014 11th Workshop on Parallel Systems and Algorithms, Lübeck, Germany: . Paper presented at 27th International Conference on Architecture of Computing Systems (ARCS) 2014, PASA-2014 11th Workshop on Parallel Systems and Algorithms, Lübeck, Germany, Feb. 2014 (pp. 27-33). Lübeck, Germany: VDE Verlag GmbH.
Open this publication in new window or tab >>A Quantitative Comparison of PRAM based Emulated Shared Memory Architectures to Current Multicore CPUs and GPUs
2014 (English)In: 27th International Conference on Architecture of Computing Systems (ARCS), 2014, ARCS Workshops: Proc. PASA-2014 11th Workshop on Parallel Systems and Algorithms, Lübeck, Germany, Lübeck, Germany: VDE Verlag GmbH, 2014, 27-33 p.Conference paper, Published paper (Refereed)
Abstract [en]

The performance of current multicore CPUs and GPUs is limited in computations making frequent use of communication/synchronization between the subtasks executed in parallel. This is because the directory-based cache systems scale weakly and/or the cost of synchronization is high. The Emulated Shared Memory (ESM) architectures relying on multithreading and efficient synchronization mechanisms have been developed to solve these problems affecting both performance and programmability of current machines. In this paper, we compare preliminarily the performance of three hardware implemented ESM architectures with state-of-the-art multicore CPUs and GPUs. The benchmarks are selected to cover different patterns of parallel computation and therefore reveal the performance potential of ESM architectures with respect to current multicores.

Place, publisher, year, edition, pages
Lübeck, Germany: VDE Verlag GmbH, 2014
Series
PARS-Mitteilungen, ISSN 0177-0454 ; 31
Keyword
Parallel computing, performance analysis, GPU, chip multiprocessor, shared memory
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-114341 (URN)978-3-8007-3579-2 (ISBN)
Conference
27th International Conference on Architecture of Computing Systems (ARCS) 2014, PASA-2014 11th Workshop on Parallel Systems and Algorithms, Lübeck, Germany, Feb. 2014
Projects
REPLICASeRC OpCoReS
Funder
Swedish e‐Science Research Center, OpCoReS
Available from: 2015-02-18 Created: 2015-02-18 Last updated: 2018-01-11
Dastgeer, U. & Kessler, C. (2014). Conditional component composition for GPU-based systems. In: Proc. Seventh Workshop on Programmability Issues for Multi-Core Computers (MULTIPROG-2014) at HiPEAC-2014, Vienna, Austria, Jan. 2014: . Paper presented at Proc. Seventh Workshop on Programmability Issues for Multi-Core Computers (MULTIPROG-2014) at HiPEAC-2014 Conference, Vienna, Austria, Jan. 2014. Vienna, Austria: HiPEAC NoE.
Open this publication in new window or tab >>Conditional component composition for GPU-based systems
2014 (English)In: Proc. Seventh Workshop on Programmability Issues for Multi-Core Computers (MULTIPROG-2014) at HiPEAC-2014, Vienna, Austria, Jan. 2014, Vienna, Austria: HiPEAC NoE , 2014Conference paper, Published paper (Refereed)
Abstract [en]

User-level components can expose multiple functionally equivalent implementations with different resource requirements and performance characteristics. A composition framework can then choose a suitable implementation for each component invocation guided by an objective function (execution time, energy etc.). In this paper, we describe the idea of conditional composition which enables the component writer to specify constraints on the selectability of a given component implementation based on information about the target system and component call properties. By incorporating such information, more informed and user-guided composition decisions can be made and thus more efficient code be generated, as shown with an example scenario for a GPU-based system.

Place, publisher, year, edition, pages
Vienna, Austria: HiPEAC NoE, 2014
Series
MULTIPROG workshop series
Keyword
heterogeneous multicore system, parallel programming, platform description language, constrained optimization, code generation, software composition, parallel computing
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-114340 (URN)
Conference
Proc. Seventh Workshop on Programmability Issues for Multi-Core Computers (MULTIPROG-2014) at HiPEAC-2014 Conference, Vienna, Austria, Jan. 2014
Projects
EU FP7 EXCESSEU FP7 PEPPHERSeRC OpCoReS
Funder
EU, FP7, Seventh Framework Programme, 611183 (EXCESS)Swedish e‐Science Research Center, OpCoReS
Available from: 2015-02-18 Created: 2015-02-18 Last updated: 2018-01-11Bibliographically approved
Forsell, M., Hansson, E., Kessler, C., Mäkelä, J.-M. & Leppänen, V. (2014). NUMA Computing with Hardware and Software Co-Support on Configurable Emulated Shared Memory Architectures. International Journal of Networking and Computing, 4(1), 189-206.
Open this publication in new window or tab >>NUMA Computing with Hardware and Software Co-Support on Configurable Emulated Shared Memory Architectures
Show others...
2014 (English)In: International Journal of Networking and Computing, ISSN 2185-2839, E-ISSN 2185-2847, Vol. 4, no 1, 189-206 p.Article in journal (Refereed) Published
Abstract [en]

The emulated shared memory (ESM) architectures are good candidates for future general purpose parallel computers due to their ability to provide an easy-to-use explicitly parallel synchronous model of computation to programmers as well as avoid most performance bottlenecks present in current multicore architectures. In order to achieve full performance the applications must, however, have enough thread-level parallelism (TLP). To solve this problem, in our earlier work we have introduced a class of configurable emulated shared memory (CESM) machines that provides a special non-uniform memory access (NUMA) mode for situations where TLP is limited or for direct compatibility for legacy code sequential computing and NUMA mechanism. Unfortunately the earlier proposed CESM architecture does not integrate the different modes of the architecture well together e.g. by leaving the memories for different modes isolated and therefore the programming interface is non-integrated. In this paper we propose a number of hardware and software techniques to support NUMA computing in CESM architectures in a seamless way. The hardware techniques include three different NUMA shared memory access mechanisms and the software ones provide a mechanism to integrate and optimize NUMA computation into the standard parallel random access machine (PRAM) operation of the CESM. The hardware techniques are evaluated on our REPLICA CESM architecture and compared to an ideal CESM machine making use of the proposed software techniques.

Keyword
Parallel Random Access Machine; NUMA; Shared Memory; Multicore Architecture; Reconfigurable Processor; Multithreaded Processor; PRAM Emulation; Parallel Computing; Parallel Computer Architecture
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-109660 (URN)
Projects
REPLICA
Available from: 2014-08-22 Created: 2014-08-22 Last updated: 2018-01-11Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0001-5241-0026

Search in DiVA

Show all publications