liu.seSearch for publications in DiVA
Change search
Link to record
Permanent link

Direct link
BETA
Publications (10 of 11) Show all publications
Li, L. & Kessler, C. (2018). Lazy Allocation and Transfer Fusion Optimization for GPU-based Heterogeneous Systems. In: 2018 26TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, AND NETWORK-BASED PROCESSING (PDP 2018): . Paper presented at 26th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP) (pp. 311-315). IEEE
Open this publication in new window or tab >>Lazy Allocation and Transfer Fusion Optimization for GPU-based Heterogeneous Systems
2018 (English)In: 2018 26TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, AND NETWORK-BASED PROCESSING (PDP 2018), IEEE , 2018, p. 311-315Conference paper, Published paper (Refereed)
Abstract [en]

We present two memory optimization techniques which improve the efficiency of data transfer over PCIe bus for GPU-based heterogeneous systems, namely lazy allocation and transfer fusion optimization. Both are based on merging data transfers so that less overhead is incurred, thereby increasing transfer throughput and making accelerator usage profitable also for smaller operand sizes. We provide the design and prototype implementation of the two techniques in CUDA. Microbench-marking results show that especially for smaller and medium-sized operands significant speedups can be achieved. We also prove that our transfer fusion optimization algorithm is optimal.

Place, publisher, year, edition, pages
IEEE, 2018
Series
Euromicro Conference on Parallel Distributed and Network-Based Processing, ISSN 1066-6192
Keywords
adaptive message fusion; GPU; CUDA; lazy memory allocation; memory transfer optimization
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-151524 (URN)10.1109/PDP2018.2018.00054 (DOI)000443807600045 ()978-1-5386-4975-6 (ISBN)
Conference
26th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)
Note

Funding Agencies|SeRC; NSC / SNIC [SNIC 2016/5-6]

Available from: 2018-09-24 Created: 2018-09-24 Last updated: 2018-10-19
Li, L. (2018). Programming Abstractions and Optimization Techniques for GPU-based Heterogeneous Systems. (Doctoral dissertation). Linköping: Linköping University Electronic Press
Open this publication in new window or tab >>Programming Abstractions and Optimization Techniques for GPU-based Heterogeneous Systems
2018 (English)Doctoral thesis, monograph (Other academic)
Abstract [en]

CPU/GPU heterogeneous systems have shown remarkable advantages in performance and energy consumption compared to homogeneous ones such as standard multi-core systems.Such heterogeneity represents one of the most promising trendsfor the near-future evolution of high performance computing hardware.However, as a double-edged sword, the heterogeneity also brings significant programming complexitiesthat prevent the easy and efficient usage of different such heterogeneous systems.In this thesis, we are interested in four such kinds of fundamental complexities that are associated withthese heterogeneous systems: measurement complexity (efforts required to measure a metric, e.g., measuring enegy), CPU-GPU selection complexity, platform complexity and data management complexity. We explore new low-cost programming abstractions to hide these complexities,and propose new optimization techniques that could be performed under the hood.

For the measurement complexity, although measuring time is trivial by native library support,measuring energy consumption, especially for systems with GPUs, is complexbecause of the low level details involved such as choosing the right measurement methods, handling the trade-off between sampling rate and accuracy,and switching to different measurement metrics.We propose a clean interface with its implementationthat not only hides the complexity of energy measurement,but also unifies different kinds of measurements. The unificationbridges the gap between time measurement and energy measurement,and if no metric-specific assumptions related to time optimization techniques are made,energy optimization can be performedby blindly reusing time optimization techniques.

For the CPU-GPU selection complexity, which relates to efficient utilization of heterogeneous hardware,we propose a new adaptive-sampling based construction mechanism of predictors for such selections which can adapt to different hardware platforms automatically,and shows non-trivial advantages over random sampling.

For the platform complexity, we propose a new modular platform modeling language and its implementation to formally and systematically describe a computer system,enabling zero-overhead platform information queries for high-level software tool chains and for programmers as a basis for making software adaptive.

For the data management complexity, we propose a new mechanism to enable a unified memory view on heterogeneous systemsthat have separate memory spaces. This mechanism enables programmers to write significantly less code,which runs equally fast with expert-written code and outperforms the current commercially available solution: Nvidia's Unified Memory.We further propose two data movement optimization techniques, lazy allocation and transfer fusion optimization.The two techniques are based on adaptively merging messages to reduce data transfer latency.We show that these techniques can be potentially beneficial and we prove that our greedy fusion algorithm is optimal.

Finally, we show that our approaches to handle different complexities can be combined so that programmers could use them simultaneously.

This research was partly funded by two EU FP7 projects (PEPPHER and EXCESS) and SeRC.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2018. p. 177
Series
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 1903
Keywords
CPU, GPU, GPGPU, heterogeneous systems, programming abstraction, performance optimization, energy optimization, adaptive sampling, MeterPU, TunerPU, XPDL, VectorPU
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-145304 (URN)10.3384/diss.diva-145304 (DOI)9789176853702 (ISBN)
Public defence
2018-04-04, Ada Lovelace, B-huset, Campus Valla, Linköping, 10:15 (English)
Opponent
Supervisors
Funder
EU, FP7, Seventh Framework Programme, PEPPHER and EXCESS
Available from: 2018-02-28 Created: 2018-02-22 Last updated: 2018-02-28Bibliographically approved
Li, L., Dastgeer, U. & Kessler, C. (2016). Pruning strategies in adaptive off-line tuning for optimized composition of components on heterogeneous systems. Parallel Computing, 51, 37-45
Open this publication in new window or tab >>Pruning strategies in adaptive off-line tuning for optimized composition of components on heterogeneous systems
2016 (English)In: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 51, p. 37-45Article in journal (Refereed) Published
Abstract [en]

Adaptive program optimizations, such as automatic selection of the expected fastest implementation variant for a computation component depending on hardware architecture and runtime context, are important especially for heterogeneous computing systems but require good performance models. Empirical performance models which require no or little human efforts show more practical feasibility if the sampling and training cost can be reduced to a reasonable level. In previous work we proposed an early version of adaptive sampling for efficient exploration and selection of training samples, which yields a decision-tree based method for representing, predicting and selecting the fastest implementation variants for given run-time call contexts property values. For adaptive pruning we use a heuristic convexity assumption. In this paper we consolidate and improve the method by new pruning techniques to better support the convexity assumption and control the trade-off between sampling time, prediction accuracy and runtime prediction overhead. Our results show that the training time can be reduced by up to 39 times without noticeable prediction accuracy decrease. (C) 2015 Elsevier B.V. All rights reserved.

Place, publisher, year, edition, pages
ELSEVIER SCIENCE BV, 2016
Keywords
Smart sampling; Heterogeneous systems; Component selection
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:liu:diva-125830 (URN)10.1016/j.parco.2015.09.003 (DOI)000370093800004 ()
Note

Funding Agencies|EU; SeRC project OpCoReS

Available from: 2016-03-08 Created: 2016-03-04 Last updated: 2018-01-10
Li, L. & Kessler, C. (2015). MeterPU: A Generic Measurement Abstraction API Enabling Energy-tuned Skeleton Backend Selection. In: Trustcom/BigDataSE/ISPA, 2015 IEEE: . Paper presented at The 1st IEEE International Workshop on Reengineering for Parallelism in Heterogeneous Parallel Platforms held in conjunction with IEEE ISPA-15, Helsinki, Finland, August 20-22, 2015 (pp. 154-159). IEEE Press, 3
Open this publication in new window or tab >>MeterPU: A Generic Measurement Abstraction API Enabling Energy-tuned Skeleton Backend Selection
2015 (English)In: Trustcom/BigDataSE/ISPA, 2015 IEEE, IEEE Press, 2015, Vol. 3, p. 154-159Conference paper, Published paper (Refereed)
Abstract [en]

We present MeterPU, an easy-to-use, generic and low-overhead abstraction API for taking measurements of various metrics (time, energy) on different hardware components (e.g. CPU, DRAM, GPU), using pluggable platform-specific measurement implementations behind a common interface in C++. We show that with MeterPU, not only legacy (time) optimization frameworks, such as autotuned skeleton back-end selection, can be easily retargeted for energy optimization, but also switching different optimization goals for arbitrary code sections now becomes trivial. We apply MeterPU to implement the first energytunable skeleton programming framework, based on the SkePU skeleton programming library.

Place, publisher, year, edition, pages
IEEE Press, 2015
Keywords
MeterPU, measurement abstraction, SkePU, skeleton programming, heterogeneous, GPU, C++ Trait, Energy measurement, Time measurement
National Category
Computer Systems
Identifiers
urn:nbn:se:liu:diva-129103 (URN)10.1109/Trustcom.2015.625 (DOI)000380431400020 ()978-1-4673-7952-6 (ISBN)
Conference
The 1st IEEE International Workshop on Reengineering for Parallelism in Heterogeneous Parallel Platforms held in conjunction with IEEE ISPA-15, Helsinki, Finland, August 20-22, 2015
Projects
EU FP7 EXCESS 611183 (EU Sjunde Ramprogrammet)SeRC OpCoReS (Swedish e-Science Research Centre)
Available from: 2016-06-12 Created: 2016-06-12 Last updated: 2016-10-13Bibliographically approved
Kessler, C., Li, L., Atalar, A. & Dobre, A. (2015). XPDL: Extensible Platform Description Language to Support Energy Modeling and Optimization. In: Proc. 44th International Conference on Parallel Processing Workshops, ICPP-EMS Embedded Multicore Systems, in conjunction with ICPP-2015, Beijing, 1-4 sep. 2015: . Paper presented at 44th International Conference on Parallel Processing Workshops, ICPP-EMS Embedded Multicore Systems, Beijing, 1-4 sep. 2015 (pp. 51-60). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>XPDL: Extensible Platform Description Language to Support Energy Modeling and Optimization
2015 (English)In: Proc. 44th International Conference on Parallel Processing Workshops, ICPP-EMS Embedded Multicore Systems, in conjunction with ICPP-2015, Beijing, 1-4 sep. 2015, Institute of Electrical and Electronics Engineers (IEEE), 2015, p. 51-60Conference paper, Published paper (Refereed)
Abstract [en]

We present XPDL, a modular, extensible platform description language for heterogeneous multicore systems and clusters. XPDL specifications provide platform metadata about hardware and installed system software that are relevant for the adaptive static and dynamic optimization of application programs and system settings for improved performance and energy efficiency. XPDL is based on XML and uses hyperlinks to create distributed libraries of platform metadata specifications. We also provide first components of a retarget able tool chain that browses and processes XPDL specifications, and generates driver code for micro benchmarking to bootstrap empirical performance and energy models at deployment time. A C++ based API enables convenient introspection of platform models, even at run-time, which allows for adaptive dynamic program optimizations such as tuned selection of implementation variants.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2015
Series
International Conference on Parallel Processing Workshops, ISSN 1530-2016
Keywords
architecture description language, computer architecture modeling, system modeling, energy optimization toolchain, retargetability, heterogeneous parallel system, platform description language, architecture description language, XML
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-129097 (URN)10.1109/ICPPW.2015.17 (DOI)000377378800008 ()
Conference
44th International Conference on Parallel Processing Workshops, ICPP-EMS Embedded Multicore Systems, Beijing, 1-4 sep. 2015
Projects
EU FP7 EXCESSSeRC OpCoReS
Funder
EU, FP7, Seventh Framework Programme, 611183Swedish e‐Science Research Center, OpCoReS
Available from: 2016-06-11 Created: 2016-06-11 Last updated: 2018-01-10
Kessler, C., Dastgeer, U. & Li, L. (2014). Optimized Composition: Generating Efficient Code for Heterogeneous Systems from Multi-Variant Components, Skeletons and Containers. In: F. Hannig and J. Teich (Ed.), Proc. First Workshop on Resource awareness and adaptivity in multi-core computing (Racing 2014), May 2014, Paderborn, Germany: . Paper presented at First Workshop on Resource awareness and adaptivity in multi-core computing (Racing 2014), May 2014, Paderborn, Germany (pp. 43-48).
Open this publication in new window or tab >>Optimized Composition: Generating Efficient Code for Heterogeneous Systems from Multi-Variant Components, Skeletons and Containers
2014 (English)In: Proc. First Workshop on Resource awareness and adaptivity in multi-core computing (Racing 2014), May 2014, Paderborn, Germany / [ed] F. Hannig and J. Teich, 2014, p. 43-48Conference paper, Published paper (Refereed)
Abstract [en]

In this survey paper, we review recent work on frameworks for the high-level, portable programming of heterogeneous multi-/manycore systems (especially, GPU-based systems) using high-level constructs such as annotated user-level software components, skeletons (i.e., predefined generic components) and containers, and discuss the optimization problems that need to be considered in selecting among multiple implementation variants, generating code and providing runtime support for efficient execution on such systems.

Series
arXiv.org ; rn:Racing/2014/09
Keywords
heterogeneous multicore system, GPU, skeleton programming, parallel programming, parallel computing, program optimization, software composition
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-114338 (URN)
Conference
First Workshop on Resource awareness and adaptivity in multi-core computing (Racing 2014), May 2014, Paderborn, Germany
Projects
EU FP7 EXCESSSeRC OpCoReS
Funder
Swedish e‐Science Research Center, OpCoReSEU, FP7, Seventh Framework Programme, 611183 (EXCESS)
Available from: 2015-02-18 Created: 2015-02-18 Last updated: 2018-01-11
Li, L., Dastgeer, U. & Kessler, C. (2014). Pruning strategies in adaptive off-line tuning for optimized composition of components on heterogeneous systems. In: 2014 43rd International Conference on Parallel Processing Workshops (ICCPW): . Paper presented at Seventh International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2) at 43rd International Conference of Parallel Processing (ICPP), Minneapolis, USA, 9-12 Sep. 2014 (pp. 255-264). IEEE conference proceedings
Open this publication in new window or tab >>Pruning strategies in adaptive off-line tuning for optimized composition of components on heterogeneous systems
2014 (English)In: 2014 43rd International Conference on Parallel Processing Workshops (ICCPW), IEEE conference proceedings, 2014, p. 255-264Conference paper, Published paper (Refereed)
Abstract [en]

Adaptive program optimizations, such as automatic selection of the expected fastest implementation variant for a computation component depending on runtime context, are important especially for heterogeneous computing systems but require good performance models. Empirical performance models based on trial executions which require no or little human efforts show more practical feasibility if the sampling and training cost can be reduced to a reasonable level. In previous work we proposed an early version of adaptive pruning algorithm for efficient selection of training samples, a decision-tree based method for representing, predicting and selecting the fastest implementation variants for given run-time call context properties, and a composition tool for building the overall composed application from its components. For adaptive pruning we use a heuristic convexity assumption. In this paper we consolidate and improve the method by new pruning techniques to better support the convexity assumption and better control the trade-off between sampling time, prediction accuracy and runtime prediction overhead. Our results show that the training time can be reduced by up to 39 times without noticeable prediction accuracy decrease. Furthermore, we evaluate the effect of combinations of pruning strategies and compare our adaptive sampling method with random sampling. We also use our smart-sampling method as a preprocessor to a state-of-the-art decision tree learning algorithm and compare the result to the predictor directly calculated by our method.

Place, publisher, year, edition, pages
IEEE conference proceedings, 2014
Series
International Conference of Parallel Processing, Workshops, ISSN 1530-2016
Keywords
empirical performance modeling, automated performance tuning, heterogeneous multicore system, GPU computing, adaptive program optimization, machine learning
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-118514 (URN)10.1109/ICPPW.2014.42 (DOI)
Conference
Seventh International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2) at 43rd International Conference of Parallel Processing (ICPP), Minneapolis, USA, 9-12 Sep. 2014
Projects
EU FP7 EXCESSSeRC-OpCoReS
Funder
EU, FP7, Seventh Framework Programme, 611183Swedish e‐Science Research Center, OpCoReS
Available from: 2015-05-29 Created: 2015-05-29 Last updated: 2018-01-11
Dastgeer, U., Li, L. & Kessler, C. (2014). The PEPPHER composition tool: performance-aware composition for GPU-based systems. Computing, 96(12), 1195-1211
Open this publication in new window or tab >>The PEPPHER composition tool: performance-aware composition for GPU-based systems
2014 (English)In: Computing, ISSN 0010-485X, E-ISSN 1436-5057, Vol. 96, no 12, p. 1195-1211Article in journal (Refereed) Published
Abstract [en]

The PEPPHER (EU FP7 project) component model defines the notion of component, interface and meta-data for homogeneous and heterogeneous parallel systems. In this paper, we describe and evaluate the PEPPHER composition tool, which explores the application’s components and their implementation variants, generates the necessary low-level code that interacts with the runtime system, and coordinates the native compilation and linking of the various code units to compose the overall application code to optimize performance. We discuss the concept of smart containers and its benefits for reducing dispatch overhead, exploiting implicit parallelism across component invocations and runtime optimization of data transfers. In an experimental evaluation with several applications, we demonstrate that the composition tool provides a high-level programming front-end while effectively utilizing the task-based PEPPHER runtime system (StarPU) underneath for different usage scenarios on GPU-based systems.

Place, publisher, year, edition, pages
Springer, 2014
Keywords
PEPPHER project; Annotated multi-variant software components; GPU-based systems; Performance portability; Autotuning
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-109659 (URN)10.1007/s00607-013-0371-8 (DOI)000344169400007 ()
Projects
EU FP7 PEPPHERSeRC OpCoReS
Funder
EU, FP7, Seventh Framework Programme, 248481Swedish e‐Science Research Center, OpCoReS
Available from: 2014-08-22 Created: 2014-08-22 Last updated: 2018-01-11
Dastgeer, U., Li, L. & Kessler, C. (2013). Adaptive Implementation Selection in the SkePU Skeleton Programming Library. In: Chengyung Wu and Albert Cohen (eds.) (Ed.), Advanced Parallel Processing Technologies (APPT-2013), Proceedings: . Paper presented at Advanced Parallel Processing Technologies, Stockholm, August 2013 (pp. 170-183).
Open this publication in new window or tab >>Adaptive Implementation Selection in the SkePU Skeleton Programming Library
2013 (English)In: Advanced Parallel Processing Technologies (APPT-2013), Proceedings / [ed] Chengyung Wu and Albert Cohen (eds.), 2013, p. 170-183Conference paper, Published paper (Refereed)
Abstract [en]

In earlier work, we have developed the SkePU skeleton programming library for modern multicore systems equipped with one or more programmable GPUs. The library internally provides four types of implementations (implementation variants) for each skeleton: serial C++, OpenMP, CUDA and OpenCL targeting either CPU or GPU execution respectively. Deciding which implementation would run faster for a given skeleton call depends upon the computation, problem size(s), system architecture and data locality.

In this paper, we present our work on automatic selection between these implementation variants by an offline machine learning method which generates a compact decision tree with low training overhead. The proposed selection mechanism is flexible yet high-level allowing a skeleton programmer to control different training choices at a higher abstraction level. We have evaluated our optimization strategy with 9 applications/kernels ported to our skeleton library and achieve on average more than 94% (90%) accuracy with just 0.53% (0.58%) training space exploration on two systems. Moreover, we discuss one application scenario where local optimization considering a single skeleton call can prove sub-optimal, and propose a heuristic for bulk implementation selection considering more than one skeleton call to address such application scenarios.

Series
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 8299
Keywords
Skeleton programming, SkePU, adaptivity, autotuning, performance optimization, GPU, multicore, parallel computing
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-102579 (URN)10.1007/978-3-642-45293-2_13 (DOI)978-3-642-45292-5 (ISBN)
Conference
Advanced Parallel Processing Technologies, Stockholm, August 2013
Projects
EU FP7 PEPPHERSeRC - OpCoReS
Funder
EU, FP7, Seventh Framework Programme, 248481Swedish e‐Science Research Center, OpCoReS
Available from: 2013-12-15 Created: 2013-12-15 Last updated: 2018-02-20
Li, L., Dastgeer, U. & Kessler, C. (2013). Adaptive Off-Line Tuning for Optimized Composition of Components for Heterogeneous Many-Core Systems. In: Dayde, Michel, Marques, Osni, Nakajima, Kengo (Ed.), High Performance Computing for Computational Science - VECPAR 2012: . Paper presented at 10th International Conference on High Performance Computing for Computational Science, VECPAR 2012; Kobe; Japan (pp. 329-345). Springer
Open this publication in new window or tab >>Adaptive Off-Line Tuning for Optimized Composition of Components for Heterogeneous Many-Core Systems
2013 (English)In: High Performance Computing for Computational Science - VECPAR 2012 / [ed] Dayde, Michel, Marques, Osni, Nakajima, Kengo, Springer, 2013, p. 329-345Conference paper, Published paper (Refereed)
Abstract [en]

In recent years heterogeneous multi-core systems have been given much attention. However, performance optimization on these platforms remains a big challenge. Optimizations performed by compilers are often limited due to lack of dynamic information and run time environment, which makes applications often not performance portable. One current approach is to provide multiple implementations for the same interface that could be used interchangeably depending on the call context, and expose the composition choices to a compiler, deployment-time composition tool and/or run-time system. Using off-line machine-learning techniques allows to improve the precision and reduce the run-time overhead of run-time composition and leads to an improvement of performance portability. In this work we extend the run-time composition mechanism in the PEPPHER composition tool by off-line composition and present an adaptive machine learning algorithm for generating compact and efficient dispatch data structures with low training time. As dispatch data structure we propose an adaptive decision tree structure, which implies an adaptive training algorithm that allows to control the trade-off between training time, dispatch precision and run-time dispatch overhead.

We have evaluated our optimization strategy with simple kernels (matrix-multiplication and sorting) as well as applications from RODINIA benchmark on two GPU-based heterogeneous systems. On average, the precision for composition choices reaches 83.6 percent with approximately 34 minutes off-line training time.

Place, publisher, year, edition, pages
Springer, 2013
Series
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 7851
Keywords
parallel programming, parallel computing, automated performance tuning, machine learning, adaptive sampling, GPU, multicore processor, software composition, program optimization, autotuning
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-93471 (URN)10.1007/978-3-642-38718-0_32 (DOI)000342997100032 ()978-3-642-38717-3 (ISBN)978-3-642-38718-0 (ISBN)
Conference
10th International Conference on High Performance Computing for Computational Science, VECPAR 2012; Kobe; Japan
Projects
EU FP7 PEPPHER (2010-2012), #248481, www.peppher.euSeRC - OpCoReS
Funder
EU, FP7, Seventh Framework Programme, 248481Swedish e‐Science Research Center, OpCoReS
Available from: 2013-06-04 Created: 2013-06-04 Last updated: 2018-02-19
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0001-8976-0484

Search in DiVA

Show all publications