liu.seSearch for publications in DiVA
Change search
Link to record
Permanent link

Direct link
BETA
Publications (10 of 13) Show all publications
Henrio, L., Kessler, C. & Li, L. (2018). Ensuring Memory Consistency in Heterogeneous Systems Based on Access Mode Declarations. In: PROCEEDINGS 2018 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING and SIMULATION (HPCS): . Paper presented at International Conference on High Performance Computing & Simulation (HPCS) (pp. 716-723). IEEE
Open this publication in new window or tab >>Ensuring Memory Consistency in Heterogeneous Systems Based on Access Mode Declarations
2018 (English)In: PROCEEDINGS 2018 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING and SIMULATION (HPCS), IEEE , 2018, p. 716-723Conference paper, Published paper (Refereed)
Abstract [en]

Running a program on disjoint memory spaces requires to address memory consistency issues and to perform transfers so that the program always accesses the right data. Several approaches exist to ensure the consistency of the memory accessed, we are interested here in the verification of a declarative approach where each component of a computation is annotated with an access mode declaring which part of the memory is read or written by the component. The programming framework uses the component annotations to guarantee the validity of the memory accesses. This is the mechanism used in VectorPU, a C++ library for programming CPU-GPU heterogeneous systems and this article proves the correctness of the software cache-coherence mechanism used in the library. Beyond the scope of VectorPU, this article can be considered as a simple and effective formalisation of memory consistency mechanisms based on the explicit declaration of the effect of each component on each memory space.

Place, publisher, year, edition, pages
IEEE, 2018
Keywords
Memory consistency; CPU-GPU heterogeneous systems; data transfer; software caching; cache coherence
National Category
Computer Engineering
Identifiers
urn:nbn:se:liu:diva-153418 (URN)10.1109/HPCS.2018.00117 (DOI)000450677700098 ()978-1-5386-7879-4 (ISBN)
Conference
International Conference on High Performance Computing & Simulation (HPCS)
Available from: 2018-12-17 Created: 2018-12-17 Last updated: 2018-12-17
Li, L. & Kessler, C. (2018). Lazy Allocation and Transfer Fusion Optimization for GPU-based Heterogeneous Systems. In: 2018 26TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, AND NETWORK-BASED PROCESSING (PDP 2018): . Paper presented at 26th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP) (pp. 311-315). IEEE
Open this publication in new window or tab >>Lazy Allocation and Transfer Fusion Optimization for GPU-based Heterogeneous Systems
2018 (English)In: 2018 26TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, AND NETWORK-BASED PROCESSING (PDP 2018), IEEE , 2018, p. 311-315Conference paper, Published paper (Refereed)
Abstract [en]

We present two memory optimization techniques which improve the efficiency of data transfer over PCIe bus for GPU-based heterogeneous systems, namely lazy allocation and transfer fusion optimization. Both are based on merging data transfers so that less overhead is incurred, thereby increasing transfer throughput and making accelerator usage profitable also for smaller operand sizes. We provide the design and prototype implementation of the two techniques in CUDA. Microbench-marking results show that especially for smaller and medium-sized operands significant speedups can be achieved. We also prove that our transfer fusion optimization algorithm is optimal.

Place, publisher, year, edition, pages
IEEE, 2018
Series
Euromicro Conference on Parallel Distributed and Network-Based Processing, ISSN 1066-6192
Keywords
adaptive message fusion; GPU; CUDA; lazy memory allocation; memory transfer optimization
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-151524 (URN)10.1109/PDP2018.2018.00054 (DOI)000443807600045 ()978-1-5386-4975-6 (ISBN)
Conference
26th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)
Note

Funding Agencies|SeRC; NSC / SNIC [SNIC 2016/5-6]

Available from: 2018-09-24 Created: 2018-09-24 Last updated: 2018-10-19
Li, L. & Kessler, C. (2018). MeterPU: a generic measurement abstraction API: Enabling energy-tuned skeleton backend selection. Journal of Supercomputing, 74(11), 5643-5658
Open this publication in new window or tab >>MeterPU: a generic measurement abstraction API: Enabling energy-tuned skeleton backend selection
2018 (English)In: Journal of Supercomputing, ISSN 0920-8542, E-ISSN 1573-0484, Vol. 74, no 11, p. 5643-5658Article in journal (Refereed) Published
Abstract [en]

We present MeterPU, an easy-to-use, generic and low-overhead abstraction API for taking measurements of various metrics (time, energy) on different hardware components (e.g., CPU, DRAM, GPU) in a heterogeneous computer system, using pluggable platform-specific measurement implementations behind a common interface in C++. We show that with MeterPU, not only legacy (time) optimization frameworks, such as autotuned skeleton back-end selection, can be easily retargeted for energy optimization, but also switching between measurement metrics or techniques for arbitrary code sections now becomes trivial. We apply MeterPU to implement the first energy-tunable skeleton programming framework, based on the SkePU skeleton programming library.

Place, publisher, year, edition, pages
SPRINGER, 2018
Keywords
MeterPU; Measurement abstraction API; GPU; Performance measurement; Energy measurement; Auto-tuning; Skeleton programming
National Category
Computer Engineering
Identifiers
urn:nbn:se:liu:diva-153385 (URN)10.1007/s11227-016-1792-x (DOI)000450623900003 ()
Note

Funding Agencies|EU FP7 project EXCESS; SeRC project OpCoReS

Available from: 2018-12-18 Created: 2018-12-18 Last updated: 2019-03-06
Li, L. (2018). Programming Abstractions and Optimization Techniques for GPU-based Heterogeneous Systems. (Doctoral dissertation). Linköping: Linköping University Electronic Press
Open this publication in new window or tab >>Programming Abstractions and Optimization Techniques for GPU-based Heterogeneous Systems
2018 (English)Doctoral thesis, monograph (Other academic)
Abstract [en]

CPU/GPU heterogeneous systems have shown remarkable advantages in performance and energy consumption compared to homogeneous ones such as standard multi-core systems.Such heterogeneity represents one of the most promising trendsfor the near-future evolution of high performance computing hardware.However, as a double-edged sword, the heterogeneity also brings significant programming complexitiesthat prevent the easy and efficient usage of different such heterogeneous systems.In this thesis, we are interested in four such kinds of fundamental complexities that are associated withthese heterogeneous systems: measurement complexity (efforts required to measure a metric, e.g., measuring enegy), CPU-GPU selection complexity, platform complexity and data management complexity. We explore new low-cost programming abstractions to hide these complexities,and propose new optimization techniques that could be performed under the hood.

For the measurement complexity, although measuring time is trivial by native library support,measuring energy consumption, especially for systems with GPUs, is complexbecause of the low level details involved such as choosing the right measurement methods, handling the trade-off between sampling rate and accuracy,and switching to different measurement metrics.We propose a clean interface with its implementationthat not only hides the complexity of energy measurement,but also unifies different kinds of measurements. The unificationbridges the gap between time measurement and energy measurement,and if no metric-specific assumptions related to time optimization techniques are made,energy optimization can be performedby blindly reusing time optimization techniques.

For the CPU-GPU selection complexity, which relates to efficient utilization of heterogeneous hardware,we propose a new adaptive-sampling based construction mechanism of predictors for such selections which can adapt to different hardware platforms automatically,and shows non-trivial advantages over random sampling.

For the platform complexity, we propose a new modular platform modeling language and its implementation to formally and systematically describe a computer system,enabling zero-overhead platform information queries for high-level software tool chains and for programmers as a basis for making software adaptive.

For the data management complexity, we propose a new mechanism to enable a unified memory view on heterogeneous systemsthat have separate memory spaces. This mechanism enables programmers to write significantly less code,which runs equally fast with expert-written code and outperforms the current commercially available solution: Nvidia's Unified Memory.We further propose two data movement optimization techniques, lazy allocation and transfer fusion optimization.The two techniques are based on adaptively merging messages to reduce data transfer latency.We show that these techniques can be potentially beneficial and we prove that our greedy fusion algorithm is optimal.

Finally, we show that our approaches to handle different complexities can be combined so that programmers could use them simultaneously.

This research was partly funded by two EU FP7 projects (PEPPHER and EXCESS) and SeRC.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2018. p. 177
Series
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 1903
Keywords
CPU, GPU, GPGPU, heterogeneous systems, programming abstraction, performance optimization, energy optimization, adaptive sampling, MeterPU, TunerPU, XPDL, VectorPU
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-145304 (URN)10.3384/diss.diva-145304 (DOI)9789176853702 (ISBN)
Public defence
2018-04-04, Ada Lovelace, B-huset, Campus Valla, Linköping, 10:15 (English)
Opponent
Supervisors
Funder
EU, FP7, Seventh Framework Programme, PEPPHER and EXCESS
Available from: 2018-02-28 Created: 2018-02-22 Last updated: 2019-09-30Bibliographically approved
Li, L., Dastgeer, U. & Kessler, C. (2016). Pruning strategies in adaptive off-line tuning for optimized composition of components on heterogeneous systems. Parallel Computing, 51, 37-45
Open this publication in new window or tab >>Pruning strategies in adaptive off-line tuning for optimized composition of components on heterogeneous systems
2016 (English)In: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 51, p. 37-45Article in journal (Refereed) Published
Abstract [en]

Adaptive program optimizations, such as automatic selection of the expected fastest implementation variant for a computation component depending on hardware architecture and runtime context, are important especially for heterogeneous computing systems but require good performance models. Empirical performance models which require no or little human efforts show more practical feasibility if the sampling and training cost can be reduced to a reasonable level. In previous work we proposed an early version of adaptive sampling for efficient exploration and selection of training samples, which yields a decision-tree based method for representing, predicting and selecting the fastest implementation variants for given run-time call contexts property values. For adaptive pruning we use a heuristic convexity assumption. In this paper we consolidate and improve the method by new pruning techniques to better support the convexity assumption and control the trade-off between sampling time, prediction accuracy and runtime prediction overhead. Our results show that the training time can be reduced by up to 39 times without noticeable prediction accuracy decrease. (C) 2015 Elsevier B.V. All rights reserved.

Place, publisher, year, edition, pages
ELSEVIER SCIENCE BV, 2016
Keywords
Smart sampling; Heterogeneous systems; Component selection
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:liu:diva-125830 (URN)10.1016/j.parco.2015.09.003 (DOI)000370093800004 ()
Note

Funding Agencies|EU; SeRC project OpCoReS

Available from: 2016-03-08 Created: 2016-03-04 Last updated: 2018-01-10
Li, L. & Kessler, C. (2015). MeterPU: A Generic Measurement Abstraction API Enabling Energy-tuned Skeleton Backend Selection. In: Trustcom/BigDataSE/ISPA, 2015 IEEE: . Paper presented at The 1st IEEE International Workshop on Reengineering for Parallelism in Heterogeneous Parallel Platforms held in conjunction with IEEE ISPA-15, Helsinki, Finland, August 20-22, 2015 (pp. 154-159). IEEE Press, 3
Open this publication in new window or tab >>MeterPU: A Generic Measurement Abstraction API Enabling Energy-tuned Skeleton Backend Selection
2015 (English)In: Trustcom/BigDataSE/ISPA, 2015 IEEE, IEEE Press, 2015, Vol. 3, p. 154-159Conference paper, Published paper (Refereed)
Abstract [en]

We present MeterPU, an easy-to-use, generic and low-overhead abstraction API for taking measurements of various metrics (time, energy) on different hardware components (e.g. CPU, DRAM, GPU), using pluggable platform-specific measurement implementations behind a common interface in C++. We show that with MeterPU, not only legacy (time) optimization frameworks, such as autotuned skeleton back-end selection, can be easily retargeted for energy optimization, but also switching different optimization goals for arbitrary code sections now becomes trivial. We apply MeterPU to implement the first energytunable skeleton programming framework, based on the SkePU skeleton programming library.

Place, publisher, year, edition, pages
IEEE Press, 2015
Keywords
MeterPU, measurement abstraction, SkePU, skeleton programming, heterogeneous, GPU, C++ Trait, Energy measurement, Time measurement
National Category
Computer Systems
Identifiers
urn:nbn:se:liu:diva-129103 (URN)10.1109/Trustcom.2015.625 (DOI)000380431400020 ()978-1-4673-7952-6 (ISBN)
Conference
The 1st IEEE International Workshop on Reengineering for Parallelism in Heterogeneous Parallel Platforms held in conjunction with IEEE ISPA-15, Helsinki, Finland, August 20-22, 2015
Projects
EU FP7 EXCESS 611183 (EU Sjunde Ramprogrammet)SeRC OpCoReS (Swedish e-Science Research Centre)
Available from: 2016-06-12 Created: 2016-06-12 Last updated: 2016-10-13Bibliographically approved
Kessler, C., Li, L., Atalar, A. & Dobre, A. (2015). XPDL: Extensible Platform Description Language to Support Energy Modeling and Optimization. In: Proc. 44th International Conference on Parallel Processing Workshops, ICPP-EMS Embedded Multicore Systems, in conjunction with ICPP-2015, Beijing, 1-4 sep. 2015: . Paper presented at 44th International Conference on Parallel Processing Workshops, ICPP-EMS Embedded Multicore Systems, Beijing, 1-4 sep. 2015 (pp. 51-60). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>XPDL: Extensible Platform Description Language to Support Energy Modeling and Optimization
2015 (English)In: Proc. 44th International Conference on Parallel Processing Workshops, ICPP-EMS Embedded Multicore Systems, in conjunction with ICPP-2015, Beijing, 1-4 sep. 2015, Institute of Electrical and Electronics Engineers (IEEE), 2015, p. 51-60Conference paper, Published paper (Refereed)
Abstract [en]

We present XPDL, a modular, extensible platform description language for heterogeneous multicore systems and clusters. XPDL specifications provide platform metadata about hardware and installed system software that are relevant for the adaptive static and dynamic optimization of application programs and system settings for improved performance and energy efficiency. XPDL is based on XML and uses hyperlinks to create distributed libraries of platform metadata specifications. We also provide first components of a retarget able tool chain that browses and processes XPDL specifications, and generates driver code for micro benchmarking to bootstrap empirical performance and energy models at deployment time. A C++ based API enables convenient introspection of platform models, even at run-time, which allows for adaptive dynamic program optimizations such as tuned selection of implementation variants.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2015
Series
International Conference on Parallel Processing Workshops, ISSN 1530-2016
Keywords
architecture description language, computer architecture modeling, system modeling, energy optimization toolchain, retargetability, heterogeneous parallel system, platform description language, architecture description language, XML
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-129097 (URN)10.1109/ICPPW.2015.17 (DOI)000377378800008 ()
Conference
44th International Conference on Parallel Processing Workshops, ICPP-EMS Embedded Multicore Systems, Beijing, 1-4 sep. 2015
Projects
EU FP7 EXCESSSeRC OpCoReS
Funder
EU, FP7, Seventh Framework Programme, 611183Swedish e‐Science Research Center, OpCoReS
Available from: 2016-06-11 Created: 2016-06-11 Last updated: 2018-01-10
Kessler, C., Dastgeer, U. & Li, L. (2014). Optimized Composition: Generating Efficient Code for Heterogeneous Systems from Multi-Variant Components, Skeletons and Containers. In: F. Hannig and J. Teich (Ed.), Proc. First Workshop on Resource awareness and adaptivity in multi-core computing (Racing 2014), May 2014, Paderborn, Germany: . Paper presented at First Workshop on Resource awareness and adaptivity in multi-core computing (Racing 2014), May 2014, Paderborn, Germany (pp. 43-48).
Open this publication in new window or tab >>Optimized Composition: Generating Efficient Code for Heterogeneous Systems from Multi-Variant Components, Skeletons and Containers
2014 (English)In: Proc. First Workshop on Resource awareness and adaptivity in multi-core computing (Racing 2014), May 2014, Paderborn, Germany / [ed] F. Hannig and J. Teich, 2014, p. 43-48Conference paper, Published paper (Refereed)
Abstract [en]

In this survey paper, we review recent work on frameworks for the high-level, portable programming of heterogeneous multi-/manycore systems (especially, GPU-based systems) using high-level constructs such as annotated user-level software components, skeletons (i.e., predefined generic components) and containers, and discuss the optimization problems that need to be considered in selecting among multiple implementation variants, generating code and providing runtime support for efficient execution on such systems.

Series
arXiv.org ; rn:Racing/2014/09
Keywords
heterogeneous multicore system, GPU, skeleton programming, parallel programming, parallel computing, program optimization, software composition
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-114338 (URN)
Conference
First Workshop on Resource awareness and adaptivity in multi-core computing (Racing 2014), May 2014, Paderborn, Germany
Projects
EU FP7 EXCESSSeRC OpCoReS
Funder
Swedish e‐Science Research Center, OpCoReSEU, FP7, Seventh Framework Programme, 611183 (EXCESS)
Available from: 2015-02-18 Created: 2015-02-18 Last updated: 2018-01-11
Li, L., Dastgeer, U. & Kessler, C. (2014). Pruning strategies in adaptive off-line tuning for optimized composition of components on heterogeneous systems. In: 2014 43rd International Conference on Parallel Processing Workshops (ICCPW): . Paper presented at Seventh International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2) at 43rd International Conference of Parallel Processing (ICPP), Minneapolis, USA, 9-12 Sep. 2014 (pp. 255-264). IEEE conference proceedings
Open this publication in new window or tab >>Pruning strategies in adaptive off-line tuning for optimized composition of components on heterogeneous systems
2014 (English)In: 2014 43rd International Conference on Parallel Processing Workshops (ICCPW), IEEE conference proceedings, 2014, p. 255-264Conference paper, Published paper (Refereed)
Abstract [en]

Adaptive program optimizations, such as automatic selection of the expected fastest implementation variant for a computation component depending on runtime context, are important especially for heterogeneous computing systems but require good performance models. Empirical performance models based on trial executions which require no or little human efforts show more practical feasibility if the sampling and training cost can be reduced to a reasonable level. In previous work we proposed an early version of adaptive pruning algorithm for efficient selection of training samples, a decision-tree based method for representing, predicting and selecting the fastest implementation variants for given run-time call context properties, and a composition tool for building the overall composed application from its components. For adaptive pruning we use a heuristic convexity assumption. In this paper we consolidate and improve the method by new pruning techniques to better support the convexity assumption and better control the trade-off between sampling time, prediction accuracy and runtime prediction overhead. Our results show that the training time can be reduced by up to 39 times without noticeable prediction accuracy decrease. Furthermore, we evaluate the effect of combinations of pruning strategies and compare our adaptive sampling method with random sampling. We also use our smart-sampling method as a preprocessor to a state-of-the-art decision tree learning algorithm and compare the result to the predictor directly calculated by our method.

Place, publisher, year, edition, pages
IEEE conference proceedings, 2014
Series
International Conference of Parallel Processing, Workshops, ISSN 1530-2016
Keywords
empirical performance modeling, automated performance tuning, heterogeneous multicore system, GPU computing, adaptive program optimization, machine learning
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-118514 (URN)10.1109/ICPPW.2014.42 (DOI)
Conference
Seventh International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2) at 43rd International Conference of Parallel Processing (ICPP), Minneapolis, USA, 9-12 Sep. 2014
Projects
EU FP7 EXCESSSeRC-OpCoReS
Funder
EU, FP7, Seventh Framework Programme, 611183Swedish e‐Science Research Center, OpCoReS
Available from: 2015-05-29 Created: 2015-05-29 Last updated: 2018-01-11
Dastgeer, U., Li, L. & Kessler, C. (2014). The PEPPHER composition tool: performance-aware composition for GPU-based systems. Computing, 96(12), 1195-1211
Open this publication in new window or tab >>The PEPPHER composition tool: performance-aware composition for GPU-based systems
2014 (English)In: Computing, ISSN 0010-485X, E-ISSN 1436-5057, Vol. 96, no 12, p. 1195-1211Article in journal (Refereed) Published
Abstract [en]

The PEPPHER (EU FP7 project) component model defines the notion of component, interface and meta-data for homogeneous and heterogeneous parallel systems. In this paper, we describe and evaluate the PEPPHER composition tool, which explores the application’s components and their implementation variants, generates the necessary low-level code that interacts with the runtime system, and coordinates the native compilation and linking of the various code units to compose the overall application code to optimize performance. We discuss the concept of smart containers and its benefits for reducing dispatch overhead, exploiting implicit parallelism across component invocations and runtime optimization of data transfers. In an experimental evaluation with several applications, we demonstrate that the composition tool provides a high-level programming front-end while effectively utilizing the task-based PEPPHER runtime system (StarPU) underneath for different usage scenarios on GPU-based systems.

Place, publisher, year, edition, pages
Springer, 2014
Keywords
PEPPHER project; Annotated multi-variant software components; GPU-based systems; Performance portability; Autotuning
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-109659 (URN)10.1007/s00607-013-0371-8 (DOI)000344169400007 ()
Projects
EU FP7 PEPPHERSeRC OpCoReS
Funder
EU, FP7, Seventh Framework Programme, 248481Swedish e‐Science Research Center, OpCoReS
Available from: 2014-08-22 Created: 2014-08-22 Last updated: 2018-01-11
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0001-8976-0484

Search in DiVA

Show all publications