liu.seSearch for publications in DiVA
Change search
Link to record
Permanent link

Direct link
BETA
Dastgeer, Usman
Publications (10 of 23) Show all publications
Maghazeh, A., Bordoloi, U. D., Dastgeer, U., Andrei, A., Eles, P. & Peng, Z. (2017). Latency-Aware Packet Processing on CPU-GPU Heterogeneous Systems. In: DAC '17 Proceedings of the 54th Annual Design Automation Conference 2017: . Paper presented at 54th ACM/EDAC/IEEE Design Automation Conference (DAC), Austin, TX, USA, June 18-22, 2017. New York, NY, USA: Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Latency-Aware Packet Processing on CPU-GPU Heterogeneous Systems
Show others...
2017 (English)In: DAC '17 Proceedings of the 54th Annual Design Automation Conference 2017, New York, NY, USA: Association for Computing Machinery (ACM), 2017Conference paper, Published paper (Refereed)
Abstract [en]

In response to the tremendous growth of the Internet, towards what we call the Internet of Things (IoT), there is a need to move from costly, high-time-to-market specific-purpose hardware to flexible, low-time-to-market general-purpose devices for packet processing. Among several such devices, GPUs have attracted attention in the past, mainly because the high computing demand of packet processing applications can, potentially, be satisfied by these throughput-oriented machines. However, another important aspect of such applications is the packet latency which, if not handled carefully, will overshadow the throughput benefits. Unfortunately, until now, this aspect has been mostly ignored. To address this issue, we propose a method that considers the variable bit rate of the traffic and, depending on the current rate, minimizes the latency, while meeting the rate demand. We propose a persistent kernel based software architecture to overcome the challenges inherent in GPU implementation like kernel invocation overhead, CPU-GPU communication and memory access overhead. We have chosen packet classification as the packet processing application to demonstrate our technique. Using the proposed approach, we are able to reduce the packet latency on average by a factor of 3.5, compared to the state-of-the-art solutions, without any packet drop.

Place, publisher, year, edition, pages
New York, NY, USA: Association for Computing Machinery (ACM), 2017
Series
Design Automation Conference DAC, ISSN 0738-100X
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-141212 (URN)10.1145/3061639.3062269 (DOI)000424895400129 ()2-s2.0-85023612665 (Scopus ID)978-1-4503-4927-7 (ISBN)
Conference
54th ACM/EDAC/IEEE Design Automation Conference (DAC), Austin, TX, USA, June 18-22, 2017
Available from: 2017-09-27 Created: 2017-09-27 Last updated: 2018-12-07Bibliographically approved
Li, L., Dastgeer, U. & Kessler, C. (2016). Pruning strategies in adaptive off-line tuning for optimized composition of components on heterogeneous systems. Parallel Computing, 51, 37-45
Open this publication in new window or tab >>Pruning strategies in adaptive off-line tuning for optimized composition of components on heterogeneous systems
2016 (English)In: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 51, p. 37-45Article in journal (Refereed) Published
Abstract [en]

Adaptive program optimizations, such as automatic selection of the expected fastest implementation variant for a computation component depending on hardware architecture and runtime context, are important especially for heterogeneous computing systems but require good performance models. Empirical performance models which require no or little human efforts show more practical feasibility if the sampling and training cost can be reduced to a reasonable level. In previous work we proposed an early version of adaptive sampling for efficient exploration and selection of training samples, which yields a decision-tree based method for representing, predicting and selecting the fastest implementation variants for given run-time call contexts property values. For adaptive pruning we use a heuristic convexity assumption. In this paper we consolidate and improve the method by new pruning techniques to better support the convexity assumption and control the trade-off between sampling time, prediction accuracy and runtime prediction overhead. Our results show that the training time can be reduced by up to 39 times without noticeable prediction accuracy decrease. (C) 2015 Elsevier B.V. All rights reserved.

Place, publisher, year, edition, pages
ELSEVIER SCIENCE BV, 2016
Keywords
Smart sampling; Heterogeneous systems; Component selection
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:liu:diva-125830 (URN)10.1016/j.parco.2015.09.003 (DOI)000370093800004 ()
Note

Funding Agencies|EU; SeRC project OpCoReS

Available from: 2016-03-08 Created: 2016-03-04 Last updated: 2018-01-10
Dastgeer, U. & Kessler, C. (2016). Smart Containers and Skeleton Programming for GPU-Based Systems. International journal of parallel programming, 44(3), 506-530
Open this publication in new window or tab >>Smart Containers and Skeleton Programming for GPU-Based Systems
2016 (English)In: International journal of parallel programming, ISSN 0885-7458, E-ISSN 1573-7640, Vol. 44, no 3, p. 506-530Article in journal (Refereed) Published
Abstract [en]

In this paper, we discuss the role, design and implementation of smart containers in the SkePU skeleton library for GPU-based systems. These containers provide an interface similar to C++ STL containers but internally perform runtime optimization of data transfers and runtime memory management for their operand data on the different memory units. We discuss how these containers can help in achieving asynchronous execution for skeleton calls while providing implicit synchronization capabilities in a data consistent manner. Furthermore, we discuss the limitations of the original, already optimizing memory management mechanism implemented in SkePU containers, and propose and implement a new mechanism that provides stronger data consistency and improves performance by reducing communication and memory allocations. With several applications, we show that our new mechanism can achieve significantly (up to 33.4 times) better performance than the initial mechanism for page-locked memory on a multi-GPU based system.

Place, publisher, year, edition, pages
SPRINGER/PLENUM PUBLISHERS, 2016
Keywords
SkePU; Smart containers; Skeleton programming; Memory management; Runtime optimizations; GPU-based systems
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:liu:diva-128719 (URN)10.1007/s10766-015-0357-6 (DOI)000374897200008 ()
Note

Funding Agencies|EU; SeRC

Available from: 2016-06-07 Created: 2016-05-30 Last updated: 2019-03-15
Dastgeer, U. & Kessler, C. (2015). Performance-aware Composition Framework for GPU-based Systems. Journal of Supercomputing, 71(12), 4646-4662
Open this publication in new window or tab >>Performance-aware Composition Framework for GPU-based Systems
2015 (English)In: Journal of Supercomputing, ISSN 0920-8542, E-ISSN 1573-0484, Vol. 71, no 12, p. 4646-4662Article in journal (Refereed) Published
Abstract [en]

User-level components of applications can be made performance-aware by annotating them with performance model and other metadata. We present a component model and a composition framework for the automatically optimized composition of applications for modern GPU-based systems from such components, which may expose multiple implementation variants. The framework targets the composition problem in an integrated manner, with the ability to do global performance-aware composition across multiple invocations. We demonstrate several key features of our framework relating to performance-aware composition including implementation selection, both with performance characteristics being known (or learned) beforehand as well as cases when they are learned at runtime. We also demonstrate hybrid execution capabilities of our framework on real applications. Furthermore, we present a bulk composition technique that can make better composition decisions by considering information about upcoming calls along with data flow information extracted from the source program by static analysis. The bulk composition improves over the traditional greedy performance aware policy that only considers the current call for optimization.

Place, publisher, year, edition, pages
Springer, 2015
Keywords
Global composition, Multivariant software components, Implementation selection, Hybrid parallel execution, GPU-based systems, Performance portability, Autotuning, Optimizing compiler
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-109661 (URN)10.1007/s11227-014-1105-1 (DOI)000365185400015 ()
Projects
EU FP7 EXCESSSeRC OpCoReS
Funder
EU, FP7, Seventh Framework Programme, 611183Swedish e‐Science Research Center, OpCoReS
Available from: 2014-08-22 Created: 2014-08-22 Last updated: 2018-01-11
Dastgeer, U. & Kessler, C. (2014). Conditional component composition for GPU-based systems. In: Proc. Seventh Workshop on Programmability Issues for Multi-Core Computers (MULTIPROG-2014) at HiPEAC-2014, Vienna, Austria, Jan. 2014: . Paper presented at Proc. Seventh Workshop on Programmability Issues for Multi-Core Computers (MULTIPROG-2014) at HiPEAC-2014 Conference, Vienna, Austria, Jan. 2014. Vienna, Austria: HiPEAC NoE
Open this publication in new window or tab >>Conditional component composition for GPU-based systems
2014 (English)In: Proc. Seventh Workshop on Programmability Issues for Multi-Core Computers (MULTIPROG-2014) at HiPEAC-2014, Vienna, Austria, Jan. 2014, Vienna, Austria: HiPEAC NoE , 2014Conference paper, Published paper (Refereed)
Abstract [en]

User-level components can expose multiple functionally equivalent implementations with different resource requirements and performance characteristics. A composition framework can then choose a suitable implementation for each component invocation guided by an objective function (execution time, energy etc.). In this paper, we describe the idea of conditional composition which enables the component writer to specify constraints on the selectability of a given component implementation based on information about the target system and component call properties. By incorporating such information, more informed and user-guided composition decisions can be made and thus more efficient code be generated, as shown with an example scenario for a GPU-based system.

Place, publisher, year, edition, pages
Vienna, Austria: HiPEAC NoE, 2014
Series
MULTIPROG workshop series
Keywords
heterogeneous multicore system, parallel programming, platform description language, constrained optimization, code generation, software composition, parallel computing
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-114340 (URN)
Conference
Proc. Seventh Workshop on Programmability Issues for Multi-Core Computers (MULTIPROG-2014) at HiPEAC-2014 Conference, Vienna, Austria, Jan. 2014
Projects
EU FP7 EXCESSEU FP7 PEPPHERSeRC OpCoReS
Funder
EU, FP7, Seventh Framework Programme, 611183 (EXCESS)Swedish e‐Science Research Center, OpCoReS
Available from: 2015-02-18 Created: 2015-02-18 Last updated: 2018-01-11Bibliographically approved
Kessler, C., Dastgeer, U. & Li, L. (2014). Optimized Composition: Generating Efficient Code for Heterogeneous Systems from Multi-Variant Components, Skeletons and Containers. In: F. Hannig and J. Teich (Ed.), Proc. First Workshop on Resource awareness and adaptivity in multi-core computing (Racing 2014), May 2014, Paderborn, Germany: . Paper presented at First Workshop on Resource awareness and adaptivity in multi-core computing (Racing 2014), May 2014, Paderborn, Germany (pp. 43-48).
Open this publication in new window or tab >>Optimized Composition: Generating Efficient Code for Heterogeneous Systems from Multi-Variant Components, Skeletons and Containers
2014 (English)In: Proc. First Workshop on Resource awareness and adaptivity in multi-core computing (Racing 2014), May 2014, Paderborn, Germany / [ed] F. Hannig and J. Teich, 2014, p. 43-48Conference paper, Published paper (Refereed)
Abstract [en]

In this survey paper, we review recent work on frameworks for the high-level, portable programming of heterogeneous multi-/manycore systems (especially, GPU-based systems) using high-level constructs such as annotated user-level software components, skeletons (i.e., predefined generic components) and containers, and discuss the optimization problems that need to be considered in selecting among multiple implementation variants, generating code and providing runtime support for efficient execution on such systems.

Series
arXiv.org ; rn:Racing/2014/09
Keywords
heterogeneous multicore system, GPU, skeleton programming, parallel programming, parallel computing, program optimization, software composition
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-114338 (URN)
Conference
First Workshop on Resource awareness and adaptivity in multi-core computing (Racing 2014), May 2014, Paderborn, Germany
Projects
EU FP7 EXCESSSeRC OpCoReS
Funder
Swedish e‐Science Research Center, OpCoReSEU, FP7, Seventh Framework Programme, 611183 (EXCESS)
Available from: 2015-02-18 Created: 2015-02-18 Last updated: 2018-01-11
Dastgeer, U. (2014). Performance-aware Component Composition for GPU-based systems. (Doctoral dissertation). Linköping: Linköping University Electronic Press
Open this publication in new window or tab >>Performance-aware Component Composition for GPU-based systems
2014 (English)Doctoral thesis, monograph (Other academic)
Abstract [en]

This thesis addresses issues associated with efficiently programming modern heterogeneous GPU-based systems, containing multicore CPUs and one or more programmable Graphics Processing Units (GPUs). We use ideas from component-based programming to address programming, performance and portability issues of these heterogeneous systems. Specifically, we present three approaches that all use the idea of having multiple implementations for each computation; performance is achieved/retained either a) by selecting a suitable implementation for each computation on a given platform or b) by dividing the computation work across different implementations running on CPU and GPU devices in parallel.

In the first approach, we work on a skeleton programming library (SkePU) that provides high-level abstraction while making intelligent  implementation selection decisions underneath either before or during the actual program execution. In the second approach, we develop a composition tool that parses extra information (metadata) from XML files, makes certain decisions online, and, in the end, generates code for making the final decisions at runtime. The third approach is a framework that uses source-code annotations and program analysis to generate code for the runtime library to make the selection decision at runtime. With a generic performance modeling API alongside program analysis capabilities, it supports online tuning as well as complex program transformations.

These approaches differ in terms of genericity, intrusiveness, capabilities and knowledge about the program source-code; however, they all demonstrate usefulness of component programming techniques for programming GPU-based systems. With experimental evaluation, we demonstrate how all three approaches, although different in their own way, provide good performance on different GPU-based systems for a variety of applications.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2014. p. 240
Series
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 1581
National Category
Computer Systems
Identifiers
urn:nbn:se:liu:diva-104314 (URN)10.3384/diss.diva-104310 (DOI)978-91-7519-383-0 (ISBN)
Public defence
2014-05-08, Visionen, Building B, Campus Valla, Linköpings universitet, Linköping, 13:15 (English)
Opponent
Supervisors
Available from: 2014-04-15 Created: 2014-02-14 Last updated: 2014-10-08Bibliographically approved
Li, L., Dastgeer, U. & Kessler, C. (2014). Pruning strategies in adaptive off-line tuning for optimized composition of components on heterogeneous systems. In: 2014 43rd International Conference on Parallel Processing Workshops (ICCPW): . Paper presented at Seventh International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2) at 43rd International Conference of Parallel Processing (ICPP), Minneapolis, USA, 9-12 Sep. 2014 (pp. 255-264). IEEE conference proceedings
Open this publication in new window or tab >>Pruning strategies in adaptive off-line tuning for optimized composition of components on heterogeneous systems
2014 (English)In: 2014 43rd International Conference on Parallel Processing Workshops (ICCPW), IEEE conference proceedings, 2014, p. 255-264Conference paper, Published paper (Refereed)
Abstract [en]

Adaptive program optimizations, such as automatic selection of the expected fastest implementation variant for a computation component depending on runtime context, are important especially for heterogeneous computing systems but require good performance models. Empirical performance models based on trial executions which require no or little human efforts show more practical feasibility if the sampling and training cost can be reduced to a reasonable level. In previous work we proposed an early version of adaptive pruning algorithm for efficient selection of training samples, a decision-tree based method for representing, predicting and selecting the fastest implementation variants for given run-time call context properties, and a composition tool for building the overall composed application from its components. For adaptive pruning we use a heuristic convexity assumption. In this paper we consolidate and improve the method by new pruning techniques to better support the convexity assumption and better control the trade-off between sampling time, prediction accuracy and runtime prediction overhead. Our results show that the training time can be reduced by up to 39 times without noticeable prediction accuracy decrease. Furthermore, we evaluate the effect of combinations of pruning strategies and compare our adaptive sampling method with random sampling. We also use our smart-sampling method as a preprocessor to a state-of-the-art decision tree learning algorithm and compare the result to the predictor directly calculated by our method.

Place, publisher, year, edition, pages
IEEE conference proceedings, 2014
Series
International Conference of Parallel Processing, Workshops, ISSN 1530-2016
Keywords
empirical performance modeling, automated performance tuning, heterogeneous multicore system, GPU computing, adaptive program optimization, machine learning
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-118514 (URN)10.1109/ICPPW.2014.42 (DOI)
Conference
Seventh International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2) at 43rd International Conference of Parallel Processing (ICPP), Minneapolis, USA, 9-12 Sep. 2014
Projects
EU FP7 EXCESSSeRC-OpCoReS
Funder
EU, FP7, Seventh Framework Programme, 611183Swedish e‐Science Research Center, OpCoReS
Available from: 2015-05-29 Created: 2015-05-29 Last updated: 2018-01-11
Dastgeer, U., Li, L. & Kessler, C. (2014). The PEPPHER composition tool: performance-aware composition for GPU-based systems. Computing, 96(12), 1195-1211
Open this publication in new window or tab >>The PEPPHER composition tool: performance-aware composition for GPU-based systems
2014 (English)In: Computing, ISSN 0010-485X, E-ISSN 1436-5057, Vol. 96, no 12, p. 1195-1211Article in journal (Refereed) Published
Abstract [en]

The PEPPHER (EU FP7 project) component model defines the notion of component, interface and meta-data for homogeneous and heterogeneous parallel systems. In this paper, we describe and evaluate the PEPPHER composition tool, which explores the application’s components and their implementation variants, generates the necessary low-level code that interacts with the runtime system, and coordinates the native compilation and linking of the various code units to compose the overall application code to optimize performance. We discuss the concept of smart containers and its benefits for reducing dispatch overhead, exploiting implicit parallelism across component invocations and runtime optimization of data transfers. In an experimental evaluation with several applications, we demonstrate that the composition tool provides a high-level programming front-end while effectively utilizing the task-based PEPPHER runtime system (StarPU) underneath for different usage scenarios on GPU-based systems.

Place, publisher, year, edition, pages
Springer, 2014
Keywords
PEPPHER project; Annotated multi-variant software components; GPU-based systems; Performance portability; Autotuning
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-109659 (URN)10.1007/s00607-013-0371-8 (DOI)000344169400007 ()
Projects
EU FP7 PEPPHERSeRC OpCoReS
Funder
EU, FP7, Seventh Framework Programme, 248481Swedish e‐Science Research Center, OpCoReS
Available from: 2014-08-22 Created: 2014-08-22 Last updated: 2018-01-11
Dastgeer, U. & Kessler, C. (2013). A Framework for Performance-aware Composition of Applications for GPU-based Systems. In: : . Paper presented at 2013 42ND ANNUAL INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP) (pp. 698-707). IEEE
Open this publication in new window or tab >>A Framework for Performance-aware Composition of Applications for GPU-based Systems
2013 (English)Conference paper, Published paper (Refereed)
Abstract [en]

User-level components of applications can be made performance-aware by annotating them with performance model and other metadata. We present a component model and a composition framework for the performance-aware composition of applications for modern GPU-based systems from such components, which may expose multiple implementation variants. The framework targets the composition problem in an integrated manner, with particular focus on global performance-aware composition across multiple invocations. We demonstrate several key features of our framework relating to performance-aware composition including implementation selection, both with performance characteristics being known (or learned) beforehand as well as cases when they are learned at runtime. We also demonstrate hybrid execution capabilities of our framework on real applications. Furthermore, as an important step towards global composition, we present a bulk composition technique that can make better composition decisions by considering information about upcoming calls along with data flow information extracted from the source program by static analysis, thus improving over the traditional greedy performance-aware policy that only considers the current call for optimization.

Place, publisher, year, edition, pages
IEEE, 2013
Series
Proceedings of the International Conference on Parallel Processing, ISSN 0190-3918
Keywords
Global composition; implementation selection; hybrid execution; GPU-based systems; performance portability
National Category
Engineering and Technology
Identifiers
urn:nbn:se:liu:diva-104651 (URN)10.1109/ICPP.2013.83 (DOI)000330046000074 ()
Conference
2013 42ND ANNUAL INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP)
Available from: 2014-02-20 Created: 2014-02-20 Last updated: 2015-05-28
Organisations

Search in DiVA

Show all publications