liu.seSearch for publications in DiVA
Change search
Refine search result
1 - 23 of 23
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the 'Create feeds' function.
  • 1.
    Ali, Akhtar
    et al.
    Linköping University, Department of Computer and Information Science. Linköping University, The Institute of Technology.
    Dastgeer, Usman
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    OpenCL for programming shared memory multicore CPUs2011In: Fourth Swedish Workshop on Multi-Core Computing MCC-2011: November 23-25, 2011, Linköping University, Linköping, Sweden / [ed] Christoph Kessler, Linköping: Linköping University , 2011, Vol. S. 65-70, p. 65-70Conference paper (Other academic)
    Abstract [en]

    In this work, we evaluate the effectiveness of OpenCL for programming multicore CPUs in a comparative case study with OpenMP and Intel TBB for five benchmark applications: matrix multiply, LU decomposition, 2D image convolution, Pi value approximation and image histogram generation.

  • 2.
    Ali, Akhtar
    et al.
    Linköping University, The Institute of Technology.
    Dastgeer, Usman
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    OpenCL for programming shared memory multicore CPUs2012In: Proceedings of the 5th Workshop on MULTIPROG2012 / [ed] E. Ayguade, B. Gaster, L. Howes, P. Stenström, O. Unsal, HiPEAC Network of Excellence , 2012Conference paper (Refereed)
    Abstract [en]

    Shared memory multicore processor technology is pervasive in mainstream computing. This new architecture challenges programmers to write code that scales over these many cores to exploit the full computational power of these machines. OpenMP and Intel Threading Building Blocks (TBB) are two of the popular frameworks used to program these architectures. Recently, OpenCL has been defined as a standard by Khronos group which focuses on programming a possibly heterogeneous set of processors with many cores such as CPU cores, GPUs, DSP processors. In this work, we evaluate the effectiveness of OpenCL for programming multicore CPUs in a comparative case study with OpenMP and Intel TBB for five benchmark applications: matrix multiply, LU decomposition,2D image convolution, Pi value approximation and image histogram generation. The evaluation includes the effect of compiler optimizations for different frameworks, OpenCL performance on different vendors’ platformsand the performance gap between CPU-specific and GPU-specific OpenCL algorithms for execution on a modern GPU. Furthermore, a brief usability evaluation of the three frameworks is also presented.

  • 3.
    Dastgeer, Usman
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Performance-aware Component Composition for GPU-based systems2014Doctoral thesis, monograph (Other academic)
    Abstract [en]

    This thesis addresses issues associated with efficiently programming modern heterogeneous GPU-based systems, containing multicore CPUs and one or more programmable Graphics Processing Units (GPUs). We use ideas from component-based programming to address programming, performance and portability issues of these heterogeneous systems. Specifically, we present three approaches that all use the idea of having multiple implementations for each computation; performance is achieved/retained either a) by selecting a suitable implementation for each computation on a given platform or b) by dividing the computation work across different implementations running on CPU and GPU devices in parallel.

    In the first approach, we work on a skeleton programming library (SkePU) that provides high-level abstraction while making intelligent  implementation selection decisions underneath either before or during the actual program execution. In the second approach, we develop a composition tool that parses extra information (metadata) from XML files, makes certain decisions online, and, in the end, generates code for making the final decisions at runtime. The third approach is a framework that uses source-code annotations and program analysis to generate code for the runtime library to make the selection decision at runtime. With a generic performance modeling API alongside program analysis capabilities, it supports online tuning as well as complex program transformations.

    These approaches differ in terms of genericity, intrusiveness, capabilities and knowledge about the program source-code; however, they all demonstrate usefulness of component programming techniques for programming GPU-based systems. With experimental evaluation, we demonstrate how all three approaches, although different in their own way, provide good performance on different GPU-based systems for a variety of applications.

  • 4.
    Dastgeer, Usman
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Skeleton Programming for Heterogeneous GPU-based Systems2011Licentiate thesis, monograph (Other academic)
    Abstract [en]

    In this thesis, we address issues associated with programming modern heterogeneous systems while focusing on a special kind of heterogeneous systems that include multicore CPUs and one or more GPUs, called GPU-based systems.We consider the skeleton programming approach to achieve high level abstraction for efficient and portable programming of these GPU-based systemsand present our work on SkePU library which is a skeleton library for these systems.

    We extend the existing SkePU library with a two-dimensional (2D) data type and skeleton operations and implement several new applications using newly made skeletons. Furthermore, we consider the algorithmic choice present in SkePU and implement support to specify and automatically optimize the algorithmic choice for a skeleton call, on a given platform.

    To show how to achieve performance, we provide a case-study on optimized GPU-based skeleton implementation for 2D stencil computations and introduce two metrics to maximize resource utilization on a GPU. By devising a mechanism to automatically calculate these two metrics, performance can be retained while porting an application from one GPU architecture to another.

    Another contribution of this thesis is implementation of the runtime support for the SkePU skeleton library. This is achieved with the help of the StarPUruntime system. By this implementation,support for dynamic scheduling and load balancing for the SkePU skeleton programs is achieved. Furthermore, a capability to do hybrid executionby parallel execution on all available CPUs and GPUs in a system, even for a single skeleton invocation, is developed.

    SkePU initially supported only data-parallel skeletons. The first task-parallel skeleton (farm) in SkePU is implemented with support for performance-aware scheduling and hierarchical parallel execution by enabling all data parallel skeletons to be usable as tasks inside the farm construct.

    Experimental evaluations are carried out and presented for algorithmic selection, performance portability, dynamic scheduling and hybrid execution aspects of our work.

  • 5.
    Dastgeer, Usman
    et al.
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Enmyren, Johan
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Auto-tuning SkePU: A multi-backend skeleton programming framework for multi-GPU systems2011In: IWMSE '11 Proceedings of the 4th International Workshop on Multicore Software Engineering, New York, NY, USA: Association for Computing Machinery (ACM), 2011, p. 25-32Conference paper (Other academic)
    Abstract [en]

    SkePU is a C++ template library that provides a simple and unified interface for specifying data-parallel computations with the help of skeletons on GPUs using CUDA and OpenCL. The interface is also general enough to support other architectures, and SkePU implements both a sequential CPU and a parallel OpenMP backend. It also supports multi-GPU systems. Currently available skeletons in SkePU include map, reduce, mapreduce, map-with-overlap, maparray, and scan. The performance of SkePU generated code is comparable to that of hand-written code, even for more complex applications such as ODE solving.

    In this paper, we discuss initial results from auto-tuning SkePU using an off-line, machine learning approach where we adapt skeletons to a given platform using training data. The prediction mechanism at execution time uses off-line pre-calculated estimates to construct an execution plan for any desired configuration with minimal overhead. The prediction mechanism accurately predicts execution time for repetitive executions and includes a mechanism to predict execution time for user functions of different complexity. The tuning framework covers selection between different backends as well as choosing optimal parameter values for the selected backend. We will discuss our approach and initial results obtained for different skeletons (map, mapreduce, reduce).

  • 6.
    Dastgeer, Usman
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    A Framework for Performance-aware Composition of Applications for GPU-based Systems2013Conference paper (Refereed)
    Abstract [en]

    User-level components of applications can be made performance-aware by annotating them with performance model and other metadata. We present a component model and a composition framework for the performance-aware composition of applications for modern GPU-based systems from such components, which may expose multiple implementation variants. The framework targets the composition problem in an integrated manner, with particular focus on global performance-aware composition across multiple invocations. We demonstrate several key features of our framework relating to performance-aware composition including implementation selection, both with performance characteristics being known (or learned) beforehand as well as cases when they are learned at runtime. We also demonstrate hybrid execution capabilities of our framework on real applications. Furthermore, as an important step towards global composition, we present a bulk composition technique that can make better composition decisions by considering information about upcoming calls along with data flow information extracted from the source program by static analysis, thus improving over the traditional greedy performance-aware policy that only considers the current call for optimization.

  • 7.
    Dastgeer, Usman
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    A performance-portable generic component for 2D convolution computations on GPU-based systems2011In: Fourth Swedish Workshop on Multi-Core Computing MCC-2011: November 23-25, 2011, Linköping University, Linköping, Sweden / [ed] Christoph Kessler, Linköping: Linköping University , 2011, Vol. S. 39-44, p. 39-44Conference paper (Other academic)
    Abstract [en]

    In this paper, we describe our work on providing a generic yet optimized GPU (CUDA/OpenCL) implementation for the 2D MapOverlap skeleton. We explain our implementation with the help  of a 2D convolutilution application, implemented using the newly deveioped skeleton.

  • 8.
    Dastgeer, Usman
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, Faculty of Science & Engineering.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, Faculty of Science & Engineering.
    A performance-portable generic component for 2D convolution computations on GPU-based systems2012In: Proceedings of the Fifth International Workshop on Programmability Issues for Heterogeneous Multicores (MULTIPROG-2012) at the HiPEAC-2012 conference, Paris, Jan. 2012 / [ed] E. Ayguade, B. Gaster, L. Howes, P. Stenström, O. Unsal, 2012Conference paper (Refereed)
    Abstract [en]

    In this paper, we describe our work on providing a generic yet optimized GPU (CUDA/OpenCL) implementation for the 2D MapOverlap skeleton. We explain our implementation with the help of a 2D convolution application, implemented using the newly developed skeleton. The memory (constant and shared memory) and adaptive tiling optimizations are applied and their performance implications are evaluated on different classes of GPUs. We present two different metrics to calculate the optimal tiling factor dynamically in an automated way which helps in retaining best performance without manual tuning while moving to newGPU architectures. With our approach, we can achieve average speedups by a factor of 3.6, 2.3, and 2.4 over an otherwise optimized (without tiling) implementation on NVIDIA C2050, GTX280 and 8800 GT GPUs respectively. Above all, the performance portability is achieved without requiring any manual changes in the skeleton program or the skeleton implementation.

  • 9.
    Dastgeer, Usman
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Conditional component composition for GPU-based systems2014In: Proc. Seventh Workshop on Programmability Issues for Multi-Core Computers (MULTIPROG-2014) at HiPEAC-2014, Vienna, Austria, Jan. 2014, Vienna, Austria: HiPEAC NoE , 2014Conference paper (Refereed)
    Abstract [en]

    User-level components can expose multiple functionally equivalent implementations with different resource requirements and performance characteristics. A composition framework can then choose a suitable implementation for each component invocation guided by an objective function (execution time, energy etc.). In this paper, we describe the idea of conditional composition which enables the component writer to specify constraints on the selectability of a given component implementation based on information about the target system and component call properties. By incorporating such information, more informed and user-guided composition decisions can be made and thus more efficient code be generated, as shown with an example scenario for a GPU-based system.

  • 10.
    Dastgeer, Usman
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Performance-aware Composition Framework for GPU-based Systems2015In: Journal of Supercomputing, ISSN 0920-8542, E-ISSN 1573-0484, Vol. 71, no 12, p. 4646-4662Article in journal (Refereed)
    Abstract [en]

    User-level components of applications can be made performance-aware by annotating them with performance model and other metadata. We present a component model and a composition framework for the automatically optimized composition of applications for modern GPU-based systems from such components, which may expose multiple implementation variants. The framework targets the composition problem in an integrated manner, with the ability to do global performance-aware composition across multiple invocations. We demonstrate several key features of our framework relating to performance-aware composition including implementation selection, both with performance characteristics being known (or learned) beforehand as well as cases when they are learned at runtime. We also demonstrate hybrid execution capabilities of our framework on real applications. Furthermore, we present a bulk composition technique that can make better composition decisions by considering information about upcoming calls along with data flow information extracted from the source program by static analysis. The bulk composition improves over the traditional greedy performance aware policy that only considers the current call for optimization.

  • 11.
    Dastgeer, Usman
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, Faculty of Science & Engineering.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, Faculty of Science & Engineering.
    Smart Containers and Skeleton Programming for GPU-Based Systems2016In: International journal of parallel programming, ISSN 0885-7458, E-ISSN 1573-7640, Vol. 44, no 3, p. 506-530Article in journal (Refereed)
    Abstract [en]

    In this paper, we discuss the role, design and implementation of smart containers in the SkePU skeleton library for GPU-based systems. These containers provide an interface similar to C++ STL containers but internally perform runtime optimization of data transfers and runtime memory management for their operand data on the different memory units. We discuss how these containers can help in achieving asynchronous execution for skeleton calls while providing implicit synchronization capabilities in a data consistent manner. Furthermore, we discuss the limitations of the original, already optimizing memory management mechanism implemented in SkePU containers, and propose and implement a new mechanism that provides stronger data consistency and improves performance by reducing communication and memory allocations. With several applications, we show that our new mechanism can achieve significantly (up to 33.4 times) better performance than the initial mechanism for page-locked memory on a multi-GPU based system.

  • 12.
    Dastgeer, Usman
    et al.
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Thibault, Samuel
    Laboratoire Bordelais de Recherche en Informatique (LaBRI), France.
    Flexible runtime support for efficient skeleton programming on hybrid systems2012In: Applications, Tools and Techniques on the Road to Exascale Computing / [ed] K. De Bosschere, E. H. D'Hollander, G. R. Joubert, D. Padua, F. Peters., Amsterdam: IOS Press, 2012, 22, p. 159-166Chapter in book (Other academic)
    Abstract [en]

    SkePU is a skeleton programming framework for multicore CPU and multi-GPU systems. StarPU is a runtime system that provides dynamic scheduling and memory management support for heterogeneous, accelerator-based systems. We have implemented support for StarPU as a possible backend for SkePU while keeping the generic SkePU interface intact. The mapping of a SkePU skeleton call to one or more StarPU tasks allows StarPU to exploit independence between different skeleton calls as well as within a single skeleton call. Support for different StarPU features, such as data partitioning and different scheduling policies (e.g. history based performance models) is implemented and discussed in this paper. The integration proved beneficial for both StarPU and SkePU. StarPU got a high level interface to run data-parallel computations on it while SkePU has achieved dynamic scheduling and hybrid parallelism support. Several benchmarks including ODE solver, separable Gaussian blur filter, Successive Over-Relaxation (SOR) and Coulombic potential are implemented. Initial experiments show that we can even achieve super-linear speedups for realistic applications and can observe clear improvements in performance with the simultaneous use of both CPUs and GPU (hybrid execution).

  • 13.
    Dastgeer, Usman
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems.
    Li, Lu
    Linköping University, Department of Computer and Information Science, Software and Systems.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems.
    Adaptive Implementation Selection in the SkePU Skeleton Programming Library2013In: Advanced Parallel Processing Technologies (APPT-2013), Proceedings / [ed] Chengyung Wu and Albert Cohen (eds.), 2013, p. 170-183Conference paper (Refereed)
    Abstract [en]

    In earlier work, we have developed the SkePU skeleton programming library for modern multicore systems equipped with one or more programmable GPUs. The library internally provides four types of implementations (implementation variants) for each skeleton: serial C++, OpenMP, CUDA and OpenCL targeting either CPU or GPU execution respectively. Deciding which implementation would run faster for a given skeleton call depends upon the computation, problem size(s), system architecture and data locality.

    In this paper, we present our work on automatic selection between these implementation variants by an offline machine learning method which generates a compact decision tree with low training overhead. The proposed selection mechanism is flexible yet high-level allowing a skeleton programmer to control different training choices at a higher abstraction level. We have evaluated our optimization strategy with 9 applications/kernels ported to our skeleton library and achieve on average more than 94% (90%) accuracy with just 0.53% (0.58%) training space exploration on two systems. Moreover, we discuss one application scenario where local optimization considering a single skeleton call can prove sub-optimal, and propose a heuristic for bulk implementation selection considering more than one skeleton call to address such application scenarios.

  • 14.
    Dastgeer, Usman
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Li, Lu
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    The PEPPHER composition tool: performance-aware composition for GPU-based systems2014In: Computing, ISSN 0010-485X, E-ISSN 1436-5057, Vol. 96, no 12, p. 1195-1211Article in journal (Refereed)
    Abstract [en]

    The PEPPHER (EU FP7 project) component model defines the notion of component, interface and meta-data for homogeneous and heterogeneous parallel systems. In this paper, we describe and evaluate the PEPPHER composition tool, which explores the application’s components and their implementation variants, generates the necessary low-level code that interacts with the runtime system, and coordinates the native compilation and linking of the various code units to compose the overall application code to optimize performance. We discuss the concept of smart containers and its benefits for reducing dispatch overhead, exploiting implicit parallelism across component invocations and runtime optimization of data transfers. In an experimental evaluation with several applications, we demonstrate that the composition tool provides a high-level programming front-end while effectively utilizing the task-based PEPPHER runtime system (StarPU) underneath for different usage scenarios on GPU-based systems.

  • 15.
    Dastgeer, Usman
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Li, Lu
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    The PEPPHER Composition Tool: Performance-Aware Dynamic Composition of Applications for GPU-Based Systems2012In: High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion, IEEE, 2012, p. 711-720Conference paper (Refereed)
    Abstract [en]

    The PEPPHER component model defines an environment for annotation of native C/C++ based components for homogeneous and heterogeneous multicore and manycore systems, including GPU and multi-GPU based systems. For the same computational functionality, captured as a component, different sequential and explicitly parallel implementation variants using various types of execution units might be provided, together with metadata such as explicitly exposed tunable parameters. The goal is to compose an application from its components and variants such that, depending on the run-time context, the most suitable implementation variant will be chosen automatically for each invocation. We describe and evaluate the PEPPHER composition tool, which explores the application's components and their implementation variants, generates the necessary low-level code that interacts with the runtime system, and coordinates the native compilation and linking of the various code units to compose the overall application code. With several applications, we demonstrate how the composition tool provides a high-level programming front-end while effectively utilizing the task-based PEPPHER runtime system (StarPU) underneath.

  • 16.
    Kessler, Christoph
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Dastgeer, Usman
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Li, Lu
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Optimized Composition: Generating Efficient Code for Heterogeneous Systems from Multi-Variant Components, Skeletons and Containers2014In: Proc. First Workshop on Resource awareness and adaptivity in multi-core computing (Racing 2014), May 2014, Paderborn, Germany / [ed] F. Hannig and J. Teich, 2014, p. 43-48Conference paper (Refereed)
    Abstract [en]

    In this survey paper, we review recent work on frameworks for the high-level, portable programming of heterogeneous multi-/manycore systems (especially, GPU-based systems) using high-level constructs such as annotated user-level software components, skeletons (i.e., predefined generic components) and containers, and discuss the optimization problems that need to be considered in selecting among multiple implementation variants, generating code and providing runtime support for efficient execution on such systems.

  • 17.
    Kessler, Christoph
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Dastgeer, Usman
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Majeed, Mudassar
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Furmento, Nathalie
    University of Bordeaux, INRIA, Bordeaux, France.
    Thibault, Samuel
    University of Bordeaux, INRIA, Bordeaux, France.
    Namyst, Raymond
    University of Bordeaux, INRIA, Bordeaux, France.
    Benkner, Siegfried
    University of Vienna, Austria.
    Pllana, Sabri
    University of Vienna, Austria.
    Träff, Jesper
    Technical University of Vienna, Austria.
    Wimmer, Martin
    Technical University of Vienna, Austria.
    Leveraging PEPPHER Technology for Performance Portable Supercomputing2013In: High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion, Salt Lake City, USA: IEEE conference proceedings, 2013, p. 1395-1396Conference paper (Other academic)
    Abstract [en]

    PEPPHER is a 3-year EU FP7 project that develops a novel approach and framework to enhance performance portability and programmability of heterogeneous multi-core systems. Its primary target is single-node heterogeneous systems, where several CPU cores are supported by accelerators such as GPUs. This poster briefly surveys the PEPPHER framework for single-node systems, and elaborates on the prospectives for leveraging the PEPPHER approach to generate performance-portable code for heterogeneous multi-node systems.

  • 18.
    Kessler, Christoph
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Dastgeer, Usman
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Thibault, Samuel
    INRIA / University of Bordeaux, France.
    Namyst, Raymond
    INRIA / University of Bordeaux, France.
    Richards, Andrew
    Codeplay Software Ltd., Edinburgh, UK.
    Dolinsky, Uwe
    Codeplay Software Ltd., Edinburgh, UK.
    Benkner, Siegfried
    University of Vienna, Austria.
    Träff, Jesper
    Technical University of Vienna, Austria.
    Pllana, Sabri
    University of Vienna, Austria.
    Programmability and Performance Portability Aspects of Heterogeneous Multi-/Manycore Systems2012Conference paper (Refereed)
    Abstract [en]

    We discuss three complementary approaches that can provide both portability and an increased level of abstraction for the programming of heterogeneous multicore systems. Together, these approaches also support performance portability, as currently investigated in the EU FP7 project PEPPHER. In particular, we consider (1) a library-based approach, here represented by the integration of the SkePU C++ skeleton programming library with the StarPU runtime system for dynamic scheduling and dynamic selection of suitable execution units for parallel tasks; (2) a language-based approach, here represented by the Offload-C++ high-level language extensions and Offload compiler to generate platform-specific code; and (3) a component-based approach, specifically the PEPPHER component system for annotating user-level application components with performance metadata, thereby preparing them for performance-aware composition. We discuss the strengths and weaknesses of these approaches and show how they could complement each other in an integrational programming framework for heterogeneous multicore systems.

  • 19.
    Li, Lu
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Dastgeer, Usman
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Adaptive Off-Line Tuning for Optimized Composition of Components for Heterogeneous Many-Core Systems2013In: High Performance Computing for Computational Science - VECPAR 2012 / [ed] Dayde, Michel, Marques, Osni, Nakajima, Kengo, Springer, 2013, p. 329-345Conference paper (Refereed)
    Abstract [en]

    In recent years heterogeneous multi-core systems have been given much attention. However, performance optimization on these platforms remains a big challenge. Optimizations performed by compilers are often limited due to lack of dynamic information and run time environment, which makes applications often not performance portable. One current approach is to provide multiple implementations for the same interface that could be used interchangeably depending on the call context, and expose the composition choices to a compiler, deployment-time composition tool and/or run-time system. Using off-line machine-learning techniques allows to improve the precision and reduce the run-time overhead of run-time composition and leads to an improvement of performance portability. In this work we extend the run-time composition mechanism in the PEPPHER composition tool by off-line composition and present an adaptive machine learning algorithm for generating compact and efficient dispatch data structures with low training time. As dispatch data structure we propose an adaptive decision tree structure, which implies an adaptive training algorithm that allows to control the trade-off between training time, dispatch precision and run-time dispatch overhead.

    We have evaluated our optimization strategy with simple kernels (matrix-multiplication and sorting) as well as applications from RODINIA benchmark on two GPU-based heterogeneous systems. On average, the precision for composition choices reaches 83.6 percent with approximately 34 minutes off-line training time.

  • 20.
    Li, Lu
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, Faculty of Science & Engineering.
    Dastgeer, Usman
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, Faculty of Science & Engineering.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, Faculty of Science & Engineering.
    Pruning strategies in adaptive off-line tuning for optimized composition of components on heterogeneous systems2016In: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 51, p. 37-45Article in journal (Refereed)
    Abstract [en]

    Adaptive program optimizations, such as automatic selection of the expected fastest implementation variant for a computation component depending on hardware architecture and runtime context, are important especially for heterogeneous computing systems but require good performance models. Empirical performance models which require no or little human efforts show more practical feasibility if the sampling and training cost can be reduced to a reasonable level. In previous work we proposed an early version of adaptive sampling for efficient exploration and selection of training samples, which yields a decision-tree based method for representing, predicting and selecting the fastest implementation variants for given run-time call contexts property values. For adaptive pruning we use a heuristic convexity assumption. In this paper we consolidate and improve the method by new pruning techniques to better support the convexity assumption and control the trade-off between sampling time, prediction accuracy and runtime prediction overhead. Our results show that the training time can be reduced by up to 39 times without noticeable prediction accuracy decrease. (C) 2015 Elsevier B.V. All rights reserved.

  • 21.
    Li, Lu
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Dastgeer, Usman
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Pruning strategies in adaptive off-line tuning for optimized composition of components on heterogeneous systems2014In: 2014 43rd International Conference on Parallel Processing Workshops (ICCPW), IEEE conference proceedings, 2014, p. 255-264Conference paper (Refereed)
    Abstract [en]

    Adaptive program optimizations, such as automatic selection of the expected fastest implementation variant for a computation component depending on runtime context, are important especially for heterogeneous computing systems but require good performance models. Empirical performance models based on trial executions which require no or little human efforts show more practical feasibility if the sampling and training cost can be reduced to a reasonable level. In previous work we proposed an early version of adaptive pruning algorithm for efficient selection of training samples, a decision-tree based method for representing, predicting and selecting the fastest implementation variants for given run-time call context properties, and a composition tool for building the overall composed application from its components. For adaptive pruning we use a heuristic convexity assumption. In this paper we consolidate and improve the method by new pruning techniques to better support the convexity assumption and better control the trade-off between sampling time, prediction accuracy and runtime prediction overhead. Our results show that the training time can be reduced by up to 39 times without noticeable prediction accuracy decrease. Furthermore, we evaluate the effect of combinations of pruning strategies and compare our adaptive sampling method with random sampling. We also use our smart-sampling method as a preprocessor to a state-of-the-art decision tree learning algorithm and compare the result to the predictor directly calculated by our method.

  • 22.
    Maghazeh, Arian
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, Faculty of Science & Engineering.
    Bordoloi, Unmesh D.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, Faculty of Science & Engineering.
    Dastgeer, Usman
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, Faculty of Science & Engineering. Ericsson Sweden.
    Andrei, Alexandru
    Ericsson Sweden.
    Eles, Petru
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, Faculty of Science & Engineering.
    Peng, Zebo
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, Faculty of Science & Engineering.
    Latency-Aware Packet Processing on CPU-GPU Heterogeneous Systems2017In: DAC '17 Proceedings of the 54th Annual Design Automation Conference 2017, New York, NY, USA: Association for Computing Machinery (ACM), 2017Conference paper (Refereed)
    Abstract [en]

    In response to the tremendous growth of the Internet, towards what we call the Internet of Things (IoT), there is a need to move from costly, high-time-to-market specific-purpose hardware to flexible, low-time-to-market general-purpose devices for packet processing. Among several such devices, GPUs have attracted attention in the past, mainly because the high computing demand of packet processing applications can, potentially, be satisfied by these throughput-oriented machines. However, another important aspect of such applications is the packet latency which, if not handled carefully, will overshadow the throughput benefits. Unfortunately, until now, this aspect has been mostly ignored. To address this issue, we propose a method that considers the variable bit rate of the traffic and, depending on the current rate, minimizes the latency, while meeting the rate demand. We propose a persistent kernel based software architecture to overcome the challenges inherent in GPU implementation like kernel invocation overhead, CPU-GPU communication and memory access overhead. We have chosen packet classification as the packet processing application to demonstrate our technique. Using the proposed approach, we are able to reduce the packet latency on average by a factor of 3.5, compared to the state-of-the-art solutions, without any packet drop.

  • 23.
    Majeed, Mudassar
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Dastgeer, Usman
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Cluster-SkePU: A Multi-Backend Skeleton Programming Library for GPU Clusters2013In: Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA-2013),, 2013Conference paper (Refereed)
    Abstract [en]

    SkePU is a C++ template library with a simple and unified interface for expressing data parallel computations in terms of generic components, called skeletons, on multi-GPU systems using CUDA and OpenCL. The smart containers in SkePU, such as Matrix and Vector, perform data management with a lazy memory copying mechanism that reduces redundant data communication. SkePU provides programmability, portability and even performance portability, but up to now application written using SkePU could only run on a single multi-GPU node. We present the extension of SkePU for GPU clusters without the need to modify the SkePU application source code. With our prototype implementation, we performed two experiments. The first experiment demonstrates the scalability with regular algorithms for N-body simulation and electric field calculation over multiple GPU nodes. The results for the second experiment show the benefit of lazy memory copying in terms of speedup gained for one level of Strassen’s algorithm and another synthetic matrix sum application.

1 - 23 of 23
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf