liu.seSearch for publications in DiVA
Change search
Refine search result
123 1 - 50 of 106
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Ali, Akhtar
    et al.
    Linköping University, Department of Computer and Information Science. Linköping University, The Institute of Technology.
    Dastgeer, Usman
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    OpenCL for programming shared memory multicore CPUs2011In: Fourth Swedish Workshop on Multi-Core Computing MCC-2011: November 23-25, 2011, Linköping University, Linköping, Sweden / [ed] Christoph Kessler, Linköping: Linköping University , 2011, Vol. S. 65-70, p. 65-70Conference paper (Other academic)
    Abstract [en]

    In this work, we evaluate the effectiveness of OpenCL for programming multicore CPUs in a comparative case study with OpenMP and Intel TBB for five benchmark applications: matrix multiply, LU decomposition, 2D image convolution, Pi value approximation and image histogram generation.

  • 2.
    Ali, Akhtar
    et al.
    Linköping University, The Institute of Technology.
    Dastgeer, Usman
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    OpenCL for programming shared memory multicore CPUs2012In: Proceedings of the 5th Workshop on MULTIPROG2012 / [ed] E. Ayguade, B. Gaster, L. Howes, P. Stenström, O. Unsal, HiPEAC Network of Excellence , 2012Conference paper (Refereed)
    Abstract [en]

    Shared memory multicore processor technology is pervasive in mainstream computing. This new architecture challenges programmers to write code that scales over these many cores to exploit the full computational power of these machines. OpenMP and Intel Threading Building Blocks (TBB) are two of the popular frameworks used to program these architectures. Recently, OpenCL has been defined as a standard by Khronos group which focuses on programming a possibly heterogeneous set of processors with many cores such as CPU cores, GPUs, DSP processors. In this work, we evaluate the effectiveness of OpenCL for programming multicore CPUs in a comparative case study with OpenMP and Intel TBB for five benchmark applications: matrix multiply, LU decomposition,2D image convolution, Pi value approximation and image histogram generation. The evaluation includes the effect of compiler optimizations for different frameworks, OpenCL performance on different vendors’ platformsand the performance gap between CPU-specific and GPU-specific OpenCL algorithms for execution on a modern GPU. Furthermore, a brief usability evaluation of the three frameworks is also presented.

  • 3.
    Andersson, Jesper
    et al.
    MSI Universitet Växjö, Sweden.
    Ericsson, Morgan
    MSI Universitet Växjö, Sweden.
    Kessler, Christoph
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    Löwe, Welf
    MSI Universitet Växjö, Sweden.
    Profile-Guided Composition2008In: 7th Int. Symposium on Software Composition SC 2008,2008, Berlin: Springer , 2008, p. 157-Conference paper (Refereed)
    Abstract [en]

    We present an approach that generates context-aware, optimized libraries of algorithms and data structures. The search space contains all combinations of implementation variants of algorithms and data structures including dynamically switching and converting between them. Based on profiling, the best implementation for a certain context is precomputed at deployment time and selected at runtime. In our experiments, the profile-guided composition outperforms the individual variants in almost all cases.

  • 4.
    Avdic, Kenan
    et al.
    Linköping University, Department of Computer and Information Science. Linköping University, The Institute of Technology.
    Melot, Nicolas
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Keller, Jörg
    FernUniversität in Hagen.
    Pipelined parallel sorting on the Intel SCC2011In: Fourth Swedish Workshop on Multi-Core Computing MCC-2011: November 23-25, 2011, Linköping University, Linköping, Sweden / [ed] Christoph Kessler, Linköping: Linköping University , 2011, Vol. S. 96-101, p. 96-101Conference paper (Other academic)
    Abstract [en]

    The Single-Chip Cloud Computer (SCC) is an experimental processor created by Intel Labs. It comprises 48 Intel-IA32 cores linked by an on-chip high performance mesh network, as well as four DDR3 memory controllers to access an off-chip main memory. We investigate the adaptation of sorting onto SCC as an algorithm engineering problem. We argue that a combination of pipelined mergesort and sample sort will fit best to SCC's architecture. We also provide a mapping based on integer linear programming to address load balancing and latency considerations. We describe a prototype implementation of our proposai together with preliminary runtime measurements, that indicate the usefulness of this approach. As mergesort can be considered as a representative of the class of streaming applications, the techniques deveioped here should also apply to the other problems in this class, such as many applications for parallel embedded systems, i.e. MPSoC. 

  • 5.
    Bednarski, Andrzej
    et al.
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    Kessler, Christoph
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    Energy-Optimal Integrated VLIW Code Generation2004In: CPC04 11th Int. Workshop on Compilers for Parallel Computers,2004, 2004, p. 227-238Conference paper (Other academic)
  • 6.
    Bednarski, Andrzej
    et al.
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    Kessler, Christoph
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    Exploiting Symmetries for Optimal Integrated Code Generation2004In: Int. Conf. on Embedded Systems and Applications ESA04,2004, 2004Conference paper (Refereed)
  • 7.
    Bednarski, Andrzej
    et al.
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    Kessler, Christoph
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    Integer Linear Programming versus Dynamic Programming for Optimal Integrated VLIW Code Generation2006In: 12th Int. Workshop on Compilers for Parallel Computers,2006, 2006, p. 73-Conference paper (Refereed)
  • 8.
    Bednarski, Andrzej
    et al.
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Optimal integrated code generation for VLIW architectures2003In: Proc. of CPC'03 10th Int. Workshop on Compilers for Parallel Computers, Amsterdam, The Netherlands, January 2003', Leiden, The Netherlands: Leiden Institute of Advanced Computer Science , 2003Conference paper (Refereed)
  • 9.
    Bednarski, Andrzej
    et al.
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Optimal integrated VLIW code generation with Integer Linear Programming2006In: Euro-Par 2006 Parallel Processing 12th International Euro-Par Conference, Dresden, Germany, August 28 – September 1, 2006. Proceedings / [ed] Wolfgang E. Nagel, Wolfgang V. Walter and Wolfgang Lehner, Springer Berlin/Heidelberg, 2006, Vol. 4128, p. 461-472Chapter in book (Refereed)
    Abstract [en]

    We give an Integer Linear Programming (ILP) solution that fully integrates all steps of code generation, i.e. instruction selection, register allocation and instruction scheduling, on the basic block level for VLIW processors.

    In earlier work, we contributed a dynamic programming (DP) based method for optimal integrated code generation, implemented in our retargetable code generator OPTIMIST. In this paper we give first results to evaluate and compare our ILP formulation with our DP method on a VLIW processor. We also demonstrate how to precondition the ILP model by a heuristic relaxation of the DP method to improve ILP optimization time.

  • 10.
    Benkner, Siegfried
    et al.
    University of Vienna.
    Pllana, Sabri
    University of Vienna.
    Larsson Träff, Jesper
    University of Vienna.
    Tsigas, Philippas
    Chalmers.
    Dolinsky, Uwe
    Codeplay Software.
    Augonnet, Cèdric
    INRIA Bordeaux.
    Bachmayer, Beverly
    Intel GmbH.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Moloney, David
    Movidius.
    Osipov, Vitaly
    Karlsruhe Institute of Technology.
    PEPPHER: Efficient and Productive Usage of Hybrid Computing Systems2011In: IEEE Micro, ISSN 0272-1732, E-ISSN 1937-4143, Vol. 31, no 5, p. 28-41Article in journal (Refereed)
    Abstract [en]

    PEPPHER, a three-year European FP7 project, addresses efficient utilization of hybrid (heterogeneous) computer systems consisting of multicore CPUs with GPU-type accelerators. This article outlines the PEPPHER performance-aware component model, performance prediction means, runtime system, and other aspects of the project. A larger example demonstrates performance portability with the PEPPHER approach across hybrid systems with one to four GPUs.

  • 11.
    Brenner, Jürgen
    et al.
    FernUniversität in Hagen, Germany.
    Keller, Jörg
    FernUniversität in Hagen, Germany.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Executing PRAM Programs on GPUs2012In: Procedia Computer Science, ISSN 1877-0509, E-ISSN 1877-0509, Vol. 9, p. 1799-1806Article in journal (Refereed)
    Abstract [en]

    We present a framework to transform PRAM programs from the PRAM programming language Fork to CUDA C, so that they can be compiled and executed on a Graphics Processor (GPU). This allows to explore parallel algorithmics on a scale beyond toy problems, to which the previous, sequential PRAM simulator restricted practical use. We explain the design decisions and evaluate a prototype implementation consisting of a runtime library and a set of rules to transform simple Fork programs which we for now apply by hand. The resulting CUDA code is almost 100 times faster than the previous simulator for compiled Fork programs and allows to handle larger data sizes. Compared to a sequential program for the same problem, the GPU code might be faster or slower, depending on the Fork program structure, i.e. on the overhead incurred. We also give an outlook how future GPUs might notably reduce the overhead.

  • 12.
    Chalabine, Mikhail
    et al.
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    Kessler, Christoph
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    A Formal Framework for Automated Round-trip Software Engineering in Static Aspect Weaving and Transformations2007In: the 29th Int. Conference on Software Engineering ICSE 2007,2007, USA: IEEE , 2007Conference paper (Refereed)
    Abstract [en]

    We present a formal framework for a recently introduced approach to Automated Round-trip Software Engineering (ARE) in source-level aspect weaving systems. Along with the formalization we improve the original method and suggest a new concept of weaving transactions in Aspect-oriented Programming (AOP). As the major contribution we formally show how, given a tree-shaped intermediate representation of a program and an ancillary transposition tree, manual edits in statically woven code can consistently be mapped back to their proper source of origin, which is either in the application core or in an element in the aspect space. The presented formalism is constructive. It frames AOP by generalizing static aspect weaving to classical tree transformations.

  • 13.
    Chalabine, Mikhail
    et al.
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    Kessler, Christoph
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    A Survey of Reasoning in Parallelization2007In: the 8th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing SNPD 2007,2007, China: IEEE , 2007Conference paper (Refereed)
    Abstract [en]

    We elaborate on reasoning in contemporary (semi) automatic parallelizing refactoring. As the main contribution we summarize contemporary approaches and show that all attempts to reason in parallelization thus far, have amounted to local code analysis given data and control dependencies. We conclude that, by retaining this perspective only, parallelization continues to exploit merely a subset of the reasoning methods available today and is likely to remain limited. To address this problem we suggest to expand the local analyses, such that, they take seriously relations between individual local parallelizing transformations. We argue that such a coupling allows to process sparser parallelizable constructs, such as, Producer-Consumer Coordination. We identify questions to be addressed to put this principle into action and report on-going work on (reasoning) mechanisms able to support this.

  • 14.
    Chalabine, Mikhail
    et al.
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    Kessler, Christoph
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    Crosscutting Concerns in Parallelization by Invasive Software Composition and Aspect Weaving2006In: 39th Hawaii International Conference on System Sciences HICSS 2006,2006, HI, USA: IEEE , 2006Conference paper (Refereed)
  • 15.
    Chalabine, Mikhail
    et al.
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Parallelisation of sequential programs by invasive composition and aspect weaving2005In: Advanced Parallel Processing Technologies: 6th International Workshop, APPT 2005, Hong Kong, China, October 27-28, 2005. Proceedings / [ed] Jiannong Cao, Wolfgang Nejdl and Ming Xu, Springer Berlin/Heidelberg, 2005, Vol. 3756, p. 131-140Chapter in book (Refereed)
    Abstract [en]

    We propose a new method of interactively parallelising programs that is based on aspect weaving and invasive software composition. This can be seen as an alternative to skeleton programming. We give motivating examples for how our method could be applied.

  • 16.
    Chalabine, Mikhail
    et al.
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Bunus, Peter
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Automated Round-trip Software Engineering in Aspect Weaving Systems2006In: 21st IEEE/ACM International Conference on Automated Software Engineering, 2006. ASE '06., Tokyo, Japan: IEEE/ACM , 2006, p. 305-308Conference paper (Refereed)
    Abstract [en]

    We suggest an approach to Automated Round-trip Software Engineering in source-level aspect weaving systems that allows for transparent mapping of manual edits in the woven program back to the appropriate source of origin, which is either the application core or the aspect space.

  • 17.
    Chalabine, Mikhail
    et al.
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    Kessler, Christoph
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    Wiklund, Staffan
    Ericsson.
    Optimising Intensive Interprocess Communication in a Parallelised Telecommunication Traffic Simulator2003In: High Performance Computing Symposium part of the Advanced Simulation Technology conference,2003, 2003Conference paper (Refereed)
    Abstract [en]

    This paper focuses on an efficient user-level handling of intensive interprocess communication in distributed parallel applications that are characterised by a high rate of data exchange. A common feature of such systems is that any parallelisation strategy focusing on the largest parallelisable fraction results in the highest possible rate of interprocess communication, compared to other parallelisation strategies. An example of such applications is the class of telecommunication traffic simulators, where the partition-communication phenomenon reveals itself due to the strong data interdependencies among the major parallelisable tasks, namely, encoding of messages, decoding of messages, and interpretation of messages.

  • 18.
    Cichowski, Patrick
    et al.
    FernUniversität in Hagen, Germany.
    Keller, Jörg
    FernUniversität in Hagen, Germany.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Energy-efficient Mapping of Task Collections onto Manycore Processors2013In: Proceedings of MULTIPROG'13 workshop at HiPEAC'13 / [ed] E. Ayguade et al. (eds.), 2013Conference paper (Refereed)
    Abstract [en]

    Streaming applications consist of a number of tasks that all run concurrently, and that process data at certain rates. On manycore processors, the tasks of the streaming application must be mapped onto the cores. While load balancing of such applications has been considered, especially in the MPSoC community, we investigate energy-efficient mapping of such task collections onto manycore processors. We first derive rules that guide the mapping process and show that as long as dynamic power consumption dominates static power consumption, the latter can be ignored and the problem reduces to load balancing. When however, as expected in the coming years, static power consumption will be a notable fraction of total power consumption, then an energy-efficient mapping must take it into account, e.g. by temporary shutdown of cores or by restricting the number of cores. We validate our findings with synthetic and real-world applications on the Intel SCC manycore processor.

  • 19.
    Cichowski, Patrick
    et al.
    FernUniversität in Hagen, Germany.
    Keller, Jörg
    FernUniversität in Hagen, Germany.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Modelling Power Consumption of the Intel SCC2012In: Proceedings of the 6th Many-core Applications Research Community (MARC) Symposium / [ed] Eric Noulard, HAL Archives Ouvertes , 2012Conference paper (Refereed)
    Abstract [en]

    The Intel SCC manycore processor supports energy-efficient computing by dynamic voltage and frequency scaling of cores on a fine-grained level. In order to enable the use of that feature in application-level energy optimizations, we report on experiments to measure power consumption in different situations. We process those measurements by a least-squares error analysis to derive the parameters of popular models for power consumption which are used on an algorithmic level. Thus, we provide a link between the worlds of hardware and high-level algorithmics.

  • 20.
    Dale, Nell
    et al.
    University of Texas at Austin.
    Bishop, Judith
    University of Pretoria.
    Barnes, David
    University of Kent at Canterbury.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    A dialog between authors and teachers2002In: Proc. ACM SIGCSE ITiCSE'02 7th Annual Conf. on Information Technology in Computer Science Education, Aarhus, Denmark, June 2002.', New York: ACM , 2002Conference paper (Refereed)
  • 21.
    Danylenko, Antonina
    et al.
    Linnaeus University, Växjö.
    Löwe, Welf
    Linnaeus University, Växjö.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems.
    Comparing Machine Learning Approaches for Context-Aware Composition2011In: Software Composition / [ed] Sven Apel, Ethan Jackson, Springer, 2011, p. 18-33Conference paper (Refereed)
    Abstract [en]

    Context-Aware Composition allows to automatically select optimal variants of algorithms, data-structures, and schedules at runtime using generalized dynamic Dispatch Tables. These tables grow exponentially with the number of significant context attributes. To make Context-Aware Composition scale, we suggest four alternative implementations to Dispatch Tables, all well-known in the field of machine learning: Decision Trees, Decision Diagrams, Naive Bayes and Support Vector Machines classifiers. We assess their decision overhead and memory consumption theoretically and practically in a number of experiments on different hardware platforms. Decision Diagrams turn out to be more compact compared to Dispatch Tables, almost as accurate, and faster in decision making. Using Decision Diagrams in Context-Aware Composition leads to a better scalability, i.e., Context-Aware Composition can be applied at more program points and regard more context attributes than before.

  • 22.
    Dastgeer, Usman
    et al.
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Enmyren, Johan
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Auto-tuning SkePU: A multi-backend skeleton programming framework for multi-GPU systems2011In: IWMSE '11 Proceedings of the 4th International Workshop on Multicore Software Engineering, New York, NY, USA: Association for Computing Machinery (ACM), 2011, p. 25-32Conference paper (Other academic)
    Abstract [en]

    SkePU is a C++ template library that provides a simple and unified interface for specifying data-parallel computations with the help of skeletons on GPUs using CUDA and OpenCL. The interface is also general enough to support other architectures, and SkePU implements both a sequential CPU and a parallel OpenMP backend. It also supports multi-GPU systems. Currently available skeletons in SkePU include map, reduce, mapreduce, map-with-overlap, maparray, and scan. The performance of SkePU generated code is comparable to that of hand-written code, even for more complex applications such as ODE solving.

    In this paper, we discuss initial results from auto-tuning SkePU using an off-line, machine learning approach where we adapt skeletons to a given platform using training data. The prediction mechanism at execution time uses off-line pre-calculated estimates to construct an execution plan for any desired configuration with minimal overhead. The prediction mechanism accurately predicts execution time for repetitive executions and includes a mechanism to predict execution time for user functions of different complexity. The tuning framework covers selection between different backends as well as choosing optimal parameter values for the selected backend. We will discuss our approach and initial results obtained for different skeletons (map, mapreduce, reduce).

  • 23.
    Dastgeer, Usman
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    A Framework for Performance-aware Composition of Applications for GPU-based Systems2013Conference paper (Refereed)
    Abstract [en]

    User-level components of applications can be made performance-aware by annotating them with performance model and other metadata. We present a component model and a composition framework for the performance-aware composition of applications for modern GPU-based systems from such components, which may expose multiple implementation variants. The framework targets the composition problem in an integrated manner, with particular focus on global performance-aware composition across multiple invocations. We demonstrate several key features of our framework relating to performance-aware composition including implementation selection, both with performance characteristics being known (or learned) beforehand as well as cases when they are learned at runtime. We also demonstrate hybrid execution capabilities of our framework on real applications. Furthermore, as an important step towards global composition, we present a bulk composition technique that can make better composition decisions by considering information about upcoming calls along with data flow information extracted from the source program by static analysis, thus improving over the traditional greedy performance-aware policy that only considers the current call for optimization.

  • 24.
    Dastgeer, Usman
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    A performance-portable generic component for 2D convolution computations on GPU-based systems2011In: Fourth Swedish Workshop on Multi-Core Computing MCC-2011: November 23-25, 2011, Linköping University, Linköping, Sweden / [ed] Christoph Kessler, Linköping: Linköping University , 2011, Vol. S. 39-44, p. 39-44Conference paper (Other academic)
    Abstract [en]

    In this paper, we describe our work on providing a generic yet optimized GPU (CUDA/OpenCL) implementation for the 2D MapOverlap skeleton. We explain our implementation with the help  of a 2D convolutilution application, implemented using the newly deveioped skeleton.

  • 25.
    Dastgeer, Usman
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, Faculty of Science & Engineering.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, Faculty of Science & Engineering.
    A performance-portable generic component for 2D convolution computations on GPU-based systems2012In: Proceedings of the Fifth International Workshop on Programmability Issues for Heterogeneous Multicores (MULTIPROG-2012) at the HiPEAC-2012 conference, Paris, Jan. 2012 / [ed] E. Ayguade, B. Gaster, L. Howes, P. Stenström, O. Unsal, 2012Conference paper (Refereed)
    Abstract [en]

    In this paper, we describe our work on providing a generic yet optimized GPU (CUDA/OpenCL) implementation for the 2D MapOverlap skeleton. We explain our implementation with the help of a 2D convolution application, implemented using the newly developed skeleton. The memory (constant and shared memory) and adaptive tiling optimizations are applied and their performance implications are evaluated on different classes of GPUs. We present two different metrics to calculate the optimal tiling factor dynamically in an automated way which helps in retaining best performance without manual tuning while moving to newGPU architectures. With our approach, we can achieve average speedups by a factor of 3.6, 2.3, and 2.4 over an otherwise optimized (without tiling) implementation on NVIDIA C2050, GTX280 and 8800 GT GPUs respectively. Above all, the performance portability is achieved without requiring any manual changes in the skeleton program or the skeleton implementation.

  • 26.
    Dastgeer, Usman
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Conditional component composition for GPU-based systems2014In: Proc. Seventh Workshop on Programmability Issues for Multi-Core Computers (MULTIPROG-2014) at HiPEAC-2014, Vienna, Austria, Jan. 2014, Vienna, Austria: HiPEAC NoE , 2014Conference paper (Refereed)
    Abstract [en]

    User-level components can expose multiple functionally equivalent implementations with different resource requirements and performance characteristics. A composition framework can then choose a suitable implementation for each component invocation guided by an objective function (execution time, energy etc.). In this paper, we describe the idea of conditional composition which enables the component writer to specify constraints on the selectability of a given component implementation based on information about the target system and component call properties. By incorporating such information, more informed and user-guided composition decisions can be made and thus more efficient code be generated, as shown with an example scenario for a GPU-based system.

  • 27.
    Dastgeer, Usman
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Performance-aware Composition Framework for GPU-based Systems2015In: Journal of Supercomputing, ISSN 0920-8542, E-ISSN 1573-0484, Vol. 71, no 12, p. 4646-4662Article in journal (Refereed)
    Abstract [en]

    User-level components of applications can be made performance-aware by annotating them with performance model and other metadata. We present a component model and a composition framework for the automatically optimized composition of applications for modern GPU-based systems from such components, which may expose multiple implementation variants. The framework targets the composition problem in an integrated manner, with the ability to do global performance-aware composition across multiple invocations. We demonstrate several key features of our framework relating to performance-aware composition including implementation selection, both with performance characteristics being known (or learned) beforehand as well as cases when they are learned at runtime. We also demonstrate hybrid execution capabilities of our framework on real applications. Furthermore, we present a bulk composition technique that can make better composition decisions by considering information about upcoming calls along with data flow information extracted from the source program by static analysis. The bulk composition improves over the traditional greedy performance aware policy that only considers the current call for optimization.

  • 28.
    Dastgeer, Usman
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, Faculty of Science & Engineering.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, Faculty of Science & Engineering.
    Smart Containers and Skeleton Programming for GPU-Based Systems2016In: International journal of parallel programming, ISSN 0885-7458, E-ISSN 1573-7640, Vol. 44, no 3, p. 506-530Article in journal (Refereed)
    Abstract [en]

    In this paper, we discuss the role, design and implementation of smart containers in the SkePU skeleton library for GPU-based systems. These containers provide an interface similar to C++ STL containers but internally perform runtime optimization of data transfers and runtime memory management for their operand data on the different memory units. We discuss how these containers can help in achieving asynchronous execution for skeleton calls while providing implicit synchronization capabilities in a data consistent manner. Furthermore, we discuss the limitations of the original, already optimizing memory management mechanism implemented in SkePU containers, and propose and implement a new mechanism that provides stronger data consistency and improves performance by reducing communication and memory allocations. With several applications, we show that our new mechanism can achieve significantly (up to 33.4 times) better performance than the initial mechanism for page-locked memory on a multi-GPU based system.

  • 29.
    Dastgeer, Usman
    et al.
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Thibault, Samuel
    Laboratoire Bordelais de Recherche en Informatique (LaBRI), France.
    Flexible runtime support for efficient skeleton programming on hybrid systems2012In: Applications, Tools and Techniques on the Road to Exascale Computing / [ed] K. De Bosschere, E. H. D'Hollander, G. R. Joubert, D. Padua, F. Peters., Amsterdam: IOS Press, 2012, 22, p. 159-166Chapter in book (Other academic)
    Abstract [en]

    SkePU is a skeleton programming framework for multicore CPU and multi-GPU systems. StarPU is a runtime system that provides dynamic scheduling and memory management support for heterogeneous, accelerator-based systems. We have implemented support for StarPU as a possible backend for SkePU while keeping the generic SkePU interface intact. The mapping of a SkePU skeleton call to one or more StarPU tasks allows StarPU to exploit independence between different skeleton calls as well as within a single skeleton call. Support for different StarPU features, such as data partitioning and different scheduling policies (e.g. history based performance models) is implemented and discussed in this paper. The integration proved beneficial for both StarPU and SkePU. StarPU got a high level interface to run data-parallel computations on it while SkePU has achieved dynamic scheduling and hybrid parallelism support. Several benchmarks including ODE solver, separable Gaussian blur filter, Successive Over-Relaxation (SOR) and Coulombic potential are implemented. Initial experiments show that we can even achieve super-linear speedups for realistic applications and can observe clear improvements in performance with the simultaneous use of both CPUs and GPU (hybrid execution).

  • 30.
    Dastgeer, Usman
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems.
    Li, Lu
    Linköping University, Department of Computer and Information Science, Software and Systems.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems.
    Adaptive Implementation Selection in the SkePU Skeleton Programming Library2013In: Advanced Parallel Processing Technologies (APPT-2013), Proceedings / [ed] Chengyung Wu and Albert Cohen (eds.), 2013, p. 170-183Conference paper (Refereed)
    Abstract [en]

    In earlier work, we have developed the SkePU skeleton programming library for modern multicore systems equipped with one or more programmable GPUs. The library internally provides four types of implementations (implementation variants) for each skeleton: serial C++, OpenMP, CUDA and OpenCL targeting either CPU or GPU execution respectively. Deciding which implementation would run faster for a given skeleton call depends upon the computation, problem size(s), system architecture and data locality.

    In this paper, we present our work on automatic selection between these implementation variants by an offline machine learning method which generates a compact decision tree with low training overhead. The proposed selection mechanism is flexible yet high-level allowing a skeleton programmer to control different training choices at a higher abstraction level. We have evaluated our optimization strategy with 9 applications/kernels ported to our skeleton library and achieve on average more than 94% (90%) accuracy with just 0.53% (0.58%) training space exploration on two systems. Moreover, we discuss one application scenario where local optimization considering a single skeleton call can prove sub-optimal, and propose a heuristic for bulk implementation selection considering more than one skeleton call to address such application scenarios.

  • 31.
    Dastgeer, Usman
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Li, Lu
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    The PEPPHER composition tool: performance-aware composition for GPU-based systems2014In: Computing, ISSN 0010-485X, E-ISSN 1436-5057, Vol. 96, no 12, p. 1195-1211Article in journal (Refereed)
    Abstract [en]

    The PEPPHER (EU FP7 project) component model defines the notion of component, interface and meta-data for homogeneous and heterogeneous parallel systems. In this paper, we describe and evaluate the PEPPHER composition tool, which explores the application’s components and their implementation variants, generates the necessary low-level code that interacts with the runtime system, and coordinates the native compilation and linking of the various code units to compose the overall application code to optimize performance. We discuss the concept of smart containers and its benefits for reducing dispatch overhead, exploiting implicit parallelism across component invocations and runtime optimization of data transfers. In an experimental evaluation with several applications, we demonstrate that the composition tool provides a high-level programming front-end while effectively utilizing the task-based PEPPHER runtime system (StarPU) underneath for different usage scenarios on GPU-based systems.

  • 32.
    Dastgeer, Usman
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Li, Lu
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    The PEPPHER Composition Tool: Performance-Aware Dynamic Composition of Applications for GPU-Based Systems2012In: High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion, IEEE, 2012, p. 711-720Conference paper (Refereed)
    Abstract [en]

    The PEPPHER component model defines an environment for annotation of native C/C++ based components for homogeneous and heterogeneous multicore and manycore systems, including GPU and multi-GPU based systems. For the same computational functionality, captured as a component, different sequential and explicitly parallel implementation variants using various types of execution units might be provided, together with metadata such as explicitly exposed tunable parameters. The goal is to compose an application from its components and variants such that, depending on the run-time context, the most suitable implementation variant will be chosen automatically for each invocation. We describe and evaluate the PEPPHER composition tool, which explores the application's components and their implementation variants, generates the necessary low-level code that interacts with the runtime system, and coordinates the native compilation and linking of the various code units to compose the overall application code. With several applications, we demonstrate how the composition tool provides a high-level programming front-end while effectively utilizing the task-based PEPPHER runtime system (StarPU) underneath.

  • 33.
    Ericsson, Morgan
    et al.
    MSI Universitet Växjö, Sweden.
    Löwe, Welf
    MSI Universitet Växjö, Sweden.
    Kessler, Christoph
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    Andersson, Jesper
    MSI Universitet Växjö, Sweden.
    Composition and Optimization2008In: Int. Workshop on Component-Based High Performance Computing CBHPC-2008,2008, New York, USA: ACM , 2008Conference paper (Refereed)
  • 34.
    Eriksson, Mattias
    et al.
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Integrated Code Generation for Loops2012In: ACM Transactions on Embedded Computing Systems, ISSN 1539-9087, E-ISSN 1558-3465, Vol. 11, no 1Article in journal (Refereed)
    Abstract [en]

    Code generation in a compiler is commonly divided into several phases: instruction selection, scheduling, register allocation, spill code generation, and, in the case of clustered architectures, cluster assignment. These phases are interdependent; for instance, a decision in the instruction selection phase affects how an operation can be scheduled We examine the effect of this separation of phases on the quality of the generated code. To study this we have formulated optimal methods for code generation with integer linear programming; first for acyclic code and then we extend this method to modulo scheduling of loops. In our experiments we compare optimal modulo scheduling, where all phases are integrated, to modulo scheduling, where instruction selection and cluster assignment are done in a separate phase. The results show that, for an architecture with two clusters, the integrated method finds a better solution than the nonintegrated method for 27% of the instances.

  • 35.
    Eriksson, Mattias
    et al.
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Integrated modulo scheduling for clustered VLIW architectures2009In: High Performance Embedded Architectures and Compilers: Fourth International Conference, HiPEAC 2009, Paphos, Cyprus, January 25-28, 2009. Proceedings / [ed] André Seznec, Joel Emer, Michael O’Boyle, Margaret Martonosi and Theo Ungerer, Springer Berlin/Heidelberg, 2009, Vol. 5409 LNCS, p. 65-79Chapter in book (Refereed)
    Abstract [en]

    We solve the problem of integrating modulo scheduling with instruction selection (including cluster assignment), instruction scheduling and register allocation, with optimal spill code generation and scheduling. Our method is based on integer linear programming. We prove that our algorithm delivers optimal results in finite time for a certain class of architectures. We believe that these results are interesting both from a theoretical point of view and as a reference point when devising heuristic methods.

  • 36.
    Eriksson, Mattias
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Integrated Offset Assignment2011In: Proceedings 9th Workshop on Optimizations for DSP and Embedded Systems (ODES-9) / [ed] George Cai and Tom van der Aa, 2011, p. 47-54Conference paper (Refereed)
    Abstract [en]

    One important part of generating code for DSP processors is to make good use of the address generation unit (AGU). In this paper we divide the code generation into three parts: (1) scheduling, (2) address register assignment, and (3) storage layout. The goal is to nd out if solving these three subproblems as one big integrated problem gives better results compared to when scheduling or address register assignment is solved separately. We present optimal dynamic programming algorithms for both integrated and non-integrated code generation for DSP processors. In our experiments we nd that integrationis benecial when the AGU has 1 or 2 address registers; for the other cases existing heuristics are near optimal. We also nd that integrating address register assignment and storage layout gives slightly better results than integrating scheduling and storage layout. I.e. address register assignment is more important than scheduling.

  • 37.
    Eriksson, Mattias
    et al.
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Chalabine, Mikhail
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Load balancing of irregular parallel divide-and-conquer algorithms in group-SPMD programming environments2006In: 8th Workshop on Parallel Systems and Algorithms PASA 2006 / [ed] Wolfgang Karl, Jürgen Becker, Karl-Erwin Großpietsch, Christian Hochberger and Erik Maehle, Frankfurt/Main, Germany, 2006, p. 313-322Conference paper (Refereed)
    Abstract [en]

    We study strategies for local load balancing of irregular parallel divide-andconquer algorithms such as Quicksort and Quickhull in SPMD-parallel environments such as MPI and Fork that allow to exploit nested parallelism by dynamic group splitting. We propose two new local strategies, repivoting and serialisation, and develop a hybrid local load balancing strategy, which is calibrated by parameters that are derived off-line from a dynamic programming optimisation. While the approach is generic, we have implemented and evaluated our method for two very different parallel platforms. We found that our local strategy is superior to global dynamic load balancing on a Linux cluster, while the latter performs better on a tightly synchronised sharedmemory platform with nonblocking, cheap task queue access.

  • 38.
    Eriksson, Mattias
    et al.
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Skoog, Oskar
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Optimal vs. Heuristic Integrated Code Generation for Clustered VLIW Architectures.2008In: 11th ACM SIGBED Int. Workshop on Software and Compilers for Embedded Systems SCOPES 2008,2008, New York, USA: Association for Computing Machinery (ACM), 2008, p. 11-20Conference paper (Refereed)
    Abstract [en]

    In this paper we present two algorithms for integrated code generation for clustered VLIW architectures. One algorithm is a heuristic based on genetic algorithms, the other algorithm is based on integer linear programming. The performance of the algorithms are compared on a portion of the Mediabench benchmark suite. We found the results of the genetic algorithm to be within one or two clock cycles from optimal for the cases where the optimum is known. In addition the heuristic algorithm produces results in predictable time also when the optimal integer linear program fails.

  • 39.
    Ernstsson, August
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, Faculty of Science & Engineering.
    Li, Lu
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, Faculty of Science & Engineering.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, Faculty of Science & Engineering.
    SkePU 2: Flexible and Type-Safe Skeleton Programming for Heterogeneous Parallel Systems2018In: International journal of parallel programming, ISSN 0885-7458, E-ISSN 1573-7640, Vol. 46, no 1, p. 62-80Article in journal (Refereed)
    Abstract [en]

    In this article we present SkePU 2, the next generation of the SkePU C++ skeleton programming framework for heterogeneous parallel systems. We critically examine the design and limitations of the SkePU 1 programming interface. We present a new, flexible and type-safe, interface for skeleton programming in SkePU 2, and a source-to-source transformation tool which knows about SkePU 2 constructs such as skeletons and user functions. We demonstrate how the source-to-source compiler transforms programs to enable efficient execution on parallel heterogeneous systems. We show how SkePU 2 enables new use-cases and applications by increasing the flexibility from SkePU 1, and how programming errors can be caught earlier and easier thanks to improved type safety. We propose a new skeleton, Call, unique in the sense that it does not impose any predefined skeleton structure and can encapsulate arbitrary user-defined multi-backend computations. We also discuss how the source-to-source compiler can enable a new optimization opportunity by selecting among multiple user function specializations when building a parallel program. Finally, we show that the performance of our prototype SkePU 2 implementation closely matches that of SkePU 1.

  • 40.
    Forsell, Martti
    et al.
    Platform Architectures Team, VTT Technical Research Centre of Finland, Finland.
    Hansson, Erik
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Mäkelä, Jari-Matti
    Information Technology, University of Turku, Finland.
    Leppänen, Ville
    Information Technology, University of Turku, Finland.
    Hardware and Software Support for NUMA Computing on Configurable Emulated Shared Memory Architectures2013In: 2013 IEEE 27th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), IEEE conference proceedings, 2013, p. 640-647Conference paper (Refereed)
    Abstract [en]

    The emulated shared memory (ESM) architectures are good candidates for future general purpose parallel computers due to their ability to provide easy-to-use explicitly parallel synchronous model of computation to programmers as well as avoid most performance bottlenecks present in current multicore architectures. In order to achieve full performance the applications must, however, have enough thread-level parallelism (TLP). To solve this problem, in our earlier work we have introduced a class of configurable emulated shared memory (CESM) machines that provides a special non-uniform memory access (NUMA) mode for situations where TLP is limited or for direct compatibility for legacy code sequential computing or NUMA mechanism. Unfortunately the earlier proposed CESM architecture does not integrate the different modes of the architecture well together e.g. by leaving the memories for different modes isolated and therefore the programming interface is non-integrated. In this paper we propose a number of hardware and software techniques to support NUMA computing in CESM architectures in a seamless way. The hardware techniques include three different NUMA-shared memory access mechanisms and the software ones provide a mechanism to integrate NUMA computation into the standard parallel random access machine (PRAM) operation of the CESM. The hardware techniques are evaluated on our REPLICA CESM architecture and compared to an ideal CESM machine making use of the proposed software techniques.

  • 41.
    Forsell, Martti
    et al.
    Platform Architectures Team, VTT Technical Research Centre of Finland, Finland.
    Hansson, Erik
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Mäkelä, Jari-Matti
    Department of Information Technology, University of Turku, Finland.
    Leppänen, Ville
    Department of Information Technology, University of Turku, Finland.
    NUMA Computing with Hardware and Software Co-Support on Configurable Emulated Shared Memory Architectures2014In: International Journal of Networking and Computing, ISSN 2185-2839, E-ISSN 2185-2847, Vol. 4, no 1, p. 189-206Article in journal (Refereed)
    Abstract [en]

    The emulated shared memory (ESM) architectures are good candidates for future general purpose parallel computers due to their ability to provide an easy-to-use explicitly parallel synchronous model of computation to programmers as well as avoid most performance bottlenecks present in current multicore architectures. In order to achieve full performance the applications must, however, have enough thread-level parallelism (TLP). To solve this problem, in our earlier work we have introduced a class of configurable emulated shared memory (CESM) machines that provides a special non-uniform memory access (NUMA) mode for situations where TLP is limited or for direct compatibility for legacy code sequential computing and NUMA mechanism. Unfortunately the earlier proposed CESM architecture does not integrate the different modes of the architecture well together e.g. by leaving the memories for different modes isolated and therefore the programming interface is non-integrated. In this paper we propose a number of hardware and software techniques to support NUMA computing in CESM architectures in a seamless way. The hardware techniques include three different NUMA shared memory access mechanisms and the software ones provide a mechanism to integrate and optimize NUMA computation into the standard parallel random access machine (PRAM) operation of the CESM. The hardware techniques are evaluated on our REPLICA CESM architecture and compared to an ideal CESM machine making use of the proposed software techniques.

  • 42.
    Hansson, Erik
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Alnervik, Erik
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Forsell, Martti
    VTT Technical Research Centre of Finland.
    A Quantitative Comparison of PRAM based Emulated Shared Memory Architectures to Current Multicore CPUs and GPUs2014In: 27th International Conference on Architecture of Computing Systems (ARCS), 2014, ARCS Workshops: Proc. PASA-2014 11th Workshop on Parallel Systems and Algorithms, Lübeck, Germany, Lübeck, Germany: VDE Verlag GmbH, 2014, p. 27-33Conference paper (Refereed)
    Abstract [en]

    The performance of current multicore CPUs and GPUs is limited in computations making frequent use of communication/synchronization between the subtasks executed in parallel. This is because the directory-based cache systems scale weakly and/or the cost of synchronization is high. The Emulated Shared Memory (ESM) architectures relying on multithreading and efficient synchronization mechanisms have been developed to solve these problems affecting both performance and programmability of current machines. In this paper, we compare preliminarily the performance of three hardware implemented ESM architectures with state-of-the-art multicore CPUs and GPUs. The benchmarks are selected to cover different patterns of parallel computation and therefore reveal the performance potential of ESM architectures with respect to current multicores.

  • 43.
    Hansson, Erik
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Optimized selection of runtime mode for the reconfigurable PRAM-NUMA architecture REPLICA using machine-learning2014In: Euro-Par 2014: Parallel Processing Workshops: Euro-Par 2014 International Workshops, Porto, Portugal, August 25-26, 2014, Revised Selected Papers, Part II / [ed] Luis Lopes et al., Springer-Verlag New York, 2014, p. 133-145Conference paper (Refereed)
    Abstract [en]

    The massively hardware multithreaded VLIW emulated shared memory (ESM) architecture REPLICA has a dynamically reconfigurable on-chip network that offers two execution modes: PRAM and NUMA. PRAM mode is mainly suitable for applications with high amount of thread level parallelism (TLP) while NUMA mode is mainly for accelerating execution of sequential programs or programs with low TLP. Also, some types of regular data parallel algorithms execute faster in NUMA mode. It is not obvious in which mode a given program region shows the best performance. In this study we focus on generic stencil-like computations exhibiting regular control flow and memory access pattern. We use two state-of-the art machine-learning methods, C5.0 (decision trees) and Eureqa Pro (symbolic regression) to select which mode to use.We use these methods to derive different predictors based on the same training data and compare their results. The accuracy of the best derived predictors are 95% and are generated by both C5.0 and Eureqa Pro, although the latter can in some cases be more sensitive to the training data. The average speedup gained due to mode switching ranges between 1.92 to 2.23 for all generated predictors on the evaluation test cases, and using a majority voting algorithm, based on the three best predictors, we can eliminate all misclassifications.

  • 44.
    Hansson, Erik
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, Faculty of Science & Engineering.
    Optimized variant-selection code generation for loops on heterogeneous multicore systems2016In: Parallel Computing: On the Road to Exascale / [ed] Gerhard R. Joubert, Hugh Leather, Mark Parsons, Frans Peters, Mark Sawyer, IOS Press, 2016, p. 103-112Chapter in book (Refereed)
    Abstract [en]

    We consider the general problem of generating code for the automated selection of the expected best implementation variants for multiple subcomputations on a heterogeneous multicore system, where the program's control flow between the subcomputations is structured by sequencing and loops. A naive greedy approach as applied in previous works on multi-variant selection code generation would determine the locally best variant for each subcomputation instance but might miss globally better solutions. We present a formalization and a fast algorithm for the global variant selection problem for loop-based programs. We also show that loop unrolling can additionally improve performance, and prove an upper bound of the unroll factor which allows to keep the run-time space overhead for the variant-dispatch data structure low. We evaluate our method in case studies using an ARM big.LITTLE based system and a GPU based system where we consider optimization for both energy and performance.

  • 45.
    Johansson, Daniel
    et al.
    NSC Linköpings universitet.
    Eriksson, Mattias
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    Kessler, Christoph
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    Bulk-synchronous parallel computing on the CELL processor2007In: PARS-2007 21. PARS - Workshop, Hamburg, Germany, May 31-Jun 1, 2007. GI/ITG-Fachgruppe Parallel-Algorithmen, -Rechnerstrukturen und -Systemsoftware PARS.,2007, GI Gesellschaft für Informatik e.V. , 2007Conference paper (Refereed)
    Abstract [en]

    In order to ease programming of heterogeneous architectures with explicitly managed memory hierarchies such as the CELL processor, we propose a solution adopting the BSP model as implemented in the parallel programming language NestStep. This allows the programmer to write programs with a global address space and run them on the slave processors (SPEs) of CELL while keeping the large data structures in main memory. We have implemented the run-time system of NestStep on CELL and report on performance results that demonstrate the feasibility of the approach. The test programs scale very well as their execution time is mostly dominated by calculations, and only a fraction is spent in the various parts of the NestStep runtime system library. The library also has a relatively small memory footprint in the SPE's software managed local memory.

  • 46.
    Keller, Jörg
    et al.
    Dept. of Mathematics and Computer Science FernUniversität in Hagen, Germany.
    Kessler, Christoph
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    Optimized Pipelined Parallel Merge Sort on the Cell BE2008In: 2nd Int. Workshop on Highly Parallel Processing on a Chip HPPC-2008,2008, Berlin: Springer , 2008Conference paper (Refereed)
    Abstract [en]

    Chip multiprocessors designed for streaming applications such as Cell BE offer impressive peak performance but suffer from limited bandwidth to off-chip main memory. As the number of cores is expected to rise further, this bottleneck will become more critical in the coming years. Hence, memory-efficient algorithms are required. As a case study, we investigate parallel sorting on Cell BE as a problem of great importance and as a challenge where the ratio between computation and memory transfer is very low. Our previous work led to a parallel mergesort that reduces memory bandwidth requirements by pipelining between SPEs, but the allocation of SPEs was rather ad-hoc. In our present work, we investigate mappings of merger nodes to SPEs. The mappings are designed to provide optimal trade-offs between load balancing, buffer memory consumption, and communication load on the on-chip bus. We solve this multi-objective optimization problem by deriving an integer linear programming formulation and compute Pareto-optimal solutions for the mapping of merge trees with up to 127 merger nodes. For mapping larger trees, we give a fast divide-and-conquer based approximation algorithm. We evaluate the sorting algorithm resulting from our mappings by a discrete event simulation.

  • 47.
    Keller, Jörg
    et al.
    Fern Universität in Hagen, Germany.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Optimized pipelined parallel merge sort on the cell BE2009In: Euro-Par 2008 Workshops - Parallel Processing: VHPC 2008, UNICORE 2008, HPPC 2008, SGS 2008, PROPER 2008, ROIA 2008, and DPA 2008, Las Palmas de Gran Canaria, Spain, August 25-26, 2008, Revised Selected Papers / [ed] Eduardo César, Michael Alexander, Achim Streit, Jesper Larsson Träff, Christophe Cérin, Andreas Knüpfer, Dieter Kranzlmüller and Shantenu Jha, Springer Berlin/Heidelberg, 2009, Vol. 5415 LNCS, p. 131-140Chapter in book (Refereed)
    Abstract [en]

    Chip multiprocessors designed for streaming applications such as Cell BE offer impressive peak performance but suffer from limited bandwidth to off-chip main memory. As the number of cores is expected to rise further, this bottleneck will become more critical in the coming years. Hence, memory-efficient algorithms are required. As a case study, we investigate parallel sorting on Cell BE as a problem of great importance and as a challenge where the ratio between computation and memory transfer is very low. Our previous work led to a parallel mergesort that reduces memory bandwidth requirements by pipelining between SPEs, but the allocation of SPEs was rather ad-hoc. In our present work, we investigate mappings of merger nodes to SPEs. The mappings are designed to provide optimal trade-offs between load balancing, buffer memory consumption, and communication load on the on-chip bus. We solve this multi-objective optimization problem by deriving an integer linear programming formulation and compute Pareto-optimal solutions for the mapping of merge trees with up to 127 merger nodes. For mapping larger trees, we give a fast divide-and-conquer based approximation algorithm. We evaluate the sorting algorithm resulting from our mappings by a discrete event simulation.

  • 48.
    Keller, Jörg
    et al.
    FernUniversität in Hagen, Fak. Math. und Informatik, Hagen, Germany.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    König, Kalle
    Technische Universit¨at Darmstadt, FB Informatik, Darmstadt, Germany.
    Heenes, Wolfgang
    Technische Universit¨at Darmstadt, FB Informatik, Darmstadt, Germany.
    Hybrid Parallel Sort on the Cell Processor2008In: 9th Workshop on Parallel Systems and Algorithms (PASA) / [ed] Wolfgang E. Nagel, Rolf Hoffmann, Andreas Koch, Bonn, Germany: Gesellschaft für Informatik, 2008, p. 107-112Conference paper (Refereed)
    Abstract [en]

    Sorting large data sets has always been an important application, and hence has been one of the benchmark applications on new parallel architectures. We present a parallel sorting algorithm for the Cell processor that combines elements of bitonic sort and merge sort, and reduces the bandwidth to main memory by pipelining. We present runtime results of a partial prototype implementation and simulation results for the complete sorting algorithm, that promise performance advantages over previous implementations.

  • 49.
    Keller, Jörg
    et al.
    Dept. of Mathematics and Computer Science Fern, Universität in Hagen, Germany.
    Kessler, Christoph
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    Träff, Jesper
    NEC Europe Ltd, NEC CC Research Center, St. Augustin, Germany.
    Practical PRAM Programming2001 (ed. 1)Book (Other academic)
    Abstract [en]

    Although PRAM (Parallel Random Access Memory) is a well-known topic in parallel computing, its practical application has rarely been explored. This groundbreaking work changes all that. Written by world experts on this technology, it explains how to use PRAM to design algorithms for parallel computers and includes a number of PRAM implementations. Readers can also use the book as a self-study guide to parallel programming in general.

  • 50.
    Keller, Jörg
    et al.
    FernUniversität in Hagen, Germany.
    Kessler, Christoph W.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Hultén, Rikard
    Linköping University, Department of Computer and Information Science. Linköping University, The Institute of Technology.
    Optimized On-Chip-Pipelining for Memory-Intensive Computations on Multi-Core Processors with Explicit Memory Hierarchy2012In: Journal of Universal Computer Science, ISSN 0948-695X, Vol. 18, no 14, p. 1987-2023Article in journal (Refereed)
    Abstract [en]

    Limited bandwidth to off-chip main memory tends to be a performance bottleneck in chip multiprocessors, and this will become even more problematic with an increasing number of cores. Especially for streaming computations where the ratio between computational work and memory transfer is low, transforming the program into more memory-efficient code is an important program optimization.

    On-chip pipelining reorganizes the computation so that partial results of subtasks are forwarded immediately between the cores over the high-bandwidth internal network, in order to reduce the volume of main memory accesses, and thereby improves the throughput for memory-intensive computations. At the same time, throughput is also constrained by the limited amount of on-chip memory available for buffering forwarded data. By optimizing the mapping of tasks to cores, balancing a trade-off between load balancing, buffer memory consumption, and communication load on the on-chip network, a larger buffer size can be applied, resulting in less DMA communication and scheduling overhead.

    In this article, we consider parallel mergesort as a representative memory-intensive application in detail, and focus on the global merging phase, which is dominating the overall sorting time for larger data sets. We work out the technical issues of applying the on-chip pipelining technique, and present several algorithms for optimized mapping of merge trees to the multiprocessor cores. We also demonstrate how some of these algorithms can be used for mapping of other streaming task graphs.

    We describe an implementation of pipelined parallel mergesort for the Cell Broadband Engine, which serves as an exemplary target. We evaluate experimentally the influence of buffer sizes and mapping optimizations, and show that optimized on-chip pipelining indeed speeds up, for realistic problem sizes, merging times by up to 70% on QS20 and 143% on PS3 compared to the merge phase of CellSort, which was by now the fastest merge sort implementation on Cell.

123 1 - 50 of 106
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf