liu.seSearch for publications in DiVA
Change search
Refine search result
123 51 - 100 of 111
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 51.
    Keller, Jörg
    et al.
    Dept. of Mathematics and Computer Science Fern, Universität in Hagen, Germany.
    Kessler, Christoph
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    Träff, Jesper
    NEC Europe Ltd, NEC CC Research Center, St. Augustin, Germany.
    Practical PRAM Programming2001 (ed. 1)Book (Other academic)
    Abstract [en]

    Although PRAM (Parallel Random Access Memory) is a well-known topic in parallel computing, its practical application has rarely been explored. This groundbreaking work changes all that. Written by world experts on this technology, it explains how to use PRAM to design algorithms for parallel computers and includes a number of PRAM implementations. Readers can also use the book as a self-study guide to parallel programming in general.

  • 52.
    Keller, Jörg
    et al.
    FernUniversität in Hagen, Germany.
    Kessler, Christoph W.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Hultén, Rikard
    Linköping University, Department of Computer and Information Science. Linköping University, The Institute of Technology.
    Optimized On-Chip-Pipelining for Memory-Intensive Computations on Multi-Core Processors with Explicit Memory Hierarchy2012In: Journal of Universal Computer Science, ISSN 0948-695X, Vol. 18, no 14, p. 1987-2023Article in journal (Refereed)
    Abstract [en]

    Limited bandwidth to off-chip main memory tends to be a performance bottleneck in chip multiprocessors, and this will become even more problematic with an increasing number of cores. Especially for streaming computations where the ratio between computational work and memory transfer is low, transforming the program into more memory-efficient code is an important program optimization.

    On-chip pipelining reorganizes the computation so that partial results of subtasks are forwarded immediately between the cores over the high-bandwidth internal network, in order to reduce the volume of main memory accesses, and thereby improves the throughput for memory-intensive computations. At the same time, throughput is also constrained by the limited amount of on-chip memory available for buffering forwarded data. By optimizing the mapping of tasks to cores, balancing a trade-off between load balancing, buffer memory consumption, and communication load on the on-chip network, a larger buffer size can be applied, resulting in less DMA communication and scheduling overhead.

    In this article, we consider parallel mergesort as a representative memory-intensive application in detail, and focus on the global merging phase, which is dominating the overall sorting time for larger data sets. We work out the technical issues of applying the on-chip pipelining technique, and present several algorithms for optimized mapping of merge trees to the multiprocessor cores. We also demonstrate how some of these algorithms can be used for mapping of other streaming task graphs.

    We describe an implementation of pipelined parallel mergesort for the Cell Broadband Engine, which serves as an exemplary target. We evaluate experimentally the influence of buffer sizes and mapping optimizations, and show that optimized on-chip pipelining indeed speeds up, for realistic problem sizes, merging times by up to 70% on QS20 and 143% on PS3 compared to the merge phase of CellSort, which was by now the fastest merge sort implementation on Cell.

  • 53.
    Keller, Jörg
    et al.
    FernUniversität in Hagen, Germany.
    Majeed, Mudassar
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Balancing CPU Load for Irregular MPI Applications2012In: Advances in Parallel Computing: Applications, Tools and Techniques on the Road to Exascale Computing / [ed] Koen De Bosschere, Erik H. D'Hollander, Gerhard R. Joubert, David Padua, Frans Peters, Mark Sawyer, IOS Press, 2012, p. 307-316Conference paper (Refereed)
    Abstract [en]

    MPI applications typically are designed to be run on a parallel machine with one process per core. If processes exhibit different computational load, either the code must be rewritten for load balancing, with negative side-effects on readability and maintainability, or the one-process-per-core philosophy leads to a low utilization of many processor cores. If several processes are mapped per core to increase CPU utilization, the load might still be unevenly distributed among the cores if the mapping is unaware of the process characteristics.

    Therefore, similarly to the MPI_Graph_create() function where the program gives hints on communication patterns so that MPI processes can be placed favorably, we propose a MPI_Load_create() function where the program supplies information on the relative loads of the MPI processes, such that processes can be favorably grouped and mapped onto processor cores. In order to account for scalability and restricted knowledge of individual MPI processes, we also propose an extension MPI_Dist_load_create() similar to MPI_Dist_graph_create(), where each individual MPI process only knows the loads of a subset of the MPI processes.

    We detail how to implement both variants on top of MPI, and provide experimental performance results both for synthetic and numeric example applications. The results indicate that load balancing is favorable in both cases.

  • 54.
    Kessler, Christoph
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    A practical access to the theory of parallel algorithms2004In: ACM SIGCSE04 Symposium on Computer Science Education,2004, 2004Conference paper (Refereed)
  • 55.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Compiling for VLIW DSPs2010In: Handbook of signal processing systems / [ed] Shuvra S. Bhattacharyya, Ed F. Deprettere, Rainer Leupers, Jarmo Takala, New York: Springer, 2010, 1, p. 603-638Chapter in book (Refereed)
  • 56.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Compiling for VLIW DSPs2013In: Handbook of signal processing systems / [ed] Shuvra S. Bhattacharyya, Ed F. Deprettere, Rainer Leupers, Jarmo Takala, New York: Springer, 2013, 2, p. 1177-1214Chapter in book (Refereed)
    Abstract [en]

    This handbook, organized into four parts, provides the reader with a comprehensive and standalone overview of signal processing systems. It contains a comprehensive index for ease of use, and an extensive bibliography for further reading.

  • 57.
    Kessler, Christoph
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    Managing distributed shared arrays in a bulk-synchronous parallel programming environment2004In: Concurrency and Computation, ISSN 1532-0626, E-ISSN 1532-0634, Vol. 16, no 02-Mar, p. 133-153Article in journal (Refereed)
    Abstract [en]

    NestStep is a parallel programming language for the BSP (bulk-hronous parallel) programming model. In this article we describe the concept of distributed shared arrays in NestStep and its implementation on top of MPI. In particular, we present a novel method for runtime scheduling of irregular, direct remote accesses to sections of distributed shared arrays. Our method, which is fully parallelized, uses conventional two-sided message passing and thus avoids the overhead of a standard implementation of direct remote memory access based on one-sided communication. The main prerequisite is that the given program is structured in a BSP-compliant way. Copyright (C) 2004 John Wiley Sons, Ltd.

  • 58.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Programming Techniques for the Cell Processor2011In: it - Information Technology, ISSN 1611-2776, Vol. 53, no 2, p. 66-74Article in journal (Refereed)
    Abstract [en]

    Cell Broadband Engine is a heterogeneous multicore processor designed mainly for applications in scientific computing, graphics, and gaming with high performance requirements. We give an overview of its architecture, review some selected development tools and programming frameworks, and describe techniques for writing efficient programs for Cell.

  • 59.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Programming the Cell Processor2010In: Fundamentals of Multicore Software Development / [ed] Victor Pankratius, Ali-Reza Adl-Tabatabai, Walter Tichy, CRC Press, 2010, p. 155-198Chapter in book (Other academic)
  • 60.
    Kessler, Christoph
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    StASy: Datorstödd administration för stora studierektorsområden2006In: Centrum för Undervisning och Lärande, CUL-report no. 10: Nya villkor för lärande och undervisning. 9:e Universitetspedagogiska konferensen vid Linköpings Universitet,2006, Linköping, Sweden: Linköpings universitet , 2006, p. 103-113Conference paper (Other academic)
  • 61.
    Kessler, Christoph
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    Teaching parallel programming early2006In: Workshop on Developing Computer Science Education -- How Can It Be Done,2006, Linköping, Sweden: Linköpings universitet , 2006Conference paper (Refereed)
  • 62.
    Kessler, Christoph
    et al.
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    Bednarski, Andrzej
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    A Dynamic Programming Approach to Optimal Integrated Code Generation2001In: ACM SIGPLAN Workshop on Languages, Compilers, and Tools for Embedded Systems LCTES2001,2001, New York, USA: ACM , 2001, p. 165-174Conference paper (Refereed)
    Abstract [en]

    Phase-decoupled methods for code generation are the state of the art in compilers for standard processors but generally produce code of poor quality for irregular target architectures such as many DSPs. In that case, the generation of efficient code requires the simultaneous solution of the main subproblems instruction selection, instruction scheduling, and register allocation, as an integrated optimization problem. In contrast to compilers for standard processors, code generation for DSPs can afford to spend much higher resources in time and space on optimizations. Today, most approaches to optimal code generation are based on integer linear programming, but these are either not integrated or not able to produce optimal solutions except for very small problem instances. We report on research in progress on a novel method for fully integrated code generation that is based on dynamic programming. In particular, we introduce the concept of a time profile. We focus on the basic block level where the data dependences among the instructions form a DAG. Our algorithm aims at combining time-optimal scheduling with optimal instruction selection, given a limited number of general-purpose registers. An extension for irregular register sets, spilling of register contents, and intricate structural constraints on code compaction based on register usage is currently under development, as well as a generalization for global code generation. A prototype implementation is operational, and we present first experimental results that show that our algorithm is practical also for medium-size problem instances. Our implementation is intended to become the core of a future, retargetable code generation system.

  • 63.
    Kessler, Christoph
    et al.
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    Bednarski, Andrzej
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    Classification and generation of schedules for VLIW processors2006In: 2th Int. Workshop on Compilers for Parallel Computers,2006, 2006, p. 60-Conference paper (Refereed)
  • 64.
    Kessler, Christoph
    et al.
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    Bednarski, Andrzej
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    Optimal integrated code generation for clustered VLIW architectures2002In: joint conference on Languages, compilers and tools for embedded systems: software and compilers for embedded systems LCTES-SCOPES02,2002, New York, USA: ACM , 2002, p. 102-Conference paper (Refereed)
    Abstract [en]

    In contrast to standard compilers, generating code for DSPs can afford spending considerable resources in time and space on optimizations. Generating efficient code for irregular architectures requires an integrated method that optimizes simultaneously for instruction selection, instruction scheduling, and register allocation.We describe a method for fully integrated optimal code generation based on dynamic programming. We introduce the concept of residence classes and space profiles, which allows us to describe and optimize for irregular register and memory structures. In order to obtain a retargetable framework we introduce a structured architecture description language, ADML, which is based on XML. We implemented a prototype of such a retargetable system for optimal code generation. Results for variants of the TI C62x show that our method can produce optimal solutions to small but nontrivial problem instances with a reasonable amount of time and space.

  • 65.
    Kessler, Christoph
    et al.
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    Bednarski, Andrzej
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    Optimal integrated code generation for VLIW architectures2006In: Concurrency and Computation, ISSN 1532-0626, E-ISSN 1532-0634, Vol. 18, no 11, p. 1353-1390Article in journal (Refereed)
    Abstract [en]

    We present a dynamic programming method for optimal integrated code generation for basic blocks that minimizes execution time. It can be applied to single-issue pipelined processors, in-order-issue superscalar processors, VLIW architectures with a single homogeneous register set, and clustered VLIW architectures with multiple register sets. For the case of a single register set, our method simultaneously copes with instruction selection, instruction scheduling, and register allocation. For clustered VLIW architectures, we also integrate the optimal partitioning of instructions, allocation of registers for temporary variables, and scheduling of data transfer operations between clusters. Our method is implemented in the prototype of a retargetable code generation framework for digital signal processors (DSPs), called OPTIMIST. We present results for the processors ARM9E, TI C62x, and a single-cluster variant of C62x. Our results show that the method can produce optimal solutions for small and (in the case of a single register set) medium-sized problem instances with a reasonable amount of time and space. For larger problem instances, our method can be seamlessly changed into a heuristic. Copyright (c) 2006 John Wiley & Sons, Ltd.

  • 66.
    Kessler, Christoph
    et al.
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    Bednarski, Andrzej
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    Eriksson, Mattias
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    Classification and generation of schedules for VLIW processors2007In: Concurrency, ISSN 1040-3108, E-ISSN 1096-9128, Vol. 19, p. 2369-2389Article in journal (Refereed)
    Abstract [en]

    We identify and analyze different classes of schedules for instruction-level parallel processor architectures. The classes are induced by various common techniques for generating or enumerating them, such as integer linear programming or list scheduling with backtracking. In particular, we study the relationship between VLIW schedules and their equivalent linearized forms (which may be used, e.g., with superscalar processors), and we identify classes of VLIW schedules that can be created from a linearized form using an in-order VLIW compaction heuristic, which is just the static equivalent of the dynamic instruction dispatch algorithm of in-order issue superscalar processors. We formulate and give a proof of the dominance of greedy schedules for instruction-level parallel architectures where all instructions have multiblock reservation tables, and we show how scheduling anomalies can occur in the presence of instructions with non-multiblock reservation tables. We also show that, in certain situations, certain schedules generally cannot be constructed by incremental scheduling algorithms that are based on topological sorting of the data dependence graph. We also discuss properties of strongly linearizable schedules, out-of-order schedules and non-dawdling schedules, and show their relationships to greedy schedules and to general schedules. We summarize our findings as a hierarchy of classes of VLIW schedules. Finally we provide an experimental evaluation showing the sizes of schedule classes in the above hierarchy, for different benchmarks and example VLIW architectures, including a single-cluster version of the TI C62x DSP processor and variants of that. Our results can sharpen the interpretation of the term optimality used with various methods for optimal VLIW scheduling, and help to identify sets of schedules that can be safely ignored when searching for a time-optimal schedule.

  • 67.
    Kessler, Christoph
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Dastgeer, Usman
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Li, Lu
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Optimized Composition: Generating Efficient Code for Heterogeneous Systems from Multi-Variant Components, Skeletons and Containers2014In: Proc. First Workshop on Resource awareness and adaptivity in multi-core computing (Racing 2014), May 2014, Paderborn, Germany / [ed] F. Hannig and J. Teich, 2014, p. 43-48Conference paper (Refereed)
    Abstract [en]

    In this survey paper, we review recent work on frameworks for the high-level, portable programming of heterogeneous multi-/manycore systems (especially, GPU-based systems) using high-level constructs such as annotated user-level software components, skeletons (i.e., predefined generic components) and containers, and discuss the optimization problems that need to be considered in selecting among multiple implementation variants, generating code and providing runtime support for efficient execution on such systems.

  • 68.
    Kessler, Christoph
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Dastgeer, Usman
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Majeed, Mudassar
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Furmento, Nathalie
    University of Bordeaux, INRIA, Bordeaux, France.
    Thibault, Samuel
    University of Bordeaux, INRIA, Bordeaux, France.
    Namyst, Raymond
    University of Bordeaux, INRIA, Bordeaux, France.
    Benkner, Siegfried
    University of Vienna, Austria.
    Pllana, Sabri
    University of Vienna, Austria.
    Träff, Jesper
    Technical University of Vienna, Austria.
    Wimmer, Martin
    Technical University of Vienna, Austria.
    Leveraging PEPPHER Technology for Performance Portable Supercomputing2013In: High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion, Salt Lake City, USA: IEEE conference proceedings, 2013, p. 1395-1396Conference paper (Other academic)
    Abstract [en]

    PEPPHER is a 3-year EU FP7 project that develops a novel approach and framework to enhance performance portability and programmability of heterogeneous multi-core systems. Its primary target is single-node heterogeneous systems, where several CPU cores are supported by accelerators such as GPUs. This poster briefly surveys the PEPPHER framework for single-node systems, and elaborates on the prospectives for leveraging the PEPPHER approach to generate performance-portable code for heterogeneous multi-node systems.

  • 69.
    Kessler, Christoph
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Dastgeer, Usman
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Thibault, Samuel
    INRIA / University of Bordeaux, France.
    Namyst, Raymond
    INRIA / University of Bordeaux, France.
    Richards, Andrew
    Codeplay Software Ltd., Edinburgh, UK.
    Dolinsky, Uwe
    Codeplay Software Ltd., Edinburgh, UK.
    Benkner, Siegfried
    University of Vienna, Austria.
    Träff, Jesper
    Technical University of Vienna, Austria.
    Pllana, Sabri
    University of Vienna, Austria.
    Programmability and Performance Portability Aspects of Heterogeneous Multi-/Manycore Systems2012Conference paper (Refereed)
    Abstract [en]

    We discuss three complementary approaches that can provide both portability and an increased level of abstraction for the programming of heterogeneous multicore systems. Together, these approaches also support performance portability, as currently investigated in the EU FP7 project PEPPHER. In particular, we consider (1) a library-based approach, here represented by the integration of the SkePU C++ skeleton programming library with the StarPU runtime system for dynamic scheduling and dynamic selection of suitable execution units for parallel tasks; (2) a language-based approach, here represented by the Offload-C++ high-level language extensions and Offload compiler to generate platform-specific code; and (3) a component-based approach, specifically the PEPPHER component system for annotating user-level application components with performance metadata, thereby preparing them for performance-aware composition. We discuss the strengths and weaknesses of these approaches and show how they could complement each other in an integrational programming framework for heterogeneous multicore systems.

  • 70.
    Kessler, Christoph
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Eitschberger, Patrick
    FernUniversität in Hagen, Germany.
    Keller, Jörg
    FernUniversität in Hagen, Germany.
    Energy-Efficient Static Scheduling of Streaming Task Collections with Malleable Tasks2013In: Proc. 25th PARS-Workshop, Gesellschaft für Informatik, 2013, p. 37-46Conference paper (Refereed)
    Abstract [en]

    We investigate the energy-efficiency of streaming task collections with parallelizable or malleable tasks on a manycore processor with frequency scaling. Streaming task collections differ from classical task sets in that all tasks are running concurrently, so that cores typically run several tasks that are scheduled round-robin on user level. A stream of data flows through the tasks and intermediate results are forwarded to other tasks like in a pipelined task graph. We first show the equivalence of task mapping for streaming task collections and normal task collections in the case of continuous frequency scaling, under reasonable assumptions for the user-level scheduler, if a makespan, i.e. a throughput requirement of the streaming application, is given and the energy consumed is to be minimized. We then show that in the case of discrete  frequency scaling, it might be necessary for processors to switch frequencies, and that idle times still can occur, in contrast to continuous frequency scaling. We formulate the mapping of (streaming) task collections on a manycore processor with discrete frequency levels as an integer linear program. Finally, we propose two heuristics to reduce energy consumption compared to the previous results by improved load balancing through the parallel execution of a parallelizable task. We evaluate the effects of the heuristics analytically and experimentally on the Intel SCC.

  • 71.
    Kessler, Christoph
    et al.
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Fritzson, Peter
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Eriksson, Mattias
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    NestStepModelica - Mathematical Modeling and Bulk-Synchronous Parallel Simulation2007In: 8th International Workshop, PARA 2006, Umeå, Sweden, June 18-21, 2006, Revised Selected Papers, Berlin, Heidelberg: Springer , 2007, p. 1006-1015Conference paper (Refereed)
    Abstract [en]

    Many parallel computing applications are used for simulation of complex engineering applications and/or for visualization. To handle their complexity, there is a need for raising the level of abstraction in specifying such applications using high level mathematical modeling techniques, such as the Modelica language and technology. However, with the increased complexity of modeled systems, it becomes increasingly important to use today-s and tomorrow-s parallel hardware efficiently. Automatic parallelization is convenient, but may need to be combined with easy-to-use methods for parallel programming. In this context, we propose to combine the abstraction power of Modelica with support for shared memory bulk-synchronous parallel programming including nested parallelism (NestStepModelica), which is both flexible (can be mapped to many different parallel architectures) and simple (offers a shared address space, structured parallelism, deterministic computation, and is deadlock-free). We describe NestStepModelica and report on first results obtained with a prototype implementation.

  • 72.
    Kessler, Christoph
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Hansson, Erik
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Flexible scheduling and thread allocation for synchronous parallel tasks2012In: ARCS-2012 Workshops / [ed] G. Mühl, J. Richling, A. Herkersdorf, Gesellschaft für Informatik , 2012, p. 517-528Conference paper (Refereed)
    Abstract [en]

    We describe a task model and dynamic scheduling and resource allocation mechanism for synchronous parallel tasks to be executed on SPMD-programmed synchronous shared memory MIMD parallel architectures with uniform, unit-time memory access and strict memory consistency, also known inthe literature as PRAMs (Parallel Random Access Machines). Our task model provides a two-tier programming model for PRAMs that flexibly combines SPMD and fork-join parallelism within the same application. It offers flexibility by dynamic scheduling and late resource binding while preserving the PRAM execution properties within each task, the only limitation being that the maximum number of threads that can be assigned to a task is limited to what the underlying architecture provides. In particular, our approach opens for automatic performance tuning at run-time by controlling the thread allocation for tasks based on run-time predictions.By a prototype implementation of a synchronous parallel task API in the SPMD-based PRAM language Fork and experimental evaluation with example programs on the SBPRAM simulator, we show that a realization of the task model on a SPMD-programmable PRAM machine is feasible with moderate runtimeoverhead per task.

  • 73.
    Kessler, Christoph
    et al.
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    Keller, Jörg
    Mathematik und Informatik Fernuniversität Hagen, Germany.
    Models for Parallel Computing: Review and Perspectives2007In: Mitteilungen - Gesellschaft für Informatik e.V., Parallel-Algorithmen und Rechnerstrukturen, ISSN 0177-0454, Vol. 24, p. 13-29Article in journal (Other academic)
    Abstract [en]

    As parallelism on different levels becomes ubiquitous in today's computers, it seems worthwhile to provide a review of the wealth of models for parallel comput ation that have evolved over the last decades. We refrain from a comprehensive survey and concentrate on models with some practical relevance, together with a perspective on models with potential future relevance. Besides presenting the models, we also refer to languages, implementations, and tools.

  • 74.
    Kessler, Christoph
    et al.
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Keller, Jörg
    FernUniversität in Hagen, Germany.
    Optimized On-Chip Pipelining of Memory-Intensive Computations on the Cell BE2008In: ACM SIGARCH Computer Architecture News, ISSN 0163-5964, Vol. 36, no 5, p. 36-45Article in journal (Refereed)
    Abstract [en]

    Multiprocessors-on-chip, such as the Cell BE processor, regularly suffer from restricted bandwidth to off-chip main memory. We propose to reduce memory bandwidth requirements, and thus increase performance, by expressing our application as a task graph, by running dependent tasks concurrently and by pipelining results directly from task to task where possible, instead of buffering in off-chip memory. To maximize bandwidth savings and balance load simultaneously, we solve a mapping problem of tasks to SPEs on the Cell BE. We present three approaches: an integer linear programming formulation that allows to compute Paretooptimal mappings for smaller task graphs, general heuristics, and a problem speci c approximation algorithm. We validate the mappings for dataparallel computations and sorting.

  • 75.
    Kessler, Christoph
    et al.
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    Keller, Jörg
    Dept. of Mathematics and Computer Science FernUniversität in Hagen, Germany.
    Optimized On-Chip Pipelining of Memory-Intensive Computations on the Cell BE2008In: MCC-2008 First Swedish Workshop on Multicore Computing,2008, Ronneby: Blekinge Institute of Technology , 2008, p. 50-59Conference paper (Refereed)
    Abstract [en]

    Multiprocessors-on-chip, such as the Cell BE processor, regularly suffer from restricted bandwidth to off-chip main memory. We propose to reduce memory bandwidth requirements, and thus increase performance, by expressing our application as a task graph, by running dependent tasks concurrently and by pipelining results directly from task to task where possible, instead of buffering in off-chip memory. To maximize bandwidth savings and balance load simultaneously, we solve a mapping problem of tasks to SPEs on the Cell BE. We present three approaches: an integer linear programming formulation that allows to compute Pareto-optimal mappings for smaller task graphs, general heuristics, and a problem specific approximation algorithm. We validate the mappings for dataparallel computations and sorting.

  • 76.
    Kessler, Christoph
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, Faculty of Science & Engineering.
    Li, Lu
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, Faculty of Science & Engineering.
    Atalar, Aras
    Chalmers University of Technology, Gothenburg, Sweden.
    Dobre, Alin
    Movidius, Dublin, Ireland.
    XPDL: Extensible Platform Description Language to Support Energy Modeling and Optimization2015In: Proc. 44th International Conference on Parallel Processing Workshops, ICPP-EMS Embedded Multicore Systems, in conjunction with ICPP-2015, Beijing, 1-4 sep. 2015, Institute of Electrical and Electronics Engineers (IEEE), 2015, p. 51-60Conference paper (Refereed)
    Abstract [en]

    We present XPDL, a modular, extensible platform description language for heterogeneous multicore systems and clusters. XPDL specifications provide platform metadata about hardware and installed system software that are relevant for the adaptive static and dynamic optimization of application programs and system settings for improved performance and energy efficiency. XPDL is based on XML and uses hyperlinks to create distributed libraries of platform metadata specifications. We also provide first components of a retarget able tool chain that browses and processes XPDL specifications, and generates driver code for micro benchmarking to bootstrap empirical performance and energy models at deployment time. A C++ based API enables convenient introspection of platform models, even at run-time, which allows for adaptive dynamic program optimizations such as tuned selection of implementation variants.

  • 77.
    Kessler, Christoph
    et al.
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Lowe, W
    Linnaeus University.
    Optimized composition of performance-aware parallel components2012In: Concurrency and Computation, ISSN 1532-0626, E-ISSN 1532-0634, Vol. 24, no 5, p. 481-498Article in journal (Refereed)
    Abstract [en]

    We describe the principles of a novel framework for performance-aware composition of sequential and explicitly parallel software components with implementation variants. Automatic composition results in a table-driven implementation that, for each parallel call of a performance-aware component, looks up the expected best implementation variant, processor allocation and schedule given the current problem, and processor group sizes. The dispatch tables are computed off-line at component deployment time by an interleaved dynamic programming algorithm from time-prediction meta-code provided by the component supplier.

  • 78.
    Kessler, Christoph
    et al.
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Löwe, Welf
    MSI Växjö universitet, Sweden.
    A Framework for Performance-Aware Composition of Explicitly Parallel Components2008In: Parallel Computing: Architectures, Algorithms and Applications, IOS Press, 2008, p. 227-234Conference paper (Refereed)
    Abstract [en]

    We describe the principles of a novel framework for performance-aware composition of explicitly parallel software components with implementation variants. Automatic composition results in a table-driven implementation that, for each parallel call of a performance-aware component, looks up the expected best implementation variant, processor allocation and schedule given the current problem and processor group sizes. The dispatch tables are computed off-line at component deployment time by interleaved dynamic programming algorithm from time-prediction metacode provided by the component supplier.

  • 79.
    Kessler, Christoph
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Melot, Nicolas
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Eitschberger, Patrick
    FernUniversität in Hagen, Germany.
    Keller, Jörg
    FernUniversität in Hagen, Germany.
    Crown Scheduling: Energy-Efficient Resource Allocation, Mapping and Discrete Frequency Scaling for Collections of Malleable Streaming Tasks2013In: Proceedings of the 23rd International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS), 2013 / [ed] Jörg Henkel and Alex Yakovlev (eds.), IEEE Computer Society Digital Library, 2013, p. 215-222Conference paper (Refereed)
    Abstract [en]

    We investigate the problem of generating energy-optimal code for a collection of streaming tasks that include parallelizable or malleable tasks on a generic many-core processor with dynamic discrete frequency scaling. Streaming task collections differ from classical task sets in that all tasks are running concurrently, so that cores typically run several tasks that are scheduled round-robin at user level in a data driven way. A stream of data flows through the tasks and intermediate results are forwarded to other tasks like in a pipelined task graph. In this paper we present crown scheduling, a novel technique for the combined optimization of resource allocation, mapping and discrete voltage/frequency scaling for malleable streaming task sets in order to optimize energy efficiency given a throughput constraint. We present optimal off-line algorithms for separate and integrated crown scheduling based on integer linear programming (ILP). We also propose extensions for dynamic rescaling to automatically adapt a given crown schedule in situations where not all tasks are data ready. Our energy model considers both static idle power and dynamic power consumption of the processor cores. Our experimental evaluation of the ILP models for a generic manycore architecture shows that at least for small and medium sized task sets even the integrated variant of crown scheduling can be solved to optimality by a state-of-the-art ILP solver within a few seconds.

  • 80.
    Kessler, Christoph
    et al.
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Nadjm-Tehrani, Simin
    Linköping University, Department of Computer and Information Science, RTSLAB - Real-Time Systems Laboratory. Linköping University, The Institute of Technology.
    Mid-term Course Evaluations with Muddy Cards2002In: ACM SIGCSE ITiCSE'02 Int. Conf. on Information technology in computer science education, Aarhus (Denmark), June 2002', New York: ACM , 2002Conference paper (Refereed)
  • 81.
    Kessler, Christoph
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems.
    Pllana, SabriLinnaeus University, Växjö.
    Proceedings of the 2013 IEEE 6th International Workshop on Multi-/Many-core Computing Systems (MuCoCoS-2013)2013Conference proceedings (editor) (Refereed)
  • 82.
    Kessler, Christoph
    et al.
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Schamai, Wladimir
    EADS InnovationWorks, Hamburg, Germany.
    Fritzson, Peter
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Platform-independent modeling of explicitly parallel programs2010Conference paper (Refereed)
    Abstract [en]

    We propose a model-driven approach to parallel programming of SPMD-style, explicitly parallel computations. We define an executable, platform-independent modeling language with explicitly parallel control and data flow for an abstract parallel computer with shared address space, and implement it as an extension of UML2 activity diagrams and a generator for Fork source code that can be compiled and executed on a high-level abstract parallel machine simulator. We also sketch how to refine the modeling language to address more realistic parallel platforms.

  • 83.
    Kessler, Martin
    et al.
    Linköping University.
    Hansson, Erik
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Åkesson, Daniel
    Linköping University, Department of Computer and Information Science. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Exploiting Instruction Level Parallelism for REPLICA - A Configurable VLIW Architecture With Chained Functional Units2012In: : Volume II, Las Vegas, Nevada, USA: CSREA Press, 2012, p. 275-281Conference paper (Other academic)
    Abstract [en]

    In this paper we present a scheduling algorithm for VLIW architectures with chained functional units. We show how our algorithm can help speed up programs at the instruction level, for an architecture called REPLICA, a configurable emulated shared memory (CESM) architecture whose computation model is based on the PRAM model. Since our LLVM based compiler is parameterizable in the number of different functional units, read and write ports to register file etc. we can generate code for different REPLICA architectures that have different functional unit configurations. We show for a set of different configurations how our implementation can produce high quality code; and we argue that the high parametrization of the compiler makes it, together with the simulator, useful for hardware/software co-design.

  • 84.
    Leha, Andreas
    et al.
    TU München, Germany.
    Chalabine, Mikhail
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    Kessler, Christoph
    Linköping University, The Institute of Technology. Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory.
    Parallelizing Scientific Code with Invasive Interactive Parallelization - A Case Study with Reuseware2008In: Int. Workshop on Component-Based High Performance Computing CBHPC-2008,2008, New York, USA: ACM , 2008Conference paper (Refereed)
    Abstract [en]

    We present a case study of parallelizing serial legacy code using Invasive Interactive Parallelization (IIP) - a compositional approach to parallelizing code refactoring rooted in the Invasive Software Composition (ISC) and the Separation of Concerns (SoC). The study focuses on scientific code, in particular, Gaussian elimination where parallelization neither requires nor incurs serious changes in the algorithmic structure. As the major contribution we show how parallelization of Gaussian elimination can be automatized with reusable parallelization recipes implemented as composers in Reuseware. We consider parallelization for both shared-and distributed-memory systems with OpenMP and MPI respectively. We present the speed-ups achieved and discuss gains in code reusability.

  • 85.
    Li, Lu
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Dastgeer, Usman
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Adaptive Off-Line Tuning for Optimized Composition of Components for Heterogeneous Many-Core Systems2013In: High Performance Computing for Computational Science - VECPAR 2012 / [ed] Dayde, Michel, Marques, Osni, Nakajima, Kengo, Springer, 2013, p. 329-345Conference paper (Refereed)
    Abstract [en]

    In recent years heterogeneous multi-core systems have been given much attention. However, performance optimization on these platforms remains a big challenge. Optimizations performed by compilers are often limited due to lack of dynamic information and run time environment, which makes applications often not performance portable. One current approach is to provide multiple implementations for the same interface that could be used interchangeably depending on the call context, and expose the composition choices to a compiler, deployment-time composition tool and/or run-time system. Using off-line machine-learning techniques allows to improve the precision and reduce the run-time overhead of run-time composition and leads to an improvement of performance portability. In this work we extend the run-time composition mechanism in the PEPPHER composition tool by off-line composition and present an adaptive machine learning algorithm for generating compact and efficient dispatch data structures with low training time. As dispatch data structure we propose an adaptive decision tree structure, which implies an adaptive training algorithm that allows to control the trade-off between training time, dispatch precision and run-time dispatch overhead.

    We have evaluated our optimization strategy with simple kernels (matrix-multiplication and sorting) as well as applications from RODINIA benchmark on two GPU-based heterogeneous systems. On average, the precision for composition choices reaches 83.6 percent with approximately 34 minutes off-line training time.

  • 86.
    Li, Lu
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, Faculty of Science & Engineering.
    Dastgeer, Usman
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, Faculty of Science & Engineering.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, Faculty of Science & Engineering.
    Pruning strategies in adaptive off-line tuning for optimized composition of components on heterogeneous systems2016In: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 51, p. 37-45Article in journal (Refereed)
    Abstract [en]

    Adaptive program optimizations, such as automatic selection of the expected fastest implementation variant for a computation component depending on hardware architecture and runtime context, are important especially for heterogeneous computing systems but require good performance models. Empirical performance models which require no or little human efforts show more practical feasibility if the sampling and training cost can be reduced to a reasonable level. In previous work we proposed an early version of adaptive sampling for efficient exploration and selection of training samples, which yields a decision-tree based method for representing, predicting and selecting the fastest implementation variants for given run-time call contexts property values. For adaptive pruning we use a heuristic convexity assumption. In this paper we consolidate and improve the method by new pruning techniques to better support the convexity assumption and control the trade-off between sampling time, prediction accuracy and runtime prediction overhead. Our results show that the training time can be reduced by up to 39 times without noticeable prediction accuracy decrease. (C) 2015 Elsevier B.V. All rights reserved.

  • 87.
    Li, Lu
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Dastgeer, Usman
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Pruning strategies in adaptive off-line tuning for optimized composition of components on heterogeneous systems2014In: 2014 43rd International Conference on Parallel Processing Workshops (ICCPW), IEEE conference proceedings, 2014, p. 255-264Conference paper (Refereed)
    Abstract [en]

    Adaptive program optimizations, such as automatic selection of the expected fastest implementation variant for a computation component depending on runtime context, are important especially for heterogeneous computing systems but require good performance models. Empirical performance models based on trial executions which require no or little human efforts show more practical feasibility if the sampling and training cost can be reduced to a reasonable level. In previous work we proposed an early version of adaptive pruning algorithm for efficient selection of training samples, a decision-tree based method for representing, predicting and selecting the fastest implementation variants for given run-time call context properties, and a composition tool for building the overall composed application from its components. For adaptive pruning we use a heuristic convexity assumption. In this paper we consolidate and improve the method by new pruning techniques to better support the convexity assumption and better control the trade-off between sampling time, prediction accuracy and runtime prediction overhead. Our results show that the training time can be reduced by up to 39 times without noticeable prediction accuracy decrease. Furthermore, we evaluate the effect of combinations of pruning strategies and compare our adaptive sampling method with random sampling. We also use our smart-sampling method as a preprocessor to a state-of-the-art decision tree learning algorithm and compare the result to the predictor directly calculated by our method.

  • 88.
    Li, Lu
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, Faculty of Science & Engineering.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, Faculty of Science & Engineering.
    Lazy Allocation and Transfer Fusion Optimization for GPU-based Heterogeneous Systems2018In: 2018 26TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, AND NETWORK-BASED PROCESSING (PDP 2018), IEEE , 2018, p. 311-315Conference paper (Refereed)
    Abstract [en]

    We present two memory optimization techniques which improve the efficiency of data transfer over PCIe bus for GPU-based heterogeneous systems, namely lazy allocation and transfer fusion optimization. Both are based on merging data transfers so that less overhead is incurred, thereby increasing transfer throughput and making accelerator usage profitable also for smaller operand sizes. We provide the design and prototype implementation of the two techniques in CUDA. Microbench-marking results show that especially for smaller and medium-sized operands significant speedups can be achieved. We also prove that our transfer fusion optimization algorithm is optimal.

  • 89.
    Li, Lu
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, Faculty of Science & Engineering.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, Faculty of Science & Engineering.
    MeterPU: A Generic Measurement Abstraction API Enabling Energy-tuned Skeleton Backend Selection2015In: Trustcom/BigDataSE/ISPA, 2015 IEEE, IEEE Press, 2015, Vol. 3, p. 154-159Conference paper (Refereed)
    Abstract [en]

    We present MeterPU, an easy-to-use, generic and low-overhead abstraction API for taking measurements of various metrics (time, energy) on different hardware components (e.g. CPU, DRAM, GPU), using pluggable platform-specific measurement implementations behind a common interface in C++. We show that with MeterPU, not only legacy (time) optimization frameworks, such as autotuned skeleton back-end selection, can be easily retargeted for energy optimization, but also switching different optimization goals for arbitrary code sections now becomes trivial. We apply MeterPU to implement the first energytunable skeleton programming framework, based on the SkePU skeleton programming library.

  • 90.
    Li, Lu
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, Faculty of Science & Engineering.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, Faculty of Science & Engineering.
    MeterPU: a generic measurement abstraction API: Enabling energy-tuned skeleton backend selection2018In: Journal of Supercomputing, ISSN 0920-8542, E-ISSN 1573-0484, Vol. 74, no 11, p. 5643-5658Article in journal (Refereed)
    Abstract [en]

    We present MeterPU, an easy-to-use, generic and low-overhead abstraction API for taking measurements of various metrics (time, energy) on different hardware components (e.g., CPU, DRAM, GPU) in a heterogeneous computer system, using pluggable platform-specific measurement implementations behind a common interface in C++. We show that with MeterPU, not only legacy (time) optimization frameworks, such as autotuned skeleton back-end selection, can be easily retargeted for energy optimization, but also switching between measurement metrics or techniques for arbitrary code sections now becomes trivial. We apply MeterPU to implement the first energy-tunable skeleton programming framework, based on the SkePU skeleton programming library.

  • 91.
    Lundvall, Håkan
    et al.
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Stavåker, Kristian
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Fritzson, Peter
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Automatic Parallelization of Simulation Code for Equation-based Models with Software Pipelining and Measurements on Three Platforms2008In: Proceedings from the First Swedish Workshop on Multi-Core Computing, MCC-08, November 27-28, 2008, Ronneby, Sweden / [ed] Håkan Grahn, Ronneby, Sweden: Blekinge Institute of Technology , 2008, p. 60-69Conference paper (Refereed)
    Abstract [en]

    In this work we report results from a new integrated method of automatically generating parallel code from Modelica models by combining parallelization at two levels of abstraction. Performing inline expansion of a Runge-Kutta solver combined with fine-grained automatic parallelization of the right-hand side of the resulting equation system opens up new possibilities for generating high performance code, which is becoming increasingly relevant when multi-core computers are becoming commonplace. An implementation, in the form of a backend module for the OpenModelica compiler, has been developed and used for measurements on two architectures: Intel Xeon and SGI Altix 3700 Bx2. This paper also contains some very recent results of a prototype implementation of this parallelization approach on the Cell BE processor architecture.

  • 92.
    Lundvall, Håkan
    et al.
    Linköping University, Department of Computer and Information Science.
    Stavåker, Kristian
    Linköping University, Department of Computer and Information Science.
    Fritzson, Peter
    Linköping University, Department of Computer and Information Science.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science.
    Automatic Parallelization of Simulation Code for Equation-based Models with Software Pipelining and Measurements on Three Platforms.2008In: SIGARCH Computer Architecture News, ISSN 0163-5964, E-ISSN 1943-5851, Vol. 36, no 5Article in journal (Refereed)
    Abstract [en]

    In this work we report results from a new integrated method of automatically generating parallel code from Modelica models by combining parallelization at two levels of abstraction. Performing inline expansion of a Runge-Kutta solver combined with fine-grained automatic parallelization of the right-hand side of the resulting equation system opens up new possibilities for generating high performance code, which is becoming increasingly relevant when multi-core computers are becoming commonplace. An implementation, in the form of a backend module for the OpenModelica compiler, has been developed and used for measurements on two architectures: Intel Xeon and SGI Altix 3700 Bx2. This paper also contains some very recent results of a prototype implementation of this parallelization approach on the Cell BE processor architecture.

  • 93.
    Majeed, Mudassar
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Dastgeer, Usman
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Cluster-SkePU: A Multi-Backend Skeleton Programming Library for GPU Clusters2013In: Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA-2013),, 2013Conference paper (Refereed)
    Abstract [en]

    SkePU is a C++ template library with a simple and unified interface for expressing data parallel computations in terms of generic components, called skeletons, on multi-GPU systems using CUDA and OpenCL. The smart containers in SkePU, such as Matrix and Vector, perform data management with a lazy memory copying mechanism that reduces redundant data communication. SkePU provides programmability, portability and even performance portability, but up to now application written using SkePU could only run on a single multi-GPU node. We present the extension of SkePU for GPU clusters without the need to modify the SkePU application source code. With our prototype implementation, we performed two experiments. The first experiment demonstrates the scalability with regular algorithms for N-body simulation and electric field calculation over multiple GPU nodes. The results for the second experiment show the benefit of lazy memory copying in terms of speedup gained for one level of Strassen’s algorithm and another synthetic matrix sum application.

  • 94.
    Mattson, Håkan
    et al.
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Nyström, Kaj
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Fritzson, Peter
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    GridModelica: Modeling and Simulating on the Grid2005Conference paper (Refereed)
  • 95.
    Mattsson, Håkan
    et al.
    Gotland University, Sweden.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, PELAB - Programming Environment Laboratory. Linköping University, The Institute of Technology.
    Towards a bulk-synchronous distributed shared memory programming environment for grids2006In: Applied Parallel Computing. State of the Art in Scientific Computing: 7th International Workshop, PARA 2004, Lyngby, Denmark, June 20-23, 2004. Revised Selected Papers / [ed] Jack Dongarra, Kaj Madsen and Jerzy Wasniewski, Springer Berlin/Heidelberg, 2006, Vol. 3732, p. 519-526Chapter in book (Refereed)
    Abstract [en]

    The current practice in grid programming uses message passing, which unfortunately leads to code that is difficult to understand, debug and optimize. Hence, for grids to become commonly accepted, also as general-purpose parallel computation platforms, Suitable parallel programming environments need to be developed. In this paper we propose an approach to realize a distributed shared memory programming environment for computational grids called GridNestStep, by adopting NestStep, a structured parallel programming language based on the Bulk Synchronous Parallel model of parallel computation.

  • 96.
    Melot, Nicolas
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Avdic, Kenan
    Linköping University, Department of Computer and Information Science. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Keller, J.
    Fern Universität in Hagen, Fac. of Math. and Computer Science, 58084 Hagen, Germany.
    Investigation of main memory bandwidth on intel single-chip cloud computer2011In: 3rd Many-Core Applications Research Community Symposium, MARC 2011, 2011, p. 107-110Conference paper (Refereed)
    Abstract [en]

    The Single-Chip Cloud Computer (SCC) is an experimental processor created by Intel Labs. It comprises 48 x86 cores linked by an on-chip high performance network, as well as four DDR3 memory controllers to access an off-chip main memory of up to 64GiB. This work evaluates the performance of the SCC when accessing the off-chip memory. The focus of this study is not on taxing the bare hardware. Instead, we are interested in the performance of applications that run on the Linux operating system and use the SCC as it is provided. We see that the per-core read memory bandwidth is largely independent of the number of cores accessing the memory simultaneously, but that the write memory access performance drops when more cores write simultaneously to the memory. In addition, the global and per-core memory bandwidth, both writing and reading, depends strongly on the memory access pattern.

  • 97.
    Melot, Nicolas
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Avdic, Kenan
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Keller, Jörg
    FernUniversitat in Hagen, Fac. of Math. and Computer Science, Hagen, Germany.
    Memory-intensive parallel computing on the Single Chip Cloud Computer: A case study with Mergesort2011Conference paper (Refereed)
    Abstract [en]

    The Single Chip Cloud computer (SCC) is an experimental processor from Intel Labs with 48 cores connected with a 2D mesh on-chip network. We evaluate the performance of SCC regarding off-chip memory accesses and communication capabilities. As benchmark, we use the merging phase of mergesort, a representative of a memory access intensive algorithm. Mergesort is parallelized and adapted in 4 variants, each leveraging different features of the SCC, in order to assess and compare their performance impact. Our results motivate to consider on-chip pipelined mergesort on SCC, which is an issue of ongoing work.

  • 98.
    Melot, Nicolas
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Avdic, Kenan
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Keller, Jörg
    Fern Universität in Hagen, Fac. of Math. and Computer Science, Hagen, Germany.
    Parallel sorting on Intel Single-Chip Cloud computer2011In: 3rd Many-core Applications ResearchCommunity (MARC) Symposium / [ed] Diana Göhringer, Michael Hübner and Jürgen Becker, Karlsruhe: KIT Scientific Publishing , 2011, , p. 11p. 107-110Conference paper (Refereed)
    Abstract [en]

    The Single-Chip Cloud Computer (SCC) is an experimental processor created by Intel Labs. It comprises 48 x86 cores linked by an on-chip high performance network, as well as four DDR3 memory controllers to access an off-chip main memory of up to 64GiB. This work evaluates the performance of the SCC when accessing the off-chip memory. The focus of this study is not on taxing the bare hardware. Instead, we are interested in the performance of applications that run on the Linux operating system and use the SCC as it is provided. We see that the per-core read memory bandwidth is largely independent of the number of cores accessing the memory simultaneously, but that the write memory access performance drops when more cores write simultaneously to the memory. In addition, the global and per-core memory bandwidth, both writing and reading, depends strongly on the memory access pattern.

  • 99.
    Melot, Nicolas
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, Faculty of Science & Engineering.
    Janzen, Johan
    Uppsala University, Sweden.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, Faculty of Science & Engineering.
    Mimer and Schedeval: Tools for Comparing Static Schedulers for Streaming Applications on Manycore Architectures2015In: 2015 44TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING WORKSHOPS, IEEE , 2015, p. 146-155Conference paper (Refereed)
    Abstract [en]

    Scheduling algorithms published in the scientific literature are often difficult to evaluate or compare due to differences between the experimental evaluations in any two papers on the topic. Very few researchers share the details about the scheduling problem instances they use in their evaluation section, the code that allows them to transform the numbers they collect into the results and graphs they show, nor the raw data produced in their experiments. Also, many scheduling algorithms published are not tested against a real processor architecture to evaluate their efficiency in a realistic setting. In this paper, we describe Mimer, a modular evaluation tool-chain for static schedulers that enables the sharing of evaluation and analysis tools employed to elaborate scheduling papers. We propose Schedeval that integrates into Mimer to evaluate static schedules of streaming applications under throughput constraints on actual target execution platforms. We evaluate the performance of Schedeval at running streaming applications on the Intel Single-Chip Cloud computer (SCC), and we demonstrate the usefulness of our tool-chain to compare existing scheduling algorithms. We conclude that Mimer and Schedeval are useful tools to study static scheduling and to observe the behavior of streaming applications when running on manycore architectures.

  • 100.
    Melot, Nicolas
    et al.
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Kessler, Christoph
    Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, The Institute of Technology.
    Avdic, Kenan
    Linköping University, Department of Computer and Information Science. Linköping University, The Institute of Technology.
    Cichowski, Patrick
    Fac. Mathematics and Computer Science, FernUniversität in Hagen, Germany.
    Keller, Jörg
    Fac. Mathematics and Computer Science, FernUniversität in Hagen, Germany.
    Engineering parallel sorting for the Intel SCC2012In: Procedia Computer Science, ISSN 1877-0509, E-ISSN 1877-0509, Vol. 9, p. 1890-1899Article in journal (Refereed)
    Abstract [en]

    The Single-Chip Cloud Computer (SCC) is an experimental processor created by Intel Labs. It comprises 48 Intel-x86 cores linked by an on-chip high performance mesh network, as well as four DDR3 memory controllers to access an off-chip main memory. We investigate the adaptation of sorting onto SCC as an algorithm engineering problem. We argue that a combination of pipelined mergesort and sample sort will fit best to SCCs architecture. We also provide a mapping based on integer linear programming to address load balancing and latency considerations. We describe a prototype implementation of our proposal together with preliminary runtime measurements, that indicate the usefulness of this approach. As mergesort can be considered as a representative of the class of streaming applications, the techniques developed here should also apply to the other problems in this class, such as many applications for parallel embedded systems, i.e. MPSoC.

123 51 - 100 of 111
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf