liu.seSök publikationer i DiVA
Ändra sökning
Avgränsa sökresultatet
123 1 - 50 av 110
RefereraExporteraLänk till träfflistan
Permanent länk
Referera
Referensformat
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Träffar per sida
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sortering
  • Standard (Relevans)
  • Författare A-Ö
  • Författare Ö-A
  • Titel A-Ö
  • Titel Ö-A
  • Publikationstyp A-Ö
  • Publikationstyp Ö-A
  • Äldst först
  • Nyast först
  • Skapad (Äldst först)
  • Skapad (Nyast först)
  • Senast uppdaterad (Äldst först)
  • Senast uppdaterad (Nyast först)
  • Disputationsdatum (tidigaste först)
  • Disputationsdatum (senaste först)
  • Standard (Relevans)
  • Författare A-Ö
  • Författare Ö-A
  • Titel A-Ö
  • Titel Ö-A
  • Publikationstyp A-Ö
  • Publikationstyp Ö-A
  • Äldst först
  • Nyast först
  • Skapad (Äldst först)
  • Skapad (Nyast först)
  • Senast uppdaterad (Äldst först)
  • Senast uppdaterad (Nyast först)
  • Disputationsdatum (tidigaste först)
  • Disputationsdatum (senaste först)
Markera
Maxantalet träffar du kan exportera från sökgränssnittet är 250. Vid större uttag använd dig av utsökningar.
  • 1.
    Ernstsson, August
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Extending smart containers for data locality-aware skeleton programming2019Ingår i: Concurrency and Computation, ISSN 1532-0626, E-ISSN 1532-0634, Vol. 31, nr 5, artikel-id e5003Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    We present an extension for the SkePU skeleton programming framework to improve the performance of sequences of transformations on smart containers. By using lazy evaluation, SkePU records skeleton invocations and dependencies as directed by smart container operands. When a partial result is required by a different part of the program, the run-time system will process the entire lineage of skeleton invocations; tiling is applied to keep chunks of container data in the working set for the whole sequence of transformations. The approach is inspired by big data frameworks operating on large clusters where good data locality is crucial. We also consider benefits other than data locality with the increased run-time information given by the lineage structures, such as backend selection for heterogeneous systems. Experimental evaluation of example applications shows potential for performance improvements due to better cache utilization, as long as the overhead of lineage construction and management is kept low.

  • 2.
    Henrio, Ludovic
    et al.
    Univ Cote Azur, France.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Li, Lu
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Ensuring Memory Consistency in Heterogeneous Systems Based on Access Mode Declarations2018Ingår i: PROCEEDINGS 2018 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING and SIMULATION (HPCS), IEEE , 2018, s. 716-723Konferensbidrag (Refereegranskat)
    Abstract [en]

    Running a program on disjoint memory spaces requires to address memory consistency issues and to perform transfers so that the program always accesses the right data. Several approaches exist to ensure the consistency of the memory accessed, we are interested here in the verification of a declarative approach where each component of a computation is annotated with an access mode declaring which part of the memory is read or written by the component. The programming framework uses the component annotations to guarantee the validity of the memory accesses. This is the mechanism used in VectorPU, a C++ library for programming CPU-GPU heterogeneous systems and this article proves the correctness of the software cache-coherence mechanism used in the library. Beyond the scope of VectorPU, this article can be considered as a simple and effective formalisation of memory consistency mechanisms based on the explicit declaration of the effect of each component on each memory space.

  • 3.
    Soudris, Dimitrios
    et al.
    Natl Tech Univ Athens, Greece.
    Papadopoulos, Lazaros
    Natl Tech Univ Athens, Greece.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Kehagias, Dionysios D.
    CERTH, Greece.
    Papadopoulos, Athanasios
    CERTH, Greece.
    Seferlis, Panos
    CERTH, Greece.
    Chatzigeorgiou, Alexander
    CERTH, Greece.
    Ampatzoglou, Apostolos
    CERTH, Greece.
    Thibault, Samuel
    Inria Bordeaux, France.
    Namyst, Raymond
    Inria Bordeaux, France.
    Pleiter, Dirk
    Forschungszentrum Julich, Germany.
    Gaydadjiev, Georgi
    Maxeler Technol Ltd, England.
    Becker, Tobias
    Maxeler Technol Ltd, England.
    Haefele, Matthieu
    Univ Paris Sud, France.
    EXA2PRO programming environment: Architecture and Applications2018Ingår i: 2018 INTERNATIONAL CONFERENCE ON EMBEDDED COMPUTER SYSTEMS: ARCHITECTURES, MODELING, AND SIMULATION (SAMOS XVIII), ASSOC COMPUTING MACHINERY , 2018, s. 202-209Konferensbidrag (Refereegranskat)
    Abstract [en]

    The EXA2PRO programming environment will integrate a set of tools and methodologies that will allow to systematically address many exascale computing challenges, including performance, performance portability, programmability, abstraction and reusability, fault tolerance and technical debt. The EXA2PRO tool-chain will enable the efficient deployment of applications in exascale computing systems, by integrating high-level software abstractions that offer performance portability and efficient exploitation of exascale systems heterogeneity, tools for efficient memory management, optimizations based on trade-offs between various metrics and fault-tolerance support. Hence, by addressing various aspects of productivity challenges, EXA2PRO is expected to have significant impact in the transition to exascale computing, as well as impact from the perspective of applications. The evaluation will be based on 4 applications from 4 different domains that will be deployed in JUELICH supercomputing center. The EXA2PRO will generate exploitable results in the form of a tool-chain that support diverse exascale heterogeneous supercomputing centers and concrete improvements in various exascale computing challenges.

  • 4.
    Li, Lu
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Lazy Allocation and Transfer Fusion Optimization for GPU-based Heterogeneous Systems2018Ingår i: 2018 26TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, AND NETWORK-BASED PROCESSING (PDP 2018), IEEE , 2018, s. 311-315Konferensbidrag (Refereegranskat)
    Abstract [en]

    We present two memory optimization techniques which improve the efficiency of data transfer over PCIe bus for GPU-based heterogeneous systems, namely lazy allocation and transfer fusion optimization. Both are based on merging data transfers so that less overhead is incurred, thereby increasing transfer throughput and making accelerator usage profitable also for smaller operand sizes. We provide the design and prototype implementation of the two techniques in CUDA. Microbench-marking results show that especially for smaller and medium-sized operands significant speedups can be achieved. We also prove that our transfer fusion optimization algorithm is optimal.

  • 5.
    Li, Lu
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    MeterPU: a generic measurement abstraction API: Enabling energy-tuned skeleton backend selection2018Ingår i: Journal of Supercomputing, ISSN 0920-8542, E-ISSN 1573-0484, Vol. 74, nr 11, s. 5643-5658Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    We present MeterPU, an easy-to-use, generic and low-overhead abstraction API for taking measurements of various metrics (time, energy) on different hardware components (e.g., CPU, DRAM, GPU) in a heterogeneous computer system, using pluggable platform-specific measurement implementations behind a common interface in C++. We show that with MeterPU, not only legacy (time) optimization frameworks, such as autotuned skeleton back-end selection, can be easily retargeted for energy optimization, but also switching between measurement metrics or techniques for arbitrary code sections now becomes trivial. We apply MeterPU to implement the first energy-tunable skeleton programming framework, based on the SkePU skeleton programming library.

  • 6.
    Ernstsson, August
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Li, Lu
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    SkePU 2: Flexible and Type-Safe Skeleton Programming for Heterogeneous Parallel Systems2018Ingår i: International journal of parallel programming, ISSN 0885-7458, E-ISSN 1573-7640, Vol. 46, nr 1, s. 62-80Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    In this article we present SkePU 2, the next generation of the SkePU C++ skeleton programming framework for heterogeneous parallel systems. We critically examine the design and limitations of the SkePU 1 programming interface. We present a new, flexible and type-safe, interface for skeleton programming in SkePU 2, and a source-to-source transformation tool which knows about SkePU 2 constructs such as skeletons and user functions. We demonstrate how the source-to-source compiler transforms programs to enable efficient execution on parallel heterogeneous systems. We show how SkePU 2 enables new use-cases and applications by increasing the flexibility from SkePU 1, and how programming errors can be caught earlier and easier thanks to improved type safety. We propose a new skeleton, Call, unique in the sense that it does not impose any predefined skeleton structure and can encapsulate arbitrary user-defined multi-backend computations. We also discuss how the source-to-source compiler can enable a new optimization opportunity by selecting among multiple user function specializations when building a parallel program. Finally, we show that the performance of our prototype SkePU 2 implementation closely matches that of SkePU 1.

  • 7.
    Torggler, Manfred
    et al.
    Fernuniv, Germany.
    Keller, Joerg
    Fernuniv, Germany.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Asymmetric Crown Scheduling2017Ingår i: 2017 25TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING (PDP 2017), IEEE , 2017, s. 421-425Konferensbidrag (Refereegranskat)
    Abstract [en]

    Streaming applications are often used for embedded and high-performance multi and manycore processors. Achieving high throughput without wasting energy can be achieved by static scheduling of parallelizable tasks with frequency scaling. We present asymmetric crown scheduling, which improves on the static crown scheduling approach by allowing flexible split ratios when subdividing processor groups. We formulate the scheduler as an integer linear program and evaluate it with synthetic task sets. The results demonstrate that a small number of split ratios improves energy efficiency of crown schedules by up to 12% with slightly higher scheduling time.

  • 8.
    Thorarensen, Sebastian
    et al.
    Linköpings universitet, Institutionen för datavetenskap. Linköpings universitet, Tekniska fakulteten.
    Cuello, Rosandra
    Linköpings universitet, Institutionen för datavetenskap. Linköpings universitet, Tekniska fakulteten.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Li, Lu
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Barry, Brendan
    Movidius Ltd, Ireland.
    Efficient Execution of SkePU Skeleton Programs on the Low-power Multicore Processor Myriad22016Ingår i: 2016 24TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, AND NETWORK-BASED PROCESSING (PDP), IEEE , 2016, s. 398-402Konferensbidrag (Refereegranskat)
    Abstract [en]

    SkePU is a state-of-the-art skeleton programming library for high-level portable programming and efficient execution on heterogeneous parallel computer systems, with a publically available implementation for general-purpose multicore CPU and multi-GPU systems. This paper presents the design, implementation and evaluation of a new back-end of the SkePU skeleton programming library for the new low-power multicore processor Myriad2 by Movidius Ltd. This enables seamless code portability of SkePU applications across both HPC and embedded (Myriad2) parallel computing systems, with decent performance, on these architecturally very diverse types of execution platforms.

  • 9.
    Melot, Nicolas
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Keller, Jörg
    FernUniversität in Hagen, Germany.
    Improving Energy-Efficiency of Static Schedules by Core Consolidation and Switching Off Unused Cores2016Ingår i: Parallel Computing: On the Road to Exascale / [ed] Gerhard R. Joubert; Hugh Leather; Mark Parsons; Frans Peters; Mark Sawyer, IOS Press, 2016, s. 285-294Konferensbidrag (Refereegranskat)
    Abstract [en]

    We demonstrate how static, energy-efficient schedules for independent, parallelizable tasks on parallel machines can be improved by modeling idle power if the static power consumption of a core comprises a notable fraction of the core's total power, which more and more often is the case. The improvement is achieved by optimally packing cores when deciding about core allocation, mapping and DVFS for each task so that all unused cores can be switched off and overall energy usage is minimized. We evaluate our proposal with a benchmark suite of task collections, and compare the resulting schedules with an optimal scheduler that does however not take idle power and core switch-off into account. We find that we can reduce energy consumption by 66% for mostly sequential tasks on many cores and by up to 91% for a realistic multicore processor model.

  • 10.
    Hansson, Erik
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Optimized variant-selection code generation for loops on heterogeneous multicore systems2016Ingår i: Parallel Computing: On the Road to Exascale / [ed] Gerhard R. Joubert; Hugh Leather; Mark Parsons; Frans Peters; Mark Sawyer, IOS Press, 2016, s. 103-112Konferensbidrag (Refereegranskat)
    Abstract [en]

    We consider the general problem of generating code for the automated selection of the expected best implementation variants for multiple subcomputations on a heterogeneous multicore system, where the program's control flow between the subcomputations is structured by sequencing and loops. A naive greedy approach as applied in previous works on multi-variant selection code generation would determine the locally best variant for each subcomputation instance but might miss globally better solutions. We present a formalization and a fast algorithm for the global variant selection problem for loop-based programs. We also show that loop unrolling can additionally improve performance, and prove an upper bound of the unroll factor which allows to keep the run-time space overhead for the variant-dispatch data structure low. We evaluate our method in case studies using an ARM big.LITTLE based system and a GPU based system where we consider optimization for both energy and performance.

  • 11.
    Sjöström, Oskar
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Ko, Soon Heum
    Linköpings universitet, Nationellt superdatorcentrum (NSC).
    Dastgeer, Usman
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Li, Lu
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Portable Parallelization of the EDGE CFD Application for GPU-based Systems using the SkePU Skeleton Programming Library2016Ingår i: Parallel Computing: On the Road to Exascale / [ed] Gerhard R. Joubert; Hugh Leather; Mark Parsons; Frans Peters; Mark Sawyer, IOS Press, 2016, s. 135-144Konferensbidrag (Refereegranskat)
    Abstract [en]

    EDGE is a complex application for computational fluid dynamics used e.g. for aerodynamic simulations in avionics industry. In this work we present the portable, high-level parallelization of EDGE for execution on multicore CPU and GPU based systems by using the multi-backend skeleton programming library SkePU. We first expose the challenges of applying portable high-level parallelization to a complex scientific application for a heterogeneous (GPU-based) system using (SkePU) skeletons and discuss the encountered flexibility problems that usually do not show up in skeleton toy programs. We then identify and implement necessary improvements in SkePU to become applicable for applications containing computations on complex data structures and with irregular data access. In particular, we improve the MapArray skeleton and provide a new MultiVector container for operand data that can be used with unstructured grid data structures. Although there is no SkePU skeleton specifically dedicated to handling computations on unstructured grids and its data structures, we still obtain portable speedup of EDGE with both multicore CPU and GPU execution by using the improved MapArray skeleton of SkePU.

  • 12.
    Li, Lu
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Dastgeer, Usman
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Pruning strategies in adaptive off-line tuning for optimized composition of components on heterogeneous systems2016Ingår i: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 51, s. 37-45Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Adaptive program optimizations, such as automatic selection of the expected fastest implementation variant for a computation component depending on hardware architecture and runtime context, are important especially for heterogeneous computing systems but require good performance models. Empirical performance models which require no or little human efforts show more practical feasibility if the sampling and training cost can be reduced to a reasonable level. In previous work we proposed an early version of adaptive sampling for efficient exploration and selection of training samples, which yields a decision-tree based method for representing, predicting and selecting the fastest implementation variants for given run-time call contexts property values. For adaptive pruning we use a heuristic convexity assumption. In this paper we consolidate and improve the method by new pruning techniques to better support the convexity assumption and control the trade-off between sampling time, prediction accuracy and runtime prediction overhead. Our results show that the training time can be reduced by up to 39 times without noticeable prediction accuracy decrease. (C) 2015 Elsevier B.V. All rights reserved.

  • 13.
    Dastgeer, Usman
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Smart Containers and Skeleton Programming for GPU-Based Systems2016Ingår i: International journal of parallel programming, ISSN 0885-7458, E-ISSN 1573-7640, Vol. 44, nr 3, s. 506-530Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    In this paper, we discuss the role, design and implementation of smart containers in the SkePU skeleton library for GPU-based systems. These containers provide an interface similar to C++ STL containers but internally perform runtime optimization of data transfers and runtime memory management for their operand data on the different memory units. We discuss how these containers can help in achieving asynchronous execution for skeleton calls while providing implicit synchronization capabilities in a data consistent manner. Furthermore, we discuss the limitations of the original, already optimizing memory management mechanism implemented in SkePU containers, and propose and implement a new mechanism that provides stronger data consistency and improves performance by reducing communication and memory allocations. With several applications, we show that our new mechanism can achieve significantly (up to 33.4 times) better performance than the initial mechanism for page-locked memory on a multi-GPU based system.

  • 14.
    Melot, Nicolas
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Keller, Joerg
    FernUniversität in Hagen, Germany (Parallelität und VLSI).
    Eitschberger, Patrick
    FernUniversität in Hagen, Germany (Parallelität und VLSI).
    Fast Crown Scheduling Heuristics for Energy-Efficient Mapping and Scaling of Moldable Streaming Tasks on Many-Core Systems2015Ingår i: ACM Transactions on Architecture and Code Optimization (TACO), ISSN 1544-3566, Vol. 11, nr 4, s. 62-Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Exploiting effectively massively parallel architectures is a major challenge that stream programming can help facilitate. We investigate the problem of generating energy-optimal code for a collection of streaming tasks that include parallelizable or moldable tasks on a generic manycore processor with dynamic discrete frequency scaling. Streaming task collections differ from classical task sets in that all tasks are running concurrently, so that cores typically run several tasks that are scheduled round-robin at user level in a data-driven way. A stream of data flows through the tasks and intermediate results may be forwarded to other tasks, as in a pipelined task graph. In this article, we consider crown scheduling, a novel technique for the combined optimization of resource allocation, mapping, and discrete voltage/frequency scaling for moldable streaming task collections in order to optimize energy efficiency given a throughput constraint. We first present optimal offline algorithms for separate and integrated crown scheduling based on integer linear programming (ILP). We make no restricting assumption about speedup behavior. We introduce the fast heuristic Longest Task, Lowest Group (LTLG) as a generalization of the Longest Processing Time (LPT) algorithm to achieve a load-balanced mapping of parallel tasks, and the Height heuristic for crown frequency scaling. We use them in feedback loop heuristics based on binary search and simulated annealing to optimize crown allocation.

    Our experimental evaluation of the ILP models for a generic manycore architecture shows that at least for small and medium-sized streaming task collections even the integrated variant of crown scheduling can be solved to optimality by a state-of-the-art ILP solver within a few seconds. Our heuristics produce makespan and energy consumption close to optimality within the limits of the phase-separated crown scheduling technique and the crown structure. Their optimization time is longer than the one of other algorithms we test, but our heuristics consistently produce better solutions.

  • 15.
    Li, Lu
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    MeterPU: A Generic Measurement Abstraction API Enabling Energy-tuned Skeleton Backend Selection2015Ingår i: Trustcom/BigDataSE/ISPA, 2015 IEEE, IEEE Press, 2015, Vol. 3, s. 154-159Konferensbidrag (Refereegranskat)
    Abstract [en]

    We present MeterPU, an easy-to-use, generic and low-overhead abstraction API for taking measurements of various metrics (time, energy) on different hardware components (e.g. CPU, DRAM, GPU), using pluggable platform-specific measurement implementations behind a common interface in C++. We show that with MeterPU, not only legacy (time) optimization frameworks, such as autotuned skeleton back-end selection, can be easily retargeted for energy optimization, but also switching different optimization goals for arbitrary code sections now becomes trivial. We apply MeterPU to implement the first energytunable skeleton programming framework, based on the SkePU skeleton programming library.

  • 16.
    Melot, Nicolas
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Janzen, Johan
    Uppsala University, Sweden.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Mimer and Schedeval: Tools for Comparing Static Schedulers for Streaming Applications on Manycore Architectures2015Ingår i: 2015 44TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING WORKSHOPS, IEEE , 2015, s. 146-155Konferensbidrag (Refereegranskat)
    Abstract [en]

    Scheduling algorithms published in the scientific literature are often difficult to evaluate or compare due to differences between the experimental evaluations in any two papers on the topic. Very few researchers share the details about the scheduling problem instances they use in their evaluation section, the code that allows them to transform the numbers they collect into the results and graphs they show, nor the raw data produced in their experiments. Also, many scheduling algorithms published are not tested against a real processor architecture to evaluate their efficiency in a realistic setting. In this paper, we describe Mimer, a modular evaluation tool-chain for static schedulers that enables the sharing of evaluation and analysis tools employed to elaborate scheduling papers. We propose Schedeval that integrates into Mimer to evaluate static schedules of streaming applications under throughput constraints on actual target execution platforms. We evaluate the performance of Schedeval at running streaming applications on the Intel Single-Chip Cloud computer (SCC), and we demonstrate the usefulness of our tool-chain to compare existing scheduling algorithms. We conclude that Mimer and Schedeval are useful tools to study static scheduling and to observe the behavior of streaming applications when running on manycore architectures.

  • 17.
    Dastgeer, Usman
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Performance-aware Composition Framework for GPU-based Systems2015Ingår i: Journal of Supercomputing, ISSN 0920-8542, E-ISSN 1573-0484, Vol. 71, nr 12, s. 4646-4662Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    User-level components of applications can be made performance-aware by annotating them with performance model and other metadata. We present a component model and a composition framework for the automatically optimized composition of applications for modern GPU-based systems from such components, which may expose multiple implementation variants. The framework targets the composition problem in an integrated manner, with the ability to do global performance-aware composition across multiple invocations. We demonstrate several key features of our framework relating to performance-aware composition including implementation selection, both with performance characteristics being known (or learned) beforehand as well as cases when they are learned at runtime. We also demonstrate hybrid execution capabilities of our framework on real applications. Furthermore, we present a bulk composition technique that can make better composition decisions by considering information about upcoming calls along with data flow information extracted from the source program by static analysis. The bulk composition improves over the traditional greedy performance aware policy that only considers the current call for optimization.

  • 18.
    Kessler, Christoph
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Li, Lu
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Atalar, Aras
    Chalmers University of Technology, Gothenburg, Sweden.
    Dobre, Alin
    Movidius, Dublin, Ireland.
    XPDL: Extensible Platform Description Language to Support Energy Modeling and Optimization2015Ingår i: Proc. 44th International Conference on Parallel Processing Workshops, ICPP-EMS Embedded Multicore Systems, in conjunction with ICPP-2015, Beijing, 1-4 sep. 2015, Institute of Electrical and Electronics Engineers (IEEE), 2015, s. 51-60Konferensbidrag (Refereegranskat)
    Abstract [en]

    We present XPDL, a modular, extensible platform description language for heterogeneous multicore systems and clusters. XPDL specifications provide platform metadata about hardware and installed system software that are relevant for the adaptive static and dynamic optimization of application programs and system settings for improved performance and energy efficiency. XPDL is based on XML and uses hyperlinks to create distributed libraries of platform metadata specifications. We also provide first components of a retarget able tool chain that browses and processes XPDL specifications, and generates driver code for micro benchmarking to bootstrap empirical performance and energy models at deployment time. A C++ based API enables convenient introspection of platform models, even at run-time, which allows for adaptive dynamic program optimizations such as tuned selection of implementation variants.

  • 19.
    Hansson, Erik
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Alnervik, Erik
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Forsell, Martti
    VTT Technical Research Centre of Finland.
    A Quantitative Comparison of PRAM based Emulated Shared Memory Architectures to Current Multicore CPUs and GPUs2014Ingår i: 27th International Conference on Architecture of Computing Systems (ARCS), 2014, ARCS Workshops: Proc. PASA-2014 11th Workshop on Parallel Systems and Algorithms, Lübeck, Germany, Lübeck, Germany: VDE Verlag GmbH, 2014, s. 27-33Konferensbidrag (Refereegranskat)
    Abstract [en]

    The performance of current multicore CPUs and GPUs is limited in computations making frequent use of communication/synchronization between the subtasks executed in parallel. This is because the directory-based cache systems scale weakly and/or the cost of synchronization is high. The Emulated Shared Memory (ESM) architectures relying on multithreading and efficient synchronization mechanisms have been developed to solve these problems affecting both performance and programmability of current machines. In this paper, we compare preliminarily the performance of three hardware implemented ESM architectures with state-of-the-art multicore CPUs and GPUs. The benchmarks are selected to cover different patterns of parallel computation and therefore reveal the performance potential of ESM architectures with respect to current multicores.

  • 20.
    Dastgeer, Usman
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Conditional component composition for GPU-based systems2014Ingår i: Proc. Seventh Workshop on Programmability Issues for Multi-Core Computers (MULTIPROG-2014) at HiPEAC-2014, Vienna, Austria, Jan. 2014, Vienna, Austria: HiPEAC NoE , 2014Konferensbidrag (Refereegranskat)
    Abstract [en]

    User-level components can expose multiple functionally equivalent implementations with different resource requirements and performance characteristics. A composition framework can then choose a suitable implementation for each component invocation guided by an objective function (execution time, energy etc.). In this paper, we describe the idea of conditional composition which enables the component writer to specify constraints on the selectability of a given component implementation based on information about the target system and component call properties. By incorporating such information, more informed and user-guided composition decisions can be made and thus more efficient code be generated, as shown with an example scenario for a GPU-based system.

  • 21.
    Forsell, Martti
    et al.
    Platform Architectures Team, VTT Technical Research Centre of Finland, Finland.
    Hansson, Erik
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Mäkelä, Jari-Matti
    Department of Information Technology, University of Turku, Finland.
    Leppänen, Ville
    Department of Information Technology, University of Turku, Finland.
    NUMA Computing with Hardware and Software Co-Support on Configurable Emulated Shared Memory Architectures2014Ingår i: International Journal of Networking and Computing, ISSN 2185-2839, E-ISSN 2185-2847, Vol. 4, nr 1, s. 189-206Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    The emulated shared memory (ESM) architectures are good candidates for future general purpose parallel computers due to their ability to provide an easy-to-use explicitly parallel synchronous model of computation to programmers as well as avoid most performance bottlenecks present in current multicore architectures. In order to achieve full performance the applications must, however, have enough thread-level parallelism (TLP). To solve this problem, in our earlier work we have introduced a class of configurable emulated shared memory (CESM) machines that provides a special non-uniform memory access (NUMA) mode for situations where TLP is limited or for direct compatibility for legacy code sequential computing and NUMA mechanism. Unfortunately the earlier proposed CESM architecture does not integrate the different modes of the architecture well together e.g. by leaving the memories for different modes isolated and therefore the programming interface is non-integrated. In this paper we propose a number of hardware and software techniques to support NUMA computing in CESM architectures in a seamless way. The hardware techniques include three different NUMA shared memory access mechanisms and the software ones provide a mechanism to integrate and optimize NUMA computation into the standard parallel random access machine (PRAM) operation of the CESM. The hardware techniques are evaluated on our REPLICA CESM architecture and compared to an ideal CESM machine making use of the proposed software techniques.

  • 22.
    Kessler, Christoph
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Dastgeer, Usman
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Li, Lu
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Optimized Composition: Generating Efficient Code for Heterogeneous Systems from Multi-Variant Components, Skeletons and Containers2014Ingår i: Proc. First Workshop on Resource awareness and adaptivity in multi-core computing (Racing 2014), May 2014, Paderborn, Germany / [ed] F. Hannig and J. Teich, 2014, s. 43-48Konferensbidrag (Refereegranskat)
    Abstract [en]

    In this survey paper, we review recent work on frameworks for the high-level, portable programming of heterogeneous multi-/manycore systems (especially, GPU-based systems) using high-level constructs such as annotated user-level software components, skeletons (i.e., predefined generic components) and containers, and discuss the optimization problems that need to be considered in selecting among multiple implementation variants, generating code and providing runtime support for efficient execution on such systems.

  • 23.
    Hansson, Erik
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Optimized selection of runtime mode for the reconfigurable PRAM-NUMA architecture REPLICA using machine-learning2014Ingår i: Euro-Par 2014: Parallel Processing Workshops: Euro-Par 2014 International Workshops, Porto, Portugal, August 25-26, 2014, Revised Selected Papers, Part II / [ed] Luis Lopes et al., Springer-Verlag New York, 2014, s. 133-145Konferensbidrag (Refereegranskat)
    Abstract [en]

    The massively hardware multithreaded VLIW emulated shared memory (ESM) architecture REPLICA has a dynamically reconfigurable on-chip network that offers two execution modes: PRAM and NUMA. PRAM mode is mainly suitable for applications with high amount of thread level parallelism (TLP) while NUMA mode is mainly for accelerating execution of sequential programs or programs with low TLP. Also, some types of regular data parallel algorithms execute faster in NUMA mode. It is not obvious in which mode a given program region shows the best performance. In this study we focus on generic stencil-like computations exhibiting regular control flow and memory access pattern. We use two state-of-the art machine-learning methods, C5.0 (decision trees) and Eureqa Pro (symbolic regression) to select which mode to use.We use these methods to derive different predictors based on the same training data and compare their results. The accuracy of the best derived predictors are 95% and are generated by both C5.0 and Eureqa Pro, although the latter can in some cases be more sensitive to the training data. The average speedup gained due to mode switching ranges between 1.92 to 2.23 for all generated predictors on the evaluation test cases, and using a majority voting algorithm, based on the three best predictors, we can eliminate all misclassifications.

  • 24.
    Li, Lu
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Dastgeer, Usman
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Pruning strategies in adaptive off-line tuning for optimized composition of components on heterogeneous systems2014Ingår i: 2014 43rd International Conference on Parallel Processing Workshops (ICCPW), IEEE conference proceedings, 2014, s. 255-264Konferensbidrag (Refereegranskat)
    Abstract [en]

    Adaptive program optimizations, such as automatic selection of the expected fastest implementation variant for a computation component depending on runtime context, are important especially for heterogeneous computing systems but require good performance models. Empirical performance models based on trial executions which require no or little human efforts show more practical feasibility if the sampling and training cost can be reduced to a reasonable level. In previous work we proposed an early version of adaptive pruning algorithm for efficient selection of training samples, a decision-tree based method for representing, predicting and selecting the fastest implementation variants for given run-time call context properties, and a composition tool for building the overall composed application from its components. For adaptive pruning we use a heuristic convexity assumption. In this paper we consolidate and improve the method by new pruning techniques to better support the convexity assumption and better control the trade-off between sampling time, prediction accuracy and runtime prediction overhead. Our results show that the training time can be reduced by up to 39 times without noticeable prediction accuracy decrease. Furthermore, we evaluate the effect of combinations of pruning strategies and compare our adaptive sampling method with random sampling. We also use our smart-sampling method as a preprocessor to a state-of-the-art decision tree learning algorithm and compare the result to the predictor directly calculated by our method.

  • 25.
    Dastgeer, Usman
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Li, Lu
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    The PEPPHER composition tool: performance-aware composition for GPU-based systems2014Ingår i: Computing, ISSN 0010-485X, E-ISSN 1436-5057, Vol. 96, nr 12, s. 1195-1211Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    The PEPPHER (EU FP7 project) component model defines the notion of component, interface and meta-data for homogeneous and heterogeneous parallel systems. In this paper, we describe and evaluate the PEPPHER composition tool, which explores the application’s components and their implementation variants, generates the necessary low-level code that interacts with the runtime system, and coordinates the native compilation and linking of the various code units to compose the overall application code to optimize performance. We discuss the concept of smart containers and its benefits for reducing dispatch overhead, exploiting implicit parallelism across component invocations and runtime optimization of data transfers. In an experimental evaluation with several applications, we demonstrate that the composition tool provides a high-level programming front-end while effectively utilizing the task-based PEPPHER runtime system (StarPU) underneath for different usage scenarios on GPU-based systems.

  • 26.
    Dastgeer, Usman
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    A Framework for Performance-aware Composition of Applications for GPU-based Systems2013Konferensbidrag (Refereegranskat)
    Abstract [en]

    User-level components of applications can be made performance-aware by annotating them with performance model and other metadata. We present a component model and a composition framework for the performance-aware composition of applications for modern GPU-based systems from such components, which may expose multiple implementation variants. The framework targets the composition problem in an integrated manner, with particular focus on global performance-aware composition across multiple invocations. We demonstrate several key features of our framework relating to performance-aware composition including implementation selection, both with performance characteristics being known (or learned) beforehand as well as cases when they are learned at runtime. We also demonstrate hybrid execution capabilities of our framework on real applications. Furthermore, as an important step towards global composition, we present a bulk composition technique that can make better composition decisions by considering information about upcoming calls along with data flow information extracted from the source program by static analysis, thus improving over the traditional greedy performance-aware policy that only considers the current call for optimization.

  • 27.
    Dastgeer, Usman
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system.
    Li, Lu
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system.
    Adaptive Implementation Selection in the SkePU Skeleton Programming Library2013Ingår i: Advanced Parallel Processing Technologies (APPT-2013), Proceedings / [ed] Chengyung Wu and Albert Cohen (eds.), 2013, s. 170-183Konferensbidrag (Refereegranskat)
    Abstract [en]

    In earlier work, we have developed the SkePU skeleton programming library for modern multicore systems equipped with one or more programmable GPUs. The library internally provides four types of implementations (implementation variants) for each skeleton: serial C++, OpenMP, CUDA and OpenCL targeting either CPU or GPU execution respectively. Deciding which implementation would run faster for a given skeleton call depends upon the computation, problem size(s), system architecture and data locality.

    In this paper, we present our work on automatic selection between these implementation variants by an offline machine learning method which generates a compact decision tree with low training overhead. The proposed selection mechanism is flexible yet high-level allowing a skeleton programmer to control different training choices at a higher abstraction level. We have evaluated our optimization strategy with 9 applications/kernels ported to our skeleton library and achieve on average more than 94% (90%) accuracy with just 0.53% (0.58%) training space exploration on two systems. Moreover, we discuss one application scenario where local optimization considering a single skeleton call can prove sub-optimal, and propose a heuristic for bulk implementation selection considering more than one skeleton call to address such application scenarios.

  • 28.
    Li, Lu
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Dastgeer, Usman
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Adaptive Off-Line Tuning for Optimized Composition of Components for Heterogeneous Many-Core Systems2013Ingår i: High Performance Computing for Computational Science - VECPAR 2012 / [ed] Dayde, Michel, Marques, Osni, Nakajima, Kengo, Springer, 2013, s. 329-345Konferensbidrag (Refereegranskat)
    Abstract [en]

    In recent years heterogeneous multi-core systems have been given much attention. However, performance optimization on these platforms remains a big challenge. Optimizations performed by compilers are often limited due to lack of dynamic information and run time environment, which makes applications often not performance portable. One current approach is to provide multiple implementations for the same interface that could be used interchangeably depending on the call context, and expose the composition choices to a compiler, deployment-time composition tool and/or run-time system. Using off-line machine-learning techniques allows to improve the precision and reduce the run-time overhead of run-time composition and leads to an improvement of performance portability. In this work we extend the run-time composition mechanism in the PEPPHER composition tool by off-line composition and present an adaptive machine learning algorithm for generating compact and efficient dispatch data structures with low training time. As dispatch data structure we propose an adaptive decision tree structure, which implies an adaptive training algorithm that allows to control the trade-off between training time, dispatch precision and run-time dispatch overhead.

    We have evaluated our optimization strategy with simple kernels (matrix-multiplication and sorting) as well as applications from RODINIA benchmark on two GPU-based heterogeneous systems. On average, the precision for composition choices reaches 83.6 percent with approximately 34 minutes off-line training time.

  • 29.
    Majeed, Mudassar
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Dastgeer, Usman
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Cluster-SkePU: A Multi-Backend Skeleton Programming Library for GPU Clusters2013Ingår i: Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA-2013),, 2013Konferensbidrag (Refereegranskat)
    Abstract [en]

    SkePU is a C++ template library with a simple and unified interface for expressing data parallel computations in terms of generic components, called skeletons, on multi-GPU systems using CUDA and OpenCL. The smart containers in SkePU, such as Matrix and Vector, perform data management with a lazy memory copying mechanism that reduces redundant data communication. SkePU provides programmability, portability and even performance portability, but up to now application written using SkePU could only run on a single multi-GPU node. We present the extension of SkePU for GPU clusters without the need to modify the SkePU application source code. With our prototype implementation, we performed two experiments. The first experiment demonstrates the scalability with regular algorithms for N-body simulation and electric field calculation over multiple GPU nodes. The results for the second experiment show the benefit of lazy memory copying in terms of speedup gained for one level of Strassen’s algorithm and another synthetic matrix sum application.

  • 30.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Compiling for VLIW DSPs2013Ingår i: Handbook of signal processing systems / [ed] Shuvra S. Bhattacharyya, Ed F. Deprettere, Rainer Leupers, Jarmo Takala, New York: Springer, 2013, 2, s. 1177-1214Kapitel i bok, del av antologi (Refereegranskat)
    Abstract [en]

    This handbook, organized into four parts, provides the reader with a comprehensive and standalone overview of signal processing systems. It contains a comprehensive index for ease of use, and an extensive bibliography for further reading.

  • 31.
    Kessler, Christoph
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Melot, Nicolas
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Eitschberger, Patrick
    FernUniversität in Hagen, Germany.
    Keller, Jörg
    FernUniversität in Hagen, Germany.
    Crown Scheduling: Energy-Efficient Resource Allocation, Mapping and Discrete Frequency Scaling for Collections of Malleable Streaming Tasks2013Ingår i: Proceedings of the 23rd International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS), 2013 / [ed] Jörg Henkel and Alex Yakovlev (eds.), IEEE Computer Society Digital Library, 2013, s. 215-222Konferensbidrag (Refereegranskat)
    Abstract [en]

    We investigate the problem of generating energy-optimal code for a collection of streaming tasks that include parallelizable or malleable tasks on a generic many-core processor with dynamic discrete frequency scaling. Streaming task collections differ from classical task sets in that all tasks are running concurrently, so that cores typically run several tasks that are scheduled round-robin at user level in a data driven way. A stream of data flows through the tasks and intermediate results are forwarded to other tasks like in a pipelined task graph. In this paper we present crown scheduling, a novel technique for the combined optimization of resource allocation, mapping and discrete voltage/frequency scaling for malleable streaming task sets in order to optimize energy efficiency given a throughput constraint. We present optimal off-line algorithms for separate and integrated crown scheduling based on integer linear programming (ILP). We also propose extensions for dynamic rescaling to automatically adapt a given crown schedule in situations where not all tasks are data ready. Our energy model considers both static idle power and dynamic power consumption of the processor cores. Our experimental evaluation of the ILP models for a generic manycore architecture shows that at least for small and medium sized task sets even the integrated variant of crown scheduling can be solved to optimality by a state-of-the-art ILP solver within a few seconds.

  • 32.
    Cichowski, Patrick
    et al.
    FernUniversität in Hagen, Germany.
    Keller, Jörg
    FernUniversität in Hagen, Germany.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Energy-efficient Mapping of Task Collections onto Manycore Processors2013Ingår i: Proceedings of MULTIPROG'13 workshop at HiPEAC'13 / [ed] E. Ayguade et al. (eds.), 2013Konferensbidrag (Refereegranskat)
    Abstract [en]

    Streaming applications consist of a number of tasks that all run concurrently, and that process data at certain rates. On manycore processors, the tasks of the streaming application must be mapped onto the cores. While load balancing of such applications has been considered, especially in the MPSoC community, we investigate energy-efficient mapping of such task collections onto manycore processors. We first derive rules that guide the mapping process and show that as long as dynamic power consumption dominates static power consumption, the latter can be ignored and the problem reduces to load balancing. When however, as expected in the coming years, static power consumption will be a notable fraction of total power consumption, then an energy-efficient mapping must take it into account, e.g. by temporary shutdown of cores or by restricting the number of cores. We validate our findings with synthetic and real-world applications on the Intel SCC manycore processor.

  • 33.
    Kessler, Christoph
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Eitschberger, Patrick
    FernUniversität in Hagen, Germany.
    Keller, Jörg
    FernUniversität in Hagen, Germany.
    Energy-Efficient Static Scheduling of Streaming Task Collections with Malleable Tasks2013Ingår i: Proc. 25th PARS-Workshop, Gesellschaft für Informatik, 2013, s. 37-46Konferensbidrag (Refereegranskat)
    Abstract [en]

    We investigate the energy-efficiency of streaming task collections with parallelizable or malleable tasks on a manycore processor with frequency scaling. Streaming task collections differ from classical task sets in that all tasks are running concurrently, so that cores typically run several tasks that are scheduled round-robin on user level. A stream of data flows through the tasks and intermediate results are forwarded to other tasks like in a pipelined task graph. We first show the equivalence of task mapping for streaming task collections and normal task collections in the case of continuous frequency scaling, under reasonable assumptions for the user-level scheduler, if a makespan, i.e. a throughput requirement of the streaming application, is given and the energy consumed is to be minimized. We then show that in the case of discrete  frequency scaling, it might be necessary for processors to switch frequencies, and that idle times still can occur, in contrast to continuous frequency scaling. We formulate the mapping of (streaming) task collections on a manycore processor with discrete frequency levels as an integer linear program. Finally, we propose two heuristics to reduce energy consumption compared to the previous results by improved load balancing through the parallel execution of a parallelizable task. We evaluate the effects of the heuristics analytically and experimentally on the Intel SCC.

  • 34.
    Shafiee Sarvestani, Amin
    et al.
    Linköpings universitet, Institutionen för datavetenskap. Linköpings universitet, Tekniska högskolan.
    Hansson, Erik
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Extensible Recognition of Algorithmic Patterns in DSP Programs for Automatic Parallelization2013Ingår i: International journal of parallel programming, ISSN 0885-7458, E-ISSN 1573-7640, Vol. 41, nr 6, s. 806-824Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    We introduce an extensible knowledge based tool for idiom (pattern) recognition in DSP (digital signal processing) programs. Our tool utilizes functionality provided by the Cetus compiler infrastructure for detecting certain computation patterns that frequently occur in DSP code. We focus on recognizing patterns for for-loops and statements in their bodies as these often are the performance critical constructs in DSP applications for which replacement by highly optimized, target-specific parallel algorithms will be most profitable. For better structuring and efficiency of pattern recognition, we classify patterns by different levels of complexity such that patterns in higher levels are defined in terms of lower level patterns. The tool works statically on the intermediate representation. For better extensibility and abstraction, most of the structural part of recognition rules is specified in XML form to separate the tool implementation from the pattern specifications. Information about detected patterns will later be used for optimized code generation by local algorithm replacement e.g. for the low-power high-throughput multicore DSP architecture ePUMA.

  • 35.
    Forsell, Martti
    et al.
    Platform Architectures Team, VTT Technical Research Centre of Finland, Finland.
    Hansson, Erik
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Mäkelä, Jari-Matti
    Information Technology, University of Turku, Finland.
    Leppänen, Ville
    Information Technology, University of Turku, Finland.
    Hardware and Software Support for NUMA Computing on Configurable Emulated Shared Memory Architectures2013Ingår i: 2013 IEEE 27th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), IEEE conference proceedings, 2013, s. 640-647Konferensbidrag (Refereegranskat)
    Abstract [en]

    The emulated shared memory (ESM) architectures are good candidates for future general purpose parallel computers due to their ability to provide easy-to-use explicitly parallel synchronous model of computation to programmers as well as avoid most performance bottlenecks present in current multicore architectures. In order to achieve full performance the applications must, however, have enough thread-level parallelism (TLP). To solve this problem, in our earlier work we have introduced a class of configurable emulated shared memory (CESM) machines that provides a special non-uniform memory access (NUMA) mode for situations where TLP is limited or for direct compatibility for legacy code sequential computing or NUMA mechanism. Unfortunately the earlier proposed CESM architecture does not integrate the different modes of the architecture well together e.g. by leaving the memories for different modes isolated and therefore the programming interface is non-integrated. In this paper we propose a number of hardware and software techniques to support NUMA computing in CESM architectures in a seamless way. The hardware techniques include three different NUMA-shared memory access mechanisms and the software ones provide a mechanism to integrate NUMA computation into the standard parallel random access machine (PRAM) operation of the CESM. The hardware techniques are evaluated on our REPLICA CESM architecture and compared to an ideal CESM machine making use of the proposed software techniques.

  • 36.
    Kessler, Christoph
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Dastgeer, Usman
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Majeed, Mudassar
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Furmento, Nathalie
    University of Bordeaux, INRIA, Bordeaux, France.
    Thibault, Samuel
    University of Bordeaux, INRIA, Bordeaux, France.
    Namyst, Raymond
    University of Bordeaux, INRIA, Bordeaux, France.
    Benkner, Siegfried
    University of Vienna, Austria.
    Pllana, Sabri
    University of Vienna, Austria.
    Träff, Jesper
    Technical University of Vienna, Austria.
    Wimmer, Martin
    Technical University of Vienna, Austria.
    Leveraging PEPPHER Technology for Performance Portable Supercomputing2013Ingår i: High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion, Salt Lake City, USA: IEEE conference proceedings, 2013, s. 1395-1396Konferensbidrag (Övrigt vetenskapligt)
    Abstract [en]

    PEPPHER is a 3-year EU FP7 project that develops a novel approach and framework to enhance performance portability and programmability of heterogeneous multi-core systems. Its primary target is single-node heterogeneous systems, where several CPU cores are supported by accelerators such as GPUs. This poster briefly surveys the PEPPHER framework for single-node systems, and elaborates on the prospectives for leveraging the PEPPHER approach to generate performance-portable code for heterogeneous multi-node systems.

  • 37.
    Dastgeer, Usman
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    A performance-portable generic component for 2D convolution computations on GPU-based systems2012Ingår i: Proceedings of the Fifth International Workshop on Programmability Issues for Heterogeneous Multicores (MULTIPROG-2012) at the HiPEAC-2012 conference, Paris, Jan. 2012 / [ed] E. Ayguade, B. Gaster, L. Howes, P. Stenström, O. Unsal, 2012Konferensbidrag (Refereegranskat)
    Abstract [en]

    In this paper, we describe our work on providing a generic yet optimized GPU (CUDA/OpenCL) implementation for the 2D MapOverlap skeleton. We explain our implementation with the help of a 2D convolution application, implemented using the newly developed skeleton. The memory (constant and shared memory) and adaptive tiling optimizations are applied and their performance implications are evaluated on different classes of GPUs. We present two different metrics to calculate the optimal tiling factor dynamically in an automated way which helps in retaining best performance without manual tuning while moving to newGPU architectures. With our approach, we can achieve average speedups by a factor of 3.6, 2.3, and 2.4 over an otherwise optimized (without tiling) implementation on NVIDIA C2050, GTX280 and 8800 GT GPUs respectively. Above all, the performance portability is achieved without requiring any manual changes in the skeleton program or the skeleton implementation.

  • 38.
    Keller, Jörg
    et al.
    FernUniversität in Hagen, Germany.
    Majeed, Mudassar
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Balancing CPU Load for Irregular MPI Applications2012Ingår i: Advances in Parallel Computing: Applications, Tools and Techniques on the Road to Exascale Computing / [ed] Koen De Bosschere, Erik H. D'Hollander, Gerhard R. Joubert, David Padua, Frans Peters, Mark Sawyer, IOS Press, 2012, s. 307-316Konferensbidrag (Refereegranskat)
    Abstract [en]

    MPI applications typically are designed to be run on a parallel machine with one process per core. If processes exhibit different computational load, either the code must be rewritten for load balancing, with negative side-effects on readability and maintainability, or the one-process-per-core philosophy leads to a low utilization of many processor cores. If several processes are mapped per core to increase CPU utilization, the load might still be unevenly distributed among the cores if the mapping is unaware of the process characteristics.

    Therefore, similarly to the MPI_Graph_create() function where the program gives hints on communication patterns so that MPI processes can be placed favorably, we propose a MPI_Load_create() function where the program supplies information on the relative loads of the MPI processes, such that processes can be favorably grouped and mapped onto processor cores. In order to account for scalability and restricted knowledge of individual MPI processes, we also propose an extension MPI_Dist_load_create() similar to MPI_Dist_graph_create(), where each individual MPI process only knows the loads of a subset of the MPI processes.

    We detail how to implement both variants on top of MPI, and provide experimental performance results both for synthetic and numeric example applications. The results indicate that load balancing is favorable in both cases.

  • 39.
    Mäkelä, Jari-Matti
    et al.
    University of Turku, Finland.
    Hansson, Erik
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Åkesson, Daniel
    Linköpings universitet, Institutionen för datavetenskap. Linköpings universitet, Tekniska högskolan.
    Forsell, Martti
    VTT Oulu, Finland.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Leppänen, Ville
    University of Turku, Finland.
    Design of the Language Replica for Hybrid PRAM-NUMA Many-core Architectures2012Ingår i: Parallel and Distributed Processing with Applications (ISPA), 2012, IEEE conference proceedings, 2012, s. 697-704Konferensbidrag (Refereegranskat)
    Abstract [en]

    Parallel programming is widely considered very demanding for an average programmer due to inherent asynchrony of underlying parallel architectures. In this paper we describe the main design principles and core features of Replica -- a parallel language aimed for high-level programming of a new paradigm of reconfigurable, scalable and powerful synchronous shared memory architectures that promise to make parallel programming radically easier with the help of strict memory consistency and deterministic synchronous execution of hardware threads and multi-operations.

  • 40.
    Melot, Nicolas
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Keller, Jörg
    Fern Universitat in Hagen, Fac. of Math. and Computer Science, Hagen, Germany.
    Efficient On-Chip Pipelined Streaming Computations on Scalable Manycore Architectures2012Konferensbidrag (Övrigt vetenskapligt)
    Abstract [en]

    Performance of manycore processors is limited by programs' use of off-chip main memory. Streaming computation organized in a pipeline limits accesses to main memory to tasks at boundaries of the pipeline to read or write to main memory. The Single Chip Cloud computer (SCC) offers 48 cores linked by a high-speed on-chip network, and allows the implementation of such on-chip pipelined technique. We assess the performance and constraints provided by the SCC and investigate on on-chip pipelined mergesort as a case study for streaming computations. We found that our on-chip pipelined mergesort yields significant speedup over classic parallel mergesort on SCC. The technique should bring improvement in power consumption and should be portable to other manycore, network-on-chip architectures such as Tilera's processors.

  • 41.
    Melot, Nicolas
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Avdic, Kenan
    Linköpings universitet, Institutionen för datavetenskap. Linköpings universitet, Tekniska högskolan.
    Cichowski, Patrick
    Fac. Mathematics and Computer Science, FernUniversität in Hagen, Germany.
    Keller, Jörg
    Fac. Mathematics and Computer Science, FernUniversität in Hagen, Germany.
    Engineering parallel sorting for the Intel SCC2012Ingår i: Procedia Computer Science, ISSN 1877-0509, E-ISSN 1877-0509, Vol. 9, s. 1890-1899Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    The Single-Chip Cloud Computer (SCC) is an experimental processor created by Intel Labs. It comprises 48 Intel-x86 cores linked by an on-chip high performance mesh network, as well as four DDR3 memory controllers to access an off-chip main memory. We investigate the adaptation of sorting onto SCC as an algorithm engineering problem. We argue that a combination of pipelined mergesort and sample sort will fit best to SCCs architecture. We also provide a mapping based on integer linear programming to address load balancing and latency considerations. We describe a prototype implementation of our proposal together with preliminary runtime measurements, that indicate the usefulness of this approach. As mergesort can be considered as a representative of the class of streaming applications, the techniques developed here should also apply to the other problems in this class, such as many applications for parallel embedded systems, i.e. MPSoC.

  • 42.
    Brenner, Jürgen
    et al.
    FernUniversität in Hagen, Germany.
    Keller, Jörg
    FernUniversität in Hagen, Germany.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Executing PRAM Programs on GPUs2012Ingår i: Procedia Computer Science, ISSN 1877-0509, E-ISSN 1877-0509, Vol. 9, s. 1799-1806Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    We present a framework to transform PRAM programs from the PRAM programming language Fork to CUDA C, so that they can be compiled and executed on a Graphics Processor (GPU). This allows to explore parallel algorithmics on a scale beyond toy problems, to which the previous, sequential PRAM simulator restricted practical use. We explain the design decisions and evaluate a prototype implementation consisting of a runtime library and a set of rules to transform simple Fork programs which we for now apply by hand. The resulting CUDA code is almost 100 times faster than the previous simulator for compiled Fork programs and allows to handle larger data sizes. Compared to a sequential program for the same problem, the GPU code might be faster or slower, depending on the Fork program structure, i.e. on the overhead incurred. We also give an outlook how future GPUs might notably reduce the overhead.

  • 43.
    Kessler, Martin
    et al.
    Linköpings universitet.
    Hansson, Erik
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Åkesson, Daniel
    Linköpings universitet, Institutionen för datavetenskap. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Exploiting Instruction Level Parallelism for REPLICA - A Configurable VLIW Architecture With Chained Functional Units2012Ingår i: : Volume II, Las Vegas, Nevada, USA: CSREA Press, 2012, s. 275-281Konferensbidrag (Övrigt vetenskapligt)
    Abstract [en]

    In this paper we present a scheduling algorithm for VLIW architectures with chained functional units. We show how our algorithm can help speed up programs at the instruction level, for an architecture called REPLICA, a configurable emulated shared memory (CESM) architecture whose computation model is based on the PRAM model. Since our LLVM based compiler is parameterizable in the number of different functional units, read and write ports to register file etc. we can generate code for different REPLICA architectures that have different functional unit configurations. We show for a set of different configurations how our implementation can produce high quality code; and we argue that the high parametrization of the compiler makes it, together with the simulator, useful for hardware/software co-design.

  • 44.
    Dastgeer, Usman
    et al.
    Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar. Linköpings universitet, Tekniska högskolan.
    Thibault, Samuel
    Laboratoire Bordelais de Recherche en Informatique (LaBRI), France.
    Flexible runtime support for efficient skeleton programming on hybrid systems2012Ingår i: Applications, Tools and Techniques on the Road to Exascale Computing / [ed] K. De Bosschere, E. H. D'Hollander, G. R. Joubert, D. Padua, F. Peters., Amsterdam: IOS Press, 2012, 22, s. 159-166Kapitel i bok, del av antologi (Övrigt vetenskapligt)
    Abstract [en]

    SkePU is a skeleton programming framework for multicore CPU and multi-GPU systems. StarPU is a runtime system that provides dynamic scheduling and memory management support for heterogeneous, accelerator-based systems. We have implemented support for StarPU as a possible backend for SkePU while keeping the generic SkePU interface intact. The mapping of a SkePU skeleton call to one or more StarPU tasks allows StarPU to exploit independence between different skeleton calls as well as within a single skeleton call. Support for different StarPU features, such as data partitioning and different scheduling policies (e.g. history based performance models) is implemented and discussed in this paper. The integration proved beneficial for both StarPU and SkePU. StarPU got a high level interface to run data-parallel computations on it while SkePU has achieved dynamic scheduling and hybrid parallelism support. Several benchmarks including ODE solver, separable Gaussian blur filter, Successive Over-Relaxation (SOR) and Coulombic potential are implemented. Initial experiments show that we can even achieve super-linear speedups for realistic applications and can observe clear improvements in performance with the simultaneous use of both CPUs and GPU (hybrid execution).

  • 45.
    Kessler, Christoph
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Hansson, Erik
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Flexible scheduling and thread allocation for synchronous parallel tasks2012Ingår i: ARCS-2012 Workshops / [ed] G. Mühl, J. Richling, A. Herkersdorf, Gesellschaft für Informatik , 2012, s. 517-528Konferensbidrag (Refereegranskat)
    Abstract [en]

    We describe a task model and dynamic scheduling and resource allocation mechanism for synchronous parallel tasks to be executed on SPMD-programmed synchronous shared memory MIMD parallel architectures with uniform, unit-time memory access and strict memory consistency, also known inthe literature as PRAMs (Parallel Random Access Machines). Our task model provides a two-tier programming model for PRAMs that flexibly combines SPMD and fork-join parallelism within the same application. It offers flexibility by dynamic scheduling and late resource binding while preserving the PRAM execution properties within each task, the only limitation being that the maximum number of threads that can be assigned to a task is limited to what the underlying architecture provides. In particular, our approach opens for automatic performance tuning at run-time by controlling the thread allocation for tasks based on run-time predictions.By a prototype implementation of a synchronous parallel task API in the SPMD-based PRAM language Fork and experimental evaluation with example programs on the SBPRAM simulator, we show that a realization of the task model on a SPMD-programmable PRAM machine is feasible with moderate runtimeoverhead per task.

  • 46.
    Eriksson, Mattias
    et al.
    Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Integrated Code Generation for Loops2012Ingår i: ACM Transactions on Embedded Computing Systems, ISSN 1539-9087, E-ISSN 1558-3465, Vol. 11, nr 1Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Code generation in a compiler is commonly divided into several phases: instruction selection, scheduling, register allocation, spill code generation, and, in the case of clustered architectures, cluster assignment. These phases are interdependent; for instance, a decision in the instruction selection phase affects how an operation can be scheduled We examine the effect of this separation of phases on the quality of the generated code. To study this we have formulated optimal methods for code generation with integer linear programming; first for acyclic code and then we extend this method to modulo scheduling of loops. In our experiments we compare optimal modulo scheduling, where all phases are integrated, to modulo scheduling, where instruction selection and cluster assignment are done in a separate phase. The results show that, for an architecture with two clusters, the integrated method finds a better solution than the nonintegrated method for 27% of the instances.

  • 47.
    Cichowski, Patrick
    et al.
    FernUniversität in Hagen, Germany.
    Keller, Jörg
    FernUniversität in Hagen, Germany.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Modelling Power Consumption of the Intel SCC2012Ingår i: Proceedings of the 6th Many-core Applications Research Community (MARC) Symposium / [ed] Eric Noulard, HAL Archives Ouvertes , 2012Konferensbidrag (Refereegranskat)
    Abstract [en]

    The Intel SCC manycore processor supports energy-efficient computing by dynamic voltage and frequency scaling of cores on a fine-grained level. In order to enable the use of that feature in application-level energy optimizations, we report on experiments to measure power consumption in different situations. We process those measurements by a least-squares error analysis to derive the parameters of popular models for power consumption which are used on an algorithmic level. Thus, we provide a link between the worlds of hardware and high-level algorithmics.

  • 48.
    Ali, Akhtar
    et al.
    Linköpings universitet, Tekniska högskolan.
    Dastgeer, Usman
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    OpenCL for programming shared memory multicore CPUs2012Ingår i: Proceedings of the 5th Workshop on MULTIPROG2012 / [ed] E. Ayguade, B. Gaster, L. Howes, P. Stenström, O. Unsal, HiPEAC Network of Excellence , 2012Konferensbidrag (Refereegranskat)
    Abstract [en]

    Shared memory multicore processor technology is pervasive in mainstream computing. This new architecture challenges programmers to write code that scales over these many cores to exploit the full computational power of these machines. OpenMP and Intel Threading Building Blocks (TBB) are two of the popular frameworks used to program these architectures. Recently, OpenCL has been defined as a standard by Khronos group which focuses on programming a possibly heterogeneous set of processors with many cores such as CPU cores, GPUs, DSP processors. In this work, we evaluate the effectiveness of OpenCL for programming multicore CPUs in a comparative case study with OpenMP and Intel TBB for five benchmark applications: matrix multiply, LU decomposition,2D image convolution, Pi value approximation and image histogram generation. The evaluation includes the effect of compiler optimizations for different frameworks, OpenCL performance on different vendors’ platformsand the performance gap between CPU-specific and GPU-specific OpenCL algorithms for execution on a modern GPU. Furthermore, a brief usability evaluation of the three frameworks is also presented.

  • 49.
    Kessler, Christoph
    et al.
    Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar. Linköpings universitet, Tekniska högskolan.
    Lowe, W
    Linnaeus University.
    Optimized composition of performance-aware parallel components2012Ingår i: Concurrency and Computation, ISSN 1532-0626, E-ISSN 1532-0634, Vol. 24, nr 5, s. 481-498Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    We describe the principles of a novel framework for performance-aware composition of sequential and explicitly parallel software components with implementation variants. Automatic composition results in a table-driven implementation that, for each parallel call of a performance-aware component, looks up the expected best implementation variant, processor allocation and schedule given the current problem, and processor group sizes. The dispatch tables are computed off-line at component deployment time by an interleaved dynamic programming algorithm from time-prediction meta-code provided by the component supplier.

  • 50.
    Keller, Jörg
    et al.
    FernUniversität in Hagen, Germany.
    Kessler, Christoph W.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Hultén, Rikard
    Linköpings universitet, Institutionen för datavetenskap. Linköpings universitet, Tekniska högskolan.
    Optimized On-Chip-Pipelining for Memory-Intensive Computations on Multi-Core Processors with Explicit Memory Hierarchy2012Ingår i: Journal of Universal Computer Science, ISSN 0948-695X, Vol. 18, nr 14, s. 1987-2023Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Limited bandwidth to off-chip main memory tends to be a performance bottleneck in chip multiprocessors, and this will become even more problematic with an increasing number of cores. Especially for streaming computations where the ratio between computational work and memory transfer is low, transforming the program into more memory-efficient code is an important program optimization.

    On-chip pipelining reorganizes the computation so that partial results of subtasks are forwarded immediately between the cores over the high-bandwidth internal network, in order to reduce the volume of main memory accesses, and thereby improves the throughput for memory-intensive computations. At the same time, throughput is also constrained by the limited amount of on-chip memory available for buffering forwarded data. By optimizing the mapping of tasks to cores, balancing a trade-off between load balancing, buffer memory consumption, and communication load on the on-chip network, a larger buffer size can be applied, resulting in less DMA communication and scheduling overhead.

    In this article, we consider parallel mergesort as a representative memory-intensive application in detail, and focus on the global merging phase, which is dominating the overall sorting time for larger data sets. We work out the technical issues of applying the on-chip pipelining technique, and present several algorithms for optimized mapping of merge trees to the multiprocessor cores. We also demonstrate how some of these algorithms can be used for mapping of other streaming task graphs.

    We describe an implementation of pipelined parallel mergesort for the Cell Broadband Engine, which serves as an exemplary target. We evaluate experimentally the influence of buffer sizes and mapping optimizations, and show that optimized on-chip pipelining indeed speeds up, for realistic problem sizes, merging times by up to 70% on QS20 and 143% on PS3 compared to the merge phase of CellSort, which was by now the fastest merge sort implementation on Cell.

123 1 - 50 av 110
RefereraExporteraLänk till träfflistan
Permanent länk
Referera
Referensformat
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf