liu.seSök publikationer i DiVA
Ändra sökning
Avgränsa sökresultatet
123 1 - 50 av 136
RefereraExporteraLänk till träfflistan
Permanent länk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Träffar per sida
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sortering
  • Standard (Relevans)
  • Författare A-Ö
  • Författare Ö-A
  • Titel A-Ö
  • Titel Ö-A
  • Publikationstyp A-Ö
  • Publikationstyp Ö-A
  • Äldst först
  • Nyast först
  • Skapad (Äldst först)
  • Skapad (Nyast först)
  • Senast uppdaterad (Äldst först)
  • Senast uppdaterad (Nyast först)
  • Disputationsdatum (tidigaste först)
  • Disputationsdatum (senaste först)
  • Standard (Relevans)
  • Författare A-Ö
  • Författare Ö-A
  • Titel A-Ö
  • Titel Ö-A
  • Publikationstyp A-Ö
  • Publikationstyp Ö-A
  • Äldst först
  • Nyast först
  • Skapad (Äldst först)
  • Skapad (Nyast först)
  • Senast uppdaterad (Äldst först)
  • Senast uppdaterad (Nyast först)
  • Disputationsdatum (tidigaste först)
  • Disputationsdatum (senaste först)
Markera
Maxantalet träffar du kan exportera från sökgränssnittet är 250. Vid större uttag använd dig av utsökningar.
  • 1.
    Ali, Akhtar
    et al.
    Linköpings universitet, Institutionen för datavetenskap. Linköpings universitet, Tekniska högskolan.
    Dastgeer, Usman
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    OpenCL for programming shared memory multicore CPUs2011Ingår i: Fourth Swedish Workshop on Multi-Core Computing MCC-2011: November 23-25, 2011, Linköping University, Linköping, Sweden / [ed] Christoph Kessler, Linköping: Linköping University , 2011, Vol. S. 65-70, s. 65-70Konferensbidrag (Övrigt vetenskapligt)
    Abstract [en]

    In this work, we evaluate the effectiveness of OpenCL for programming multicore CPUs in a comparative case study with OpenMP and Intel TBB for five benchmark applications: matrix multiply, LU decomposition, 2D image convolution, Pi value approximation and image histogram generation.

  • 2.
    Ali, Akhtar
    et al.
    Linköpings universitet, Tekniska högskolan.
    Dastgeer, Usman
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    OpenCL for programming shared memory multicore CPUs2012Ingår i: Proceedings of the 5th Workshop on MULTIPROG2012 / [ed] E. Ayguade, B. Gaster, L. Howes, P. Stenström, O. Unsal, HiPEAC Network of Excellence , 2012Konferensbidrag (Refereegranskat)
    Abstract [en]

    Shared memory multicore processor technology is pervasive in mainstream computing. This new architecture challenges programmers to write code that scales over these many cores to exploit the full computational power of these machines. OpenMP and Intel Threading Building Blocks (TBB) are two of the popular frameworks used to program these architectures. Recently, OpenCL has been defined as a standard by Khronos group which focuses on programming a possibly heterogeneous set of processors with many cores such as CPU cores, GPUs, DSP processors. In this work, we evaluate the effectiveness of OpenCL for programming multicore CPUs in a comparative case study with OpenMP and Intel TBB for five benchmark applications: matrix multiply, LU decomposition,2D image convolution, Pi value approximation and image histogram generation. The evaluation includes the effect of compiler optimizations for different frameworks, OpenCL performance on different vendors’ platformsand the performance gap between CPU-specific and GPU-specific OpenCL algorithms for execution on a modern GPU. Furthermore, a brief usability evaluation of the three frameworks is also presented.

    Ladda ner fulltext (pdf)
    fulltext
  • 3.
    Amaral, Vasco
    et al.
    Univ Nova Lisboa, Portugal.
    Norberto, Beatriz
    Univ Nova Lisboa, Portugal.
    Goulao, Miguel
    Univ Nova Lisboa, Portugal.
    Aldinucci, Marco
    Univ Torino, Italy.
    Benkner, Siegfried
    Univ Vienna, Austria.
    Bracciali, Andrea
    Univ Stirling, Scotland.
    Carreira, Paulo
    Univ Lisbon, Portugal.
    Celms, Edgars
    Univ Latvia, Latvia.
    Correia, Luis
    Univ Lisbon, Portugal.
    Grelck, Clemens
    Univ Amsterdam, Netherlands.
    Karatza, Helen
    Aristotle Univ Thessaloniki, Greece.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Kilpatrick, Peter
    Queens Univ Belfast, North Ireland.
    Martiniano, Hugo
    Univ Lisbon, Portugal.
    Mavridis, Ilias
    Aristotle Univ Thessaloniki, Greece.
    Pllana, Sabri
    Linnaeus Univ, Sweden.
    Respicio, Ana
    Univ Lisbon, Portugal.
    Simao, Jose
    Inst Politecn Lisboa, Portugal.
    Veiga, Luis
    Univ Lisbon, Portugal.
    Visa, Ari
    Tampere Univ, Finland.
    Programming languages for data-Intensive HPC applications: A systematic mapping study2020Ingår i: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 91, artikel-id UNSP 102584Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    A major challenge in modelling and simulation is the need to combine expertise in both software technologies and a given scientific domain. When High-Performance Computing (HPC) is required to solve a scientific problem, software development becomes a problematic issue. Considering the complexity of the software for HPC, it is useful to identify programming languages that can be used to alleviate this issue. Because the existing literature on the topic of HPC is very dispersed, we performed a Systematic Mapping Study (SMS) in the context of the European COST Action cHiPSet. This literature study maps characteristics of various programming languages for data-intensive HPC applications, including category, typical user profiles, effectiveness, and type of articles. We organised the SMS in two phases. In the first phase, relevant articles are identified employing an automated keyword-based search in eight digital libraries. This lead to an initial sample of 420 papers, which was then narrowed down in a second phase by human inspection of article abstracts, titles and keywords to 152 relevant articles published in the period 2006-2018. The analysis of these articles enabled us to identify 26 programming languages referred to in 33 of relevant articles. We compared the outcome of the mapping study with results of our questionnaire-based survey that involved 57 HPC experts. The mapping study and the survey revealed that the desired features of programming languages for data-intensive HPC applications are portability, performance and usability. Furthermore, we observed that the majority of the programming languages used in the context of data-intensive HPC applications are text-based general-purpose programming languages. Typically these have a steep learning curve, which makes them difficult to adopt. We believe that the outcome of this study will inspire future research and development in programming languages for data-intensive HPC applications. (C) 2019 Elsevier B.V. All rights reserved.

  • 4.
    Andersson, Jesper
    et al.
    MSI Universitet Växjö, Sweden.
    Ericsson, Morgan
    MSI Universitet Växjö, Sweden.
    Kessler, Christoph
    Linköpings universitet, Tekniska högskolan. Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar.
    Löwe, Welf
    MSI Universitet Växjö, Sweden.
    Profile-Guided Composition2008Ingår i: 7th Int. Symposium on Software Composition SC 2008,2008, Berlin: Springer , 2008, s. 157-Konferensbidrag (Refereegranskat)
    Abstract [en]

    We present an approach that generates context-aware, optimized libraries of algorithms and data structures. The search space contains all combinations of implementation variants of algorithms and data structures including dynamically switching and converting between them. Based on profiling, the best implementation for a certain context is precomputed at deployment time and selected at runtime. In our experiments, the profile-guided composition outperforms the individual variants in almost all cases.

  • 5.
    Avdic, Kenan
    et al.
    Linköpings universitet, Institutionen för datavetenskap. Linköpings universitet, Tekniska högskolan.
    Melot, Nicolas
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Keller, Jörg
    FernUniversität in Hagen.
    Pipelined parallel sorting on the Intel SCC2011Ingår i: Fourth Swedish Workshop on Multi-Core Computing MCC-2011: November 23-25, 2011, Linköping University, Linköping, Sweden / [ed] Christoph Kessler, Linköping: Linköping University , 2011, Vol. S. 96-101, s. 96-101Konferensbidrag (Övrigt vetenskapligt)
    Abstract [en]

    The Single-Chip Cloud Computer (SCC) is an experimental processor created by Intel Labs. It comprises 48 Intel-IA32 cores linked by an on-chip high performance mesh network, as well as four DDR3 memory controllers to access an off-chip main memory. We investigate the adaptation of sorting onto SCC as an algorithm engineering problem. We argue that a combination of pipelined mergesort and sample sort will fit best to SCC's architecture. We also provide a mapping based on integer linear programming to address load balancing and latency considerations. We describe a prototype implementation of our proposai together with preliminary runtime measurements, that indicate the usefulness of this approach. As mergesort can be considered as a representative of the class of streaming applications, the techniques deveioped here should also apply to the other problems in this class, such as many applications for parallel embedded systems, i.e. MPSoC. 

  • 6.
    Bednarski, Andrzej
    et al.
    Linköpings universitet, Tekniska högskolan. Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar.
    Kessler, Christoph
    Linköpings universitet, Tekniska högskolan. Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar.
    Energy-Optimal Integrated VLIW Code Generation2004Ingår i: CPC04 11th Int. Workshop on Compilers for Parallel Computers,2004, 2004, s. 227-238Konferensbidrag (Övrigt vetenskapligt)
  • 7.
    Bednarski, Andrzej
    et al.
    Linköpings universitet, Tekniska högskolan. Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar.
    Kessler, Christoph
    Linköpings universitet, Tekniska högskolan. Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar.
    Exploiting Symmetries for Optimal Integrated Code Generation2004Ingår i: Int. Conf. on Embedded Systems and Applications ESA04,2004, 2004Konferensbidrag (Refereegranskat)
  • 8.
    Bednarski, Andrzej
    et al.
    Linköpings universitet, Tekniska högskolan. Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar.
    Kessler, Christoph
    Linköpings universitet, Tekniska högskolan. Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar.
    Integer Linear Programming versus Dynamic Programming for Optimal Integrated VLIW Code Generation2006Ingår i: 12th Int. Workshop on Compilers for Parallel Computers,2006, 2006, s. 73-Konferensbidrag (Refereegranskat)
  • 9.
    Bednarski, Andrzej
    et al.
    Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar. Linköpings universitet, Tekniska högskolan.
    Optimal integrated code generation for VLIW architectures2003Ingår i: Proc. of CPC'03 10th Int. Workshop on Compilers for Parallel Computers, Amsterdam, The Netherlands, January 2003', Leiden, The Netherlands: Leiden Institute of Advanced Computer Science , 2003Konferensbidrag (Refereegranskat)
  • 10.
    Bednarski, Andrzej
    et al.
    Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar. Linköpings universitet, Tekniska högskolan.
    Optimal integrated VLIW code generation with Integer Linear Programming2006Ingår i: Euro-Par 2006 Parallel Processing 12th International Euro-Par Conference, Dresden, Germany, August 28 – September 1, 2006. Proceedings / [ed] Wolfgang E. Nagel, Wolfgang V. Walter and Wolfgang Lehner, Springer Berlin/Heidelberg, 2006, Vol. 4128, s. 461-472Kapitel i bok, del av antologi (Refereegranskat)
    Abstract [en]

    We give an Integer Linear Programming (ILP) solution that fully integrates all steps of code generation, i.e. instruction selection, register allocation and instruction scheduling, on the basic block level for VLIW processors.

    In earlier work, we contributed a dynamic programming (DP) based method for optimal integrated code generation, implemented in our retargetable code generator OPTIMIST. In this paper we give first results to evaluate and compare our ILP formulation with our DP method on a VLIW processor. We also demonstrate how to precondition the ILP model by a heuristic relaxation of the DP method to improve ILP optimization time.

  • 11.
    Benkner, Siegfried
    et al.
    University of Vienna.
    Pllana, Sabri
    University of Vienna.
    Larsson Träff, Jesper
    University of Vienna.
    Tsigas, Philippas
    Chalmers.
    Dolinsky, Uwe
    Codeplay Software.
    Augonnet, Cèdric
    INRIA Bordeaux.
    Bachmayer, Beverly
    Intel GmbH.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar. Linköpings universitet, Tekniska högskolan.
    Moloney, David
    Movidius.
    Osipov, Vitaly
    Karlsruhe Institute of Technology.
    PEPPHER: Efficient and Productive Usage of Hybrid Computing Systems2011Ingår i: IEEE Micro, ISSN 0272-1732, E-ISSN 1937-4143, Vol. 31, nr 5, s. 28-41Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    PEPPHER, a three-year European FP7 project, addresses efficient utilization of hybrid (heterogeneous) computer systems consisting of multicore CPUs with GPU-type accelerators. This article outlines the PEPPHER performance-aware component model, performance prediction means, runtime system, and other aspects of the project. A larger example demonstrates performance portability with the PEPPHER approach across hybrid systems with one to four GPUs.

  • 12.
    Brenner, Jürgen
    et al.
    FernUniversität in Hagen, Germany.
    Keller, Jörg
    FernUniversität in Hagen, Germany.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Executing PRAM Programs on GPUs2012Ingår i: Procedia Computer Science, E-ISSN 1877-0509, Vol. 9, s. 1799-1806Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    We present a framework to transform PRAM programs from the PRAM programming language Fork to CUDA C, so that they can be compiled and executed on a Graphics Processor (GPU). This allows to explore parallel algorithmics on a scale beyond toy problems, to which the previous, sequential PRAM simulator restricted practical use. We explain the design decisions and evaluate a prototype implementation consisting of a runtime library and a set of rules to transform simple Fork programs which we for now apply by hand. The resulting CUDA code is almost 100 times faster than the previous simulator for compiled Fork programs and allows to handle larger data sizes. Compared to a sequential program for the same problem, the GPU code might be faster or slower, depending on the Fork program structure, i.e. on the overhead incurred. We also give an outlook how future GPUs might notably reduce the overhead.

    Ladda ner fulltext (pdf)
    fulltext
  • 13.
    Chalabine, Mikhail
    et al.
    Linköpings universitet, Tekniska högskolan. Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar.
    Kessler, Christoph
    Linköpings universitet, Tekniska högskolan. Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar.
    A Formal Framework for Automated Round-trip Software Engineering in Static Aspect Weaving and Transformations2007Ingår i: the 29th Int. Conference on Software Engineering ICSE 2007,2007, USA: IEEE , 2007Konferensbidrag (Refereegranskat)
    Abstract [en]

    We present a formal framework for a recently introduced approach to Automated Round-trip Software Engineering (ARE) in source-level aspect weaving systems. Along with the formalization we improve the original method and suggest a new concept of weaving transactions in Aspect-oriented Programming (AOP). As the major contribution we formally show how, given a tree-shaped intermediate representation of a program and an ancillary transposition tree, manual edits in statically woven code can consistently be mapped back to their proper source of origin, which is either in the application core or in an element in the aspect space. The presented formalism is constructive. It frames AOP by generalizing static aspect weaving to classical tree transformations.

  • 14.
    Chalabine, Mikhail
    et al.
    Linköpings universitet, Tekniska högskolan. Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar.
    Kessler, Christoph
    Linköpings universitet, Tekniska högskolan. Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar.
    A Survey of Reasoning in Parallelization2007Ingår i: the 8th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing SNPD 2007,2007, China: IEEE , 2007Konferensbidrag (Refereegranskat)
    Abstract [en]

    We elaborate on reasoning in contemporary (semi) automatic parallelizing refactoring. As the main contribution we summarize contemporary approaches and show that all attempts to reason in parallelization thus far, have amounted to local code analysis given data and control dependencies. We conclude that, by retaining this perspective only, parallelization continues to exploit merely a subset of the reasoning methods available today and is likely to remain limited. To address this problem we suggest to expand the local analyses, such that, they take seriously relations between individual local parallelizing transformations. We argue that such a coupling allows to process sparser parallelizable constructs, such as, Producer-Consumer Coordination. We identify questions to be addressed to put this principle into action and report on-going work on (reasoning) mechanisms able to support this.

  • 15.
    Chalabine, Mikhail
    et al.
    Linköpings universitet, Tekniska högskolan. Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar.
    Kessler, Christoph
    Linköpings universitet, Tekniska högskolan. Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar.
    Crosscutting Concerns in Parallelization by Invasive Software Composition and Aspect Weaving2006Ingår i: 39th Hawaii International Conference on System Sciences HICSS 2006,2006, HI, USA: IEEE , 2006Konferensbidrag (Refereegranskat)
  • 16.
    Chalabine, Mikhail
    et al.
    Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar. Linköpings universitet, Tekniska högskolan.
    Parallelisation of sequential programs by invasive composition and aspect weaving2005Ingår i: Advanced Parallel Processing Technologies: 6th International Workshop, APPT 2005, Hong Kong, China, October 27-28, 2005. Proceedings / [ed] Jiannong Cao, Wolfgang Nejdl and Ming Xu, Springer Berlin/Heidelberg, 2005, Vol. 3756, s. 131-140Kapitel i bok, del av antologi (Refereegranskat)
    Abstract [en]

    We propose a new method of interactively parallelising programs that is based on aspect weaving and invasive software composition. This can be seen as an alternative to skeleton programming. We give motivating examples for how our method could be applied.

  • 17.
    Chalabine, Mikhail
    et al.
    Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar. Linköpings universitet, Tekniska högskolan.
    Bunus, Peter
    Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar. Linköpings universitet, Tekniska högskolan.
    Automated Round-trip Software Engineering in Aspect Weaving Systems2006Ingår i: 21st IEEE/ACM International Conference on Automated Software Engineering, 2006. ASE '06., Tokyo, Japan: IEEE/ACM , 2006, s. 305-308Konferensbidrag (Refereegranskat)
    Abstract [en]

    We suggest an approach to Automated Round-trip Software Engineering in source-level aspect weaving systems that allows for transparent mapping of manual edits in the woven program back to the appropriate source of origin, which is either the application core or the aspect space.

  • 18.
    Chalabine, Mikhail
    et al.
    Linköpings universitet, Tekniska högskolan. Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar.
    Kessler, Christoph
    Linköpings universitet, Tekniska högskolan. Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar.
    Wiklund, Staffan
    Ericsson.
    Optimising Intensive Interprocess Communication in a Parallelised Telecommunication Traffic Simulator2003Ingår i: High Performance Computing Symposium part of the Advanced Simulation Technology conference,2003, 2003Konferensbidrag (Refereegranskat)
    Abstract [en]

    This paper focuses on an efficient user-level handling of intensive interprocess communication in distributed parallel applications that are characterised by a high rate of data exchange. A common feature of such systems is that any parallelisation strategy focusing on the largest parallelisable fraction results in the highest possible rate of interprocess communication, compared to other parallelisation strategies. An example of such applications is the class of telecommunication traffic simulators, where the partition-communication phenomenon reveals itself due to the strong data interdependencies among the major parallelisable tasks, namely, encoding of messages, decoding of messages, and interpretation of messages.

  • 19.
    Cichowski, Patrick
    et al.
    FernUniversität in Hagen, Germany.
    Keller, Jörg
    FernUniversität in Hagen, Germany.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Energy-efficient Mapping of Task Collections onto Manycore Processors2013Ingår i: Proceedings of MULTIPROG'13 workshop at HiPEAC'13 / [ed] E. Ayguade et al. (eds.), 2013Konferensbidrag (Refereegranskat)
    Abstract [en]

    Streaming applications consist of a number of tasks that all run concurrently, and that process data at certain rates. On manycore processors, the tasks of the streaming application must be mapped onto the cores. While load balancing of such applications has been considered, especially in the MPSoC community, we investigate energy-efficient mapping of such task collections onto manycore processors. We first derive rules that guide the mapping process and show that as long as dynamic power consumption dominates static power consumption, the latter can be ignored and the problem reduces to load balancing. When however, as expected in the coming years, static power consumption will be a notable fraction of total power consumption, then an energy-efficient mapping must take it into account, e.g. by temporary shutdown of cores or by restricting the number of cores. We validate our findings with synthetic and real-world applications on the Intel SCC manycore processor.

  • 20.
    Cichowski, Patrick
    et al.
    FernUniversität in Hagen, Germany.
    Keller, Jörg
    FernUniversität in Hagen, Germany.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Modelling Power Consumption of the Intel SCC2012Ingår i: Proceedings of the 6th Many-core Applications Research Community (MARC) Symposium / [ed] Eric Noulard, HAL Archives Ouvertes , 2012Konferensbidrag (Refereegranskat)
    Abstract [en]

    The Intel SCC manycore processor supports energy-efficient computing by dynamic voltage and frequency scaling of cores on a fine-grained level. In order to enable the use of that feature in application-level energy optimizations, we report on experiments to measure power consumption in different situations. We process those measurements by a least-squares error analysis to derive the parameters of popular models for power consumption which are used on an algorithmic level. Thus, we provide a link between the worlds of hardware and high-level algorithmics.

  • 21.
    Dale, Nell
    et al.
    University of Texas at Austin.
    Bishop, Judith
    University of Pretoria.
    Barnes, David
    University of Kent at Canterbury.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar. Linköpings universitet, Tekniska högskolan.
    A dialog between authors and teachers2002Ingår i: Proc. ACM SIGCSE ITiCSE'02 7th Annual Conf. on Information Technology in Computer Science Education, Aarhus, Denmark, June 2002.', New York: ACM , 2002Konferensbidrag (Refereegranskat)
  • 22.
    Danylenko, Antonina
    et al.
    Linnaeus University, Växjö.
    Löwe, Welf
    Linnaeus University, Växjö.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system.
    Comparing Machine Learning Approaches for Context-Aware Composition2011Ingår i: Software Composition / [ed] Sven Apel, Ethan Jackson, Springer, 2011, s. 18-33Konferensbidrag (Refereegranskat)
    Abstract [en]

    Context-Aware Composition allows to automatically select optimal variants of algorithms, data-structures, and schedules at runtime using generalized dynamic Dispatch Tables. These tables grow exponentially with the number of significant context attributes. To make Context-Aware Composition scale, we suggest four alternative implementations to Dispatch Tables, all well-known in the field of machine learning: Decision Trees, Decision Diagrams, Naive Bayes and Support Vector Machines classifiers. We assess their decision overhead and memory consumption theoretically and practically in a number of experiments on different hardware platforms. Decision Diagrams turn out to be more compact compared to Dispatch Tables, almost as accurate, and faster in decision making. Using Decision Diagrams in Context-Aware Composition leads to a better scalability, i.e., Context-Aware Composition can be applied at more program points and regard more context attributes than before.

  • 23.
    Dastgeer, Usman
    et al.
    Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar. Linköpings universitet, Tekniska högskolan.
    Enmyren, Johan
    Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Auto-tuning SkePU: A multi-backend skeleton programming framework for multi-GPU systems2011Ingår i: IWMSE '11 Proceedings of the 4th International Workshop on Multicore Software Engineering, New York, NY, USA: Association for Computing Machinery (ACM), 2011, s. 25-32Konferensbidrag (Övrigt vetenskapligt)
    Abstract [en]

    SkePU is a C++ template library that provides a simple and unified interface for specifying data-parallel computations with the help of skeletons on GPUs using CUDA and OpenCL. The interface is also general enough to support other architectures, and SkePU implements both a sequential CPU and a parallel OpenMP backend. It also supports multi-GPU systems. Currently available skeletons in SkePU include map, reduce, mapreduce, map-with-overlap, maparray, and scan. The performance of SkePU generated code is comparable to that of hand-written code, even for more complex applications such as ODE solving.

    In this paper, we discuss initial results from auto-tuning SkePU using an off-line, machine learning approach where we adapt skeletons to a given platform using training data. The prediction mechanism at execution time uses off-line pre-calculated estimates to construct an execution plan for any desired configuration with minimal overhead. The prediction mechanism accurately predicts execution time for repetitive executions and includes a mechanism to predict execution time for user functions of different complexity. The tuning framework covers selection between different backends as well as choosing optimal parameter values for the selected backend. We will discuss our approach and initial results obtained for different skeletons (map, mapreduce, reduce).

  • 24.
    Dastgeer, Usman
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    A Framework for Performance-aware Composition of Applications for GPU-based Systems2013Konferensbidrag (Refereegranskat)
    Abstract [en]

    User-level components of applications can be made performance-aware by annotating them with performance model and other metadata. We present a component model and a composition framework for the performance-aware composition of applications for modern GPU-based systems from such components, which may expose multiple implementation variants. The framework targets the composition problem in an integrated manner, with particular focus on global performance-aware composition across multiple invocations. We demonstrate several key features of our framework relating to performance-aware composition including implementation selection, both with performance characteristics being known (or learned) beforehand as well as cases when they are learned at runtime. We also demonstrate hybrid execution capabilities of our framework on real applications. Furthermore, as an important step towards global composition, we present a bulk composition technique that can make better composition decisions by considering information about upcoming calls along with data flow information extracted from the source program by static analysis, thus improving over the traditional greedy performance-aware policy that only considers the current call for optimization.

  • 25.
    Dastgeer, Usman
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    A performance-portable generic component for 2D convolution computations on GPU-based systems2011Ingår i: Fourth Swedish Workshop on Multi-Core Computing MCC-2011: November 23-25, 2011, Linköping University, Linköping, Sweden / [ed] Christoph Kessler, Linköping: Linköping University , 2011, Vol. S. 39-44, s. 39-44Konferensbidrag (Övrigt vetenskapligt)
    Abstract [en]

    In this paper, we describe our work on providing a generic yet optimized GPU (CUDA/OpenCL) implementation for the 2D MapOverlap skeleton. We explain our implementation with the help  of a 2D convolutilution application, implemented using the newly deveioped skeleton.

  • 26.
    Dastgeer, Usman
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    A performance-portable generic component for 2D convolution computations on GPU-based systems2012Ingår i: Proceedings of the Fifth International Workshop on Programmability Issues for Heterogeneous Multicores (MULTIPROG-2012) at the HiPEAC-2012 conference, Paris, Jan. 2012 / [ed] E. Ayguade, B. Gaster, L. Howes, P. Stenström, O. Unsal, 2012Konferensbidrag (Refereegranskat)
    Abstract [en]

    In this paper, we describe our work on providing a generic yet optimized GPU (CUDA/OpenCL) implementation for the 2D MapOverlap skeleton. We explain our implementation with the help of a 2D convolution application, implemented using the newly developed skeleton. The memory (constant and shared memory) and adaptive tiling optimizations are applied and their performance implications are evaluated on different classes of GPUs. We present two different metrics to calculate the optimal tiling factor dynamically in an automated way which helps in retaining best performance without manual tuning while moving to newGPU architectures. With our approach, we can achieve average speedups by a factor of 3.6, 2.3, and 2.4 over an otherwise optimized (without tiling) implementation on NVIDIA C2050, GTX280 and 8800 GT GPUs respectively. Above all, the performance portability is achieved without requiring any manual changes in the skeleton program or the skeleton implementation.

    Ladda ner fulltext (pdf)
    fulltext
  • 27.
    Dastgeer, Usman
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Conditional component composition for GPU-based systems2014Ingår i: Proc. Seventh Workshop on Programmability Issues for Multi-Core Computers (MULTIPROG-2014) at HiPEAC-2014, Vienna, Austria, Jan. 2014, Vienna, Austria: HiPEAC NoE , 2014Konferensbidrag (Refereegranskat)
    Abstract [en]

    User-level components can expose multiple functionally equivalent implementations with different resource requirements and performance characteristics. A composition framework can then choose a suitable implementation for each component invocation guided by an objective function (execution time, energy etc.). In this paper, we describe the idea of conditional composition which enables the component writer to specify constraints on the selectability of a given component implementation based on information about the target system and component call properties. By incorporating such information, more informed and user-guided composition decisions can be made and thus more efficient code be generated, as shown with an example scenario for a GPU-based system.

    Ladda ner fulltext (pdf)
    fulltext
  • 28.
    Dastgeer, Usman
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Performance-aware Composition Framework for GPU-based Systems2015Ingår i: Journal of Supercomputing, ISSN 0920-8542, E-ISSN 1573-0484, Vol. 71, nr 12, s. 4646-4662Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    User-level components of applications can be made performance-aware by annotating them with performance model and other metadata. We present a component model and a composition framework for the automatically optimized composition of applications for modern GPU-based systems from such components, which may expose multiple implementation variants. The framework targets the composition problem in an integrated manner, with the ability to do global performance-aware composition across multiple invocations. We demonstrate several key features of our framework relating to performance-aware composition including implementation selection, both with performance characteristics being known (or learned) beforehand as well as cases when they are learned at runtime. We also demonstrate hybrid execution capabilities of our framework on real applications. Furthermore, we present a bulk composition technique that can make better composition decisions by considering information about upcoming calls along with data flow information extracted from the source program by static analysis. The bulk composition improves over the traditional greedy performance aware policy that only considers the current call for optimization.

  • 29.
    Dastgeer, Usman
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Smart Containers and Skeleton Programming for GPU-Based Systems2016Ingår i: International journal of parallel programming, ISSN 0885-7458, E-ISSN 1573-7640, Vol. 44, nr 3, s. 506-530Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    In this paper, we discuss the role, design and implementation of smart containers in the SkePU skeleton library for GPU-based systems. These containers provide an interface similar to C++ STL containers but internally perform runtime optimization of data transfers and runtime memory management for their operand data on the different memory units. We discuss how these containers can help in achieving asynchronous execution for skeleton calls while providing implicit synchronization capabilities in a data consistent manner. Furthermore, we discuss the limitations of the original, already optimizing memory management mechanism implemented in SkePU containers, and propose and implement a new mechanism that provides stronger data consistency and improves performance by reducing communication and memory allocations. With several applications, we show that our new mechanism can achieve significantly (up to 33.4 times) better performance than the initial mechanism for page-locked memory on a multi-GPU based system.

    Ladda ner fulltext (pdf)
    fulltext
  • 30.
    Dastgeer, Usman
    et al.
    Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar. Linköpings universitet, Tekniska högskolan.
    Thibault, Samuel
    Laboratoire Bordelais de Recherche en Informatique (LaBRI), France.
    Flexible runtime support for efficient skeleton programming on hybrid systems2012Ingår i: Applications, Tools and Techniques on the Road to Exascale Computing / [ed] K. De Bosschere, E. H. D'Hollander, G. R. Joubert, D. Padua, F. Peters., Amsterdam: IOS Press, 2012, 22, s. 159-166Kapitel i bok, del av antologi (Övrigt vetenskapligt)
    Abstract [en]

    SkePU is a skeleton programming framework for multicore CPU and multi-GPU systems. StarPU is a runtime system that provides dynamic scheduling and memory management support for heterogeneous, accelerator-based systems. We have implemented support for StarPU as a possible backend for SkePU while keeping the generic SkePU interface intact. The mapping of a SkePU skeleton call to one or more StarPU tasks allows StarPU to exploit independence between different skeleton calls as well as within a single skeleton call. Support for different StarPU features, such as data partitioning and different scheduling policies (e.g. history based performance models) is implemented and discussed in this paper. The integration proved beneficial for both StarPU and SkePU. StarPU got a high level interface to run data-parallel computations on it while SkePU has achieved dynamic scheduling and hybrid parallelism support. Several benchmarks including ODE solver, separable Gaussian blur filter, Successive Over-Relaxation (SOR) and Coulombic potential are implemented. Initial experiments show that we can even achieve super-linear speedups for realistic applications and can observe clear improvements in performance with the simultaneous use of both CPUs and GPU (hybrid execution).

  • 31.
    Dastgeer, Usman
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system.
    Li, Lu
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system.
    Adaptive Implementation Selection in the SkePU Skeleton Programming Library2013Ingår i: Advanced Parallel Processing Technologies (APPT-2013), Proceedings / [ed] Chengyung Wu and Albert Cohen (eds.), 2013, s. 170-183Konferensbidrag (Refereegranskat)
    Abstract [en]

    In earlier work, we have developed the SkePU skeleton programming library for modern multicore systems equipped with one or more programmable GPUs. The library internally provides four types of implementations (implementation variants) for each skeleton: serial C++, OpenMP, CUDA and OpenCL targeting either CPU or GPU execution respectively. Deciding which implementation would run faster for a given skeleton call depends upon the computation, problem size(s), system architecture and data locality.

    In this paper, we present our work on automatic selection between these implementation variants by an offline machine learning method which generates a compact decision tree with low training overhead. The proposed selection mechanism is flexible yet high-level allowing a skeleton programmer to control different training choices at a higher abstraction level. We have evaluated our optimization strategy with 9 applications/kernels ported to our skeleton library and achieve on average more than 94% (90%) accuracy with just 0.53% (0.58%) training space exploration on two systems. Moreover, we discuss one application scenario where local optimization considering a single skeleton call can prove sub-optimal, and propose a heuristic for bulk implementation selection considering more than one skeleton call to address such application scenarios.

  • 32.
    Dastgeer, Usman
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Li, Lu
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    The PEPPHER composition tool: performance-aware composition for GPU-based systems2014Ingår i: Computing, ISSN 0010-485X, E-ISSN 1436-5057, Vol. 96, nr 12, s. 1195-1211Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    The PEPPHER (EU FP7 project) component model defines the notion of component, interface and meta-data for homogeneous and heterogeneous parallel systems. In this paper, we describe and evaluate the PEPPHER composition tool, which explores the application’s components and their implementation variants, generates the necessary low-level code that interacts with the runtime system, and coordinates the native compilation and linking of the various code units to compose the overall application code to optimize performance. We discuss the concept of smart containers and its benefits for reducing dispatch overhead, exploiting implicit parallelism across component invocations and runtime optimization of data transfers. In an experimental evaluation with several applications, we demonstrate that the composition tool provides a high-level programming front-end while effectively utilizing the task-based PEPPHER runtime system (StarPU) underneath for different usage scenarios on GPU-based systems.

  • 33.
    Dastgeer, Usman
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Li, Lu
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    The PEPPHER Composition Tool: Performance-Aware Dynamic Composition of Applications for GPU-Based Systems2012Ingår i: High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion, IEEE, 2012, s. 711-720Konferensbidrag (Refereegranskat)
    Abstract [en]

    The PEPPHER component model defines an environment for annotation of native C/C++ based components for homogeneous and heterogeneous multicore and manycore systems, including GPU and multi-GPU based systems. For the same computational functionality, captured as a component, different sequential and explicitly parallel implementation variants using various types of execution units might be provided, together with metadata such as explicitly exposed tunable parameters. The goal is to compose an application from its components and variants such that, depending on the run-time context, the most suitable implementation variant will be chosen automatically for each invocation. We describe and evaluate the PEPPHER composition tool, which explores the application's components and their implementation variants, generates the necessary low-level code that interacts with the runtime system, and coordinates the native compilation and linking of the various code units to compose the overall application code. With several applications, we demonstrate how the composition tool provides a high-level programming front-end while effectively utilizing the task-based PEPPHER runtime system (StarPU) underneath.

  • 34.
    Ericsson, Morgan
    et al.
    MSI Universitet Växjö, Sweden.
    Löwe, Welf
    MSI Universitet Växjö, Sweden.
    Kessler, Christoph
    Linköpings universitet, Tekniska högskolan. Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar.
    Andersson, Jesper
    MSI Universitet Växjö, Sweden.
    Composition and Optimization2008Ingår i: Int. Workshop on Component-Based High Performance Computing CBHPC-2008,2008, New York, USA: ACM , 2008Konferensbidrag (Refereegranskat)
  • 35.
    Eriksson, Mattias
    et al.
    Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Integrated Code Generation for Loops2012Ingår i: ACM Transactions on Embedded Computing Systems, ISSN 1539-9087, E-ISSN 1558-3465, Vol. 11, nr 1Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Code generation in a compiler is commonly divided into several phases: instruction selection, scheduling, register allocation, spill code generation, and, in the case of clustered architectures, cluster assignment. These phases are interdependent; for instance, a decision in the instruction selection phase affects how an operation can be scheduled We examine the effect of this separation of phases on the quality of the generated code. To study this we have formulated optimal methods for code generation with integer linear programming; first for acyclic code and then we extend this method to modulo scheduling of loops. In our experiments we compare optimal modulo scheduling, where all phases are integrated, to modulo scheduling, where instruction selection and cluster assignment are done in a separate phase. The results show that, for an architecture with two clusters, the integrated method finds a better solution than the nonintegrated method for 27% of the instances.

  • 36.
    Eriksson, Mattias
    et al.
    Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar. Linköpings universitet, Tekniska högskolan.
    Integrated modulo scheduling for clustered VLIW architectures2009Ingår i: High Performance Embedded Architectures and Compilers: Fourth International Conference, HiPEAC 2009, Paphos, Cyprus, January 25-28, 2009. Proceedings / [ed] André Seznec, Joel Emer, Michael O’Boyle, Margaret Martonosi and Theo Ungerer, Springer Berlin/Heidelberg, 2009, Vol. 5409 LNCS, s. 65-79Kapitel i bok, del av antologi (Refereegranskat)
    Abstract [en]

    We solve the problem of integrating modulo scheduling with instruction selection (including cluster assignment), instruction scheduling and register allocation, with optimal spill code generation and scheduling. Our method is based on integer linear programming. We prove that our algorithm delivers optimal results in finite time for a certain class of architectures. We believe that these results are interesting both from a theoretical point of view and as a reference point when devising heuristic methods.

  • 37.
    Eriksson, Mattias
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Integrated Offset Assignment2011Ingår i: Proceedings 9th Workshop on Optimizations for DSP and Embedded Systems (ODES-9) / [ed] George Cai and Tom van der Aa, 2011, s. 47-54Konferensbidrag (Refereegranskat)
    Abstract [en]

    One important part of generating code for DSP processors is to make good use of the address generation unit (AGU). In this paper we divide the code generation into three parts: (1) scheduling, (2) address register assignment, and (3) storage layout. The goal is to nd out if solving these three subproblems as one big integrated problem gives better results compared to when scheduling or address register assignment is solved separately. We present optimal dynamic programming algorithms for both integrated and non-integrated code generation for DSP processors. In our experiments we nd that integrationis benecial when the AGU has 1 or 2 address registers; for the other cases existing heuristics are near optimal. We also nd that integrating address register assignment and storage layout gives slightly better results than integrating scheduling and storage layout. I.e. address register assignment is more important than scheduling.

    Ladda ner fulltext (pdf)
    Integrated Offset Assignment
  • 38.
    Eriksson, Mattias
    et al.
    Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar. Linköpings universitet, Tekniska högskolan.
    Chalabine, Mikhail
    Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar. Linköpings universitet, Tekniska högskolan.
    Load balancing of irregular parallel divide-and-conquer algorithms in group-SPMD programming environments2006Ingår i: 8th Workshop on Parallel Systems and Algorithms PASA 2006 / [ed] Wolfgang Karl, Jürgen Becker, Karl-Erwin Großpietsch, Christian Hochberger and Erik Maehle, Frankfurt/Main, Germany, 2006, s. 313-322Konferensbidrag (Refereegranskat)
    Abstract [en]

    We study strategies for local load balancing of irregular parallel divide-andconquer algorithms such as Quicksort and Quickhull in SPMD-parallel environments such as MPI and Fork that allow to exploit nested parallelism by dynamic group splitting. We propose two new local strategies, repivoting and serialisation, and develop a hybrid local load balancing strategy, which is calibrated by parameters that are derived off-line from a dynamic programming optimisation. While the approach is generic, we have implemented and evaluated our method for two very different parallel platforms. We found that our local strategy is superior to global dynamic load balancing on a Linux cluster, while the latter performs better on a tightly synchronised sharedmemory platform with nonblocking, cheap task queue access.

  • 39.
    Eriksson, Mattias
    et al.
    Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar. Linköpings universitet, Tekniska högskolan.
    Skoog, Oskar
    Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, PELAB - Laboratoriet för programmeringsomgivningar. Linköpings universitet, Tekniska högskolan.
    Optimal vs. Heuristic Integrated Code Generation for Clustered VLIW Architectures.2008Ingår i: 11th ACM SIGBED Int. Workshop on Software and Compilers for Embedded Systems SCOPES 2008,2008, New York, USA: Association for Computing Machinery (ACM), 2008, s. 11-20Konferensbidrag (Refereegranskat)
    Abstract [en]

    In this paper we present two algorithms for integrated code generation for clustered VLIW architectures. One algorithm is a heuristic based on genetic algorithms, the other algorithm is based on integer linear programming. The performance of the algorithms are compared on a portion of the Mediabench benchmark suite. We found the results of the genetic algorithm to be within one or two clock cycles from optimal for the cases where the optimum is known. In addition the heuristic algorithm produces results in predictable time also when the optimal integer linear program fails.

  • 40.
    Ernstsson, August
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Ahlqvist, Johan
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Zouzoula, Stavroula
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    SkePU 3: Portable High-Level Programming of Heterogeneous Systems and HPC Clusters2021Ingår i: International journal of parallel programming, ISSN 0885-7458, E-ISSN 1573-7640, Vol. 49, nr 6, s. 846-866Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    We present the third generation of the C++-based open-source skeleton programming framework SkePU. Its main new features include new skeletons, new data container types, support for returning multiple objects from skeleton instances and user functions, support for specifying alternative platform-specific user functions to exploit e.g. custom SIMD instructions, generalized scheduling variants for the multicore CPU backends, and a new cluster-backend targeting the custom MPI interface provided by the StarPU task-based runtime system. We have also revised the smart data containers memory consistency model for automatic data sharing between main and device memory. The new features are the result of a two-year co-design effort collecting feedback from HPC application partners in the EU H2020 project EXA2PRO, and target especially the HPC application domain and HPC platforms. We evaluate the performance effects of the new features on high-end multicore CPU and GPU systems and on HPC clusters.

    Ladda ner fulltext (pdf)
    fulltext
  • 41.
    Ernstsson, August
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Griebler, Dalvan
    Pontif Catholic Univ Rio Grande do Sul PUCRS, Brazil.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Assessing Application Efficiency and Performance Portability in Single-Source Programming for Heterogeneous Parallel Systems2023Ingår i: International journal of parallel programming, ISSN 0885-7458, E-ISSN 1573-7640, Vol. 51, s. 61-82Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    We analyze the performance portability of the skeleton-based, single-source multi-backend high-level programming framework SkePU across multiple different CPU-GPU heterogeneous systems. Thereby, we provide a systematic application efficiency characterization of SkePU-generated code in comparison to equivalent hand-written code in more low-level parallel programming models such as OpenMP and CUDA. For this purpose, we contribute ports of the STREAM benchmark suite and of a part of the NAS Parallel Benchmark suite to SkePU. We show that for STREAM and the EP benchmark, SkePU regularly scores efficiency values above 80% and in particular for CPU systems, SkePU can outperform hand-written code.

    Ladda ner fulltext (pdf)
    fulltext
  • 42.
    Ernstsson, August
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Extending smart containers for data locality-aware skeleton programming2019Ingår i: Concurrency and Computation, ISSN 1532-0626, E-ISSN 1532-0634, Vol. 31, nr 5, artikel-id e5003Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    We present an extension for the SkePU skeleton programming framework to improve the performance of sequences of transformations on smart containers. By using lazy evaluation, SkePU records skeleton invocations and dependencies as directed by smart container operands. When a partial result is required by a different part of the program, the run-time system will process the entire lineage of skeleton invocations; tiling is applied to keep chunks of container data in the working set for the whole sequence of transformations. The approach is inspired by big data frameworks operating on large clusters where good data locality is crucial. We also consider benefits other than data locality with the increased run-time information given by the lineage structures, such as backend selection for heterogeneous systems. Experimental evaluation of example applications shows potential for performance improvements due to better cache utilization, as long as the overhead of lineage construction and management is kept low.

  • 43.
    Ernstsson, August
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Multi-Variant User Functions for Platform-Aware Skeleton Programming2020Ingår i: PARALLEL COMPUTING: TECHNOLOGY TRENDS, IOS PRESS , 2020, Vol. 36, s. 475-484Konferensbidrag (Refereegranskat)
    Abstract [en]

    Todays computer architectures are increasingly specialized and heterogeneous configurations of computational units are common. To provide efficient programming of these systems while still achieving good performance, including performance portability across platforms, high-level parallel programming libraries and tool-chains are used, such as the skeleton programming framework SkePU. SkePU works on heterogeneous systems by automatically generating program components, "user functions", for multiple different execution units in the system, such as CPU and GPU, from a high-level C++ program. This work extends this multi-backend approach by providing the possibility for the programmer to provide additional variants of these user functions tailored for different scenarios, such as platform constraints. This paper introduces the overall approach of multi-variant user functions, provides several use cases including explicit SIMD vectorization for supported hardware, and evaluates the result of these optimizations that can be achieved using this extension.

  • 44.
    Ernstsson, August
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Li, Lu
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    SkePU 2: Flexible and Type-Safe Skeleton Programming for Heterogeneous Parallel Systems2018Ingår i: International journal of parallel programming, ISSN 0885-7458, E-ISSN 1573-7640, Vol. 46, nr 1, s. 62-80Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    In this article we present SkePU 2, the next generation of the SkePU C++ skeleton programming framework for heterogeneous parallel systems. We critically examine the design and limitations of the SkePU 1 programming interface. We present a new, flexible and type-safe, interface for skeleton programming in SkePU 2, and a source-to-source transformation tool which knows about SkePU 2 constructs such as skeletons and user functions. We demonstrate how the source-to-source compiler transforms programs to enable efficient execution on parallel heterogeneous systems. We show how SkePU 2 enables new use-cases and applications by increasing the flexibility from SkePU 1, and how programming errors can be caught earlier and easier thanks to improved type safety. We propose a new skeleton, Call, unique in the sense that it does not impose any predefined skeleton structure and can encapsulate arbitrary user-defined multi-backend computations. We also discuss how the source-to-source compiler can enable a new optimization opportunity by selecting among multiple user function specializations when building a parallel program. Finally, we show that the performance of our prototype SkePU 2 implementation closely matches that of SkePU 1.

    Ladda ner fulltext (pdf)
    fulltext
  • 45.
    Ernstsson, August
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Vandenbergen, Nicolas
    Julich Supercomp Ctr, Germany.
    Keller, Jörg
    Fernuniv, Germany.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    A Deterministic Portable Parallel Pseudo-Random Number Generator for Pattern-Based Programming of Heterogeneous Parallel Systems2022Ingår i: International journal of parallel programming, ISSN 0885-7458, E-ISSN 1573-7640, Vol. 50, s. 319-340Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    SkePU is a pattern-based high-level programming model for transparent program execution on heterogeneous parallel computing systems. A key feature of SkePU is that, in general, the selection of the execution platform for a skeleton-based function call need not be determined statically. On single-node systems, SkePU can select among CPU, multithreaded CPU, single or multi-GPU execution. Many scientific applications use pseudo-random number generators (PRNGs) as part of the computation. In the interest of correctness and debugging, deterministic parallel execution is a desirable property, which however requires a deterministically parallelized pseudo-random number generator. We present the API and implementation of a deterministic, portable parallel PRNG extension to SkePU that is scalable by design and exhibits the same behavior regardless where and with how many resources it is executed. We evaluate it with four probabilistic applications and show that the PRNG enables scalability on both multi-core CPU and GPU resources, and hence supports the universal portability of SkePU code even in the presence of PRNG calls, while source code complexity is reduced.

    Ladda ner fulltext (pdf)
    fulltext
  • 46.
    Forsell, Martti
    et al.
    Platform Architectures Team, VTT Technical Research Centre of Finland, Finland.
    Hansson, Erik
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Mäkelä, Jari-Matti
    Information Technology, University of Turku, Finland.
    Leppänen, Ville
    Information Technology, University of Turku, Finland.
    Hardware and Software Support for NUMA Computing on Configurable Emulated Shared Memory Architectures2013Ingår i: 2013 IEEE 27th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), IEEE conference proceedings, 2013, s. 640-647Konferensbidrag (Refereegranskat)
    Abstract [en]

    The emulated shared memory (ESM) architectures are good candidates for future general purpose parallel computers due to their ability to provide easy-to-use explicitly parallel synchronous model of computation to programmers as well as avoid most performance bottlenecks present in current multicore architectures. In order to achieve full performance the applications must, however, have enough thread-level parallelism (TLP). To solve this problem, in our earlier work we have introduced a class of configurable emulated shared memory (CESM) machines that provides a special non-uniform memory access (NUMA) mode for situations where TLP is limited or for direct compatibility for legacy code sequential computing or NUMA mechanism. Unfortunately the earlier proposed CESM architecture does not integrate the different modes of the architecture well together e.g. by leaving the memories for different modes isolated and therefore the programming interface is non-integrated. In this paper we propose a number of hardware and software techniques to support NUMA computing in CESM architectures in a seamless way. The hardware techniques include three different NUMA-shared memory access mechanisms and the software ones provide a mechanism to integrate NUMA computation into the standard parallel random access machine (PRAM) operation of the CESM. The hardware techniques are evaluated on our REPLICA CESM architecture and compared to an ideal CESM machine making use of the proposed software techniques.

  • 47.
    Forsell, Martti
    et al.
    Platform Architectures Team, VTT Technical Research Centre of Finland, Finland.
    Hansson, Erik
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Mäkelä, Jari-Matti
    Department of Information Technology, University of Turku, Finland.
    Leppänen, Ville
    Department of Information Technology, University of Turku, Finland.
    NUMA Computing with Hardware and Software Co-Support on Configurable Emulated Shared Memory Architectures2014Ingår i: International Journal of Networking and Computing, ISSN 2185-2839, E-ISSN 2185-2847, Vol. 4, nr 1, s. 189-206Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    The emulated shared memory (ESM) architectures are good candidates for future general purpose parallel computers due to their ability to provide an easy-to-use explicitly parallel synchronous model of computation to programmers as well as avoid most performance bottlenecks present in current multicore architectures. In order to achieve full performance the applications must, however, have enough thread-level parallelism (TLP). To solve this problem, in our earlier work we have introduced a class of configurable emulated shared memory (CESM) machines that provides a special non-uniform memory access (NUMA) mode for situations where TLP is limited or for direct compatibility for legacy code sequential computing and NUMA mechanism. Unfortunately the earlier proposed CESM architecture does not integrate the different modes of the architecture well together e.g. by leaving the memories for different modes isolated and therefore the programming interface is non-integrated. In this paper we propose a number of hardware and software techniques to support NUMA computing in CESM architectures in a seamless way. The hardware techniques include three different NUMA shared memory access mechanisms and the software ones provide a mechanism to integrate and optimize NUMA computation into the standard parallel random access machine (PRAM) operation of the CESM. The hardware techniques are evaluated on our REPLICA CESM architecture and compared to an ideal CESM machine making use of the proposed software techniques.

  • 48.
    Hansson, Erik
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Alnervik, Erik
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Forsell, Martti
    VTT Technical Research Centre of Finland.
    A Quantitative Comparison of PRAM based Emulated Shared Memory Architectures to Current Multicore CPUs and GPUs2014Ingår i: 27th International Conference on Architecture of Computing Systems (ARCS), 2014, ARCS Workshops: Proc. PASA-2014 11th Workshop on Parallel Systems and Algorithms, Lübeck, Germany, Lübeck, Germany: VDE Verlag GmbH, 2014, s. 27-33Konferensbidrag (Refereegranskat)
    Abstract [en]

    The performance of current multicore CPUs and GPUs is limited in computations making frequent use of communication/synchronization between the subtasks executed in parallel. This is because the directory-based cache systems scale weakly and/or the cost of synchronization is high. The Emulated Shared Memory (ESM) architectures relying on multithreading and efficient synchronization mechanisms have been developed to solve these problems affecting both performance and programmability of current machines. In this paper, we compare preliminarily the performance of three hardware implemented ESM architectures with state-of-the-art multicore CPUs and GPUs. The benchmarks are selected to cover different patterns of parallel computation and therefore reveal the performance potential of ESM architectures with respect to current multicores.

  • 49.
    Hansson, Erik
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska högskolan.
    Optimized selection of runtime mode for the reconfigurable PRAM-NUMA architecture REPLICA using machine-learning2014Ingår i: Euro-Par 2014: Parallel Processing Workshops: Euro-Par 2014 International Workshops, Porto, Portugal, August 25-26, 2014, Revised Selected Papers, Part II / [ed] Luis Lopes et al., Springer-Verlag New York, 2014, s. 133-145Konferensbidrag (Refereegranskat)
    Abstract [en]

    The massively hardware multithreaded VLIW emulated shared memory (ESM) architecture REPLICA has a dynamically reconfigurable on-chip network that offers two execution modes: PRAM and NUMA. PRAM mode is mainly suitable for applications with high amount of thread level parallelism (TLP) while NUMA mode is mainly for accelerating execution of sequential programs or programs with low TLP. Also, some types of regular data parallel algorithms execute faster in NUMA mode. It is not obvious in which mode a given program region shows the best performance. In this study we focus on generic stencil-like computations exhibiting regular control flow and memory access pattern. We use two state-of-the art machine-learning methods, C5.0 (decision trees) and Eureqa Pro (symbolic regression) to select which mode to use.We use these methods to derive different predictors based on the same training data and compare their results. The accuracy of the best derived predictors are 95% and are generated by both C5.0 and Eureqa Pro, although the latter can in some cases be more sensitive to the training data. The average speedup gained due to mode switching ranges between 1.92 to 2.23 for all generated predictors on the evaluation test cases, and using a majority voting algorithm, based on the three best predictors, we can eliminate all misclassifications.

  • 50.
    Hansson, Erik
    et al.
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Kessler, Christoph
    Linköpings universitet, Institutionen för datavetenskap, Programvara och system. Linköpings universitet, Tekniska fakulteten.
    Optimized variant-selection code generation for loops on heterogeneous multicore systems2016Ingår i: Parallel Computing: On the Road to Exascale / [ed] Gerhard R. Joubert; Hugh Leather; Mark Parsons; Frans Peters; Mark Sawyer, IOS Press, 2016, s. 103-112Konferensbidrag (Refereegranskat)
    Abstract [en]

    We consider the general problem of generating code for the automated selection of the expected best implementation variants for multiple subcomputations on a heterogeneous multicore system, where the program's control flow between the subcomputations is structured by sequencing and loops. A naive greedy approach as applied in previous works on multi-variant selection code generation would determine the locally best variant for each subcomputation instance but might miss globally better solutions. We present a formalization and a fast algorithm for the global variant selection problem for loop-based programs. We also show that loop unrolling can additionally improve performance, and prove an upper bound of the unroll factor which allows to keep the run-time space overhead for the variant-dispatch data structure low. We evaluate our method in case studies using an ARM big.LITTLE based system and a GPU based system where we consider optimization for both energy and performance.

123 1 - 50 av 136
RefereraExporteraLänk till träfflistan
Permanent länk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf