liu.seSearch for publications in DiVA
Change search
Link to record
Permanent link

Direct link
BETA
Hansson, Erik
Publications (9 of 9) Show all publications
Hansson, E., Alnervik, E., Kessler, C. & Forsell, M. (2014). A Quantitative Comparison of PRAM based Emulated Shared Memory Architectures to Current Multicore CPUs and GPUs. In: 27th International Conference on Architecture of Computing Systems (ARCS), 2014, ARCS Workshops: Proc. PASA-2014 11th Workshop on Parallel Systems and Algorithms, Lübeck, Germany: . Paper presented at 27th International Conference on Architecture of Computing Systems (ARCS) 2014, PASA-2014 11th Workshop on Parallel Systems and Algorithms, Lübeck, Germany, Feb. 2014 (pp. 27-33). Lübeck, Germany: VDE Verlag GmbH
Open this publication in new window or tab >>A Quantitative Comparison of PRAM based Emulated Shared Memory Architectures to Current Multicore CPUs and GPUs
2014 (English)In: 27th International Conference on Architecture of Computing Systems (ARCS), 2014, ARCS Workshops: Proc. PASA-2014 11th Workshop on Parallel Systems and Algorithms, Lübeck, Germany, Lübeck, Germany: VDE Verlag GmbH, 2014, p. 27-33Conference paper, Published paper (Refereed)
Abstract [en]

The performance of current multicore CPUs and GPUs is limited in computations making frequent use of communication/synchronization between the subtasks executed in parallel. This is because the directory-based cache systems scale weakly and/or the cost of synchronization is high. The Emulated Shared Memory (ESM) architectures relying on multithreading and efficient synchronization mechanisms have been developed to solve these problems affecting both performance and programmability of current machines. In this paper, we compare preliminarily the performance of three hardware implemented ESM architectures with state-of-the-art multicore CPUs and GPUs. The benchmarks are selected to cover different patterns of parallel computation and therefore reveal the performance potential of ESM architectures with respect to current multicores.

Place, publisher, year, edition, pages
Lübeck, Germany: VDE Verlag GmbH, 2014
Series
PARS-Mitteilungen, ISSN 0177-0454 ; 31
Keywords
Parallel computing, performance analysis, GPU, chip multiprocessor, shared memory
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-114341 (URN)978-3-8007-3579-2 (ISBN)
Conference
27th International Conference on Architecture of Computing Systems (ARCS) 2014, PASA-2014 11th Workshop on Parallel Systems and Algorithms, Lübeck, Germany, Feb. 2014
Projects
REPLICASeRC OpCoReS
Funder
Swedish e‐Science Research Center, OpCoReS
Available from: 2015-02-18 Created: 2015-02-18 Last updated: 2018-01-11
Hansson, E. (2014). Code Generation and Global Optimization Techniques for a Reconfigurable PRAM-NUMA Multicore Architecture. (Licentiate dissertation). Linköping: Linköping University Electronic Press
Open this publication in new window or tab >>Code Generation and Global Optimization Techniques for a Reconfigurable PRAM-NUMA Multicore Architecture
2014 (English)Licentiate thesis, monograph (Other academic)
Abstract [en]

In this thesis we describe techniques for code generation and global optimization for a PRAM-NUMA multicore architecture. We specifically focus on the REPLICA architecture which is a family massively multithreaded very long instruction word (VLIW) chip multiprocessors with chained functional units that has a reconfigurable emulated shared on-chip memory. The on-ship memory system supports two execution modes, PRAM and NUMA, which can be switched between at run-time.PRAM mode is considered the standard execution mode and targets mainly applications with very high thread level parallelism (TLP). In contrast, NUMA mode is optimized for sequential legacy applications and applications with low amount of TLP. Different versions of the REPLICA architecture have different number of cores, hardware threads and functional units. In order to utilize the REPLICA architecture efficiently we have made several contributionsto the development of a compiler for REPLICA target code generation. It supports both code generation for PRAM mode and NUMA mode and can generate code for different versions of the processor pipeline (i.e. for different numbers of functional units). It includes optimization phases to increase the utilization of the available functional units. We have also contributed to quantitative the evaluation of PRAM and NUMA mode. The results show that PRAM mode often suits programs with irregular memory access patterns and control flow best while NUMA mode suites regular programs better. However, for a particular program it is not always obvious which mode, PRAM or NUMA, will show best performance. To tackle this we contributed a case study for generic stencil computations, using machine learning derived cost models in order to automatically select at runtime which mode to execute in. We extended this to also include a sequence of kernels.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2014. p. 101
Series
Linköping Studies in Science and Technology. Thesis, ISSN 0280-7971 ; 1688
Keywords
PRAM; NUMA; multicore; reconfigurable; code generation; optimized composition;
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:liu:diva-111333 (URN)10.3384/lic.diva-111333 (DOI)978-91-7519-189-8 (ISBN)
Presentation
2014-12-16, Alan Turing, Hus E, Campus Valla,Linköpings universitet, Linköping, 13:15 (English)
Opponent
Supervisors
Available from: 2014-11-17 Created: 2014-10-14 Last updated: 2018-01-11Bibliographically approved
Forsell, M., Hansson, E., Kessler, C., Mäkelä, J.-M. & Leppänen, V. (2014). NUMA Computing with Hardware and Software Co-Support on Configurable Emulated Shared Memory Architectures. International Journal of Networking and Computing, 4(1), 189-206
Open this publication in new window or tab >>NUMA Computing with Hardware and Software Co-Support on Configurable Emulated Shared Memory Architectures
Show others...
2014 (English)In: International Journal of Networking and Computing, ISSN 2185-2839, E-ISSN 2185-2847, Vol. 4, no 1, p. 189-206Article in journal (Refereed) Published
Abstract [en]

The emulated shared memory (ESM) architectures are good candidates for future general purpose parallel computers due to their ability to provide an easy-to-use explicitly parallel synchronous model of computation to programmers as well as avoid most performance bottlenecks present in current multicore architectures. In order to achieve full performance the applications must, however, have enough thread-level parallelism (TLP). To solve this problem, in our earlier work we have introduced a class of configurable emulated shared memory (CESM) machines that provides a special non-uniform memory access (NUMA) mode for situations where TLP is limited or for direct compatibility for legacy code sequential computing and NUMA mechanism. Unfortunately the earlier proposed CESM architecture does not integrate the different modes of the architecture well together e.g. by leaving the memories for different modes isolated and therefore the programming interface is non-integrated. In this paper we propose a number of hardware and software techniques to support NUMA computing in CESM architectures in a seamless way. The hardware techniques include three different NUMA shared memory access mechanisms and the software ones provide a mechanism to integrate and optimize NUMA computation into the standard parallel random access machine (PRAM) operation of the CESM. The hardware techniques are evaluated on our REPLICA CESM architecture and compared to an ideal CESM machine making use of the proposed software techniques.

Keywords
Parallel Random Access Machine; NUMA; Shared Memory; Multicore Architecture; Reconfigurable Processor; Multithreaded Processor; PRAM Emulation; Parallel Computing; Parallel Computer Architecture
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-109660 (URN)
Projects
REPLICA
Available from: 2014-08-22 Created: 2014-08-22 Last updated: 2018-01-11Bibliographically approved
Hansson, E. & Kessler, C. (2014). Optimized selection of runtime mode for the reconfigurable PRAM-NUMA architecture REPLICA using machine-learning. In: Luis Lopes et al. (Ed.), Euro-Par 2014: Parallel Processing Workshops: Euro-Par 2014 International Workshops, Porto, Portugal, August 25-26, 2014, Revised Selected Papers, Part II. Paper presented at Euro-Par 2014 Conference (pp. 133-145). Springer-Verlag New York
Open this publication in new window or tab >>Optimized selection of runtime mode for the reconfigurable PRAM-NUMA architecture REPLICA using machine-learning
2014 (English)In: Euro-Par 2014: Parallel Processing Workshops: Euro-Par 2014 International Workshops, Porto, Portugal, August 25-26, 2014, Revised Selected Papers, Part II / [ed] Luis Lopes et al., Springer-Verlag New York, 2014, p. 133-145Conference paper, Published paper (Refereed)
Abstract [en]

The massively hardware multithreaded VLIW emulated shared memory (ESM) architecture REPLICA has a dynamically reconfigurable on-chip network that offers two execution modes: PRAM and NUMA. PRAM mode is mainly suitable for applications with high amount of thread level parallelism (TLP) while NUMA mode is mainly for accelerating execution of sequential programs or programs with low TLP. Also, some types of regular data parallel algorithms execute faster in NUMA mode. It is not obvious in which mode a given program region shows the best performance. In this study we focus on generic stencil-like computations exhibiting regular control flow and memory access pattern. We use two state-of-the art machine-learning methods, C5.0 (decision trees) and Eureqa Pro (symbolic regression) to select which mode to use.We use these methods to derive different predictors based on the same training data and compare their results. The accuracy of the best derived predictors are 95% and are generated by both C5.0 and Eureqa Pro, although the latter can in some cases be more sensitive to the training data. The average speedup gained due to mode switching ranges between 1.92 to 2.23 for all generated predictors on the evaluation test cases, and using a majority voting algorithm, based on the three best predictors, we can eliminate all misclassifications.

Place, publisher, year, edition, pages
Springer-Verlag New York, 2014
Series
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 8806
Keywords
parallel computing, reconfigurable architecture, chip multiprocessor, machine learning, program optimization, performance analysis
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-114342 (URN)10.1007/978-3-319-14313-2_12 (DOI)000354785000012 ()978-3-319-14312-5 (ISBN)978-3-319-14313-2 (ISBN)
Conference
Euro-Par 2014 Conference
Projects
REPLICASeRC-OpCoReS
Funder
Swedish e‐Science Research Center, OpCoReS
Available from: 2015-02-18 Created: 2015-02-18 Last updated: 2018-02-21
Shafiee Sarvestani, A., Hansson, E. & Kessler, C. (2013). Extensible Recognition of Algorithmic Patterns in DSP Programs for Automatic Parallelization. International journal of parallel programming, 41(6), 806-824
Open this publication in new window or tab >>Extensible Recognition of Algorithmic Patterns in DSP Programs for Automatic Parallelization
2013 (English)In: International journal of parallel programming, ISSN 0885-7458, E-ISSN 1573-7640, Vol. 41, no 6, p. 806-824Article in journal (Refereed) Published
Abstract [en]

We introduce an extensible knowledge based tool for idiom (pattern) recognition in DSP (digital signal processing) programs. Our tool utilizes functionality provided by the Cetus compiler infrastructure for detecting certain computation patterns that frequently occur in DSP code. We focus on recognizing patterns for for-loops and statements in their bodies as these often are the performance critical constructs in DSP applications for which replacement by highly optimized, target-specific parallel algorithms will be most profitable. For better structuring and efficiency of pattern recognition, we classify patterns by different levels of complexity such that patterns in higher levels are defined in terms of lower level patterns. The tool works statically on the intermediate representation. For better extensibility and abstraction, most of the structural part of recognition rules is specified in XML form to separate the tool implementation from the pattern specifications. Information about detected patterns will later be used for optimized code generation by local algorithm replacement e.g. for the low-power high-throughput multicore DSP architecture ePUMA.

Place, publisher, year, edition, pages
Springer Verlag (Germany), 2013
Keywords
Automatic parallelization, Algorithmic pattern recognition, Cetus, DSP, DSP code parallelization, Compiler frameworks
National Category
Engineering and Technology
Identifiers
urn:nbn:se:liu:diva-97429 (URN)10.1007/s10766-012-0229-2 (DOI)000322726300005 ()
Note

Funding Agencies|SSF||SeRC||

Available from: 2013-09-12 Created: 2013-09-12 Last updated: 2017-12-06
Forsell, M., Hansson, E., Kessler, C., Mäkelä, J.-M. & Leppänen, V. (2013). Hardware and Software Support for NUMA Computing on Configurable Emulated Shared Memory Architectures. In: 2013 IEEE 27th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW): . Paper presented at 15th Workshop on Advances on Parallel and Distributed Computational Models (APDCM 2013), in conjunction with 2013 IEEE 27th International Parallel and Distributed Processing Symposium, 20-24 May 2013, Boston, Massachusetts USA (pp. 640-647). IEEE conference proceedings
Open this publication in new window or tab >>Hardware and Software Support for NUMA Computing on Configurable Emulated Shared Memory Architectures
Show others...
2013 (English)In: 2013 IEEE 27th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), IEEE conference proceedings, 2013, p. 640-647Conference paper, Published paper (Refereed)
Abstract [en]

The emulated shared memory (ESM) architectures are good candidates for future general purpose parallel computers due to their ability to provide easy-to-use explicitly parallel synchronous model of computation to programmers as well as avoid most performance bottlenecks present in current multicore architectures. In order to achieve full performance the applications must, however, have enough thread-level parallelism (TLP). To solve this problem, in our earlier work we have introduced a class of configurable emulated shared memory (CESM) machines that provides a special non-uniform memory access (NUMA) mode for situations where TLP is limited or for direct compatibility for legacy code sequential computing or NUMA mechanism. Unfortunately the earlier proposed CESM architecture does not integrate the different modes of the architecture well together e.g. by leaving the memories for different modes isolated and therefore the programming interface is non-integrated. In this paper we propose a number of hardware and software techniques to support NUMA computing in CESM architectures in a seamless way. The hardware techniques include three different NUMA-shared memory access mechanisms and the software ones provide a mechanism to integrate NUMA computation into the standard parallel random access machine (PRAM) operation of the CESM. The hardware techniques are evaluated on our REPLICA CESM architecture and compared to an ideal CESM machine making use of the proposed software techniques.

Place, publisher, year, edition, pages
IEEE conference proceedings, 2013
Keywords
Parallel computing, Multicore architecture, Parallel programming model, PRAM, GPU, Benchmarking, Performance analysis
National Category
Natural Sciences Computer Sciences
Identifiers
urn:nbn:se:liu:diva-102597 (URN)10.1109/IPDPSW.2013.146 (DOI)978-0-7695-4979-8 (ISBN)
Conference
15th Workshop on Advances on Parallel and Distributed Computational Models (APDCM 2013), in conjunction with 2013 IEEE 27th International Parallel and Distributed Processing Symposium, 20-24 May 2013, Boston, Massachusetts USA
Projects
REPLICA
Available from: 2013-12-16 Created: 2013-12-16 Last updated: 2018-01-11
Mäkelä, J.-M., Hansson, E., Åkesson, D., Forsell, M., Kessler, C. & Leppänen, V. (2012). Design of the Language Replica for Hybrid PRAM-NUMA Many-core Architectures. In: Parallel and Distributed Processing with Applications (ISPA), 2012: . Paper presented at 10th IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA 2012), Leganes, Madrid, Spain, July 10-13, 2012 (pp. 697-704). IEEE conference proceedings
Open this publication in new window or tab >>Design of the Language Replica for Hybrid PRAM-NUMA Many-core Architectures
Show others...
2012 (English)In: Parallel and Distributed Processing with Applications (ISPA), 2012, IEEE conference proceedings, 2012, p. 697-704Conference paper, Published paper (Refereed)
Abstract [en]

Parallel programming is widely considered very demanding for an average programmer due to inherent asynchrony of underlying parallel architectures. In this paper we describe the main design principles and core features of Replica -- a parallel language aimed for high-level programming of a new paradigm of reconfigurable, scalable and powerful synchronous shared memory architectures that promise to make parallel programming radically easier with the help of strict memory consistency and deterministic synchronous execution of hardware threads and multi-operations.

Place, publisher, year, edition, pages
IEEE conference proceedings, 2012
Keywords
multi-core, parallel computing, programming languages, parallel random access machine, PRAM model, NUMA shared memory, parallel computer architecture, many-core architecture, compiler, code generation
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-91970 (URN)10.1109/ISPA.2012.103 (DOI)978-1-4673-1631-6 (ISBN)
Conference
10th IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA 2012), Leganes, Madrid, Spain, July 10-13, 2012
Projects
REPLICA
Available from: 2013-05-06 Created: 2013-05-06 Last updated: 2018-01-11Bibliographically approved
Kessler, M., Hansson, E., Åkesson, D. & Kessler, C. (2012). Exploiting Instruction Level Parallelism for REPLICA - A Configurable VLIW Architecture With Chained Functional Units. In: : Volume II. Paper presented at International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'12), 16-19 July 2012, Las Vegas, Nevada, USA (pp. 275-281). Las Vegas, Nevada, USA: CSREA Press
Open this publication in new window or tab >>Exploiting Instruction Level Parallelism for REPLICA - A Configurable VLIW Architecture With Chained Functional Units
2012 (English)In: : Volume II, Las Vegas, Nevada, USA: CSREA Press, 2012, p. 275-281Conference paper, Published paper (Other academic)
Abstract [en]

In this paper we present a scheduling algorithm for VLIW architectures with chained functional units. We show how our algorithm can help speed up programs at the instruction level, for an architecture called REPLICA, a configurable emulated shared memory (CESM) architecture whose computation model is based on the PRAM model. Since our LLVM based compiler is parameterizable in the number of different functional units, read and write ports to register file etc. we can generate code for different REPLICA architectures that have different functional unit configurations. We show for a set of different configurations how our implementation can produce high quality code; and we argue that the high parametrization of the compiler makes it, together with the simulator, useful for hardware/software co-design.

Place, publisher, year, edition, pages
Las Vegas, Nevada, USA: CSREA Press, 2012
Keywords
instruction level parallelism, chained VLIW architecture, code generation, instruction scheduling, configurable architecture, LLVM compiler, compiler backend
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-91975 (URN)1-60132-228-3 (ISBN)
Conference
International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'12), 16-19 July 2012, Las Vegas, Nevada, USA
Projects
REPLICA
Available from: 2013-05-06 Created: 2013-05-06 Last updated: 2018-01-11Bibliographically approved
Kessler, C. & Hansson, E. (2012). Flexible scheduling and thread allocation for synchronous parallel tasks. In: G. Mühl, J. Richling, A. Herkersdorf (Ed.), ARCS-2012 Workshops. Paper presented at 10th Workshop on Parallel Systems and Algorithms (PASA 2012), Munich, Germany, February 29, 2012 (pp. 517-528). Gesellschaft für Informatik
Open this publication in new window or tab >>Flexible scheduling and thread allocation for synchronous parallel tasks
2012 (English)In: ARCS-2012 Workshops / [ed] G. Mühl, J. Richling, A. Herkersdorf, Gesellschaft für Informatik , 2012, p. 517-528Conference paper, Published paper (Refereed)
Abstract [en]

We describe a task model and dynamic scheduling and resource allocation mechanism for synchronous parallel tasks to be executed on SPMD-programmed synchronous shared memory MIMD parallel architectures with uniform, unit-time memory access and strict memory consistency, also known inthe literature as PRAMs (Parallel Random Access Machines). Our task model provides a two-tier programming model for PRAMs that flexibly combines SPMD and fork-join parallelism within the same application. It offers flexibility by dynamic scheduling and late resource binding while preserving the PRAM execution properties within each task, the only limitation being that the maximum number of threads that can be assigned to a task is limited to what the underlying architecture provides. In particular, our approach opens for automatic performance tuning at run-time by controlling the thread allocation for tasks based on run-time predictions.By a prototype implementation of a synchronous parallel task API in the SPMD-based PRAM language Fork and experimental evaluation with example programs on the SBPRAM simulator, we show that a realization of the task model on a SPMD-programmable PRAM machine is feasible with moderate runtimeoverhead per task.

Place, publisher, year, edition, pages
Gesellschaft für Informatik, 2012
Series
Lecture Notes in Informatics, ISSN 1617-5468 ; 29
Keywords
Malleable tasks, mapping, PRAM, parallel computing, scheduling, thread allocation, run-time system, parallel task queue
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-91572 (URN)978-1-4577-2145-8 (ISBN)
Conference
10th Workshop on Parallel Systems and Algorithms (PASA 2012), Munich, Germany, February 29, 2012
Projects
REPLICA
Available from: 2013-04-26 Created: 2013-04-26 Last updated: 2018-01-11
Organisations

Search in DiVA

Show all publications