liu.seSearch for publications in DiVA
Change search
Link to record
Permanent link

Direct link
Eriksson, Mattias
Publications (10 of 11) Show all publications
Eriksson, M. & Kessler, C. (2012). Integrated Code Generation for Loops. ACM Transactions on Embedded Computing Systems, 11(1)
Open this publication in new window or tab >>Integrated Code Generation for Loops
2012 (English)In: ACM Transactions on Embedded Computing Systems, ISSN 1539-9087, E-ISSN 1558-3465, Vol. 11, no 1Article in journal (Refereed) Published
Abstract [en]

Code generation in a compiler is commonly divided into several phases: instruction selection, scheduling, register allocation, spill code generation, and, in the case of clustered architectures, cluster assignment. These phases are interdependent; for instance, a decision in the instruction selection phase affects how an operation can be scheduled We examine the effect of this separation of phases on the quality of the generated code. To study this we have formulated optimal methods for code generation with integer linear programming; first for acyclic code and then we extend this method to modulo scheduling of loops. In our experiments we compare optimal modulo scheduling, where all phases are integrated, to modulo scheduling, where instruction selection and cluster assignment are done in a separate phase. The results show that, for an architecture with two clusters, the integrated method finds a better solution than the nonintegrated method for 27% of the instances.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2012
Keywords
Algorithms, Experimentation, Performance, Theory, Code generation, clustered VLIW architectures, modulo scheduling
National Category
Engineering and Technology
Identifiers
urn:nbn:se:liu:diva-81509 (URN)10.1145/2180887.2180896 (DOI)000307050900009 ()
Available from: 2012-09-18 Created: 2012-09-18 Last updated: 2017-12-07
Eriksson, M. (2011). Integrated Code Generation. (Doctoral dissertation). Linköping: Linköping University Electronic Press
Open this publication in new window or tab >>Integrated Code Generation
2011 (English)Doctoral thesis, monograph (Other academic)
Abstract [en]

Code generation in a compiler is commonly divided into several phases: instruction selection, scheduling, register allocation, spill code generation, and, in the case of clustered architectures, cluster assignment. These phases are interdependent; for instance, a decision in the instruction selection phase affects how an operation can be scheduled. We examine the effect of this separation of phases on the quality of the generated code. To study this we have formulated optimal methods for code generation with integer linear programming; first for acyclic code and then we extend this method to modulo scheduling of loops. In our experiments we compare optimal modulo scheduling, where all phases are integrated, to modulo scheduling where instruction selection and cluster assignment are done in a separate phase. The results show that, for an architecture with two clusters, the integrated method finds a better solution than the non-integrated method for 39% of the instances.

Our algorithm for modulo scheduling iteratively considers schedules with increasing number of schedule slots. A problem with such an iterative method is that if the initiation interval is not equal to the lower bound there is no way to determine whether the found solution is optimal or not. We have proven that for a class of architectures that we call transfer free, we can set an upper bound on the schedule length. I.e., we can prove when a found modulo schedule with initiation interval larger than the lower bound is optimal.

Another code generation problem that we study is how to optimize the usage of the address generation unit in simple processors that have very limited addressing modes. In this problem the subtasks are: scheduling, address register assignment and stack layout. Also for this problem we compare the results of integrated methods to the results of non-integrated methods, and we find that integration is beneficial when there are only a few (1 or 2) address registers available.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2011. p. 169
Series
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 1375
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-67471 (URN)978-91-7393-147-2 (ISBN)
Public defence
2011-06-07, Visionen, Hus B, Campus Valla, Linköpings universitet, Linköping, 13:15 (English)
Opponent
Supervisors
Available from: 2011-05-12 Created: 2011-04-13 Last updated: 2019-12-19Bibliographically approved
Eriksson, M. & Kessler, C. (2011). Integrated Offset Assignment. In: George Cai and Tom van der Aa (Ed.), Proceedings 9th Workshop on Optimizations for DSP and Embedded Systems (ODES-9): . Paper presented at 9th Workshop on Optimizations for DSP and Embedded Systems (ODES-9), co-located with CGO-2011, Chamonix, France, 2 April 2011 (pp. 47-54).
Open this publication in new window or tab >>Integrated Offset Assignment
2011 (English)In: Proceedings 9th Workshop on Optimizations for DSP and Embedded Systems (ODES-9) / [ed] George Cai and Tom van der Aa, 2011, p. 47-54Conference paper, Published paper (Refereed)
Abstract [en]

One important part of generating code for DSP processors is to make good use of the address generation unit (AGU). In this paper we divide the code generation into three parts: (1) scheduling, (2) address register assignment, and (3) storage layout. The goal is to nd out if solving these three subproblems as one big integrated problem gives better results compared to when scheduling or address register assignment is solved separately. We present optimal dynamic programming algorithms for both integrated and non-integrated code generation for DSP processors. In our experiments we nd that integrationis benecial when the AGU has 1 or 2 address registers; for the other cases existing heuristics are near optimal. We also nd that integrating address register assignment and storage layout gives slightly better results than integrating scheduling and storage layout. I.e. address register assignment is more important than scheduling.

Keywords
Integrated offset assignment, address code optimization, code generation, digital signal processor (DSP), address generation unit, instruction scheduling
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-93375 (URN)
Conference
9th Workshop on Optimizations for DSP and Embedded Systems (ODES-9), co-located with CGO-2011, Chamonix, France, 2 April 2011
Projects
OPTIMISTePUMAIntegrated Software Pipelining
Funder
Swedish Research CouncilSwedish Foundation for Strategic Research
Available from: 2013-05-31 Created: 2013-05-31 Last updated: 2018-01-11Bibliographically approved
Eriksson, M. & Kessler, C. (2009). Integrated modulo scheduling for clustered VLIW architectures. In: André Seznec, Joel Emer, Michael O’Boyle, Margaret Martonosi and Theo Ungerer (Ed.), High Performance Embedded Architectures and Compilers: Fourth International Conference, HiPEAC 2009, Paphos, Cyprus, January 25-28, 2009. Proceedings (pp. 65-79). Springer Berlin/Heidelberg, 5409 LNCS
Open this publication in new window or tab >>Integrated modulo scheduling for clustered VLIW architectures
2009 (English)In: High Performance Embedded Architectures and Compilers: Fourth International Conference, HiPEAC 2009, Paphos, Cyprus, January 25-28, 2009. Proceedings / [ed] André Seznec, Joel Emer, Michael O’Boyle, Margaret Martonosi and Theo Ungerer, Springer Berlin/Heidelberg, 2009, Vol. 5409 LNCS, p. 65-79Chapter in book (Refereed)
Abstract [en]

We solve the problem of integrating modulo scheduling with instruction selection (including cluster assignment), instruction scheduling and register allocation, with optimal spill code generation and scheduling. Our method is based on integer linear programming. We prove that our algorithm delivers optimal results in finite time for a certain class of architectures. We believe that these results are interesting both from a theoretical point of view and as a reference point when devising heuristic methods.

Place, publisher, year, edition, pages
Springer Berlin/Heidelberg, 2009
Series
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 5409
National Category
Engineering and Technology
Identifiers
urn:nbn:se:liu:diva-18840 (URN)10.1007/978-3-540-92990-1_7 (DOI)978-3-540-92989-5 (ISBN)3-540-92989-4 (ISBN)978-3-540-92990-1 (ISBN)
Available from: 2009-06-05 Created: 2009-06-05 Last updated: 2018-02-20Bibliographically approved
Eriksson, M. (2009). Integrated Software Pipelining. (Licentiate dissertation). Linköping: Linköping University Electronic Press
Open this publication in new window or tab >>Integrated Software Pipelining
2009 (English)Licentiate thesis, monograph (Other academic)
Abstract [en]

In this thesis we address the problem of integrated software pipelining for clustered VLIW architectures. The phases that are integrated and solved as one combined problem are: cluster assignment, instruction selection, scheduling, register allocation and spilling.

As a first step we describe two methods for integrated code generation of basic blocks. The first method is optimal and based on integer linear programming. The second method is a heuristic based on genetic algorithms.

We then extend the integer linear programming model to modulo scheduling. To the best of our knowledge this is the first time anybody has optimally solved the modulo scheduling problem for clustered architectures with instruction selection and cluster assignment integrated.

We also show that optimal spilling is closely related to optimal register allocation when the register files are clustered. In fact, optimal spilling is as simple as adding an additional virtual register file representing the memory and have transfer instructions to and from this register file corresponding to stores and loads.

Our algorithm for modulo scheduling iteratively considers schedules with increasing number of schedule slots. A problem with such an iterative method is that if the initiation interval is not equal to the lower bound there is no way to determine whether the found solution is optimal or not. We have proven that for a class of architectures that we call transfer free, we can set an upper bound on the schedule length. I.e., we can prove when a found modulo schedule with initiation interval larger than the lower bound is optimal.

Experiments have been conducted to show the usefulness and limitations of our optimal methods. For the basic block case we compare the optimal method to the heuristic based on genetic algorithms.

This work has been supported by The Swedish national graduate school in computer science (CUGS) and Vetenskapsrådet (VR).

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2009. p. 88
Series
Linköping Studies in Science and Technology. Thesis, ISSN 0280-7971 ; 1393
Keywords
Code generation, compilers, instruction scheduling, register allocation, spill code generation, modulo scheduling, integer linear programming, genetic programming.
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-16170 (URN)LIU-TEK-LIC-2009:1 (Local ID)9789173936996 (ISBN)LIU-TEK-LIC-2009:1 (Archive number)LIU-TEK-LIC-2009:1 (OAI)
Presentation
2009-02-16, Alan Turing, Campus Valla, Linköpings universitet, Linköping, 13:15 (English)
Opponent
Supervisors
Available from: 2009-01-14 Created: 2009-01-07 Last updated: 2020-08-20Bibliographically approved
Ålind, M., Eriksson, M. & Kessler, C. (2008). BlockLib: A Skeleton Library for Cell Broadband Engine. In: Proceedings - International Conference on Software Engineering: . Paper presented at 30th International Conference on Software Engineering, ICSE 2008 Co-located Workshops - 1st International Workshop on Multicore Software Engineering, IWMSE (pp. 7-14). New York, USA: ACM
Open this publication in new window or tab >>BlockLib: A Skeleton Library for Cell Broadband Engine
2008 (English)In: Proceedings - International Conference on Software Engineering, New York, USA: ACM , 2008, p. 7-14Conference paper, Published paper (Refereed)
Abstract [en]

Cell Broadband Engine is a heterogeneous multicore processor for high-performance computing and gaming. Its architecture allows for an impressive peak performance but, at the same time, makes it very hard to write efficient code. The need to simultaneously exploit SIMD instructions, coordinate parallel execution of the slave processors, overlap DMA memory traffic with computation, keep data properly aligned in memory, and explicitly manage the very small on-chip memory buffers of the slave processors, leads to very complex code. In this work, we adopt the skeleton programming approach to abstract from much of the complexity of Cell programming while maintaining high performance. The abstraction is achieved through a library of parallel generic building blocks, called BlockLib. Macro-based generative programming is used to reduce the overhead of genericity in skeleton functions and control code size expansion. We demonstrate the library usage with a parallel ODE solver application. Our experimental results show that BlockLib code achieves performance close to hand-written code and even outperforms the native IBM BLAS library in cases where several slave processors are used.

Place, publisher, year, edition, pages
New York, USA: ACM, 2008
Keywords
generic parallel programming, parallel computing, generative programming, software library, multicore processor, software components
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-43692 (URN)10.1145/1370082.1370088 (DOI)74556 (Local ID)978-1-60558-031-9 (ISBN)74556 (Archive number)74556 (OAI)
Conference
30th International Conference on Software Engineering, ICSE 2008 Co-located Workshops - 1st International Workshop on Multicore Software Engineering, IWMSE
Available from: 2009-10-10 Created: 2009-10-10 Last updated: 2018-01-12
Eriksson, M., Skoog, O. & Kessler, C. (2008). Optimal vs. Heuristic Integrated Code Generation for Clustered VLIW Architectures.. In: 11th ACM SIGBED Int. Workshop on Software and Compilers for Embedded Systems SCOPES 2008,2008: . Paper presented at 11th international workshop on Software & compilers for embedded systems, Munich, Germany, 13-14 March 2008 (pp. 11-20). New York, USA: Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Optimal vs. Heuristic Integrated Code Generation for Clustered VLIW Architectures.
2008 (English)In: 11th ACM SIGBED Int. Workshop on Software and Compilers for Embedded Systems SCOPES 2008,2008, New York, USA: Association for Computing Machinery (ACM), 2008, p. 11-20Conference paper, Published paper (Refereed)
Abstract [en]

In this paper we present two algorithms for integrated code generation for clustered VLIW architectures. One algorithm is a heuristic based on genetic algorithms, the other algorithm is based on integer linear programming. The performance of the algorithms are compared on a portion of the Mediabench benchmark suite. We found the results of the genetic algorithm to be within one or two clock cycles from optimal for the cases where the optimum is known. In addition the heuristic algorithm produces results in predictable time also when the optimal integer linear program fails.

Place, publisher, year, edition, pages
New York, USA: Association for Computing Machinery (ACM), 2008
Keywords
compiler technology, integrated code generation, genetic programming, integer linear programming
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-43690 (URN)10.1145/1361096.1361099 (DOI)74553 (Local ID)74553 (Archive number)74553 (OAI)
Conference
11th international workshop on Software & compilers for embedded systems, Munich, Germany, 13-14 March 2008
Available from: 2009-10-10 Created: 2009-10-10 Last updated: 2018-01-12Bibliographically approved
Johansson, D., Eriksson, M. & Kessler, C. (2007). Bulk-synchronous parallel computing on the CELL processor. In: PARS-2007 21. PARS - Workshop, Hamburg, Germany, May 31-Jun 1, 2007. GI/ITG-Fachgruppe Parallel-Algorithmen, -Rechnerstrukturen und -Systemsoftware PARS.,2007. GI Gesellschaft für Informatik e.V.
Open this publication in new window or tab >>Bulk-synchronous parallel computing on the CELL processor
2007 (English)In: PARS-2007 21. PARS - Workshop, Hamburg, Germany, May 31-Jun 1, 2007. GI/ITG-Fachgruppe Parallel-Algorithmen, -Rechnerstrukturen und -Systemsoftware PARS.,2007, GI Gesellschaft für Informatik e.V. , 2007Conference paper, Published paper (Refereed)
Abstract [en]

In order to ease programming of heterogeneous architectures with explicitly managed memory hierarchies such as the CELL processor, we propose a solution adopting the BSP model as implemented in the parallel programming language NestStep. This allows the programmer to write programs with a global address space and run them on the slave processors (SPEs) of CELL while keeping the large data structures in main memory. We have implemented the run-time system of NestStep on CELL and report on performance results that demonstrate the feasibility of the approach. The test programs scale very well as their execution time is mostly dominated by calculations, and only a fraction is spent in the various parts of the NestStep runtime system library. The library also has a relatively small memory footprint in the SPE's software managed local memory.

Place, publisher, year, edition, pages
GI Gesellschaft für Informatik e.V., 2007
Keywords
multicore processor architecture, parallel programming, bulk-synchronous parallelism
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-40732 (URN)54009 (Local ID)54009 (Archive number)54009 (OAI)
Available from: 2009-10-10 Created: 2009-10-10 Last updated: 2018-01-13
Kessler, C., Bednarski, A. & Eriksson, M. (2007). Classification and generation of schedules for VLIW processors. Concurrency, 19, 2369-2389
Open this publication in new window or tab >>Classification and generation of schedules for VLIW processors
2007 (English)In: Concurrency, ISSN 1040-3108, E-ISSN 1096-9128, Vol. 19, p. 2369-2389Article in journal (Refereed) Published
Abstract [en]

We identify and analyze different classes of schedules for instruction-level parallel processor architectures. The classes are induced by various common techniques for generating or enumerating them, such as integer linear programming or list scheduling with backtracking. In particular, we study the relationship between VLIW schedules and their equivalent linearized forms (which may be used, e.g., with superscalar processors), and we identify classes of VLIW schedules that can be created from a linearized form using an in-order VLIW compaction heuristic, which is just the static equivalent of the dynamic instruction dispatch algorithm of in-order issue superscalar processors. We formulate and give a proof of the dominance of greedy schedules for instruction-level parallel architectures where all instructions have multiblock reservation tables, and we show how scheduling anomalies can occur in the presence of instructions with non-multiblock reservation tables. We also show that, in certain situations, certain schedules generally cannot be constructed by incremental scheduling algorithms that are based on topological sorting of the data dependence graph. We also discuss properties of strongly linearizable schedules, out-of-order schedules and non-dawdling schedules, and show their relationships to greedy schedules and to general schedules. We summarize our findings as a hierarchy of classes of VLIW schedules. Finally we provide an experimental evaluation showing the sizes of schedule classes in the above hierarchy, for different benchmarks and example VLIW architectures, including a single-cluster version of the TI C62x DSP processor and variants of that. Our results can sharpen the interpretation of the term optimality used with various methods for optimal VLIW scheduling, and help to identify sets of schedules that can be safely ignored when searching for a time-optimal schedule.

Keywords
instruction-level parallelism, instruction scheduling, code generation, code compaction, integer linear programming, VLIW architecture, superscalar processor
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-40156 (URN)10.1002/cpe.1175 (DOI)52451 (Local ID)52451 (Archive number)52451 (OAI)
Available from: 2009-10-10 Created: 2009-10-10 Last updated: 2018-01-13
Kessler, C., Fritzson, P. & Eriksson, M. (2007). NestStepModelica - Mathematical Modeling and Bulk-Synchronous Parallel Simulation. In: 8th International Workshop, PARA 2006, Umeå, Sweden, June 18-21, 2006, Revised Selected Papers: . Paper presented at PARA06 - Workshop on State-of-the-art in Scientific and Parallel Computing, June 18-21, 2006, Umeå, Sweden (pp. 1006-1015). Berlin, Heidelberg: Springer
Open this publication in new window or tab >>NestStepModelica - Mathematical Modeling and Bulk-Synchronous Parallel Simulation
2007 (English)In: 8th International Workshop, PARA 2006, Umeå, Sweden, June 18-21, 2006, Revised Selected Papers, Berlin, Heidelberg: Springer , 2007, p. 1006-1015Conference paper, Published paper (Refereed)
Abstract [en]

Many parallel computing applications are used for simulation of complex engineering applications and/or for visualization. To handle their complexity, there is a need for raising the level of abstraction in specifying such applications using high level mathematical modeling techniques, such as the Modelica language and technology. However, with the increased complexity of modeled systems, it becomes increasingly important to use today-s and tomorrow-s parallel hardware efficiently. Automatic parallelization is convenient, but may need to be combined with easy-to-use methods for parallel programming. In this context, we propose to combine the abstraction power of Modelica with support for shared memory bulk-synchronous parallel programming including nested parallelism (NestStepModelica), which is both flexible (can be mapped to many different parallel architectures) and simple (offers a shared address space, structured parallelism, deterministic computation, and is deadlock-free). We describe NestStepModelica and report on first results obtained with a prototype implementation.

Place, publisher, year, edition, pages
Berlin, Heidelberg: Springer, 2007
Series
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; Volume 4699
Keywords
parallel programming, Modelica, NestStep, Bulk-synchronous parallel computing, Linux cluster
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-40157 (URN)10.1007/978-3-540-75755-9_118 (DOI)52452 (Local ID)978-3-540-75754-2 (ISBN)978-3-540-75755-9 (ISBN)52452 (Archive number)52452 (OAI)
Conference
PARA06 - Workshop on State-of-the-art in Scientific and Parallel Computing, June 18-21, 2006, Umeå, Sweden
Available from: 2009-10-10 Created: 2009-10-10 Last updated: 2018-01-26Bibliographically approved
Organisations

Search in DiVA

Show all publications