liu.seSearch for publications in DiVA
Change search
Link to record
Permanent link

Direct link
BETA
Karlsson, Andréas
Alternative names
Publications (10 of 15) Show all publications
Karlsson, A. (2016). Design of Energy-Efficient High-Performance ASIP-DSP Platforms. (Doctoral dissertation). Linköping: Linköping University Electronic Press
Open this publication in new window or tab >>Design of Energy-Efficient High-Performance ASIP-DSP Platforms
2016 (English)Doctoral thesis, monograph (Other academic)
Abstract [en]

In the last ten years, limited clock frequency scaling and increasing power density has shifted IC design focus towards parallelism, heterogeneity and energy efficiency. Improving energy efficiency is by no means simple and it calls for a reevaluation of old design choices in processor architecture, and perhaps more importantly, development of new programming methodologies that exploit the features of modern architectures.

This thesis discusses the design of energy-efficient digital signal processors with application-specific instructions sets, so-called ASIP-DSPs, and their programming tools. Target applications for such processors include, but are not limited to, communications, multimedia, image processing, intelligent vision and radar. These applications are often implemented by a limited set of kernel algorithms, whose performance and efficiency are critical to the application's success. At the same time, the extreme non-recurring engineering cost of system-on-chip designs means that product life-time must be kept as long as possible. Neither general-purpose processors nor non-programmable ASICs can meet both the flexibility and efficiency requirements, and ASIPs may instead be the best trade-off between all the conflicting goals.

Traditional superscalar- and VLIW processor design focus has been to improve the throughput of fine-grained instructions, which results in high flexibility, but also high energy consumption. SIMD architectures, on the other hand, are often restricted by inefficient data access. The result is architectures which spend more energy and/or time on supporting operations rather than actual computing.

This thesis defines the performance limit of an architecture with an N-way parallel datapath as consuming 2N elements of compute data per clock cycle. To approach this performance, this work proposes coarse-grained higher-order functional (HOF) instructions, which encode the most  frequently executed compute-, data access- and control sequences into single many-cycle instructions, to reduce the overheads of instruction delivery, while at the same time maintaining orthogonality. The work further investigates opportunities for operation fusion to improve computing performance, and proposes a flexible memory subsystem for conflict-free parallel memory access with permutation and lookup-table-based addressing, to ensure that high computing throughput can be sustained even in the presence of irregular data access patterns. These concepts are extensively studied by implementing a large kernel algorithm library with typical DSP kernels, to prove their effectiveness and adequacy. Compared to contemporary VLIW DSP solutions, our solution can practically eliminate instruction fetching energy in many scenarios, significantly reduce control path switching, simplify the implementation of kernels and reduce code size, sometimes by as much as 30 times.

The techniques proposed in this thesis have been implemented in the DSP platform ePUMA (embedded Parallel DSP processor with Unique Memory Access), a configurable control-compute heterogeneous platform with distributed memory, optimized for low-power predictable DSP computing. Hardware evaluation has been done with FPGA prototypes. In addition, several VLSI layouts have been created for energy and area estimations. This includes smaller designs, as well as a large design with 73 cores, capable of 1280 integer GOPS or 256 GFLOPS at 500MHz and which measures 45mm2 in 28nm FD-SOI technology.

In addition to the hardware design, this thesis also discusses parallel programming flow for distributed memory architectures and ePUMA application implementation. A DSP kernel programming language and its compiler is presented. This effectively demonstrates how kernels written in a high-level language can be translated into HOF instructions for very high processing efficiency.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2016. p. 340
Series
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 1772
National Category
Computer Engineering Computer Systems Computer Sciences Other Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:liu:diva-130723 (URN)10.3384/diss.diva-130723 (DOI)9789176857458 (ISBN)
Public defence
2016-09-07, Visionen, B-huset, Campus Valla, Linköping, 10:15
Opponent
Supervisors
Available from: 2016-08-22 Created: 2016-08-22 Last updated: 2018-01-10Bibliographically approved
Gebrewahid, E., Ali Arslan, M., Karlsson, A. & Ul-Abdin, Z. (2016). Support for Data Parallelism in the CAL Actor Language. In: PROCEEDINGS OF THE 2016 3RD WORKSHOP ON PROGRAMMING MODELS FOR SIMD/VECTOR PROCESSING (WPMVP 2016): . Paper presented at 3rd Workshop on Programming Models for SIMD/Vector Processing (WPMVP) (pp. 1-8). New York, NY: Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Support for Data Parallelism in the CAL Actor Language
2016 (English)In: PROCEEDINGS OF THE 2016 3RD WORKSHOP ON PROGRAMMING MODELS FOR SIMD/VECTOR PROCESSING (WPMVP 2016), New York, NY: Association for Computing Machinery (ACM), 2016, p. 1-8Conference paper, Published paper (Refereed)
Abstract [en]

With the arrival of heterogeneous manycores comprising various features to support task, data and instruction-level parallelism, developing applications that take full advantage of the hardware parallel features has become a major challenge. In this paper, we present an extension to our CAL compilation framework (CAL2Many) that supports data parallelism in the CAL Actor Language. Our compilation framework makes it possible to program architectures with SIMD support using high-level language and provides efficient code generation. We support general SIMD instructions but the code generation backend is currently implemented for two custom architectures, namely ePUMA and EIT. Our experiments were carried out for two custom SIMD processor architectures using two applications.

The experiment shows the possibility of achieving performance comparable to hand-written machine code with much less programming effort.

Place, publisher, year, edition, pages
New York, NY: Association for Computing Machinery (ACM), 2016
Keywords
SIMD, CAL Actor Language, QRD
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-130905 (URN)10.1145/2870650.2870656 (DOI)000390594100002 ()978-1-4503-4060-1 (ISBN)
Conference
3rd Workshop on Programming Models for SIMD/Vector Processing (WPMVP)
Projects
HiPEC
Funder
Swedish Foundation for Strategic Research ELLIIT - The Linköping‐Lund Initiative on IT and Mobile Communications
Available from: 2016-08-30 Created: 2016-08-30 Last updated: 2018-01-10Bibliographically approved
Karlsson, A., Sohl, J. & Liu, D. (2015). Cost-efficient Mapping of 3- and 5-point DFTs to General Baseband Processors. In: International Conference on Digital Signal Processing (DSP), Singapore, 21-24 July, 2015: . Paper presented at IEEE International Conference on Digital Signal Processing (DSP) (pp. 780-784). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Cost-efficient Mapping of 3- and 5-point DFTs to General Baseband Processors
2015 (English)In: International Conference on Digital Signal Processing (DSP), Singapore, 21-24 July, 2015, Institute of Electrical and Electronics Engineers (IEEE), 2015, p. 780-784Conference paper, Published paper (Refereed)
Abstract [en]

Discrete Fourier transforms of 3 and 5 points are essential building blocks in FFT implementations for standards such as 3GPP-LTE. In addition to being more complex than 2 and 4 point DFTs, these DFTs also cause problems with data access in SDR-DSPs, since the data access width, in general, is a power of 2. This work derives mappings of these DFTs to a 4-way SIMD datapath that has been designed with 2 and 4-point DFT in mind. Our instruction set proposals, based on modified Winograd DFT, achieves single cycle execution of 3-point DFTs and 2.25 cycle average execution of 5-point DFTs in a cost-effective manner by reutilizing the already available arithmetic units. This represents an approximate speed-up of 3 times compared to an SDR-DSP with only MAC-support. In contrast to our more general design, we also demonstrate that a typical single-purpose FFT-specialized 5-way architecture only delivers 9% to 25% extra performance on average, while requiring 85% more arithmetic units and a more expensive memory subsystem.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2015
National Category
Computer Engineering
Identifiers
urn:nbn:se:liu:diva-120397 (URN)10.1109/ICDSP.2015.7251982 (DOI)000380506600164 ()978-1-4799-8058-1 (ISBN)
Conference
IEEE International Conference on Digital Signal Processing (DSP)
Projects
HiPEC
Available from: 2015-08-04 Created: 2015-08-04 Last updated: 2018-01-11
Karlsson, A., Sohl, J. & Liu, D. (2015). Energy-efficient sorting with the distributed memory architecture ePUMA. In: IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA): . Paper presented at IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA) (pp. 116-123). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Energy-efficient sorting with the distributed memory architecture ePUMA
2015 (English)In: IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA), Institute of Electrical and Electronics Engineers (IEEE), 2015, p. 116-123Conference paper, Published paper (Refereed)
Abstract [en]

This paper presents the novel heterogeneous DSP architecture ePUMA and demonstrates its features through an implementation of sorting of larger data sets. We derive a sorting algorithm with fixed-size merging tasks suitable for distributed memory architectures, which allows very simple scheduling and predictable data-independent sorting time.The implementation on ePUMA utilizes the architecture's specialized compute cores and control cores, and local memory parallelism, to separate and overlap sorting with data access and control for close to stall-free sorting.Penalty-free unaligned and out-of-order local memory access is used in combination with proposed application-specific sorting instructions to derive highly efficient local sorting and merging kernels used by the system-level algorithm.Our evaluation shows that the proposed implementation can rival the sorting performance of high-performance commercial CPUs and GPUs, with two orders of magnitude higher energy efficiency, which would allow high-performance sorting on low-power devices.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2015
National Category
Computer Engineering
Identifiers
urn:nbn:se:liu:diva-120398 (URN)10.1109/Trustcom.2015.620 (DOI)000380431400015 ()978-1-4673-7952-6 (ISBN)
Conference
IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA)
Projects
HiPEC
Available from: 2015-08-04 Created: 2015-08-04 Last updated: 2018-01-11
Karlsson, A., Sohl, J. & Liu, D. (2015). ePUMA: A Processor Architecture for Future DSP. In: International Conference on Digital Signal Processing (DSP), Singapore, 21-24 July, 2015: . Paper presented at IEEE International Conference on Digital Signal Processing (DSP) (pp. 253-257).
Open this publication in new window or tab >>ePUMA: A Processor Architecture for Future DSP
2015 (English)In: International Conference on Digital Signal Processing (DSP), Singapore, 21-24 July, 2015, 2015, p. 253-257Conference paper, Published paper (Refereed)
Abstract [en]

Since the breakdown of Dennard scaling the primary design goal for processor designs has shifted from increasing performance to increasing performance per Watt. The ePUMA platform is a flexible and configurable DSP platform that tries to address many of the problems with traditional DSP designs, to increase  performance, but use less power. We trade the flexibility of traditional VLIW DSP designs for a simpler single instruction issue scheme and instead make sure that each instruction can perform more work. Multi-cycle instructions can operate directly on vectors and matrices in memory and the datapaths implement common DSP subgraphs directly in hardware, for high compute throughput. Memory bottlenecks, that are common in other architectures, are handled with flexible LUT-based multi-bank memory addressing and memory parallelism. A major contributor to energy consumption, data movement, is reduced by using heterogeneous interconnect and clustering compute resources around local memories for simple data sharing. To evaluate ePUMA we have implemented the majority of the kernel library from a commercial VLIW DSP manufacturer for comparison. Our results not only show good performance, but also an order of magnitude increase in energy- and area efficiency. In addition, the kernel code size is reduced by 91% on average compared to the VLIW DSP. These benefits makes ePUMA an attractive solution for future DSP.

National Category
Computer Engineering
Identifiers
urn:nbn:se:liu:diva-120396 (URN)10.1109/ICDSP.2015.7251870 (DOI)000380506600052 ()978-1-4799-8058-1 (ISBN)
Conference
IEEE International Conference on Digital Signal Processing (DSP)
Projects
HiPEC
Available from: 2015-08-04 Created: 2015-08-04 Last updated: 2018-01-11
Karlsson, A., Sohl, J. & Liu, D. (2015). Software-based QPP Interleaving for Baseband DSPs with LUT-accelerated Addressing. In: International Conference on Digital Signal Processing (DSP), Singapore, 21-24 July, 2015: . Paper presented at IEEE International Conference on Digital Signal Processing (DSP). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Software-based QPP Interleaving for Baseband DSPs with LUT-accelerated Addressing
2015 (English)In: International Conference on Digital Signal Processing (DSP), Singapore, 21-24 July, 2015, Institute of Electrical and Electronics Engineers (IEEE), 2015Conference paper, Published paper (Refereed)
Abstract [en]

This paper demonstrates how QPP interleaving and de-interleaving for Turbo decoding in 3GPP-LTE can be implemented efficiently on baseband processors with lookup-table (LUT) based addressing support of multi-bank memory. We introduce a LUT-compression technique that reduces LUT size to 1% of what would otherwise be needed to store the full data access patterns for all LTE block sizes. By reusing the already existing program memory of a baseband processor to store LUTs and using our proposed general address generator, our 8-way data access path can reach the same throughput as a dedicated 8-way interleaving ASIC implementation. This avoids the addition of a dedicated interleaving address generator to a processor which,  according to ASIC synthesis, would be 75\% larger than our proposed address generator. Since our software implementation only involves the address generator, the processor's datapaths are free to perform the other operations of Turbo decoding in parallel with interleaving. Our software implementation ensure programmability and flexibility and is the fastest software-based implementation of QPP interleaving known to us.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2015
National Category
Computer Engineering
Identifiers
urn:nbn:se:liu:diva-120395 (URN)10.1109/ICDSP.2015.7251983 (DOI)000380506600165 ()978-1-4799-8058-1 (ISBN)
Conference
IEEE International Conference on Digital Signal Processing (DSP)
Projects
HiPEC
Available from: 2015-08-04 Created: 2015-08-04 Last updated: 2018-01-11
Sohl, J., Karlsson, A. & Liu, D. (2013). Conflict-free data access for multi-bank memory architectures using padding. In: High Performance Computing (HiPC), 2013: . Paper presented at 20th International Conference on High Performance Computing (HiPC 2013), 18-21 December 2103, Bangalore India (pp. 425-432). IEEE
Open this publication in new window or tab >>Conflict-free data access for multi-bank memory architectures using padding
2013 (English)In: High Performance Computing (HiPC), 2013, IEEE , 2013, p. 425-432Conference paper, Published paper (Refereed)
Abstract [en]

For high performance computation memory access is a major issue. Whether it is a supercomputer, a GPGPU device, or an Application Specific Instruction set Processor (ASIP) for Digital Signal Processing (DSP) parallel execution is a necessity. A high rate of computation puts pressure on the memory access, and it is often non-trivial to maximize the data rate to the execution units. Many algorithms that from a computational point of view can be implemented efficiently on parallel architectures fail to achieve significant speed-ups. The reason is very often that the speed-up possible with the available execution units are poorly utilized due to inefficient data access. This paper shows a method for improving the access time for sequences of data that are completely static at the cost of extra memory. This is done by resolving memory conflicts by using padding. The method can be automatically applied and it is shown to significantly reduce the data access time for sorting and FFTs. The execution time for the FFT is improved with up to a factor of 3.4 and for sorting by a factor of up to 8.

Place, publisher, year, edition, pages
IEEE, 2013
National Category
Engineering and Technology
Identifiers
urn:nbn:se:liu:diva-106937 (URN)10.1109/HiPC.2013.6799112 (DOI)
Conference
20th International Conference on High Performance Computing (HiPC 2013), 18-21 December 2103, Bangalore India
Available from: 2014-05-27 Created: 2014-05-27 Last updated: 2014-06-03Bibliographically approved
Karlsson, A., Sohl, J., Wang, J. & Liu, D. (2013). ePUMA: A unique memory access based parallel DSP processor for SDR and CR. In: Global Conference on Signal and Information Processing (GlobalSIP), 2013 IEEE: . Paper presented at 1st IEEE Global Conference on Signal and Information Processing (GlobalSIP 2013), 3-5 December 2013, Austin, TX, USA (pp. 1234-1237). IEEE
Open this publication in new window or tab >>ePUMA: A unique memory access based parallel DSP processor for SDR and CR
2013 (English)In: Global Conference on Signal and Information Processing (GlobalSIP), 2013 IEEE, IEEE , 2013, p. 1234-1237Conference paper, Published paper (Refereed)
Abstract [en]

This paper presents ePUMA, a master-slave heterogeneous DSP processor for communications and multimedia. We introduce the ePUMA VPE, a vector processing slave-core designed for heavy DSP workloads and demonstrate how its features can used to implement DSP kernels that efficiently overlap computing, data access and control to achieve maximum datapath utilization. The efficiency is evaluated by implementing a basic set of kernels commonly used in SDR. The experiments show that all kernels asymptotically reach above 90% effective datapath utilization. while many approach 100%, thus the design effectively overlaps computing, data access and control. Compared to popular VLIW solutions, the need for a large register file with many ports is eliminated, thus saving power and chip area. When compared to a commercial VLIW solution, our solution also achieves code size reductions of up to 30 times and a significantly simplified kernel implementation.

Place, publisher, year, edition, pages
IEEE, 2013
Keywords
digital signal processing chips; parallel processing; CR; DSP kernels; SDR; data access; ePUMA VPE; master-slave heterogeneous DSP processor; maximum datapath utilization; memory access based parallel DSP processor; vector processing element; vector processing slave-core; Assembly; Computer architecture; Digital signal processing; Kernel;Registers; VLIW; Vectors; DSP; SDR; VPE; ePUMA
National Category
Computer Engineering
Identifiers
urn:nbn:se:liu:diva-109322 (URN)10.1109/GlobalSIP.2013.6737131 (DOI)978-147990248-4 (ISBN)
Conference
1st IEEE Global Conference on Signal and Information Processing (GlobalSIP 2013), 3-5 December 2013, Austin, TX, USA
Available from: 2014-08-13 Created: 2014-08-13 Last updated: 2018-01-11Bibliographically approved
Sohl, J., Wang, J., Karlsson, A. & Liu, D. (2012). Automatic Permutation for Arbitrary Static Access Patterns. In: Parallel and Distributed Processing with Applications (ISPA), 2012: . Paper presented at 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications (ISPA), 10-13 July 2012, Madrid, Spain (pp. 215-222). IEEE
Open this publication in new window or tab >>Automatic Permutation for Arbitrary Static Access Patterns
2012 (English)In: Parallel and Distributed Processing with Applications (ISPA), 2012, IEEE , 2012, p. 215-222Conference paper, Published paper (Refereed)
Abstract [en]

A significant portion of the execution time on current SIMD and VLIW processors is spent on data access rather than instructions that perform actual computations. The ePUMA architecture provides features that allow arbitrary data elements to be accessed in parallel as long as the elements reside in different memory banks. Using permutation to move data elements that are accessed in parallel, the overhead from memory access can be greatly reduced; and, in many cases completely removed. This paper presents a practical method for automatic permutation based on Integer Linear Programming (ILP). No assumptions are made about the structure of the access patterns other than their static nature. Methods for speeding up the solution time for periodic access patterns and reusing existing solutions are also presented. Benchmarks for e.g. FFTs show speedups of up to 3.4 when using permutation compared to regular implementations.

Place, publisher, year, edition, pages
IEEE, 2012
Keywords
integer programming;linear programming;multiprocessing systems;parallel architectures;storage management;ILP;SIMD processor;VLIW processor;arbitrary data element;arbitrary static access pattern;automatic permutation;data access;ePUMA architecture;execution time;integer linear programming;memory access;memory banks;periodic access pattern;Discrete cosine transforms;Equations;Hardware;Mathematical model;Memory management;Program processors;Vectors;integer linear programming;multi-bank memories;parallel data access;permutation
National Category
Computer Engineering
Identifiers
urn:nbn:se:liu:diva-100377 (URN)10.1109/ISPA.2012.36 (DOI)978-1-4673-1631-6 (ISBN)
Conference
2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications (ISPA), 10-13 July 2012, Madrid, Spain
Projects
ePUMAHiPEC
Funder
Swedish Foundation for Strategic Research
Available from: 2013-11-04 Created: 2013-11-04 Last updated: 2018-01-11
Wang, J., Karlsson, A., Sohl, J. & Liu, D. (2012). Convolutional Decoding on Deep-pipelined SIMD Processor with Flexible Parallel Memory. In: Digital System Design (DSD), 2012: . Paper presented at 15th Euromicro Conference on Digital System Design (DSD 2012), 5-8 September 2012, Izmir, Turkey (pp. 529-532). IEEE
Open this publication in new window or tab >>Convolutional Decoding on Deep-pipelined SIMD Processor with Flexible Parallel Memory
2012 (English)In: Digital System Design (DSD), 2012, IEEE , 2012, p. 529-532Conference paper, Published paper (Refereed)
Abstract [en]

Single Instruction Multiple Data (SIMD) architecture has been proved to be a suitable parallel processor architecture for media and communication signal processing. But the computing overhead such as memory access latency and vector data permutation limit the performance of conventional SIMD processor. Solutions such as combined VLIW and SIMD architecture are designed with an increased complexity for compiler design and assembly programming. This paper introduces the SIMD processor in the ePUMA1 platform which uses deep execution pipeline and flexible parallel memory to achieve high computing performance. Its deep pipeline can execute combined operations in one cycle. And the parallel memory architecture supports conflict free parallel data access. It solves the problem of large vector permutation in a short vector SIMD machine in a more efficient way than conventional vector permutation instruction. We evaluate the architecture by implementing the soft decision Viterbi algorithm for convolutional decoding. The result is compared with other architectures, including TI C54x, CEVA TeakLike III, and PowerPC AltiVec, to show ePUMA’s computing efficiency advantage.

Place, publisher, year, edition, pages
IEEE, 2012
Keywords
convolutional codes;decoding;parallel processing;random-access storage;CEVA TeakLike III;PowerPC AltiVec;TI C54x;VLIW architecture;assembly programming;communication signal processing;compiler design;conflict free parallel data access;convolutional decoding;deep-pipelined SIMD processor;ePUMA1 platform;flexible parallel memory;media signal processing;memory access latency;parallel processor architecture;pipeline parallel memory;short vector SIMD machine;single instruction multiple data architecture;soft decision Viterbi algorithm;vector data permutation limit;vector permutation instruction;Computer architecture;Convolution;Convolutional codes;Decoding;Measurement;Vectors;Viterbi algorithm;Convolutional decoding;Parallel memory;SIMD;Viterbi;ePUMA
National Category
Computer Engineering
Identifiers
urn:nbn:se:liu:diva-100380 (URN)10.1109/DSD.2012.34 (DOI)978-1-4673-2498-4 (ISBN)
Conference
15th Euromicro Conference on Digital System Design (DSD 2012), 5-8 September 2012, Izmir, Turkey
Projects
ePUMA
Available from: 2013-11-04 Created: 2013-11-04 Last updated: 2018-01-11Bibliographically approved
Organisations

Search in DiVA

Show all publications