liu.seSearch for publications in DiVA
Change search
Link to record
Permanent link

Direct link
BETA
Sohl, Joar
Alternative names
Publications (10 of 17) Show all publications
Karlsson, A., Sohl, J. & Liu, D. (2015). Cost-efficient Mapping of 3- and 5-point DFTs to General Baseband Processors. In: International Conference on Digital Signal Processing (DSP), Singapore, 21-24 July, 2015: . Paper presented at IEEE International Conference on Digital Signal Processing (DSP) (pp. 780-784). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Cost-efficient Mapping of 3- and 5-point DFTs to General Baseband Processors
2015 (English)In: International Conference on Digital Signal Processing (DSP), Singapore, 21-24 July, 2015, Institute of Electrical and Electronics Engineers (IEEE), 2015, p. 780-784Conference paper, Published paper (Refereed)
Abstract [en]

Discrete Fourier transforms of 3 and 5 points are essential building blocks in FFT implementations for standards such as 3GPP-LTE. In addition to being more complex than 2 and 4 point DFTs, these DFTs also cause problems with data access in SDR-DSPs, since the data access width, in general, is a power of 2. This work derives mappings of these DFTs to a 4-way SIMD datapath that has been designed with 2 and 4-point DFT in mind. Our instruction set proposals, based on modified Winograd DFT, achieves single cycle execution of 3-point DFTs and 2.25 cycle average execution of 5-point DFTs in a cost-effective manner by reutilizing the already available arithmetic units. This represents an approximate speed-up of 3 times compared to an SDR-DSP with only MAC-support. In contrast to our more general design, we also demonstrate that a typical single-purpose FFT-specialized 5-way architecture only delivers 9% to 25% extra performance on average, while requiring 85% more arithmetic units and a more expensive memory subsystem.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2015
National Category
Computer Engineering
Identifiers
urn:nbn:se:liu:diva-120397 (URN)10.1109/ICDSP.2015.7251982 (DOI)000380506600164 ()978-1-4799-8058-1 (ISBN)
Conference
IEEE International Conference on Digital Signal Processing (DSP)
Projects
HiPEC
Available from: 2015-08-04 Created: 2015-08-04 Last updated: 2018-01-11
Sohl, J. (2015). Efficient Compilation for Application Specific Instruction set DSP Processors with Multi-bank Memories. (Doctoral dissertation). Linköping: Linköping University Electronic Press
Open this publication in new window or tab >>Efficient Compilation for Application Specific Instruction set DSP Processors with Multi-bank Memories
2015 (English)Doctoral thesis, monograph (Other academic)
Abstract [en]

Modern signal processing systems require more and more processing capacity as times goes on. Previously, large increases in speed and power efficiency have come from process technology improvements. However, lately the gain from process improvements have been greatly reduced. Currently, the way forward for high-performance systems is to use specialized hardware and/or parallel designs.

Application Specific Integrated Circuits (ASICs) have long been used to accelerate the processing of tasks that are too computationally heavy for more general processors. The problem with ASICs is that they are costly to develop and verify, and the product life time can be limited with newer standards. Since they are very specific the applicable domain is very narrow.

More general processors are more flexible and can easily adapt to perform the functions of ASIC based designs. However, the generality comes with a performance cost that renders general designs unusable for some tasks. The question then becomes, how general can a processor be while still being power efficient and fast enough for some particular domain?

Application Specific Instruction set Processors (ASIPs) are processors that target a specific application domain, and can offer enough performance  with power efficiency and silicon cost that is comparable to ASICs. The flexibility allows for the same hardware design to be used over several system designs, and also for multiple functions in the same system, if some functions are not used simultaneously.

One problem with ASIPs is that they are more difficult to program than a general purpose processor, given that we want efficient software. Utilizing all of the features that give an ASIP its performance advantage can be difficult at times, and new tools and methods for programming them are needed.

This thesis will present ePUMA (embedded Parallel DSP platform with Unique Memory Access), an ASIP architecture that targets algorithms with predictable data access. These kinds of algorithms are very common in e.g. baseband processing or multimedia applications. The primary focus will be on the specific features of ePUMA that are utilized to achieve high performance, and how it is possible to automatically utilize them using tools. The most significant features include data permutation for conflict-free data access, and utilization of address generation features for overhead free code execution. This sometimes requires specific information; for example the exact sequences of addresses in memory that are accessed, or that some operations may be performed in parallel. This is not always available when writing code using the traditional way with traditional languages, e.g. C, as extracting this information is still a very active research topic. In the near future at least, the way that software is written needs to change to exploit all hardware features, but in many cases in a positive way. Often the problem with current methods is that code is overly specific, and that a more general abstractions are actually easier to generate code from.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2015. p. 188
Series
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 1641
National Category
Computer Engineering Signal Processing
Identifiers
urn:nbn:se:liu:diva-113702 (URN)10.3384/diss.diva-113702 (DOI)978-91-7519-151-5 (ISBN)
Public defence
2015-02-27, Visionen, Hus B, Campus Valla, Linköpings universitet, Linköping, 10:15 (Swedish)
Opponent
Supervisors
Available from: 2015-01-29 Created: 2015-01-29 Last updated: 2018-01-11Bibliographically approved
Karlsson, A., Sohl, J. & Liu, D. (2015). Energy-efficient sorting with the distributed memory architecture ePUMA. In: IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA): . Paper presented at IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA) (pp. 116-123). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Energy-efficient sorting with the distributed memory architecture ePUMA
2015 (English)In: IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA), Institute of Electrical and Electronics Engineers (IEEE), 2015, p. 116-123Conference paper, Published paper (Refereed)
Abstract [en]

This paper presents the novel heterogeneous DSP architecture ePUMA and demonstrates its features through an implementation of sorting of larger data sets. We derive a sorting algorithm with fixed-size merging tasks suitable for distributed memory architectures, which allows very simple scheduling and predictable data-independent sorting time.The implementation on ePUMA utilizes the architecture's specialized compute cores and control cores, and local memory parallelism, to separate and overlap sorting with data access and control for close to stall-free sorting.Penalty-free unaligned and out-of-order local memory access is used in combination with proposed application-specific sorting instructions to derive highly efficient local sorting and merging kernels used by the system-level algorithm.Our evaluation shows that the proposed implementation can rival the sorting performance of high-performance commercial CPUs and GPUs, with two orders of magnitude higher energy efficiency, which would allow high-performance sorting on low-power devices.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2015
National Category
Computer Engineering
Identifiers
urn:nbn:se:liu:diva-120398 (URN)10.1109/Trustcom.2015.620 (DOI)000380431400015 ()978-1-4673-7952-6 (ISBN)
Conference
IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA)
Projects
HiPEC
Available from: 2015-08-04 Created: 2015-08-04 Last updated: 2018-01-11
Karlsson, A., Sohl, J. & Liu, D. (2015). ePUMA: A Processor Architecture for Future DSP. In: International Conference on Digital Signal Processing (DSP), Singapore, 21-24 July, 2015: . Paper presented at IEEE International Conference on Digital Signal Processing (DSP) (pp. 253-257).
Open this publication in new window or tab >>ePUMA: A Processor Architecture for Future DSP
2015 (English)In: International Conference on Digital Signal Processing (DSP), Singapore, 21-24 July, 2015, 2015, p. 253-257Conference paper, Published paper (Refereed)
Abstract [en]

Since the breakdown of Dennard scaling the primary design goal for processor designs has shifted from increasing performance to increasing performance per Watt. The ePUMA platform is a flexible and configurable DSP platform that tries to address many of the problems with traditional DSP designs, to increase  performance, but use less power. We trade the flexibility of traditional VLIW DSP designs for a simpler single instruction issue scheme and instead make sure that each instruction can perform more work. Multi-cycle instructions can operate directly on vectors and matrices in memory and the datapaths implement common DSP subgraphs directly in hardware, for high compute throughput. Memory bottlenecks, that are common in other architectures, are handled with flexible LUT-based multi-bank memory addressing and memory parallelism. A major contributor to energy consumption, data movement, is reduced by using heterogeneous interconnect and clustering compute resources around local memories for simple data sharing. To evaluate ePUMA we have implemented the majority of the kernel library from a commercial VLIW DSP manufacturer for comparison. Our results not only show good performance, but also an order of magnitude increase in energy- and area efficiency. In addition, the kernel code size is reduced by 91% on average compared to the VLIW DSP. These benefits makes ePUMA an attractive solution for future DSP.

National Category
Computer Engineering
Identifiers
urn:nbn:se:liu:diva-120396 (URN)10.1109/ICDSP.2015.7251870 (DOI)000380506600052 ()978-1-4799-8058-1 (ISBN)
Conference
IEEE International Conference on Digital Signal Processing (DSP)
Projects
HiPEC
Available from: 2015-08-04 Created: 2015-08-04 Last updated: 2018-01-11
Karlsson, A., Sohl, J. & Liu, D. (2015). Software-based QPP Interleaving for Baseband DSPs with LUT-accelerated Addressing. In: International Conference on Digital Signal Processing (DSP), Singapore, 21-24 July, 2015: . Paper presented at IEEE International Conference on Digital Signal Processing (DSP). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Software-based QPP Interleaving for Baseband DSPs with LUT-accelerated Addressing
2015 (English)In: International Conference on Digital Signal Processing (DSP), Singapore, 21-24 July, 2015, Institute of Electrical and Electronics Engineers (IEEE), 2015Conference paper, Published paper (Refereed)
Abstract [en]

This paper demonstrates how QPP interleaving and de-interleaving for Turbo decoding in 3GPP-LTE can be implemented efficiently on baseband processors with lookup-table (LUT) based addressing support of multi-bank memory. We introduce a LUT-compression technique that reduces LUT size to 1% of what would otherwise be needed to store the full data access patterns for all LTE block sizes. By reusing the already existing program memory of a baseband processor to store LUTs and using our proposed general address generator, our 8-way data access path can reach the same throughput as a dedicated 8-way interleaving ASIC implementation. This avoids the addition of a dedicated interleaving address generator to a processor which,  according to ASIC synthesis, would be 75\% larger than our proposed address generator. Since our software implementation only involves the address generator, the processor's datapaths are free to perform the other operations of Turbo decoding in parallel with interleaving. Our software implementation ensure programmability and flexibility and is the fastest software-based implementation of QPP interleaving known to us.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2015
National Category
Computer Engineering
Identifiers
urn:nbn:se:liu:diva-120395 (URN)10.1109/ICDSP.2015.7251983 (DOI)000380506600165 ()978-1-4799-8058-1 (ISBN)
Conference
IEEE International Conference on Digital Signal Processing (DSP)
Projects
HiPEC
Available from: 2015-08-04 Created: 2015-08-04 Last updated: 2018-01-11
Sohl, J., Karlsson, A. & Liu, D. (2013). Conflict-free data access for multi-bank memory architectures using padding. In: High Performance Computing (HiPC), 2013: . Paper presented at 20th International Conference on High Performance Computing (HiPC 2013), 18-21 December 2103, Bangalore India (pp. 425-432). IEEE
Open this publication in new window or tab >>Conflict-free data access for multi-bank memory architectures using padding
2013 (English)In: High Performance Computing (HiPC), 2013, IEEE , 2013, p. 425-432Conference paper, Published paper (Refereed)
Abstract [en]

For high performance computation memory access is a major issue. Whether it is a supercomputer, a GPGPU device, or an Application Specific Instruction set Processor (ASIP) for Digital Signal Processing (DSP) parallel execution is a necessity. A high rate of computation puts pressure on the memory access, and it is often non-trivial to maximize the data rate to the execution units. Many algorithms that from a computational point of view can be implemented efficiently on parallel architectures fail to achieve significant speed-ups. The reason is very often that the speed-up possible with the available execution units are poorly utilized due to inefficient data access. This paper shows a method for improving the access time for sequences of data that are completely static at the cost of extra memory. This is done by resolving memory conflicts by using padding. The method can be automatically applied and it is shown to significantly reduce the data access time for sorting and FFTs. The execution time for the FFT is improved with up to a factor of 3.4 and for sorting by a factor of up to 8.

Place, publisher, year, edition, pages
IEEE, 2013
National Category
Engineering and Technology
Identifiers
urn:nbn:se:liu:diva-106937 (URN)10.1109/HiPC.2013.6799112 (DOI)
Conference
20th International Conference on High Performance Computing (HiPC 2013), 18-21 December 2103, Bangalore India
Available from: 2014-05-27 Created: 2014-05-27 Last updated: 2014-06-03Bibliographically approved
Karlsson, A., Sohl, J., Wang, J. & Liu, D. (2013). ePUMA: A unique memory access based parallel DSP processor for SDR and CR. In: Global Conference on Signal and Information Processing (GlobalSIP), 2013 IEEE: . Paper presented at 1st IEEE Global Conference on Signal and Information Processing (GlobalSIP 2013), 3-5 December 2013, Austin, TX, USA (pp. 1234-1237). IEEE
Open this publication in new window or tab >>ePUMA: A unique memory access based parallel DSP processor for SDR and CR
2013 (English)In: Global Conference on Signal and Information Processing (GlobalSIP), 2013 IEEE, IEEE , 2013, p. 1234-1237Conference paper, Published paper (Refereed)
Abstract [en]

This paper presents ePUMA, a master-slave heterogeneous DSP processor for communications and multimedia. We introduce the ePUMA VPE, a vector processing slave-core designed for heavy DSP workloads and demonstrate how its features can used to implement DSP kernels that efficiently overlap computing, data access and control to achieve maximum datapath utilization. The efficiency is evaluated by implementing a basic set of kernels commonly used in SDR. The experiments show that all kernels asymptotically reach above 90% effective datapath utilization. while many approach 100%, thus the design effectively overlaps computing, data access and control. Compared to popular VLIW solutions, the need for a large register file with many ports is eliminated, thus saving power and chip area. When compared to a commercial VLIW solution, our solution also achieves code size reductions of up to 30 times and a significantly simplified kernel implementation.

Place, publisher, year, edition, pages
IEEE, 2013
Keywords
digital signal processing chips; parallel processing; CR; DSP kernels; SDR; data access; ePUMA VPE; master-slave heterogeneous DSP processor; maximum datapath utilization; memory access based parallel DSP processor; vector processing element; vector processing slave-core; Assembly; Computer architecture; Digital signal processing; Kernel;Registers; VLIW; Vectors; DSP; SDR; VPE; ePUMA
National Category
Computer Engineering
Identifiers
urn:nbn:se:liu:diva-109322 (URN)10.1109/GlobalSIP.2013.6737131 (DOI)978-147990248-4 (ISBN)
Conference
1st IEEE Global Conference on Signal and Information Processing (GlobalSIP 2013), 3-5 December 2013, Austin, TX, USA
Available from: 2014-08-13 Created: 2014-08-13 Last updated: 2018-01-11Bibliographically approved
Sohl, J., Wang, J., Karlsson, A. & Liu, D. (2012). Automatic Permutation for Arbitrary Static Access Patterns. In: Parallel and Distributed Processing with Applications (ISPA), 2012: . Paper presented at 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications (ISPA), 10-13 July 2012, Madrid, Spain (pp. 215-222). IEEE
Open this publication in new window or tab >>Automatic Permutation for Arbitrary Static Access Patterns
2012 (English)In: Parallel and Distributed Processing with Applications (ISPA), 2012, IEEE , 2012, p. 215-222Conference paper, Published paper (Refereed)
Abstract [en]

A significant portion of the execution time on current SIMD and VLIW processors is spent on data access rather than instructions that perform actual computations. The ePUMA architecture provides features that allow arbitrary data elements to be accessed in parallel as long as the elements reside in different memory banks. Using permutation to move data elements that are accessed in parallel, the overhead from memory access can be greatly reduced; and, in many cases completely removed. This paper presents a practical method for automatic permutation based on Integer Linear Programming (ILP). No assumptions are made about the structure of the access patterns other than their static nature. Methods for speeding up the solution time for periodic access patterns and reusing existing solutions are also presented. Benchmarks for e.g. FFTs show speedups of up to 3.4 when using permutation compared to regular implementations.

Place, publisher, year, edition, pages
IEEE, 2012
Keywords
integer programming;linear programming;multiprocessing systems;parallel architectures;storage management;ILP;SIMD processor;VLIW processor;arbitrary data element;arbitrary static access pattern;automatic permutation;data access;ePUMA architecture;execution time;integer linear programming;memory access;memory banks;periodic access pattern;Discrete cosine transforms;Equations;Hardware;Mathematical model;Memory management;Program processors;Vectors;integer linear programming;multi-bank memories;parallel data access;permutation
National Category
Computer Engineering
Identifiers
urn:nbn:se:liu:diva-100377 (URN)10.1109/ISPA.2012.36 (DOI)978-1-4673-1631-6 (ISBN)
Conference
2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications (ISPA), 10-13 July 2012, Madrid, Spain
Projects
ePUMAHiPEC
Funder
Swedish Foundation for Strategic Research
Available from: 2013-11-04 Created: 2013-11-04 Last updated: 2018-01-11
Wang, J., Karlsson, A., Sohl, J. & Liu, D. (2012). Convolutional Decoding on Deep-pipelined SIMD Processor with Flexible Parallel Memory. In: Digital System Design (DSD), 2012: . Paper presented at 15th Euromicro Conference on Digital System Design (DSD 2012), 5-8 September 2012, Izmir, Turkey (pp. 529-532). IEEE
Open this publication in new window or tab >>Convolutional Decoding on Deep-pipelined SIMD Processor with Flexible Parallel Memory
2012 (English)In: Digital System Design (DSD), 2012, IEEE , 2012, p. 529-532Conference paper, Published paper (Refereed)
Abstract [en]

Single Instruction Multiple Data (SIMD) architecture has been proved to be a suitable parallel processor architecture for media and communication signal processing. But the computing overhead such as memory access latency and vector data permutation limit the performance of conventional SIMD processor. Solutions such as combined VLIW and SIMD architecture are designed with an increased complexity for compiler design and assembly programming. This paper introduces the SIMD processor in the ePUMA1 platform which uses deep execution pipeline and flexible parallel memory to achieve high computing performance. Its deep pipeline can execute combined operations in one cycle. And the parallel memory architecture supports conflict free parallel data access. It solves the problem of large vector permutation in a short vector SIMD machine in a more efficient way than conventional vector permutation instruction. We evaluate the architecture by implementing the soft decision Viterbi algorithm for convolutional decoding. The result is compared with other architectures, including TI C54x, CEVA TeakLike III, and PowerPC AltiVec, to show ePUMA’s computing efficiency advantage.

Place, publisher, year, edition, pages
IEEE, 2012
Keywords
convolutional codes;decoding;parallel processing;random-access storage;CEVA TeakLike III;PowerPC AltiVec;TI C54x;VLIW architecture;assembly programming;communication signal processing;compiler design;conflict free parallel data access;convolutional decoding;deep-pipelined SIMD processor;ePUMA1 platform;flexible parallel memory;media signal processing;memory access latency;parallel processor architecture;pipeline parallel memory;short vector SIMD machine;single instruction multiple data architecture;soft decision Viterbi algorithm;vector data permutation limit;vector permutation instruction;Computer architecture;Convolution;Convolutional codes;Decoding;Measurement;Vectors;Viterbi algorithm;Convolutional decoding;Parallel memory;SIMD;Viterbi;ePUMA
National Category
Computer Engineering
Identifiers
urn:nbn:se:liu:diva-100380 (URN)10.1109/DSD.2012.34 (DOI)978-1-4673-2498-4 (ISBN)
Conference
15th Euromicro Conference on Digital System Design (DSD 2012), 5-8 September 2012, Izmir, Turkey
Projects
ePUMA
Available from: 2013-11-04 Created: 2013-11-04 Last updated: 2018-01-11Bibliographically approved
Wang, J., Karlsson, A., Sohl, J., Pettersson, M. & Liu, D. (2011). A multi-level arbitration and topology free streaming network for chip multiprocessor. In: ASIC (ASICON), 2011: . Paper presented at IEEE 9th International Conference on ASIC (ASICON 2011), 25-28 October 2011, Xiamen, China (pp. 153-158). IEEE
Open this publication in new window or tab >>A multi-level arbitration and topology free streaming network for chip multiprocessor
Show others...
2011 (English)In: ASIC (ASICON), 2011, IEEE , 2011, p. 153-158Conference paper, Published paper (Refereed)
Abstract [en]

Predictable computing is common in embedded signal processing, which has communication characteristics of data independent memory access and long streaming data transfer. This paper presents a streaming network-on-chip (NoC) StreamNet for chip multiprocessor (CMP) platform targeting predictable signal processing. The network is based on circuit-switch and uses a two-level arbitration scheme. The first level uses fast hardware arbitration, and the second level is programmable software arbitration. Its communication protocol is designed to support free choice of network topology. Associated with its scheduling tool, the network can achieve high communication efficiency and improve parallel computing performance. This NoC architecture is used to design the Ring network in the ePUMA1 multiprocessor DSP. The evaluation by the multi-user signal processing application at the LTE base-station shows the low parallel computing overhead on the ePUMA multiprocessor platform.

Place, publisher, year, edition, pages
IEEE, 2011
Keywords
multiprocessing systems;network-on-chip;parallel processing;protocols;scheduling;signal processing;LTE base-station;StreamNet;chip multiprocessor;communication characteristics;communication protocol;data independent memory access;ePUMA multiprocessor platform;ePUMA1 multiprocessor DSP;embedded signal processing;long streaming data transfer;multilevel arbitration;multiuser signal processing;parallel computing performance;programmable software arbitration;ring network;scheduling tool;streaming network-on-chip;topology free streaming network;Bandwidth;Communications technology;Computer architecture;Digital signal processing;Manuals;Software;Switches;Multiprocessor;Network-on-chip;Predictable computing;Streaming network
National Category
Computer Engineering
Identifiers
urn:nbn:se:liu:diva-100381 (URN)10.1109/ASICON.2011.6157145 (DOI)978-1-61284-192-2 (ISBN)978-1-61284-191-5 (ISBN)
Conference
IEEE 9th International Conference on ASIC (ASICON 2011), 25-28 October 2011, Xiamen, China
Projects
ePUMA
Available from: 2013-11-04 Created: 2013-11-04 Last updated: 2018-01-11Bibliographically approved
Organisations

Search in DiVA

Show all publications