liu.seSearch for publications in DiVA
Change search
Link to record
Permanent link

Direct link
BETA
Wang, Jian
Publications (10 of 11) Show all publications
Wang, J. (2014). Low Overhead Memory Subsystem Design for a Multicore Parallel DSP Processor. (Doctoral dissertation). Linköping: Linköping University Electronic Press
Open this publication in new window or tab >>Low Overhead Memory Subsystem Design for a Multicore Parallel DSP Processor
2014 (English)Doctoral thesis, monograph (Other academic)
Abstract [en]

The physical scaling following Moore’s law is saturated while the requirement on computing keeps growing. The gain from improving silicon technology is only the shrinking of the silicon area, and the speed-power scaling has almost stopped in the last two years. It calls for new parallel computing architectures and new parallel programming methods.

Traditional ASIC (Application Specific Integrated Circuits) hardware has been used for acceleration of Digital Signal Processing (DSP) subsystems on SoC (System-on-Chip). Embedded systems become more complicated, and more functions, more applications, and more features must be integrated in one ASIC chip to follow up the market requirements. At the same time, the product lifetime of a SoC with ASIC has been much reduced because of the dynamic market. The life time of the design for a typical main chip in a mobile phone based on ASIC acceleration is about half a year and the NRE (Non-Recurring Engineering) cost of it can be much more than 50 million US$.

The current situation calls for a new solution than that of ASIC. ASIP (Application Specific Instruction set Processor) offers comparable power consumption and silicon cost to ASICs. Its greatest advantage is the functional flexibility in a predefined application domain. ASIP based SoC enables software upgrading without changing hardware. Thus the product life time can be 5-10 times more than that of ASIC based SoC.

This dissertation will present an ASIP based SoC, a new unified parallel DSP subsystem named ePUMA (embedded Parallel DSP Platform with Unique Memory Access), to target embedded signal processing in  communication and multimedia applications. The unified DSP subsystem can further reduce the hardware cost, especially the memory cost, of embedded SoC processors, and most importantly, provide full programmability for a wide range of DSP applications. The ePUMA processor is based on a master-slave heterogeneous multi-core architecture. One master core performs the central control, and multiple Single Instruction Multiple Data (SIMD) coprocessors work in parallel to offer a majority of the computing power.

The focus and the main contribution of this thesis are on the memory subsystem design of ePUMA. The multi-core system uses a distributed memory architecture based on scratchpad memories and software controlled data movement. It is suitable for the data access properties of streaming applications and the kernel based multi-core computing model. The essential techniques include the conflict free access parallel memory architecture, the multi-layer interconnection network, the non-address stream data transfer, the transitioned memory buffers, and the lookup table based parallel memory addressing. The goal of the design is to minimize the hardware cost, simplify the software protocol for inter-processor communication, and increase the arithmetic computing efficiency.

We have so far proved by applications that most DSP algorithms, such as filters, vector/matrix operations, transforms, and arithmetic functions, can achieve computing efficiency over 70% on the ePUMA platform. And the non-address stream network provides equivalent communication bandwidth by less than 30% implementation cost of a crossbar interconnection.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2014. p. 190
Series
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 1532
National Category
Engineering and Technology
Identifiers
urn:nbn:se:liu:diva-105866 (URN)10.3384/diss.diva-105866 (DOI)978-91-7519-556-8 (ISBN)
Public defence
2014-05-09, Visionen, Hus B, Campus Valla, Linköpings universitet, Linköping, 13:15 (English)
Opponent
Supervisors
Available from: 2014-04-11 Created: 2014-04-11 Last updated: 2015-02-18Bibliographically approved
Karlsson, A., Sohl, J., Wang, J. & Liu, D. (2013). ePUMA: A unique memory access based parallel DSP processor for SDR and CR. In: Global Conference on Signal and Information Processing (GlobalSIP), 2013 IEEE: . Paper presented at 1st IEEE Global Conference on Signal and Information Processing (GlobalSIP 2013), 3-5 December 2013, Austin, TX, USA (pp. 1234-1237). IEEE
Open this publication in new window or tab >>ePUMA: A unique memory access based parallel DSP processor for SDR and CR
2013 (English)In: Global Conference on Signal and Information Processing (GlobalSIP), 2013 IEEE, IEEE , 2013, p. 1234-1237Conference paper, Published paper (Refereed)
Abstract [en]

This paper presents ePUMA, a master-slave heterogeneous DSP processor for communications and multimedia. We introduce the ePUMA VPE, a vector processing slave-core designed for heavy DSP workloads and demonstrate how its features can used to implement DSP kernels that efficiently overlap computing, data access and control to achieve maximum datapath utilization. The efficiency is evaluated by implementing a basic set of kernels commonly used in SDR. The experiments show that all kernels asymptotically reach above 90% effective datapath utilization. while many approach 100%, thus the design effectively overlaps computing, data access and control. Compared to popular VLIW solutions, the need for a large register file with many ports is eliminated, thus saving power and chip area. When compared to a commercial VLIW solution, our solution also achieves code size reductions of up to 30 times and a significantly simplified kernel implementation.

Place, publisher, year, edition, pages
IEEE, 2013
Keywords
digital signal processing chips; parallel processing; CR; DSP kernels; SDR; data access; ePUMA VPE; master-slave heterogeneous DSP processor; maximum datapath utilization; memory access based parallel DSP processor; vector processing element; vector processing slave-core; Assembly; Computer architecture; Digital signal processing; Kernel;Registers; VLIW; Vectors; DSP; SDR; VPE; ePUMA
National Category
Computer Engineering
Identifiers
urn:nbn:se:liu:diva-109322 (URN)10.1109/GlobalSIP.2013.6737131 (DOI)978-147990248-4 (ISBN)
Conference
1st IEEE Global Conference on Signal and Information Processing (GlobalSIP 2013), 3-5 December 2013, Austin, TX, USA
Available from: 2014-08-13 Created: 2014-08-13 Last updated: 2018-01-11Bibliographically approved
Sohl, J., Wang, J., Karlsson, A. & Liu, D. (2012). Automatic Permutation for Arbitrary Static Access Patterns. In: Parallel and Distributed Processing with Applications (ISPA), 2012: . Paper presented at 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications (ISPA), 10-13 July 2012, Madrid, Spain (pp. 215-222). IEEE
Open this publication in new window or tab >>Automatic Permutation for Arbitrary Static Access Patterns
2012 (English)In: Parallel and Distributed Processing with Applications (ISPA), 2012, IEEE , 2012, p. 215-222Conference paper, Published paper (Refereed)
Abstract [en]

A significant portion of the execution time on current SIMD and VLIW processors is spent on data access rather than instructions that perform actual computations. The ePUMA architecture provides features that allow arbitrary data elements to be accessed in parallel as long as the elements reside in different memory banks. Using permutation to move data elements that are accessed in parallel, the overhead from memory access can be greatly reduced; and, in many cases completely removed. This paper presents a practical method for automatic permutation based on Integer Linear Programming (ILP). No assumptions are made about the structure of the access patterns other than their static nature. Methods for speeding up the solution time for periodic access patterns and reusing existing solutions are also presented. Benchmarks for e.g. FFTs show speedups of up to 3.4 when using permutation compared to regular implementations.

Place, publisher, year, edition, pages
IEEE, 2012
Keywords
integer programming;linear programming;multiprocessing systems;parallel architectures;storage management;ILP;SIMD processor;VLIW processor;arbitrary data element;arbitrary static access pattern;automatic permutation;data access;ePUMA architecture;execution time;integer linear programming;memory access;memory banks;periodic access pattern;Discrete cosine transforms;Equations;Hardware;Mathematical model;Memory management;Program processors;Vectors;integer linear programming;multi-bank memories;parallel data access;permutation
National Category
Computer Engineering
Identifiers
urn:nbn:se:liu:diva-100377 (URN)10.1109/ISPA.2012.36 (DOI)978-1-4673-1631-6 (ISBN)
Conference
2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications (ISPA), 10-13 July 2012, Madrid, Spain
Projects
ePUMAHiPEC
Funder
Swedish Foundation for Strategic Research
Available from: 2013-11-04 Created: 2013-11-04 Last updated: 2018-01-11
Wang, J., Karlsson, A., Sohl, J. & Liu, D. (2012). Convolutional Decoding on Deep-pipelined SIMD Processor with Flexible Parallel Memory. In: Digital System Design (DSD), 2012: . Paper presented at 15th Euromicro Conference on Digital System Design (DSD 2012), 5-8 September 2012, Izmir, Turkey (pp. 529-532). IEEE
Open this publication in new window or tab >>Convolutional Decoding on Deep-pipelined SIMD Processor with Flexible Parallel Memory
2012 (English)In: Digital System Design (DSD), 2012, IEEE , 2012, p. 529-532Conference paper, Published paper (Refereed)
Abstract [en]

Single Instruction Multiple Data (SIMD) architecture has been proved to be a suitable parallel processor architecture for media and communication signal processing. But the computing overhead such as memory access latency and vector data permutation limit the performance of conventional SIMD processor. Solutions such as combined VLIW and SIMD architecture are designed with an increased complexity for compiler design and assembly programming. This paper introduces the SIMD processor in the ePUMA1 platform which uses deep execution pipeline and flexible parallel memory to achieve high computing performance. Its deep pipeline can execute combined operations in one cycle. And the parallel memory architecture supports conflict free parallel data access. It solves the problem of large vector permutation in a short vector SIMD machine in a more efficient way than conventional vector permutation instruction. We evaluate the architecture by implementing the soft decision Viterbi algorithm for convolutional decoding. The result is compared with other architectures, including TI C54x, CEVA TeakLike III, and PowerPC AltiVec, to show ePUMA’s computing efficiency advantage.

Place, publisher, year, edition, pages
IEEE, 2012
Keywords
convolutional codes;decoding;parallel processing;random-access storage;CEVA TeakLike III;PowerPC AltiVec;TI C54x;VLIW architecture;assembly programming;communication signal processing;compiler design;conflict free parallel data access;convolutional decoding;deep-pipelined SIMD processor;ePUMA1 platform;flexible parallel memory;media signal processing;memory access latency;parallel processor architecture;pipeline parallel memory;short vector SIMD machine;single instruction multiple data architecture;soft decision Viterbi algorithm;vector data permutation limit;vector permutation instruction;Computer architecture;Convolution;Convolutional codes;Decoding;Measurement;Vectors;Viterbi algorithm;Convolutional decoding;Parallel memory;SIMD;Viterbi;ePUMA
National Category
Computer Engineering
Identifiers
urn:nbn:se:liu:diva-100380 (URN)10.1109/DSD.2012.34 (DOI)978-1-4673-2498-4 (ISBN)
Conference
15th Euromicro Conference on Digital System Design (DSD 2012), 5-8 September 2012, Izmir, Turkey
Projects
ePUMA
Available from: 2013-11-04 Created: 2013-11-04 Last updated: 2018-01-11Bibliographically approved
Wang, J., Karlsson, A., Sohl, J., Pettersson, M. & Liu, D. (2011). A multi-level arbitration and topology free streaming network for chip multiprocessor. In: ASIC (ASICON), 2011: . Paper presented at IEEE 9th International Conference on ASIC (ASICON 2011), 25-28 October 2011, Xiamen, China (pp. 153-158). IEEE
Open this publication in new window or tab >>A multi-level arbitration and topology free streaming network for chip multiprocessor
Show others...
2011 (English)In: ASIC (ASICON), 2011, IEEE , 2011, p. 153-158Conference paper, Published paper (Refereed)
Abstract [en]

Predictable computing is common in embedded signal processing, which has communication characteristics of data independent memory access and long streaming data transfer. This paper presents a streaming network-on-chip (NoC) StreamNet for chip multiprocessor (CMP) platform targeting predictable signal processing. The network is based on circuit-switch and uses a two-level arbitration scheme. The first level uses fast hardware arbitration, and the second level is programmable software arbitration. Its communication protocol is designed to support free choice of network topology. Associated with its scheduling tool, the network can achieve high communication efficiency and improve parallel computing performance. This NoC architecture is used to design the Ring network in the ePUMA1 multiprocessor DSP. The evaluation by the multi-user signal processing application at the LTE base-station shows the low parallel computing overhead on the ePUMA multiprocessor platform.

Place, publisher, year, edition, pages
IEEE, 2011
Keywords
multiprocessing systems;network-on-chip;parallel processing;protocols;scheduling;signal processing;LTE base-station;StreamNet;chip multiprocessor;communication characteristics;communication protocol;data independent memory access;ePUMA multiprocessor platform;ePUMA1 multiprocessor DSP;embedded signal processing;long streaming data transfer;multilevel arbitration;multiuser signal processing;parallel computing performance;programmable software arbitration;ring network;scheduling tool;streaming network-on-chip;topology free streaming network;Bandwidth;Communications technology;Computer architecture;Digital signal processing;Manuals;Software;Switches;Multiprocessor;Network-on-chip;Predictable computing;Streaming network
National Category
Computer Engineering
Identifiers
urn:nbn:se:liu:diva-100381 (URN)10.1109/ASICON.2011.6157145 (DOI)978-1-61284-192-2 (ISBN)978-1-61284-191-5 (ISBN)
Conference
IEEE 9th International Conference on ASIC (ASICON 2011), 25-28 October 2011, Xiamen, China
Projects
ePUMA
Available from: 2013-11-04 Created: 2013-11-04 Last updated: 2018-01-11Bibliographically approved
Wang, J., Sohl, J., Karlsson, A. & Liu, D. (2011). An Efficient Streaming Star Network for Multi-core Parallel DSP Processor. In: Networking and Computing (ICNC), 2011: . Paper presented at Second International Conference on Networking and Computing (ICNC 2011), 30 November - 2 December 2011, Osaka, Japan (pp. 332-336). IEEE
Open this publication in new window or tab >>An Efficient Streaming Star Network for Multi-core Parallel DSP Processor
2011 (English)In: Networking and Computing (ICNC), 2011, IEEE , 2011, p. 332-336Conference paper, Published paper (Refereed)
Abstract [en]

As more and more computing components are integrated into one digital signal processing (DSP) system to achieve high computing power by executing tasks in parallel, it is soon observed that the inter-processor and processor to memory communication overheads become the performance bottleneck and limit the scalability of a multi-processor platform. For chip multiprocessor (CMP) DSP systems targeting on predictable computing, an appreciation of the communication characteristics is essential to design an efficient interconnection architecture and improve performance. This paper presents a Star network designed for the ePUMA multi-core DSP processor based on analysis of the network communication models. As part of ePUMA’s multi-layer interconnection network, the Star network handles core to off-chip memory communications for kernel computing on slave processors. The network has short setup latency, easy multiprocessor synchronization, rich memory addressing patterns, and power efficient streaming data transfer. The improved network efficiency is evaluated in comparison with a previous study.

Place, publisher, year, edition, pages
IEEE, 2011
Keywords
digital signal processing chips;multiprocessor interconnection networks;chip multiprocessor;digital signal processing;ePUMA processor;kernel computing;memory addressing pattern;multicore parallel DSP processor;multilayer interconnection network;multiprocessor synchronization;off-chip memory communication;slave processor;streaming data transfer;streaming star network;Computational modeling;Digital signal processing;Kernel;Memory management;Process control;Vectors;DMA;DSP;multi-core;network-on-chip;streaming signal processin
National Category
Computer Engineering
Identifiers
urn:nbn:se:liu:diva-100383 (URN)10.1109/ICNC.2011.64 (DOI)978-1-4577-1796-3 (ISBN)
Conference
Second International Conference on Networking and Computing (ICNC 2011), 30 November - 2 December 2011, Osaka, Japan
Projects
ePUMA
Available from: 2013-11-04 Created: 2013-11-04 Last updated: 2018-01-11Bibliographically approved
Wang, J., Sohl, J. & Dake, L. (2010). Architectural Support for Reducing Parallel Processing Overhead in an Embedded Multiprocessor. In: EUC '10 Proceedings of the 2010 IEEE/IFIP International Conference on Embedded and Ubiquitous Computing. Paper presented at 8th International Conference on Embedded and Ubiquitous Computing (EUC), 2010 IEEE/IFIP, 11-13 December, Hong Kong, China (pp. 47-52). Washington, DC, USA: IEEE Computer Society
Open this publication in new window or tab >>Architectural Support for Reducing Parallel Processing Overhead in an Embedded Multiprocessor
2010 (English)In: EUC '10 Proceedings of the 2010 IEEE/IFIP International Conference on Embedded and Ubiquitous Computing, Washington, DC, USA: IEEE Computer Society , 2010, p. 47-52Conference paper, Published paper (Refereed)
Abstract [en]

The host-multi-SIMD chip multiprocessor (CMP) architecture has been proved to be an efficient architecture for high performance signal processing which explores both task level parallelism by multi-core processing and data level parallelism by SIMD processors. Different from the cache-based memory subsystem in most general purpose processors, this architecture uses on-chip scratchpad memory (SPM) as processor local data buffer and allows software to explicitly control the data movements in the memory hierarchy. This SPM-based solution is more efficient for predictable signal processing in embedded systems where data access patterns are known at design time. The predictable performance is especially important for real time signal processing. According to Amdahl¡¯s law, the nonparallelizable part of an algorithm has critical impact on the overall performance. Implementing an algorithm in a parallel platform usually produces control and communication overhead which is not parallelizable. This paper presents the architectural support in an embedded multiprocessor platform to maximally reduce the parallel processing overhead. The effectiveness of these architecture designs in boosting parallel performance is evaluated by an implementation example of 64x64 complex matrix multiplication. The result shows that the parallel processing overhead is reduced from 369% to 28%.

Place, publisher, year, edition, pages
Washington, DC, USA: IEEE Computer Society, 2010
Keywords
Parallel DSP, Multiprocessor, Control overhead, Communication overhead, Matrix multiplication
National Category
Engineering and Technology
Identifiers
urn:nbn:se:liu:diva-66220 (URN)10.1109/EUC.2010.17 (DOI)978-0-7695-4322-2 (ISBN)
Conference
8th International Conference on Embedded and Ubiquitous Computing (EUC), 2010 IEEE/IFIP, 11-13 December, Hong Kong, China
Projects
ePUMA
Available from: 2011-03-17 Created: 2011-03-08 Last updated: 2011-04-12Bibliographically approved
Wang, J., Joar, S., Olof, K. & Liu, D. (2010). ePUMA: a novel embedded parallel DSP platform for predictable computing. Paper presented at International Conference on Information and Electroincs Engineering. Chengdu, China: Institute of Electrical and Electronics Engineers, Inc.
Open this publication in new window or tab >>ePUMA: a novel embedded parallel DSP platform for predictable computing
2010 (English)Conference paper, Published paper (Refereed)
Abstract [en]

In this paper, a novel parallel DSP platform based on master-multi-SIMD architecture is introduced. The platform is named ePUMA [1]. The essential technology is to use separated data access kernels and algorithm kernels to minimize the communication overhead of parallel processing by running the two types of kernels in parallel. ePUMA platform is optimized for predictable computing. The memory subsystem design that relies on regular and predictable memory accesses can dramatically improve the performance according to benchmarking results. As a scalable parallel platform, the chip area is estimated for different number of co-processors. The aim of ePUMA parallel platform is to achieve low power high performance embedded parallel computing with low silicon cost for communications and similar signal processing applications.

Place, publisher, year, edition, pages
Chengdu, China: Institute of Electrical and Electronics Engineers, Inc., 2010
Keywords
chip multi-processor; SIMD architecture; multibank memory; conflict-free memory access; data permutation
National Category
Other Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:liu:diva-58573 (URN)10.1109/ICETC.2010.5529952 (DOI)978-1-4244-6367-1 (ISBN)
Conference
International Conference on Information and Electroincs Engineering
Projects
ePUMA
Note
©2010 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. Jian Wang, Joar Sohl, Olof Kraigher and Dake Liu, ePUMA: a novel embedded parallel DSP platform for predictable computing, 2010, International Conference on Information and Electronics Engineering, (5), 32-35. http://dx.doi.org/10.1109/ICETC.2010.5529952 Available from: 2010-08-16 Created: 2010-08-16 Last updated: 2010-08-16Bibliographically approved
Liu, D., Sohl, J. & Wang, J. (2010). Parallel Programming and its architectures Based on data access separated algorithm Kernels. International Journal of Embedded and Real-Time Communication Systems, 1(1), 65-85
Open this publication in new window or tab >>Parallel Programming and its architectures Based on data access separated algorithm Kernels
2010 (English)In: International Journal of Embedded and Real-Time Communication Systems, ISSN 1947-3176, E-ISSN 1947-3184, Vol. 1, no 1, p. 65-85Article in journal (Refereed) Published
Abstract [en]

A novel master-multi-SIMD architecture and its kernel (template) based parallel programming flow is introduced as a parallel signal processing platform. The name of the platform is ePUMA (embedded Parallel DSP processor architecture with Unique Memory Access). The essential technology is to separate data accessing kernels from arithmetic computing kernels so that the run-time cost of data access can be minimized by running it in parallel with algorithm computing. The SIMD memory subsystem architecture based on the proposed flow dramatically improves the total computing performance. The hardware system and programming flow introduced in this article will primarily aim at low-power high-performance embedded parallel computing with low silicon cost for communications and similar real-time signal processing. Copyright © 2010, IGI Global.

Place, publisher, year, edition, pages
IGI Global, 2010
Keywords
Conflict-Free Memory Access; EPUMA; Low Power; Memory Subsystem; Parallel DSP; Parallel Programming; Permutation; SIMD; Template-Based Programming; Vector Memory
National Category
Engineering and Technology
Identifiers
urn:nbn:se:liu:diva-100693 (URN)10.4018/jertcs.2010103004 (DOI)
Available from: 2013-11-11 Created: 2013-11-11 Last updated: 2017-12-06
Wang, J., Sohl, J., Kraigher, O. & Dake, L. (2010). Software programmable data allocation in multi-bank memory of SIMD processors. In: Sebastian Lopez (Ed.), Proceedings of the 2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools. Paper presented at 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools, 1-3 September, Lille, France (pp. 28-33). Washington, DC, USA: IEEE Computer Society
Open this publication in new window or tab >>Software programmable data allocation in multi-bank memory of SIMD processors
2010 (English)In: Proceedings of the 2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools / [ed] Sebastian Lopez, Washington, DC, USA: IEEE Computer Society , 2010, p. 28-33Conference paper, Published paper (Refereed)
Abstract [en]

The host-SIMD style heterogeneous multi-processor architecture offers high computing performance and user friendly programmability. It explores both task level parallelism and data level parallelism by the on-chip multiple SIMD coprocessors. For embedded DSP applications with predictable computing feature, this architecture can be further optimized for performance, implementation cost and power consumption. The optimization could be done by improving the SIMD processing efficiency and reducing redundant memory accesses and data shuffle operations. This paper introduces one effective approach by designing a software programmable multi-bank memory system for SIMD processors. Both the hardware architecture and software programming model are described in this paper, with an implementation example of the BLAS syrk routine. The proposed memory system offers high SIMD data access flexibility by using lookup table based address generators, and applying data permutations on both DMA controller interface and SIMD data access. The evaluation results show that the SIMD processor with this memory system can achieve high execution efficiency, with only 10% to 30% overhead. The proposed memory system also saves the implementation cost on SIMD local registers, in our system, each SIMD core has only 8 128-bit vector registers.

Place, publisher, year, edition, pages
Washington, DC, USA: IEEE Computer Society, 2010
Keywords
SIMD processor, multi-bank memory, conflict-free memory access, data allocation
National Category
Engineering and Technology
Identifiers
urn:nbn:se:liu:diva-66219 (URN)10.1109/DSD.2010.26 (DOI)978-0-7695-4171-6 (ISBN)
Conference
13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools, 1-3 September, Lille, France
Projects
ePUMA
Available from: 2011-03-17 Created: 2011-03-08 Last updated: 2011-04-12Bibliographically approved
Organisations

Search in DiVA

Show all publications