liu.seSearch for publications in DiVA
Endre søk
Begrens søket
1 - 15 of 15
RefereraExporteraLink til resultatlisten
Permanent link
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Treff pr side
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sortering
  • Standard (Relevans)
  • Forfatter A-Ø
  • Forfatter Ø-A
  • Tittel A-Ø
  • Tittel Ø-A
  • Type publikasjon A-Ø
  • Type publikasjon Ø-A
  • Eldste først
  • Nyeste først
  • Skapad (Eldste først)
  • Skapad (Nyeste først)
  • Senast uppdaterad (Eldste først)
  • Senast uppdaterad (Nyeste først)
  • Disputationsdatum (tidligste først)
  • Disputationsdatum (siste først)
  • Standard (Relevans)
  • Forfatter A-Ø
  • Forfatter Ø-A
  • Tittel A-Ø
  • Tittel Ø-A
  • Type publikasjon A-Ø
  • Type publikasjon Ø-A
  • Eldste først
  • Nyeste først
  • Skapad (Eldste først)
  • Skapad (Nyeste først)
  • Senast uppdaterad (Eldste først)
  • Senast uppdaterad (Nyeste først)
  • Disputationsdatum (tidligste først)
  • Disputationsdatum (siste først)
Merk
Maxantalet träffar du kan exportera från sökgränssnittet är 250. Vid större uttag använd dig av utsökningar.
  • 1.
    Gebrewahid, Essayas
    et al.
    Högskolan i Halmstad, Akademin för informationsteknologi, Halmstad Embedded and Intelligent Systems Research (EIS), Centrum för forskning om inbyggda system (CERES), Sweden.
    Ali Arslan, Mehmet
    Lund University, Sweden.
    Karlsson, Andréas
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska fakulteten.
    Ul-Abdin, Zain
    Högskolan i Halmstad, Akademin för informationsteknologi, Halmstad Embedded and Intelligent Systems Research (EIS), Centrum för forskning om inbyggda system (CERES), Sweden.
    Support for Data Parallelism in the CAL Actor Language2016Inngår i: PROCEEDINGS OF THE 2016 3RD WORKSHOP ON PROGRAMMING MODELS FOR SIMD/VECTOR PROCESSING (WPMVP 2016), New York, NY: Association for Computing Machinery (ACM), 2016, s. 1-8Konferansepaper (Fagfellevurdert)
    Abstract [en]

    With the arrival of heterogeneous manycores comprising various features to support task, data and instruction-level parallelism, developing applications that take full advantage of the hardware parallel features has become a major challenge. In this paper, we present an extension to our CAL compilation framework (CAL2Many) that supports data parallelism in the CAL Actor Language. Our compilation framework makes it possible to program architectures with SIMD support using high-level language and provides efficient code generation. We support general SIMD instructions but the code generation backend is currently implemented for two custom architectures, namely ePUMA and EIT. Our experiments were carried out for two custom SIMD processor architectures using two applications.

    The experiment shows the possibility of achieving performance comparable to hand-written machine code with much less programming effort.

  • 2.
    Karlsson, Andreas
    et al.
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska fakulteten.
    Sohl, Joar
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska fakulteten.
    Liu, Dake
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska fakulteten.
    Cost-efficient Mapping of 3- and 5-point DFTs to General Baseband Processors2015Inngår i: International Conference on Digital Signal Processing (DSP), Singapore, 21-24 July, 2015, Institute of Electrical and Electronics Engineers (IEEE), 2015, s. 780-784Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Discrete Fourier transforms of 3 and 5 points are essential building blocks in FFT implementations for standards such as 3GPP-LTE. In addition to being more complex than 2 and 4 point DFTs, these DFTs also cause problems with data access in SDR-DSPs, since the data access width, in general, is a power of 2. This work derives mappings of these DFTs to a 4-way SIMD datapath that has been designed with 2 and 4-point DFT in mind. Our instruction set proposals, based on modified Winograd DFT, achieves single cycle execution of 3-point DFTs and 2.25 cycle average execution of 5-point DFTs in a cost-effective manner by reutilizing the already available arithmetic units. This represents an approximate speed-up of 3 times compared to an SDR-DSP with only MAC-support. In contrast to our more general design, we also demonstrate that a typical single-purpose FFT-specialized 5-way architecture only delivers 9% to 25% extra performance on average, while requiring 85% more arithmetic units and a more expensive memory subsystem.

  • 3.
    Karlsson, Andreas
    et al.
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska fakulteten.
    Sohl, Joar
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska högskolan.
    Liu, Dake
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska fakulteten.
    Energy-efficient sorting with the distributed memory architecture ePUMA2015Inngår i: IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA), Institute of Electrical and Electronics Engineers (IEEE), 2015, s. 116-123Konferansepaper (Fagfellevurdert)
    Abstract [en]

    This paper presents the novel heterogeneous DSP architecture ePUMA and demonstrates its features through an implementation of sorting of larger data sets. We derive a sorting algorithm with fixed-size merging tasks suitable for distributed memory architectures, which allows very simple scheduling and predictable data-independent sorting time.The implementation on ePUMA utilizes the architecture's specialized compute cores and control cores, and local memory parallelism, to separate and overlap sorting with data access and control for close to stall-free sorting.Penalty-free unaligned and out-of-order local memory access is used in combination with proposed application-specific sorting instructions to derive highly efficient local sorting and merging kernels used by the system-level algorithm.Our evaluation shows that the proposed implementation can rival the sorting performance of high-performance commercial CPUs and GPUs, with two orders of magnitude higher energy efficiency, which would allow high-performance sorting on low-power devices.

  • 4.
    Karlsson, Andreas
    et al.
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska fakulteten.
    Sohl, Joar
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska fakulteten.
    Liu, Dake
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska fakulteten.
    ePUMA: A Processor Architecture for Future DSP2015Inngår i: International Conference on Digital Signal Processing (DSP), Singapore, 21-24 July, 2015, 2015, s. 253-257Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Since the breakdown of Dennard scaling the primary design goal for processor designs has shifted from increasing performance to increasing performance per Watt. The ePUMA platform is a flexible and configurable DSP platform that tries to address many of the problems with traditional DSP designs, to increase  performance, but use less power. We trade the flexibility of traditional VLIW DSP designs for a simpler single instruction issue scheme and instead make sure that each instruction can perform more work. Multi-cycle instructions can operate directly on vectors and matrices in memory and the datapaths implement common DSP subgraphs directly in hardware, for high compute throughput. Memory bottlenecks, that are common in other architectures, are handled with flexible LUT-based multi-bank memory addressing and memory parallelism. A major contributor to energy consumption, data movement, is reduced by using heterogeneous interconnect and clustering compute resources around local memories for simple data sharing. To evaluate ePUMA we have implemented the majority of the kernel library from a commercial VLIW DSP manufacturer for comparison. Our results not only show good performance, but also an order of magnitude increase in energy- and area efficiency. In addition, the kernel code size is reduced by 91% on average compared to the VLIW DSP. These benefits makes ePUMA an attractive solution for future DSP.

  • 5.
    Karlsson, Andreas
    et al.
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska fakulteten.
    Sohl, Joar
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska högskolan.
    Liu, Dake
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska fakulteten.
    Software-based QPP Interleaving for Baseband DSPs with LUT-accelerated Addressing2015Inngår i: International Conference on Digital Signal Processing (DSP), Singapore, 21-24 July, 2015, Institute of Electrical and Electronics Engineers (IEEE), 2015Konferansepaper (Fagfellevurdert)
    Abstract [en]

    This paper demonstrates how QPP interleaving and de-interleaving for Turbo decoding in 3GPP-LTE can be implemented efficiently on baseband processors with lookup-table (LUT) based addressing support of multi-bank memory. We introduce a LUT-compression technique that reduces LUT size to 1% of what would otherwise be needed to store the full data access patterns for all LTE block sizes. By reusing the already existing program memory of a baseband processor to store LUTs and using our proposed general address generator, our 8-way data access path can reach the same throughput as a dedicated 8-way interleaving ASIC implementation. This avoids the addition of a dedicated interleaving address generator to a processor which,  according to ASIC synthesis, would be 75\% larger than our proposed address generator. Since our software implementation only involves the address generator, the processor's datapaths are free to perform the other operations of Turbo decoding in parallel with interleaving. Our software implementation ensure programmability and flexibility and is the fastest software-based implementation of QPP interleaving known to us.

  • 6. Bestill onlineKjøp publikasjonen >>
    Karlsson, Andréas
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska fakulteten.
    Design of Energy-Efficient High-Performance ASIP-DSP Platforms2016Doktoravhandling, monografi (Annet vitenskapelig)
    Abstract [en]

    In the last ten years, limited clock frequency scaling and increasing power density has shifted IC design focus towards parallelism, heterogeneity and energy efficiency. Improving energy efficiency is by no means simple and it calls for a reevaluation of old design choices in processor architecture, and perhaps more importantly, development of new programming methodologies that exploit the features of modern architectures.

    This thesis discusses the design of energy-efficient digital signal processors with application-specific instructions sets, so-called ASIP-DSPs, and their programming tools. Target applications for such processors include, but are not limited to, communications, multimedia, image processing, intelligent vision and radar. These applications are often implemented by a limited set of kernel algorithms, whose performance and efficiency are critical to the application's success. At the same time, the extreme non-recurring engineering cost of system-on-chip designs means that product life-time must be kept as long as possible. Neither general-purpose processors nor non-programmable ASICs can meet both the flexibility and efficiency requirements, and ASIPs may instead be the best trade-off between all the conflicting goals.

    Traditional superscalar- and VLIW processor design focus has been to improve the throughput of fine-grained instructions, which results in high flexibility, but also high energy consumption. SIMD architectures, on the other hand, are often restricted by inefficient data access. The result is architectures which spend more energy and/or time on supporting operations rather than actual computing.

    This thesis defines the performance limit of an architecture with an N-way parallel datapath as consuming 2N elements of compute data per clock cycle. To approach this performance, this work proposes coarse-grained higher-order functional (HOF) instructions, which encode the most  frequently executed compute-, data access- and control sequences into single many-cycle instructions, to reduce the overheads of instruction delivery, while at the same time maintaining orthogonality. The work further investigates opportunities for operation fusion to improve computing performance, and proposes a flexible memory subsystem for conflict-free parallel memory access with permutation and lookup-table-based addressing, to ensure that high computing throughput can be sustained even in the presence of irregular data access patterns. These concepts are extensively studied by implementing a large kernel algorithm library with typical DSP kernels, to prove their effectiveness and adequacy. Compared to contemporary VLIW DSP solutions, our solution can practically eliminate instruction fetching energy in many scenarios, significantly reduce control path switching, simplify the implementation of kernels and reduce code size, sometimes by as much as 30 times.

    The techniques proposed in this thesis have been implemented in the DSP platform ePUMA (embedded Parallel DSP processor with Unique Memory Access), a configurable control-compute heterogeneous platform with distributed memory, optimized for low-power predictable DSP computing. Hardware evaluation has been done with FPGA prototypes. In addition, several VLSI layouts have been created for energy and area estimations. This includes smaller designs, as well as a large design with 73 cores, capable of 1280 integer GOPS or 256 GFLOPS at 500MHz and which measures 45mm2 in 28nm FD-SOI technology.

    In addition to the hardware design, this thesis also discusses parallel programming flow for distributed memory architectures and ePUMA application implementation. A DSP kernel programming language and its compiler is presented. This effectively demonstrates how kernels written in a high-level language can be translated into HOF instructions for very high processing efficiency.

  • 7.
    Karlsson, Andréas
    et al.
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska högskolan.
    Sohl, Joar
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska högskolan.
    Wang, Jian
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska högskolan.
    Liu, Dake
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska högskolan.
    ePUMA: A unique memory access based parallel DSP processor for SDR and CR2013Inngår i: Global Conference on Signal and Information Processing (GlobalSIP), 2013 IEEE, IEEE , 2013, s. 1234-1237Konferansepaper (Fagfellevurdert)
    Abstract [en]

    This paper presents ePUMA, a master-slave heterogeneous DSP processor for communications and multimedia. We introduce the ePUMA VPE, a vector processing slave-core designed for heavy DSP workloads and demonstrate how its features can used to implement DSP kernels that efficiently overlap computing, data access and control to achieve maximum datapath utilization. The efficiency is evaluated by implementing a basic set of kernels commonly used in SDR. The experiments show that all kernels asymptotically reach above 90% effective datapath utilization. while many approach 100%, thus the design effectively overlaps computing, data access and control. Compared to popular VLIW solutions, the need for a large register file with many ports is eliminated, thus saving power and chip area. When compared to a commercial VLIW solution, our solution also achieves code size reductions of up to 30 times and a significantly simplified kernel implementation.

  • 8.
    Liu, Dake
    et al.
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska högskolan.
    Karlsson, Andréas
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska högskolan.
    Sohl, Joar
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska högskolan.
    Wang, Jian
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska högskolan.
    Petersson, Magnus
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska högskolan.
    Zhou, Wenbiao
    Beijing Institute of Technologies, China.
    ePUMA embedded parallel DSP processor with Unique Memory Access2011Inngår i: Information, Communications and Signal Processing (ICICS), 2011, IEEE , 2011, s. 1-5Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Computing unto 100GOPS without cooling is essential for high-end embedded systems and much required by markets. A novel master-slave multi-SIMD architecture and its kernel (template) based parallel programming flow is thus introduced as a parallel signal processing platform, ePUMA, embedded Parallel DSP processor with Unique Memory Access. It is an on chip multi-DSP-processor (CMP) targeting to predictable signal processing for communications and multimedia. The essential technologies are to separate the processing of control stream from parallel computing, and to separate parallel data access from parallel arithmetic computing kernels. By separations, the computation and data access can be orthogonal both in hardware and in programs. Orthogonal operations can therefore be executed in parallel and the run time cost of data access can be minimized. Benchmark shows that the computing performance therefore reaches about 80% of the hardware limit. Less than 40% of the hardware limit can be reached by normal processors. The unique SIMD memory subsystem architecture offers programmable conflict free parallel data accesses. Programming flow and tools are also developed to support coding on the unique hardware architecture. A prototype on FPGA shows especially high performance over silicon cost.

  • 9.
    Ogniewski, Jens
    et al.
    Linköpings universitet, Institutionen för systemteknik, Informationskodning. Linköpings universitet, Tekniska högskolan.
    Karlsson, Andréas
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska högskolan.
    Ragnemalm, Ingemar
    Linköpings universitet, Institutionen för systemteknik, Informationskodning. Linköpings universitet, Tekniska högskolan.
    TEXTURE COMPRESSION IN MEMORY AND PERFORMANCE-CONSTRAINED EMBEDDED SYSTEMS2011Inngår i: Computer Graphics, Visualization, Computer Vision and Image Processing 2011 / [ed] Yingcai Xiao, 2011, s. 19-26Konferansepaper (Fagfellevurdert)
    Abstract [en]

    More embedded systems gain increasing multimedia capabilities, including computer graphics. Although this is mainly due to their increasing computational capability, optimizations of algorithms and data structures are important as well, since these systems have to fulfill a variety of constraints and cannot be geared solely towards performance. In this paper, the two most popular texture compression methods (DXT1 and PVRTC) are compared in both image quality and decoding performance aspects. For this, both have been ported to the ePUMA platform which is used as an example of energy consumption optimized embedded systems. Furthermore, a new DXT1 encoder has been developed which reaches higher image quality than existing encoders.

  • 10.
    Ragnemalm, Ingemar
    et al.
    Linköpings universitet, Institutionen för systemteknik, Informationskodning. Linköpings universitet, Tekniska högskolan.
    Karlsson, Andréas
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska högskolan.
    Computing The Euclidean Distance Transform on the ePUMA Parallel Hardware2011Inngår i: Computer Graphics, Visualization, Computer Vision and Image Processing 2011 / [ed] Yingcai Xiao, 2011, s. 228-232Konferansepaper (Fagfellevurdert)
    Abstract [en]

    The ePUMA architecture is a novel parallel architecture being developed as a platform for low-power computing, typically for embedded or hand-held devices. As part of the exploration of the platform, we have implemented the Euclidean Distance Transform. We outline the ePUMA architecture and describe how the algorithm was implemented.

  • 11.
    Sohl, Joar
    et al.
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska högskolan.
    Karlsson, Andréas
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska högskolan.
    Liu, Dake
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska högskolan.
    Conflict-free data access for multi-bank memory architectures using padding2013Inngår i: High Performance Computing (HiPC), 2013, IEEE , 2013, s. 425-432Konferansepaper (Fagfellevurdert)
    Abstract [en]

    For high performance computation memory access is a major issue. Whether it is a supercomputer, a GPGPU device, or an Application Specific Instruction set Processor (ASIP) for Digital Signal Processing (DSP) parallel execution is a necessity. A high rate of computation puts pressure on the memory access, and it is often non-trivial to maximize the data rate to the execution units. Many algorithms that from a computational point of view can be implemented efficiently on parallel architectures fail to achieve significant speed-ups. The reason is very often that the speed-up possible with the available execution units are poorly utilized due to inefficient data access. This paper shows a method for improving the access time for sequences of data that are completely static at the cost of extra memory. This is done by resolving memory conflicts by using padding. The method can be automatically applied and it is shown to significantly reduce the data access time for sorting and FFTs. The execution time for the FFT is improved with up to a factor of 3.4 and for sorting by a factor of up to 8.

  • 12.
    Sohl, Joar
    et al.
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska högskolan.
    Wang, Jian
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska högskolan.
    Karlsson, Andréas
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska högskolan.
    Liu, Dake
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska högskolan.
    Automatic Permutation for Arbitrary Static Access Patterns2012Inngår i: Parallel and Distributed Processing with Applications (ISPA), 2012, IEEE , 2012, s. 215-222Konferansepaper (Fagfellevurdert)
    Abstract [en]

    A significant portion of the execution time on current SIMD and VLIW processors is spent on data access rather than instructions that perform actual computations. The ePUMA architecture provides features that allow arbitrary data elements to be accessed in parallel as long as the elements reside in different memory banks. Using permutation to move data elements that are accessed in parallel, the overhead from memory access can be greatly reduced; and, in many cases completely removed. This paper presents a practical method for automatic permutation based on Integer Linear Programming (ILP). No assumptions are made about the structure of the access patterns other than their static nature. Methods for speeding up the solution time for periodic access patterns and reusing existing solutions are also presented. Benchmarks for e.g. FFTs show speedups of up to 3.4 when using permutation compared to regular implementations.

  • 13.
    Wang, Jian
    et al.
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska högskolan.
    Karlsson, Andréas
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska högskolan.
    Sohl, Joar
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska högskolan.
    Liu, Dake
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska högskolan.
    Convolutional Decoding on Deep-pipelined SIMD Processor with Flexible Parallel Memory2012Inngår i: Digital System Design (DSD), 2012, IEEE , 2012, s. 529-532Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Single Instruction Multiple Data (SIMD) architecture has been proved to be a suitable parallel processor architecture for media and communication signal processing. But the computing overhead such as memory access latency and vector data permutation limit the performance of conventional SIMD processor. Solutions such as combined VLIW and SIMD architecture are designed with an increased complexity for compiler design and assembly programming. This paper introduces the SIMD processor in the ePUMA1 platform which uses deep execution pipeline and flexible parallel memory to achieve high computing performance. Its deep pipeline can execute combined operations in one cycle. And the parallel memory architecture supports conflict free parallel data access. It solves the problem of large vector permutation in a short vector SIMD machine in a more efficient way than conventional vector permutation instruction. We evaluate the architecture by implementing the soft decision Viterbi algorithm for convolutional decoding. The result is compared with other architectures, including TI C54x, CEVA TeakLike III, and PowerPC AltiVec, to show ePUMA’s computing efficiency advantage.

  • 14.
    Wang, Jian
    et al.
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska högskolan.
    Karlsson, Andréas
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska högskolan.
    Sohl, Joar
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska högskolan.
    Pettersson, Magnus
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska högskolan.
    Liu, Dake
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska högskolan.
    A multi-level arbitration and topology free streaming network for chip multiprocessor2011Inngår i: ASIC (ASICON), 2011, IEEE , 2011, s. 153-158Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Predictable computing is common in embedded signal processing, which has communication characteristics of data independent memory access and long streaming data transfer. This paper presents a streaming network-on-chip (NoC) StreamNet for chip multiprocessor (CMP) platform targeting predictable signal processing. The network is based on circuit-switch and uses a two-level arbitration scheme. The first level uses fast hardware arbitration, and the second level is programmable software arbitration. Its communication protocol is designed to support free choice of network topology. Associated with its scheduling tool, the network can achieve high communication efficiency and improve parallel computing performance. This NoC architecture is used to design the Ring network in the ePUMA1 multiprocessor DSP. The evaluation by the multi-user signal processing application at the LTE base-station shows the low parallel computing overhead on the ePUMA multiprocessor platform.

  • 15.
    Wang, Jian
    et al.
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska högskolan.
    Sohl, Joar
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska högskolan.
    Karlsson, Andréas
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska högskolan.
    Liu, Dake
    Linköpings universitet, Institutionen för systemteknik, Datorteknik. Linköpings universitet, Tekniska högskolan.
    An Efficient Streaming Star Network for Multi-core Parallel DSP Processor2011Inngår i: Networking and Computing (ICNC), 2011, IEEE , 2011, s. 332-336Konferansepaper (Fagfellevurdert)
    Abstract [en]

    As more and more computing components are integrated into one digital signal processing (DSP) system to achieve high computing power by executing tasks in parallel, it is soon observed that the inter-processor and processor to memory communication overheads become the performance bottleneck and limit the scalability of a multi-processor platform. For chip multiprocessor (CMP) DSP systems targeting on predictable computing, an appreciation of the communication characteristics is essential to design an efficient interconnection architecture and improve performance. This paper presents a Star network designed for the ePUMA multi-core DSP processor based on analysis of the network communication models. As part of ePUMA’s multi-layer interconnection network, the Star network handles core to off-chip memory communications for kernel computing on slave processors. The network has short setup latency, easy multiprocessor synchronization, rich memory addressing patterns, and power efficient streaming data transfer. The improved network efficiency is evaluated in comparison with a previous study.

1 - 15 of 15
RefereraExporteraLink til resultatlisten
Permanent link
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf