liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Performance and Energy Efficient Building Blocks for Network-on-Chip Architectures
Linköping University, Department of Electrical Engineering. Linköping University, The Institute of Technology.
2006 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

The ever shrinking size of the MOS transistors brings the promise of scalable Network-on-Chip (NoC) architectures containing hundreds of processing elements with on-chip communication, all integrated into a single die. Such a computational fabric will provide high levels of performance in an energy efficient manner. To mitigate emerging wire-delay problem and to address the need for substantial interconnect bandwidth, packet switched routers are fast replacing shared buses and dedicated wires as the interconnect fabric of choice. With on-chip communication consuming a significant portion of the chip power and area budgets, there is a compelling need for compact, low power routers. While applications dictate the choice of the compute core, the advent of multimedia applications, such as 3D graphics and signal processing, places stronger demands for self-contained, low-latency floating-point processors with increased throughput. Therefore, this work focuses on two key building blocks critical to the success of NoC design: high performance, area and energy efficient router and floating-point processor architectures.

This thesis first presents a six-port four-lane 57 GB/s non-blocking router core based on wormhole switching. The router features double-pumped crossbar channels and destinationaware channel drivers that dynamically configure based on the current packet destination. This enables 45% reduction in crossbar channel area, 23% overall router area, up to 3.8X reduction in peak channel power, and 7.2% improvement in average channel power, with no performance penalty over a published design. In a 150nm six-metal CMOS process, the 12.2mm2 router contains 1.9 million transistors and operates at 1GHz at 1.2V. We next present a new pipelined single-precision floating-point multiply accumulator core (FPMAC) featuring a single-cycle accumulate loop using base 32 and internal carry-save arithmetic, with delayed addition techniques. Combined algorithmic, logic and circuit techniques enable multiply-accumulates at speeds exceeding 3GHz, with single-cycle throughput. Unlike existing FPMAC architectures, the design eliminates scheduling restrictions between consecutive FPMAC instructions. The optimizations allow removal of the costly normalization step from the critical accumulate loop and conditionally powered down using dynamic sleep transistors on long accumulate operations, saving active and leakage power. In addition, an improved leading zero anticipator (LZA) and overflow detection logic applicable to carry-save format is presented. In a 90nm seven-metal dual-VT CMOS process, the 2mm2 custom design contains 230K transistors. The fully functional first silicon achieves 6.2 GFLOPS of performance while dissipating 1.2W at 3.1GHz, 1.3V supply.

It is clear that realization of successful NoC designs require well balanced decisions at all levels: architecture, logic, circuit and physical design. Our results from key building blocks demonstrate the feasibility of pushing the performance limits of compute cores and communication routers, while keeping active and leakage power, and area under control.

Place, publisher, year, edition, pages
Institutionen för systemteknik , 2006. , 51 p.
Series
Linköping Studies in Science and Technology. Thesis, ISSN 0280-7971 ; 1255
Series
Keyword [en]
Network-on-Chip Architectures, floating-point units, tiled-architectures, crossbar routers, multi-processor interconnection
National Category
Computer Engineering
Identifiers
URN: urn:nbn:se:liu:diva-7845ISBN: 91-85523-54-2 (print)OAI: oai:DiVA.org:liu-7845DiVA: diva2:22785
Presentation
2006-05-30, Glashuset, Campus Valla, Linköpings universitet, Linköping, 13:15 (English)
Opponent
Supervisors
Note
Report code: LiU-TEK-LIC-2006:36.Available from: 2007-01-31 Created: 2007-01-31 Last updated: 2009-06-08
List of papers
1. A Six-Port 57GB/s Double-Pumped Non-blocking Router Core
Open this publication in new window or tab >>A Six-Port 57GB/s Double-Pumped Non-blocking Router Core
2005 (English)In: Symposium on VLSI Circuits, Digest of Technical Papers, June 16-18,, 2005, 268-269 p.Conference paper, Published paper (Refereed)
Abstract [en]

A six-port four-lane 57GB/s router core features double-pumped crossbar channels and destination-aware channel drivers that dynamically configure based on the current flit destination. This enables 45% reduction in channel area, 23% overall chip area, and up to 3.8× reduction in peak channel power, depending on router traffic patterns. In a 150nm six-metal process, the 12.2mm2 core contains 1.9 million transistors and operates at 1GHz at 1.2 V.

Keyword
crossbar, router, lowpower driver
National Category
Engineering and Technology
Identifiers
urn:nbn:se:liu:diva-13108 (URN)10.1109/VLSIC.2005.1469383 (DOI)
Available from: 2008-04-01 Created: 2008-04-01 Last updated: 2009-06-08
2. A 6.2 GFLOPS Floating Point Multiply-Accumulator with Conditional Normalization
Open this publication in new window or tab >>A 6.2 GFLOPS Floating Point Multiply-Accumulator with Conditional Normalization
2006 (English)In: IEEE Journal of Solid-State Circuits, ISSN 0018-9200, Vol. 41, no 10, 2314-2323 p.Article in journal (Refereed) Published
Abstract [en]

A pipelined single-precision floating-point multiply-accumulator (FPMAC) featuring a single-cycle accumulate loop using base 32 and internal carry-save arithmetic with delayed addition is described. A combination of algorithmic, logic, and circuit techniques enables multiply-accumulate operations at speeds exceeding 3 GHz with single-cycle throughput. The optimizations allow removal of the costly normalization step from the critical accumulate loop. This logic is conditionally powered down using dynamic sleep transistors on long accumulate operations, saving active and leakage power. In addition, an improved leading-zero anticipator (LZA) and overflow prediction logic applicable to carry-save format is presented. In a 90-nm seven-metal dual-VT CMOS process, the 2 mm2 custom design contains 230K transistors. The fully functional first silicon achieves 6.2 GFlops of performance while dissipating 1.2 W at 3.1 GHz, 1.3-V supply

Keyword
CMOS logic circuits, adders, floating point arithmetic, multiplying circuits, 1.2 W, 1.3 V, 90 nm, CMOS digital integrated circuits, algorithmic technique, carry-save arithmetic, circuit technique, conditional normalization, delayed addition, floating-point multiply-accumulator, leading-zero anticipator, logic technique, overflow prediction logic
National Category
Engineering and Technology
Identifiers
urn:nbn:se:liu:diva-13110 (URN)10.1109/JSSC.2006.881557 (DOI)
Available from: 2008-04-01 Created: 2008-04-01 Last updated: 2009-06-08
3. A 5 GHz floating point multiply-accumulator in 90 nm dual VT CMOS
Open this publication in new window or tab >>A 5 GHz floating point multiply-accumulator in 90 nm dual VT CMOS
Show others...
2003 (English)In: IEEE International Solid-State Circuits Conference, Digest of Technical Papers, IEEE , 2003, Vol. 1, 334-335 p.Conference paper, Published paper (Refereed)
Abstract [en]

A 32 b single-cycle floating point accumulator that uses base 32 and carry-save format with delayed addition is described. Combined algorithmic, logic and circuit techniques enable multiply-accumulate operation at 5 GHz. In a 90 nm 7M dual-VT CMOS process, the 2 mm2 prototype contains 230K transistors and dissipates 1.2 W at 5 GHz, 1.2 V and 25°C.

Place, publisher, year, edition, pages
IEEE, 2003
Series
Digest of Technical Papers, ISSN 0193-6530 ; Vol. 1
National Category
Engineering and Technology
Identifiers
urn:nbn:se:liu:diva-13109 (URN)10.1109/ISSCC.2003.1234322 (DOI)0-7803-7707-9 (ISBN)
Conference
2003 IEEE International Solid-State Circuits Conference, Digest of Technical Papers. ISSCC
Available from: 2008-04-01 Created: 2008-04-01 Last updated: 2011-03-02Bibliographically approved

Open Access in DiVA

fulltext(2444 kB)1929 downloads
File information
File name FULLTEXT01.pdfFile size 2444 kBChecksum MD5
11bab7021c68497507ab5b4d8a2612d256f4b9710900760bac67b0713b91cf41d3240f8e
Type fulltextMimetype application/pdf

Other links

Link to Ph.D. Thesis
By organisation
Department of Electrical EngineeringThe Institute of Technology
Computer Engineering

Search outside of DiVA

GoogleGoogle Scholar
Total: 1929 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 1890 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf