liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Performance and Energy Efficient Network-on-Chip Architectures
Linköping University, Department of Electrical Engineering. Linköping University, The Institute of Technology.
2007 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

The scaling of MOS transistors into the nanometer regime opens the possibility for creating large Network-on-Chip (NoC) architectures containing hundreds of integrated processing elements with on-chip communication. NoC architectures, with structured on-chip networks are emerging as a scalable and modular solution to global communications within large systems-on-chip. NoCs mitigate the emerging wire-delay problem and addresses the need for substantial interconnect bandwidth by replacing today’s shared buses with packet-switched router networks. With on-chip communication consuming a significant portion of the chip power and area budgets, there is a compelling need for compact, low power routers. While applications dictate the choice of the compute core, the advent of multimedia applications, such as three-dimensional (3D) graphics and signal processing, places stronger demands for self-contained, low-latency floating-point processors with increased throughput. This work demonstrates that a computational fabric built using optimized building blocks can provide high levels of performance in an energy efficient manner. The thesis details an integrated 80- Tile NoC architecture implemented in a 65-nm process technology. The prototype is designed to deliver over 1.0TFLOPS of performance while dissipating less than 100W.

This thesis first presents a six-port four-lane 57 GB/s non-blocking router core based on wormhole switching. The router features double-pumped crossbar channels and destinationaware channel drivers that dynamically configure based on the current packet destination. This enables 45% reduction in crossbar channel area, 23% overall router area, up to 3.8X reduction in peak channel power, and 7.2% improvement in average channel power. In a 150-nm sixmetal CMOS process, the 12.2 mm2 router contains 1.9-million transistors and operates at 1 GHz at 1.2 V supply.

We next describe a new pipelined single-precision floating-point multiply accumulator core (FPMAC) featuring a single-cycle accumulation loop using base 32 and internal carry-save arithmetic, with delayed addition techniques. A combination of algorithmic, logic and circuit techniques enable multiply-accumulate operations at speeds exceeding 3GHz, with singlecycle throughput. This approach reduces the latency of dependent FPMAC instructions and enables a sustained multiply-add result (2FLOPS) every cycle. The optimizations allow removal of the costly normalization step from the critical accumulation loop and conditionally powered down using dynamic sleep transistors on long accumulate operations, saving active and leakage power. In a 90-nm seven-metal dual-VT CMOS process, the 2 mm2 custom design contains 230-K transistors. Silicon achieves 6.2-GFLOPS of performance while dissipating 1.2 W at 3.1 GHz, 1.3 V supply.

We finally present the industry's first single-chip programmable teraFLOPS processor. The NoC architecture contains 80 tiles arranged as an 8×10 2D array of floating-point cores and packet-switched routers, both designed to operate at 4 GHz. Each tile has two pipelined singleprecision FPMAC units which feature a single-cycle accumulation loop for high throughput. The five-port router combines 100 GB/s of raw bandwidth with low fall-through latency under 1ns. The on-chip 2D mesh network provides a bisection bandwidth of 2 Tera-bits/s. The 15-FO4 design employs mesochronous clocking, fine-grained clock gating, dynamic sleep transistors, and body-bias techniques. In a 65-nm eight-metal CMOS process, the 275 mm2 custom design contains 100-M transistors. The fully functional first silicon achieves over 1.0TFLOPS of performance on a range of benchmarks while dissipating 97 W at 4.27 GHz and 1.07-V supply.

It is clear that realization of successful NoC designs require well balanced decisions at all levels: architecture, logic, circuit and physical design. Our results demonstrate that the NoC architecture successfully delivers on its promise of greater integration, high performance, good scalability and high energy efficiency.

Place, publisher, year, edition, pages
Institutionen för systemteknik , 2007. , 93 p.
Series
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 1130
Keyword [en]
Chips, MOS transistors, Network-on-Chip (NoC), process technology, FPMAC
National Category
Other Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
URN: urn:nbn:se:liu:diva-11439ISBN: 978-91-85895-91-5 (print)OAI: oai:DiVA.org:liu-11439DiVA: diva2:17857
Public defence
2007-10-30, Visionen, Hus B, Campus US, Linköpings universitet, Linköping, 10:15 (English)
Opponent
Supervisors
Available from: 2008-04-01 Created: 2008-04-01 Last updated: 2009-05-18
List of papers
1. A Six-Port 57GB/s Double-Pumped Non-blocking Router Core
Open this publication in new window or tab >>A Six-Port 57GB/s Double-Pumped Non-blocking Router Core
2005 (English)In: Symposium on VLSI Circuits, Digest of Technical Papers, June 16-18,, 2005, 268-269 p.Conference paper, Published paper (Refereed)
Abstract [en]

A six-port four-lane 57GB/s router core features double-pumped crossbar channels and destination-aware channel drivers that dynamically configure based on the current flit destination. This enables 45% reduction in channel area, 23% overall chip area, and up to 3.8× reduction in peak channel power, depending on router traffic patterns. In a 150nm six-metal process, the 12.2mm2 core contains 1.9 million transistors and operates at 1GHz at 1.2 V.

Keyword
crossbar, router, lowpower driver
National Category
Engineering and Technology
Identifiers
urn:nbn:se:liu:diva-13108 (URN)10.1109/VLSIC.2005.1469383 (DOI)
Available from: 2008-04-01 Created: 2008-04-01 Last updated: 2009-06-08
2. A 5 GHz floating point multiply-accumulator in 90 nm dual VT CMOS
Open this publication in new window or tab >>A 5 GHz floating point multiply-accumulator in 90 nm dual VT CMOS
Show others...
2003 (English)In: IEEE International Solid-State Circuits Conference, Digest of Technical Papers, IEEE , 2003, Vol. 1, 334-335 p.Conference paper, Published paper (Refereed)
Abstract [en]

A 32 b single-cycle floating point accumulator that uses base 32 and carry-save format with delayed addition is described. Combined algorithmic, logic and circuit techniques enable multiply-accumulate operation at 5 GHz. In a 90 nm 7M dual-VT CMOS process, the 2 mm2 prototype contains 230K transistors and dissipates 1.2 W at 5 GHz, 1.2 V and 25°C.

Place, publisher, year, edition, pages
IEEE, 2003
Series
Digest of Technical Papers, ISSN 0193-6530 ; Vol. 1
National Category
Engineering and Technology
Identifiers
urn:nbn:se:liu:diva-13109 (URN)10.1109/ISSCC.2003.1234322 (DOI)0-7803-7707-9 (ISBN)
Conference
2003 IEEE International Solid-State Circuits Conference, Digest of Technical Papers. ISSCC
Available from: 2008-04-01 Created: 2008-04-01 Last updated: 2011-03-02Bibliographically approved
3. A 6.2 GFLOPS Floating Point Multiply-Accumulator with Conditional Normalization
Open this publication in new window or tab >>A 6.2 GFLOPS Floating Point Multiply-Accumulator with Conditional Normalization
2006 (English)In: IEEE Journal of Solid-State Circuits, ISSN 0018-9200, Vol. 41, no 10, 2314-2323 p.Article in journal (Refereed) Published
Abstract [en]

A pipelined single-precision floating-point multiply-accumulator (FPMAC) featuring a single-cycle accumulate loop using base 32 and internal carry-save arithmetic with delayed addition is described. A combination of algorithmic, logic, and circuit techniques enables multiply-accumulate operations at speeds exceeding 3 GHz with single-cycle throughput. The optimizations allow removal of the costly normalization step from the critical accumulate loop. This logic is conditionally powered down using dynamic sleep transistors on long accumulate operations, saving active and leakage power. In addition, an improved leading-zero anticipator (LZA) and overflow prediction logic applicable to carry-save format is presented. In a 90-nm seven-metal dual-VT CMOS process, the 2 mm2 custom design contains 230K transistors. The fully functional first silicon achieves 6.2 GFlops of performance while dissipating 1.2 W at 3.1 GHz, 1.3-V supply

Keyword
CMOS logic circuits, adders, floating point arithmetic, multiplying circuits, 1.2 W, 1.3 V, 90 nm, CMOS digital integrated circuits, algorithmic technique, carry-save arithmetic, circuit technique, conditional normalization, delayed addition, floating-point multiply-accumulator, leading-zero anticipator, logic technique, overflow prediction logic
National Category
Engineering and Technology
Identifiers
urn:nbn:se:liu:diva-13110 (URN)10.1109/JSSC.2006.881557 (DOI)
Available from: 2008-04-01 Created: 2008-04-01 Last updated: 2009-06-08
4. An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS
Open this publication in new window or tab >>An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS
Show others...
2007 (English)In: IEEE International Solid-State Circuits Conference, San Fransisco, USA, 2007, IEEE , 2007, 98-99 p.Conference paper, Published paper (Refereed)
Abstract [en]

A 275mm2 network-on-chip architecture contains 80 tiles arranged as a 10 times 8 2D array of floating-point cores and packet-switched routers, operating at 4GHz. The 15-F04 design employs mesochronous clocking, fine-grained clock gating, dynamic sleep transistors, and body-bias techniques. The 65nm 100M transistor die is designed to achieve a peak performance of 1.0TFLOPS at 1V while dissipating 98W.

Place, publisher, year, edition, pages
IEEE, 2007
Series
Digest of Technical Papers, ISSN 0193-6530
National Category
Engineering and Technology
Identifiers
urn:nbn:se:liu:diva-13111 (URN)10.1109/ISSCC.2007.373606 (DOI)1-4244-0853-9 (ISBN)
Conference
IEEE International Solid-State Circuits Conference, ISSCC 2007, Digest of Technical Papers, San Francisco, CA, USA
Available from: 2008-04-01 Created: 2008-04-01 Last updated: 2011-03-02Bibliographically approved
5. A 5.1GHz 0.34mm2 Router for Network-on-Chip Applications
Open this publication in new window or tab >>A 5.1GHz 0.34mm2 Router for Network-on-Chip Applications
Show others...
2007 (English)In: 2007 IEEE Symposium on VLSI Circuits, IEEE , 2007, 42-43 p.Conference paper, Published paper (Refereed)
Abstract [en]

A five-port two-lane pipelined packet-switched router core with phase-tolerant mesochronous links forms the key communication fabric for an 80-tile network-on-chip (NoC) architecture. The 15FO4 design combines 102 GB/s of raw bandwidth with low fall-through latency of 980 ps. A shared crossbar architecture with a double-pumped crossbar switch enables a compact 0.34 mm2 router layout. In a 65nm eight-metal CMOS process, the router contains 210K transistors and operates at 5.1GHz at 1.2 V, while dissipating 945 mW.

Place, publisher, year, edition, pages
IEEE, 2007
National Category
Engineering and Technology
Identifiers
urn:nbn:se:liu:diva-13112 (URN)10.1109/VLSIC.2007.4342758 (DOI)978-4-900784-04-8 (ISBN)978-4-900784-05-5 (ISBN)
Conference
20th Symposium on VLSI Circuits
Available from: 2008-04-01 Created: 2008-04-01 Last updated: 2013-09-10

Open Access in DiVA

cover(88 kB)23 downloads
File information
File name COVER01.pdfFile size 88 kBChecksum MD5
133158a9b9089200ef6f8e320aa108632f13dd5d9ec139ad89aa19e94c98407112aba0e2
Type coverMimetype application/pdf
fulltext(1950 kB)10747 downloads
File information
File name FULLTEXT01.pdfFile size 1950 kBChecksum MD5
5dbc78d3084bc3b0cacf7cbb030e75fddf8e7b1e063a8233b0d4905a1f47f4004ccc28a5
Type fulltextMimetype application/pdf

Other links

Link to Licentiate Thesis
By organisation
Department of Electrical EngineeringThe Institute of Technology
Other Electrical Engineering, Electronic Engineering, Information Engineering

Search outside of DiVA

GoogleGoogle Scholar
Total: 10747 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 929 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf