Efficient High-Speed On-Chip Global Interconnects

Peter Caputa
Efficient High-Speed On-Chip Global Interconnects
Peter Caputa
ISBN 91-85457-87-6

Copyright ©Peter Caputa, 2006

Linköping Studies in Science and Technology
Dissertation No. 992
ISSN 0345-7524

Electronic Devices
Department of Electrical Engineering
Linköping University
SE-581 83 Linköping
Sweden

Cover Image
Microphotograph of a test chip fabricated in 0.18 μm CMOS. The chip carries a velocity-of-light-limited 5.4 mm long global bus and a receiver based on the Synchronous Latency Insensitive Design scheme.

Printed by LiU-Tryck, Linköping University
Linköping, Sweden, 2006
Abstract

The continuous miniaturization of integrated circuits has opened the path towards System-on-Chip realizations. Process shrinking into the nanometer regime improves transistor performance while the delay of global interconnects, connecting circuit blocks separated by a long distance, significantly increases. In fact, global interconnects extending across a full chip can have a delay corresponding to multiple clock cycles. At the same time, global clock skew constraints, not only between blocks but also along the pipelined interconnects, become even tighter. On-chip interconnects have always been considered $RC$-like, that is exhibiting long $RC$-delays. This has motivated large efforts on alternatives such as on-chip optical interconnects, which have not yet been demonstrated, or complex schemes utilizing on-chip RF-transmission or pulsed current-mode signaling.

In this thesis, we show that well-designed electrical global interconnects, behaving as transmission lines, have the capacity of very high data rates (higher than can be delivered by the actual process) and support near velocity-of-light delay for single-ended voltage-mode signaling, thus mitigating the $RC$-problem. We critically explore key interconnect performance measures such as data delay, maximum data rate, crosstalk, edge rates and power dissipation. To experimentally demonstrate the feasibility and superior properties of on-chip transmission line interconnects, we have designed and fabricated a test chip carrying a 5 mm long global communication link. Measurements show that we can achieve 3 Gb/s/wire over the 5 mm long, repeaterless on-chip bus implemented in a standard 0.18 $\mu$m CMOS process, achieving a signal velocity of 1/3 of the velocity of light in vacuum.

To manage the problems due to global wire delays, we describe and implement a Synchronous Latency Insensitive Design (SLID) scheme, based on source-synchronous data transfer between blocks and data re-timing at the receiving block. The SLID-technique not only mitigates unknown global wire delays, but also removes the need for controlling global clock skew. The high-performance and high robustness capability of the SLID-method is practically demonstrated through a successful implementation of a SLID-based, 5.4 mm long, on-chip global bus, supporting 3 Gb/s/wire and dynamically accepting $\pm 2$ clock cycles of
data-clock skew, in a standard 0.18 μm CMOS process.

In the context of technology scaling, there is a tendency for interconnects to dominate chip power dissipation due to their large total capacitance. In this thesis we address the problem of interconnect power dissipation by proposing and analyzing a transition-energy cost model aimed for efficient power estimation of performance-critical buses. The model, which includes properties that closely capture effects present in high-performance VLSI buses, can be used to more accurately determine the energy benefits of e.g. transition coding of bus topologies. We further show a power optimization scheme based on appropriate choice of reduced voltage swing of the interconnect and scaling of receiver amplifier. Finally, the power saving impact of swing reduction in combination with a sense-amplifying flip-flop receiver is shown on a microprocessor cache bus architecture used in industry.
Preface

This Ph.D. thesis presents the results of my research during the period from April 2001 to December 2005 at the Electronic Devices group, Department of Electrical Engineering, Linköping University, Sweden. The following papers are included in the thesis:


I have also been involved in research work, which has generated the following papers falling outside the scope of this thesis:


Contributions

The main contributions of this dissertation are as follows:

- A comprehensive analysis showing that the intrinsic limitations of electrical on-chip interconnects can be overcome by utilization of transmission line-style wires.

- A successful CMOS implementation of a global communication link showing the feasibility of transmission line-style interconnects achieving near velocity-of-light delay and high data rates.

- Motivating the use of a Synchronous Latency Insensitive Design (SLID) scheme for integrated circuits aimed at managing the timing problems caused by unknown on-chip global clock skew and wire delays. This includes validation of the technique by measurements of fabricated silicon.

- A bus transition-energy cost model including capacitances related to interconnect inter-layer coupling and the internal nodes of a realistic multi-stage transmitter - properties which were not treated in previous models.
## Abbreviations

<table>
<thead>
<tr>
<th>Abbreviation</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>AC</td>
<td>Alternating Current</td>
</tr>
<tr>
<td>AR</td>
<td>Aspect Ratio</td>
</tr>
<tr>
<td>ASIC</td>
<td>Application-Specific Integrated Circuit</td>
</tr>
<tr>
<td>CAD</td>
<td>Computer-Aided Design</td>
</tr>
<tr>
<td>CMOS</td>
<td>Complementary Metal-Oxide-Semiconductor</td>
</tr>
<tr>
<td>DC</td>
<td>Direct Current</td>
</tr>
<tr>
<td>DSM</td>
<td>Deep SubMicron</td>
</tr>
<tr>
<td>FIFO</td>
<td>First In First Out</td>
</tr>
<tr>
<td>GALS</td>
<td>Globally Asynchronous Locally Synchronous</td>
</tr>
<tr>
<td>IC</td>
<td>Integrated Circuit</td>
</tr>
<tr>
<td>IEEE</td>
<td>Institute of Electrical and Electronics Engineering</td>
</tr>
<tr>
<td>ILD</td>
<td>Inter-Layer Dielectric</td>
</tr>
<tr>
<td>ISI</td>
<td>InterSymbol Interference</td>
</tr>
<tr>
<td>ITRS</td>
<td>International Technology Roadmap for Semiconductors</td>
</tr>
<tr>
<td>LID</td>
<td>Latency Insensitive Design</td>
</tr>
<tr>
<td>MOSFET</td>
<td>Metal-Oxide-Semiconductor Field-Effect Transistor</td>
</tr>
<tr>
<td>NMOS</td>
<td>N-channel Metal-Oxide-Semiconductor</td>
</tr>
<tr>
<td>NoC</td>
<td>Network-on-Chip</td>
</tr>
<tr>
<td>PCB</td>
<td>Printed Circuit Board</td>
</tr>
<tr>
<td>PMOS</td>
<td>P-channel Metal-Oxide-Semiconductor</td>
</tr>
<tr>
<td>RC</td>
<td>Resistance-Capacitance</td>
</tr>
<tr>
<td>RF</td>
<td>Radio-Frequency</td>
</tr>
<tr>
<td>RLC</td>
<td>Resistance-Inductance-Capacitance</td>
</tr>
<tr>
<td>Rx</td>
<td>Receiver</td>
</tr>
<tr>
<td>SLID</td>
<td>Synchronous Latency Insensitive Design</td>
</tr>
<tr>
<td>SoC</td>
<td>System-on-Chip</td>
</tr>
<tr>
<td>Tx</td>
<td>Transmitter</td>
</tr>
<tr>
<td>VLSI</td>
<td>Very Large Scale Integration</td>
</tr>
</tbody>
</table>
Acknowledgments

Welcome to the page that will be read by most people! In my bookshelf, I have 42 licentiate and PhD theses and unfortunately, I don’t think I have actually read any of them from cover to cover. So if you are one of those who will not read this document carefully, I’m not at all dissapointed in you - I probably would not have read this book either. The first time I sat in the audience of a PhD-dissertation, my thoughts were something like: “Look at that person up there! I would never be able to do that!”. Sometimes, life has a few twists in store for you - all of a sudden I find myself being the person up there, defending my own PhD-thesis. How on earth was I able to complete this work!? A couple of weeks ago, I saw a quote in the newspaper. It said:

“Always remember that you can do anything you want to do. If there is something you think you cannot do - it is simply because you don't want to.”

There is definitely a lot of truth in that one! Many people have supported and encouraged me during my years as a PhD-student, and they deserve my warmest thanks!

- My advisor and supervisor Prof. Christer Svensson, for giving me the opportunity to work in his inspirational, encouraging, and professional environment.

- Prof. Atila Alvandpour for all the long and loud (yes, we sometimes had to close the door due to all the noise...) work related debates, and non-work related discussions.

- Prof. Per Larsson-Edefors for his excellent skill of convincing me to join the Electronic Devices group. Had it not been for his patience, persistent, and never-ending persuasion, this thesis would never have existed.

- Lic. Eng. Stefan Andersson for keeping me company during endless late-evening undergraduate labs, the off-topic discussions, and work related advice (especially during critical tape-out nights).
• Henrik Fredriksson for the chip design summer of 2004. We spent the whole summer at the office designing a chip instead of having vacation like normal people. Thank God the chip worked!

• Dr. Daniel Wiklund for setting up the LaTeX template file for Lic. Eng. Stefan Andersson so that I could use his template file to create this document.

• All past and present members of the Electronic Devices group, especially Dr. Henrik Eriksson, Dr. Daniel Eckerbert, Dr. Kalle Folkesson, Lic. Eng. Mindaugas Drazdziulis, Dr. Ulf Nordquist, Dr. Tomas Henriksson, M.Sc. Martin Hansson, Dr. Håkan Bengtsson, M.Sc. Joacim Olsson, M.Sc. Behzad Mesgarzadeh, M.Sc. Rashad Ramzan, Dr. Darius Jakonis, Dr. Ingemar Söderqvist, Dr. Mattias Duppils, Dr. Annika Rantzer, Prof. Dake Liu, Ass. Prof. Jerzy Dabrovski, Naveed Ahsan, Adj. Prof. Aziz Ouacha, Isabel Ferrer, Rahman Aljasmi, Sriram Vangal, and Sreedhar Natarajan. Thanks for creating such a great research group.

• Our Research Engineer Arta Alvandpour for smoothly fixing hardware related computer problems, tool issues, and other practical headaches.

• Our secretary Anna Folkesson for help with administrative issues.

• All the people at the Intel Circuit Research Lab in Hillsboro, Oregon, USA, especially Dr. Ram Krishnamurthy, M.Sc. Mark Anders, Dr. Sanu Mathew, M.Sc. Steven Hsu, M.Sc. Matthew Haycock, M.Sc.EE Shekhar Borkar, and Karie Mawer. Thanks for making my internship at Intel such a great experience.

• My fantastic parents Jana and Milan Caputa who always support me regardless of what I decide to do in life.

• My outstanding friends for all the laughs and crazy things we have done together! What would life be without you?! I wonder if the really stupid things have been done or if they are waiting for us somewhere in the future? One thing I do know - there are countless laughs remaining before it’s over!

• Thanks to all the people I have forgotten to thank! You know my memory is not the best - I have to write notes to myself about everything.

Peter Caputa
Linköping, December 2005
## Contents

Abstract iii
Preface v
Contributions vii
Abbreviations ix
Acknowledgments xi

### I Background 1

1 Introduction to On-Chip Interconnects 3
   1.1 A Short History of Integrated Circuits ...................... 3
   1.2 Devices .................................. 5
      1.2.1 MOSFET Performance ............................ 5
      1.2.2 Power Dissipation of CMOS Circuits ............. 6
   1.3 Interconnect Parameters ............................ 7
      1.3.1 Interconnect Capacitance .......................... 7
      1.3.2 Interconnect Resistance .......................... 9
      1.3.3 Interconnect Inductance ......................... 11
   1.4 First-Order Wire Delay .............................. 14
   1.5 Scaling Trends and Future Challenges ........................ 15
   1.6 Outline and Scope of Thesis .......................... 16

2 Well-behaved Global Interconnects 21
   2.1 General Wire Modeling ............................. 21
      2.1.1 Signal Propagation on Transmission Lines ........ 21
      2.1.2 Characteristic Impedance ....................... 22
      2.1.3 Transmission Line Transfer Function ............ 23
      2.1.4 Signal Attenuation ............................ 23

xiii
## CONTENTS

2.1.5 RC-Interconnect Delay .................................. 24  
2.1.6 Transmission Line Delay ................................ 24  
2.1.7 Signal Reflections ..................................... 25  
2.1.8 Transmission Line Termination ......................... 26  
2.1.9 RC-domain and RLC-domain ............................. 26  
2.1.10 Frequency Response for a General Signaling Link .... 27  
2.1.11 Simulations of Wire Capacity .......................... 29  
2.2 Delay Measurements ........................................ 31  

3 Interconnect Models ......................................... 35  
3.1 Problematic Frequency Components ....................... 35  
3.2 Distributed Interconnect Models ......................... 37  
3.3 Field Solvers ........................................... 37  
3.4 From Field Solver to Simulation Model ................... 38  

4 Crosstalk ..................................................... 43  
4.1 Crosstalk Mechanisms .................................... 43  
4.2 Line Parameter Variations ................................ 45  
4.3 Measured Crosstalk-Induced Delay Variations ........... 47  
4.4 Crosstalk Effects on Latencies and Data Rates .......... 49  

5 Synchronization .............................................. 55  
5.1 Synchronous Clocking .................................... 55  
5.2 Mesochronous Clocking .................................. 57  
5.3 Plesiochronous Clocking ................................ 58  
5.4 Synchronous Latency Insensitive Design .................. 59  
5.4.1 Recent Synchronization Approaches .................. 59  
5.4.2 The SLID Design Flow ............................... 60  
5.4.3 SLID Synchronizer Implementation ................... 61  

6 Power-Efficient Interconnect Design ........................ 67  
6.1 RC-Interconnect Power Dissipation ....................... 67  
6.2 Transmission Line Power Dissipation ..................... 68  
6.3 Low-Swing Signaling ..................................... 69  
6.3.1 Optimum-Voltage Swing Interconnect ................ 69  
6.3.2 Investigated Optimum-Swing Signaling Link .......... 70  
6.4 A Power-Efficient Cache Bus Technique ................. 72  
6.4.1 Dynamic Buses ....................................... 72  
6.4.2 Conventional Cache Bus Architecture ............... 72  
6.4.3 Proposed Cache Bus Architecture ................... 74  
6.5 Transition-Energy Cost Modeling ........................ 75
# CONTENTS

<table>
<thead>
<tr>
<th>Section</th>
<th>Title</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>6.5.1</td>
<td>Bus Coding</td>
<td>75</td>
</tr>
<tr>
<td>6.5.2</td>
<td>Proposed Transition-Energy Cost Model</td>
<td>75</td>
</tr>
<tr>
<td>6.5.3</td>
<td>Accuracy of Proposed Transition-Energy Model</td>
<td>77</td>
</tr>
</tbody>
</table>

## 7 Conclusions

## II Papers

### 8 Paper 1

<table>
<thead>
<tr>
<th>Section</th>
<th>Title</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>8.1</td>
<td>Introduction</td>
<td>88</td>
</tr>
<tr>
<td>8.2</td>
<td>The Microstrip</td>
<td>89</td>
</tr>
<tr>
<td>8.3</td>
<td>The Driver</td>
<td>91</td>
</tr>
<tr>
<td>8.4</td>
<td>The Receiver</td>
<td>92</td>
</tr>
<tr>
<td>8.5</td>
<td>On-Chip Power Optimization</td>
<td>93</td>
</tr>
<tr>
<td>8.6</td>
<td>Signaling Link Latency</td>
<td>95</td>
</tr>
<tr>
<td>8.7</td>
<td>Conclusion</td>
<td>96</td>
</tr>
</tbody>
</table>

### 9 Paper 2

<table>
<thead>
<tr>
<th>Section</th>
<th>Title</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>9.1</td>
<td>Introduction</td>
<td>100</td>
</tr>
<tr>
<td>9.2</td>
<td>Basics of Wires</td>
<td>101</td>
</tr>
<tr>
<td>9.3</td>
<td>A New Scheme for Global Interconnect</td>
<td>106</td>
</tr>
<tr>
<td>9.4</td>
<td>A NoC Example</td>
<td>107</td>
</tr>
<tr>
<td>9.5</td>
<td>Conclusion</td>
<td>110</td>
</tr>
</tbody>
</table>

### 10 Paper 3

<table>
<thead>
<tr>
<th>Section</th>
<th>Title</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>10.1</td>
<td>Introduction</td>
<td>114</td>
</tr>
<tr>
<td>10.2</td>
<td>3D Field Solver RLCK Extraction</td>
<td>115</td>
</tr>
<tr>
<td>10.3</td>
<td>Low-swing L1 Cache Bus</td>
<td>116</td>
</tr>
<tr>
<td>10.4</td>
<td>Performance Comparison</td>
<td>118</td>
</tr>
<tr>
<td>10.5</td>
<td>Conclusion</td>
<td>119</td>
</tr>
</tbody>
</table>

### 11 Paper 4

<table>
<thead>
<tr>
<th>Section</th>
<th>Title</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>11.1</td>
<td>Introduction</td>
<td>122</td>
</tr>
<tr>
<td>11.2</td>
<td>Proposed DSM Bus Model</td>
<td>123</td>
</tr>
<tr>
<td>11.2.1</td>
<td>Wire Model</td>
<td>123</td>
</tr>
<tr>
<td>11.2.2</td>
<td>Driver Model</td>
<td>124</td>
</tr>
<tr>
<td>11.2.3</td>
<td>Model parameters</td>
<td>125</td>
</tr>
<tr>
<td>11.3</td>
<td>Transition Table Derivation</td>
<td>126</td>
</tr>
<tr>
<td>11.4</td>
<td>Simulation vs. Proposed Model</td>
<td>127</td>
</tr>
<tr>
<td>11.4.1</td>
<td>Spectre Simulation Model</td>
<td>127</td>
</tr>
<tr>
<td>11.4.2</td>
<td>Proposed Model Performance</td>
<td>128</td>
</tr>
</tbody>
</table>
11.5 Proposed Model vs. Previous Work .......................... 129
11.6 Conclusion .................................................. 130

12 Paper 5 ........................................................................... 133
12.1 Introduction .......................................................... 134
12.2 Interconnect Modeling .............................................. 135
12.3 Simulation Setup ..................................................... 137
12.4 Performance and Robustness Comparison ................. 137
12.5 Conclusion ............................................................ 141

13 Paper 6 ........................................................................... 143
13.1 Introduction .......................................................... 144
13.2 Wire Performance .................................................... 145
13.3 Theory ................................................................. 145
  13.3.1 Modeling ......................................................... 145
  13.3.2 Performance .................................................... 148
  13.3.3 Utilizing the Wire for Self Pre-emphasis ................. 151
13.4 Design Example ....................................................... 151
  13.4.1 Single Isolated Wire .......................................... 151
  13.4.2 Wire Spacing for a Shielded and Non-shielded Bus .... 153
  13.4.3 Effects of Terminating the Line ......................... 156
13.5 Conclusion ............................................................ 156

14 Paper 7 ........................................................................... 161
14.1 Introduction .......................................................... 162
14.2 Limitations of Classical Interconnect Design .............. 163
14.3 Well-Behaved Velocity-of-Light Limited Interconnects ........ 164
  14.3.1 RC-domain and RLC-domain Borderline .............. 164
  14.3.2 RLC-domain Wire Delay .................................. 165
  14.3.3 Return Paths .................................................... 166
14.4 Test Chip Design ....................................................... 167
  14.4.1 Interconnect Topology ........................................ 167
  14.4.2 Test Circuit Functionality .................................. 168
14.5 Measured Performance ............................................. 170
  14.5.1 Interconnect Delay ........................................... 170
  14.5.2 Crosstalk Induced Delay Variations ..................... 171
14.6 Discussion ............................................................ 172
14.7 Conclusion ............................................................ 172
CONTENTS

15 Paper 8 175
15.1 Introduction ........................................ 176
15.2 Simulation Model ..................................... 177
15.3 Analytical Model ..................................... 178
15.4 Interconnect Performance Evaluation ................. 179
   15.4.1 Propagation Delay ................................ 179
   15.4.2 Data-Rate ....................................... 181
   15.4.3 Power and Energy ............................... 183
15.5 Conclusion ........................................... 185

16 Paper 9 187

III Appendix 195
A Transmission Line Equations 197
   A.1 Characteristic Impedance ......................... 197
   A.2 The Propagation Constant ....................... 198
   A.3 The Telegrapher’s Equation ..................... 199
   A.4 Frequency Response for a General Signaling Link 200
Part I

Background
Chapter 1

Introduction to On-Chip Interconnects

Integrated Circuits (ICs) have laid the foundation of today’s computerized society. To meet future performance and technology goals, not only the devices, but also the interconnects must scale accordingly. For each new technology node, the lateral and vertical geometries are shrunk by approximately 30%. However, technology scaling affects the properties for the devices and interconnects differently, which makes the interconnects the bottleneck in many digital systems. This chapter starts with a short history of ICs and then takes a deeper look at interconnect parameters and some transistor properties to explain their scaling behavior, which provides the motivation to this thesis.

1.1 A Short History of Integrated Circuits

Silicon technology has been the basis of microelectronics for a long time. One of the earliest steps towards the IC was taken in 1947 when Bardeen and Brattain demonstrated the first working point-contact solid-state amplifier. The name “transistor” was suggested several months after the first successful demonstration of the device. The original point-contact transistor structure, shown in Figure 1.1a, comprised a plate of n-type germanium and two line contacts of gold supported on a plastic wedge [1]. In 1958, Kilby demonstrated a miniaturized electronic circuit implementation [2] where he utilized germanium with etched mesa structures to separate the components, electrically connected by bonded gold wires. A year later, Robert Noyce fabricated the first IC with planar interconnects utilizing photolitography and etching techniques, methods which are still in use today. These first ICs were based on bipolar transistors and it would take almost 10 more years to come up with techniques which permitted the first stable Metal-
Figure 1.1: The evolution of miniaturization is remarkable. a) A single transistor in 1947 b) 230 million transistors on the Intel Pentium Extreme Edition 840 dual-core microprocessor in 2005.

Oxide-Semiconductor Field-Effect Transistor (MOSFET) IC. NMOS, PMOS, and CMOS technologies soon followed.

Static CMOS technology uses a combination of PMOS and NMOS transistors to form logic gates. These logic gates process the “1:s” and “0:s” which are the information-carrying bits in a digital system. In 1965, Moore stated his famous law (Moore’s law) saying that the number of transistors on an IC would double every 12 months [3]. During the 70’s, the concept of device scaling was introduced [4] and the time frame for Moore’s law had to be revised to 24 months. It turns out that CMOS has very attractive scaling properties along with low stand-by power, low cost, and fast development, which has made it the number one technology choice for digital circuits. One important advantage of CMOS is its possibility to integrate both analog and digital circuits on the same die, which makes CMOS an excellent technology for future System-on-Chip (SoC) implementations.

In my view, Moore’s law is perhaps not much of a law, but rather a very powerful driving force which has strongly motivated companies in the microelectronics area to continuously develop and improve ICs throughout the decades. And surely, without having that famous law as inspiration, the 230-million transistor microprocessor from Intel shown in Figure 1.1b would not yet have been available to customers.
1.2 Devices

1.2.1 MOSFET Performance

Figure 1.2 shows an NMOS transistor in schematic and cross-section views. The transistor is used to control the amount of current flowing between the drain and source terminals by means of a voltage applied to the polysilicon gate terminal. The gate is resting on a thin layer of insulating SiO$_2$. When the gate voltage is increased above a certain threshold voltage, $V_T$, a conducting channel of electrons is formed in the positively (p) doped silicon substrate between the heavily negatively (n+) doped source and drain. For proper operation, a voltage also has to be applied to the silicon substrate (bulk). The described transistor behavior makes it possible to use the device as a switch in digital circuits, and either as an amplifying device or a voltage controlled resistor in analog circuits.

![NMOS Schematic Symbol](image1)

![NMOS Cross Section](image2)

Figure 1.2: An NMOS transistor in schematic and cross-section views.

The MOSFET high-frequency performance is often described by the cutoff frequency, $f_T$, which is the frequency at which the AC-signal short-circuit current gain is unity. In deep submicron technologies, transistors have very short channel lengths, $L_n$, causing the carriers to reach velocity saturation, $v_{sat}$, which decreases the electron mobility, $\mu_n$ [5]. For these short channel devices, $f_T$ can be expressed as [6]:

$$f_T = \frac{v_{sat}}{2\pi L_n}$$  \hspace{1cm} (1.1)

In Eq. 1.1 we see that $L_n$ appears in the denominator of the $f_T$ expression. For each new generation of CMOS technology, $L_n$ is scaled down by approximately 30%, which rapidly improves the transistor high-frequency performance.
1.2.2 Power Dissipation of CMOS Circuits

Figure 1.3 shows a CMOS inverter loaded with an output capacitance $C$ (consisting of parasitic capacitances from the gate itself, the connecting wires, and the gate capacitances of succeeding logic blocks).

The power consumption of CMOS gates has mainly three sources of origin. First of all, the full-swing charging and discharging of the output node results in switching power, $P_{sw}$, given by:

$$P_{sw} = \alpha f_{clk} CV_{dd}^2$$  \hspace{1cm} (1.2)

where $\alpha$ is the switching activity of the node, $f_{clk}$, the clock frequency, and $V_{dd}$ the supply voltage [7]. Secondly, there is short-circuit power caused by the non-zero rise and fall times of the input signals. Thus, for a short period of time during a transition, both NMOS and PMOS transistors are turned on causing a short-circuit path between $V_{dd}$ and ground. Eq. 1.3 gives a simplified expression for the short-circuit power $P_{sc}$:

$$P_{sc} = \frac{\beta}{12} (V_{dd} - 2V_T)^3 \frac{\tau}{T}$$  \hspace{1cm} (1.3)

where $\beta$ is the transistor gain factor (assumed to be the same for both NMOS and PMOS), $\tau$ is the signal rise (or fall) time, and $T$ the signal period [8]. $P_{sw}$ and $P_{sc}$ represent the dynamic power consumption of the device. Thirdly, leakage power, $P_{leak}$, mainly through sub-threshold leakage, gate leakage, and reverse-biased diode junction leakage is gaining importance [9]. When geometries are down-scaled, we get more transistors per area switching at higher frequencies, which increases the dynamic power consumption. To combat this dynamic power increase, $V_{dd}$ is scaled down, which reduces $P_{sw}$ and $P_{sc}$ according to Eq. 1.2 and Eq. 1.3. Lower $V_{dd}$ reduces the ability of the gate to control the channel, thus
the gate insulator thickness and $V_T$ have to be reduced. This reduction causes $P_{\text{leak}}$ to dramatically increase and seriously threaten the power budget for large and complex VLSI circuits. Therefore, to address the trade-off between dynamic power and leakage power while maintaining the maximum drive current of the transistor, $V_{dd}$ and $V_T$ cannot be scaled at the same rate as the device geometries.

1.3 Interconnect Parameters

An IC would be non-functional without wires connecting all devices on the die. When we connect two circuit nodes in a circuit schematic, we mentally tend to think of it as an ideal wire without any delay or attenuation. However, real interconnects have a resistance, capacitance, and inductance per unit length making the wire an unintended parasitic circuit element. Early IC-implementations were running at low frequency and the impact of parasitic capacitances associated with transistors dominated over the ones referred to interconnects. These early processes typically had two metal layers and one polysilicon layer available for interconnect routing [10]. Increased integration and chip complexity lay the foundation for more interconnect layers. Future state-of-the-art processes are expected to have over ten metal layers where low, thin, tight layers are used for local routing and high, thick, sparse layers are utilized for global interconnect and power [11]. Wire lengths tend to increase in today’s multi-GHz ICs. Signals are transmitted with fast rise times across global low-resistive copper interconnects with large cross sectional area, surrounded by insulators with low dielectric constant. For these wires, inductive effects which were ignored in the past must be considered due to this new on-chip environment.

1.3.1 Interconnect Capacitance

The interconnects studied and implemented throughout this thesis are in the form of microstrips. A microstrip is a strip of metal over a return ground plane, as shown in Figure 1.4. $w$, $h$, and $d$ is the wire width, height, and length, while $t_{ox}$ is the distance to the underlying ground plane. An electric and magnetic field is created around the microstrip in Figure 1.4 if a driving circuit injects a voltage and current signal, respectively, onto it. When two conducting objects are charged to different electric potentials, an electric field is created between them and a capacitance, $C$, arises. It always takes some non-zero time to build up a voltage between two objects. The capacitance can be seen as the reluctance of voltage to instantaneously increase or decrease in response to an input signal. The capacitance for the single isolated microstrip wire in Figure 1.4 can be approximated by:
Introduction to On-Chip Interconnects

Figure 1.4: Single microstrip wire.

Figure 1.5: Multi-level interconnect capacitance.

\[ C = C_{pp} + C_{fringe} = \frac{w \epsilon_{ox}}{t_{ox}} d + \frac{2\pi \epsilon_{ox}}{\ln(2 + 4t_{ox}/h)} d \quad (1.4) \]

where \( C_{pp} \) is the “parallel-plate” (bottom area-to-substrate) capacitance, \( C_{fringe} \) is the fringing (side-wall-to-substrate) capacitance, and \( \epsilon_{ox} \) the insulator dielectric constant. Eq. 1.4 is a corrected version of Eq. 4.2 in [7]. This simplification is only useful for estimating rough capacitance values. In reality, a wire is surrounded by a large number of other wires on the same layer and adjacent layers of the multi-level interconnect hierarchies offered in today’s processes. Each wire is coupled not only to the grounded substrate, but also to neighboring wires, as shown in Figure 1.5. To model the capacitance in such a complex environment is a non-trivial task and still a topic for active research [12] [13] [14] [15]. Eq. 1.4 is not a good model for the capacitance of a wire in such a complicated three-dimensional interconnect structure. In fact, as technology is scaled, the denser inter-layer and intra-layer routing in modern processes makes inter-wire capacitances equally or more important than parallel plate capacitances [16]. This effect is more notable in higher metal layers since the interconnect is routed farther away from the substrate. In practice, field solver extraction tools are utilized to numerically calculate the parasitic capacitance values of sophisticated interconnect geometries.
1.3 Interconnect Parameters

1.3.2 Interconnect Resistance

The DC-resistance, \( R_{dc} \), for the microstrip shown in Figure 1.4 is given by:

\[
R_{dc} = \frac{\rho}{A} = \frac{\rho}{hw} d = R_{sq} d
\]  

(1.5)

where \( \rho \) is the metal resistivity and \( A=wh \) is the wire cross section. The sheet resistance, \( R_{sq}=\rho/h \), which gives the resistance per square of interconnect is normally tabulated for semiconductor processes. Eq. 1.5 is sufficient at low signal frequencies when the entire cross section of the wire carries the current. However, as the signal frequency increases, the current density starts to fall off exponentially into the conductor. This phenomenon is called skin effect since most of the current is now flowing through the “skin” of the conductor. The skin depth, \( \delta \), is the depth at which the current density has decreased by a factor \( e^{-1} \) of its value at the surface and is given by:

\[
\delta = \sqrt{\frac{\rho}{\pi f \mu}}
\]  

(1.6)

where \( f \) is the signal frequency and \( \mu \) is the permeability [17]. The onset of the skin effect occurs for frequencies above \( f_s \), the skin frequency. For microstrip interconnects, \( f_s \) is the frequency at which \( \delta \) equals the conductor thickness and can be solved for by setting \( \delta=h \) and \( f=f_s \) in Eq. 1.6, which gives:

\[
f_s = \frac{\rho}{\pi h^2 \mu}
\]  

(1.7)

Skin effect decreases the effective cross sectional area that carries the current, which causes resistance to increase. Around 63 % of the total current flows within one skin depth, but one usually makes the approximation that all current flows uniformly within this outer shell of thickness \( \delta \), as shown in Figure 1.6. Thus, at high frequencies, the AC-resistance for a microstrip is given by:

\[
R_{ac,signal} = \frac{\rho}{\delta w} d = \frac{\sqrt{\rho \pi f \mu}}{w} d
\]  

(1.8)

Throughout this thesis, we have used a 0.18 \( \mu \)m CMOS process [18], which carries six metal layers. Metal 1 (M1) up to Metal 4 (M4) have the same thickness, while the top Metal 5 (M5) and Metal 6 (M6) are twice as thick. In this process, the skin frequency of a microstrip wire is calculated to 9.6 GHz for M5, M6 and 36.5 GHz for M1-M4. Figure 1.7 plots the skin depth versus frequency for pure aluminum (\( \rho_{Al}=2.65 \times 10^{-8} \Omega m \)), pure copper (\( \rho_{Cu}=1.67 \times 10^{-8} \Omega m \)), and each of the six metal layers in the target 0.18 \( \mu \)m CMOS process. Note that the skin depth variation for the various metal layers in the utilized process is caused by the difference in material resistivity.
When simulating interconnects, one must take into account both conductor and return path resistance. For a microstrip, the return current flows in the ground plane beneath the signal wire. A model for the distribution of current density in the ground plane is [19]:

\[
I(w_c) \approx \frac{I_0}{\frac{\pi t_{ox}}{1 + (w_c/t_{ox})^2}}
\]

(1.9)

where \(I_0\) is the total signal current and \(w_c\) is the distance from the center line of the signal wire, as shown in Figure 1.6. According to Eq. 1.9, 80% of the total current...
flows in the return plane within a distance of $w_t = \pm 3t_{ox}$. One approach is now to model the return path resistance as a wire of cross sectional area $A_{ret} = 6t_{ox}$, which gives the following AC-resistance for the return:

$$R_{ac,return} = \frac{\rho}{A_{ret}} = \frac{\rho}{6t_{ox}} = \frac{\sqrt{\mu \pi f}}{6t_{ox}}$$  \hspace{1cm} (1.10)

The total AC-resistance is then the sum of the contributions from the signal trace and the return path:

$$R_{ac} = R_{ac,signal} + R_{ac,return}$$  \hspace{1cm} (1.11)

To achieve a causal time-domain behavior of conductors with skin-effect, Arabi [20] showed that the high-frequency resistance, $R_{ac,tot}$ must be complex:

$$R_{ac,tot} = R_{dc} + R_{ac}(1 + j)$$  \hspace{1cm} (1.12)

where the imaginary term describes the inductive part of the skin effect.

Highly resistive interconnects cause large signal attenuation. As chip complexity and device density increases, wires have to be made narrower, which increases wire resistance according to the basic Eq. 1.5. By making interconnects taller, the cross sectional area of the conductor grows, which helps to lower the resistance. For each new technology node, the wire Aspect Ratio ($AR = h/w$) has gradually changed from thin and wide to tall and narrow, as illustrated in Figure 1.8. In advanced processes, the top metal layer $AR$ is typically close to 2 [21]. Copper has recently replaced aluminum as interconnect material in top metal layers to further reduce wire resistance. Since copper, unlike aluminum, diffuses into most dielectrics it must be encapsulated by a suitable metal (such as Ta, TaN) or dielectric (such as SiN, SiC) barrier. The technique of encapsulating copper interconnects is called dual-damascene processing, which becomes increasingly difficult as the thickness of the barrier scales with metal width [22].

### 1.3.3 Interconnect Inductance

As already mentioned, whenever a driving circuit forces a voltage and current signal onto a conductor, an electric and magnetic field is induced around it. The process of building up the current flow is not instantaneous but rather takes some finite amount of time. The unwillingness of the current to ramp up or down straightaway is called inductance, $L$. Inductance is only defined for current loops. Therefore, the inductance of a line is the self-inductance of the loop formed by the signal wire and its return. Any current injected into a system must somehow return to the source. Thus, when a current $I$ is injected into a signal conductor, there must be a net current of $-I$ flowing in a return path. Current can return
through the substrate or through nearby DC-paths [23]. Some return current is in the form of non-negligible displacement current through interconnect capacitances [24]. Since inductance has a long range effect, the return paths are not known beforehand. In general, current will always return through the path of least effective impedance. Therefore, low-speed current follows the path of least resistance, while high-speed current flows through the path of least inductance located as close to the signal line as possible. This high-frequency behavior is called the proximity effect [24] [25]. Total inductance is the sum of external inductance (the current flowing on the conductor surface), and internal inductance (due to current flowing inside the conductor). At very high frequencies, the current tends to crowd at the conductor surface due to the skin effect (described in Section 1.3.2). Thus, as frequency increases, the total inductance falls asymptotically towards the external inductance value [26]. One way to gain control over the wire behavior is to provide a dominant current return path close to the signal wire. Such a return path can be either a ground (or supply) plane below the signal wire, or in the form of coplanar returns, i.e. neighboring ground (or supply) conductors on the same level. If there is any change in termination of nearby wires or if any discontinuity occurs in the return path, the returning current must find a different way. This enlarges the loop area, which increases inductance and effective resistance. This in turn affects propagation delay [27].

Assume a signal loop $A$. The most basic definition of inductance originates from a fundamental relation between the voltage, $V$, and the current, $I$ associated with the loop. A voltage drop is created when the current flow through the loop changes:
1.3 Interconnect Parameters

\[ V_{A,\text{self}} = L \frac{dI_A}{dt} \]  

(1.13)

In cases when a conductor is completely surrounded by a homogeneous uniform dielectric, the capacitance, \( c \), and inductance, \( l \), per unit length are related by:

\[ lc = \epsilon \mu \]  

(1.14)

where \( \epsilon = \epsilon_r \epsilon_0 \) is the dielectric constant and \( \mu = \mu_r \mu_0 \) is the permeability [17]. For lossless lines, inductance can also be calculated from capacitance through Eq. 1.15, which describes the speed, \( \nu \), at which an electromagnetic wave travels through a medium [7]:

\[ \nu = \frac{1}{\sqrt{lc}} = \frac{1}{\sqrt{\epsilon \mu}} = \frac{c_0}{\sqrt{\epsilon_r \mu_r}} \]  

(1.15)

Thus, the maximum effective velocity for on-chip signals is around two times slower than in vacuum since \( \epsilon_r = 3.9 \) for SiO\(_2\), typically used as insulator. However, real wires are not lossless and a process stack typically includes insulators with different dielectric constants on adjacent levels.

Ruehli [28] introduced the concept of partial inductance to determine return current loops. In this method, the return path of a conductor segment is assumed to close at infinity. These infinity return paths cancel out in a final subtraction. For a rectangular microstrip conductor, as the one shown in Figure 1.4, assuming uniform current distribution, the closed form expression of partial self inductance is given by:

\[ L = \frac{\mu_0 d}{2\pi} \left( \ln \left( \frac{2d}{w+h} \right) + \frac{1}{2} - \frac{0.2235(w+h)}{d} \right) \]  

(1.16)

where \( d, w, h \) are the wire length, width, and thickness, respectively.

Moreover, whenever there are two loops of current (loop \( A \) and \( B \)), which exist close to each other, the change in current flow of loop \( B \) creates a magnetic flux, which passes through loop \( A \) and induces a voltage \( V_{A,\text{mut}} \) in it:

\[ V_{A,\text{mut}} = M \frac{dI_B}{dt} \]  

(1.17)

The amount of magnetic field coupling between the loops is the mutual inductance, \( M \). As for partial inductance, the partial mutual inductance between two parallel conductors of equal length is given by:

\[ M = \frac{\mu_0 d}{2\pi} \left( \ln \left( \frac{2d}{s} \right) - 1 + \frac{s}{d} \right) \]  

(1.18)

where \( s \) is the conductor separation [29]. An excellent comparison of several other partial self and mutual inductance formulas can be found in [30].
1.4 First-Order Wire Delay

One of the most important parameters describing the performance of a wire or group of wires (bus) is delay (or latency). The attenuation for most integrated circuit wires is very large, causing $RC$-charging to dominate the wire delay behavior. For this case, the conductor can be described by a $\pi$-circuit consisting of a series resistor, $R_w$ (the wire DC-resistance), and two capacitors having half the wire capacitance, $C_w$, each. Integrated circuit interconnects typically have an open far end making the wire load, $C_L$, purely capacitive. Figure 1.9 shows such a wire connected to a driver where $R_S$ and $C_S$ is the driver source resistance and capacitance, respectively.

![Driver model](image)

Figure 1.9: Driver, $\pi$-model wire, and load.

If we assume that $RC$-charging is dominating the behavior of the driver-wire-termination structure in Figure 1.9, the delay can be described by the Elmore delay expression [7]:

$$t_{d,Elmore} = (R_S(C_S + C_w + C_L) + R_w(C_w^2/2 + C_L))\ln(2)$$

(1.19)

where $R_wC_w$ is a dominating factor. From Eq. 1.4 and Eq. 1.5, we know that $C_w \propto dw(t_{ox})$ and $R_w \propto dl(wh)$, which makes $R_wC_w \propto d^2l(ht_{ox})$. When we look at how interconnects have been scaled throughout the evolution of some state-of-the-art modern processes, we can make the reasonable assumption that for each new technology node, the wire dimensions $(w, h, t_{ox})$ are scaled by the transistor scale factor $S_{tr}=0.7$ (30% down-sizing) [21] [31] [32] [33]. When the lateral and vertical dimensions are shrunk by approximately 30%, one would expect chip size to decrease by 50% for each new generation. However, as new designs add more transistors to further exploit integration, the average die size tends to increase by approximately 7% each year [34]. Therefore, the relative length of global interconnects is scaled by the factor $S_d=1.07$. Using these assumptions, $R_wC_w$ is scaled by $S_d^2/S_{tr}^2=2.34$, roughly doubling the wire $RC$-delay.

This is the traditional view of integrated circuit wires, characterized by large delays (much larger than velocity of light delays). In fact, the industry standard is...
1.5 Scaling Trends and Future Challenges

The most obvious result of technology scaling is its impact on transistor compactness. Over the last forty years, we have seen a spectacular increase in integration density as well as computational complexity and performance. In the near future, device scaling will continue and most probably lead to billion transistor VLSI designs. The enormous complexity and countless degrees of freedom in these designs will present interesting challenges for the manufacturing community, circuit designers, and CAD-tools. Table 1.1 shows some predicted scaling trends from the 2004 International Technology Roadmap for Semiconductors (ITRS) [11].

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Technology node [nm]</td>
<td>90</td>
<td>65</td>
<td>45</td>
<td>32</td>
<td>22</td>
</tr>
<tr>
<td>Nominal $V_{dd}$ [V]</td>
<td>1.2</td>
<td>1.1</td>
<td>1.0</td>
<td>0.9</td>
<td>0.8</td>
</tr>
<tr>
<td>Saturation $V_T$ [V]</td>
<td>0.2</td>
<td>0.18</td>
<td>0.15</td>
<td>0.11</td>
<td>0.1</td>
</tr>
<tr>
<td>Gate leakage [A/cm$^2$]</td>
<td>4.5 x 10$^2$</td>
<td>9.3 x 10$^2$</td>
<td>1.9 x 10$^3$</td>
<td>7.7 x 10$^3$</td>
<td>1.9 x 10$^4$</td>
</tr>
<tr>
<td>Subthreshold leakage [$\mu$A/$\mu$m]</td>
<td>0.05</td>
<td>0.07</td>
<td>0.1</td>
<td>0.3</td>
<td>0.5</td>
</tr>
<tr>
<td>Peak $f_T$ [GHz]</td>
<td>120</td>
<td>200</td>
<td>280</td>
<td>400</td>
<td>700</td>
</tr>
<tr>
<td>NAND2 gate delay (Fan-out 3)[ps]</td>
<td>23.9</td>
<td>16.2</td>
<td>9.9</td>
<td>6.5</td>
<td>3.7</td>
</tr>
<tr>
<td>Number of metal levels</td>
<td>10</td>
<td>11</td>
<td>12</td>
<td>12</td>
<td>14</td>
</tr>
<tr>
<td>Metal1 $AR$ (for Cu)</td>
<td>1.7</td>
<td>1.7</td>
<td>1.8</td>
<td>1.9</td>
<td>2</td>
</tr>
<tr>
<td>$RC$-delay[ps], 1 mm Metal1</td>
<td>224</td>
<td>384</td>
<td>616</td>
<td>970</td>
<td>2008</td>
</tr>
<tr>
<td>Global metal $AR$ (for Cu)</td>
<td>2.1</td>
<td>2.2</td>
<td>2.3</td>
<td>2.4</td>
<td>2.5</td>
</tr>
<tr>
<td>$RC$-delay[ps], 1 mm Global metal</td>
<td>55</td>
<td>92</td>
<td>143</td>
<td>248</td>
<td>452</td>
</tr>
<tr>
<td>ILD effective $\kappa$</td>
<td>3.1-3.6</td>
<td>2.7-3.0</td>
<td>2.3-2.6</td>
<td>2.0-2.4</td>
<td>&lt;2.0</td>
</tr>
</tbody>
</table>

Table 1.1: Predicted scaling trends according to ITRS 2004.

For each new technology generation, the device packaging density and maximum operating frequency is expected to increase. Although supply and threshold voltages are progressively scaled down, the serious problem of leakage currents is projected to become even worse. Table 1.1 also clearly shows a catastrophic $RC$-delay trend, not only for low-level interconnects but also for globally routed wires. A general concern is that low-$\kappa$ dielectrics are not being introduced at the pace required by the roadmap due to reliability and yield issues associated with the integration of these new materials with dual-damascene copper. In the long term, the increase of copper effective resistivity due to electron surface scattering effects is expected to become an important factor. As frequency of operation increases,
inductive effects come into the picture and additional ground planes (increasing the total number of metal layers) may be required for inductive shielding. An interesting near term solution projected by the ITRS is 3D-interconnects, which are stacked layers of either devices or separate dies connected through the package by bond pads or through-wafer contacts.

Since the dawn of semiconductor technology, there has been a discussion about the physical limit of how small the transistors can be. Today’s prediction is a smallest possible gate length of around 10 nm, based on derivations of the minimum energy that must be transferred in each switching event. In these small devices, effects such as direct tunneling between source and drain are also taken into account when forecasting the fundamental limit [36]. In a perfect world, the ultimate processing technology could have these extremely tiny 10 nm transistors integrated with room-temperature, superconductive interconnects surrounded by a vacuum insulator. Even for such a process, Meindl [37] calculated that an interconnect longer than 30 \( \mu \)m would have a latency exceeding that of a minimum-sized 10 nm transistor. With that scenario in mind, interconnects will have a massive impact on circuit performance in the future, and researchers in this field will most probably not be out of work for some time to come.

1.6 Outline and Scope of Thesis

As discussed in Section 1.4 and Section 1.5, future process scaling will dramatically affect the properties of long global on-chip interconnections. Classically, on-chip wires and buses have been engineered in a way that makes \( RC \)-behavior dominate. The traditional approach of dealing with \( RC \)-limited interconnects is to insert repeaters, which in the best case makes wire delay proportional to wire length. Repeater inserted \( RC \)-interconnects are unfortunately characterized by limited bandwidth, large delays and high power consumption. This undesirable situation provides motivation for research aimed at improving interconnect capacity.

We have made a critical analysis of the intrinsic limitations of electrical on-chip interconnects and found that the limitations can be overcome. By utilizing two upper-level metals, one for the wires and one as a ground return plane, a signal conductor will behave as a microwave-style transmission line, which allows for velocity-of-light delay if properly dimensioned. In Paper 1, Paper 2, and Paper 6, we present our analysis of wire properties together with design constraints for well-behaved global interconnects. To demonstrate the feasibility of transmission line-style interconnects, we present a chip implementation of such a structure in Paper 7. Our measurements verify that it is possible to design electrical interconnect with velocity-of-light delay and high bandwidth properties in a
standard 0.18 μm CMOS process. To successfully implement transmission line-style interconnects, it is first necessary to utilize a good wire model in the simulation phase of a chip design. In Paper 5 and Paper 8, we show how the choice of interconnect model affects the observed performance in terms of latency, data rate, and power dissipation. We show that the classical simple $RC$-wire model is insufficient and strongly underestimates signal integrity critical issues such as overshoot, ground noise, crosstalk, and edge rates.

Synchronous clocking of integrated circuits is the dominating timing approach used today. The success of this method relies on low-skew clocks and control of global wire delays. However, process scaling not only allows for higher clock frequencies and increased circuit complexity, but also results in longer wire delays, which all together makes it more difficult to meet the required timing constraints. In Paper 9, we describe and practically demonstrate a Synchronous Latency Insensitive Design (SLID) scheme to resolve the timing closure problems due to unknown global wire delays, clock skew and other timing uncertainties in integrated circuits.

Interconnects tend to dissipate an increasingly larger portion of total chip power. One method to address the power problem is to utilize transition coding on global buses, i.e. encoding power-hungry data patterns into more power efficient transitions. To make a correct decision on which transitions that would benefit from coding, it is relevant to start from an accurate transition cost model. In Paper 4, we propose and analyze a transition-energy cost model, which includes a multi-stage transmitter and wire properties that closer capture effects present in high-performance buses. Also, in Paper 1, we show a power optimization scheme based on proper choice of reduced voltage swing of the interconnect and scaling of the receiver amplifier. Finally, in Paper 3, the power benefit of swing reduction in combination with a sense-amplifying flip-flop receiver is compared with the dynamic L1 cache bus architecture employed in the Intel Pentium 4 microprocessor.

References


Physical Dimensions”, in IEEE Journal of Solid-State Circuits, vol. 9,


sign”, in Proceedings of the International Symposium on Quality Electronic


Effectively Reduce Total Standby Leakage in Nano-Scale CMOS Circuits”,

Two Level Metal Process With Planarization Techniques”, in IEEE VLSI


Based on Results of Conformal Mapping Method”, in IEEE Transactions on

Delay, and Crosstalk in VLSI”, in IEEE Transactions on Semiconductor

Extraction of Interconnect Capacitances for Multilayer VLSI Circuits”, in
IEEE Transactions on Computer-Aided Design of Integrated Circuits and

pacitance Characterization using Charge-based Capacitance Measurement
(CBCM) Technique and 3-D Simulation”, in Proceedings of the Custom In-
1.6 Outline and Scope of Thesis


[31] S. Thompson, et al., “A 90 nm Logic Technology Featuring 50 nm Strained Silicon Channel Transistors, 7 layers of Cu Interconnects, low k ILD, and $1 \mu m^2$ SRAM Cell”, in 2002 IEDM Technical Digest, pp. 61-64.


Chapter 2

Well-behaved Global Interconnects

This chapter presents design principles aimed at overcoming the intrinsic limitations of on-chip global interconnects. We first start with a discussion on general wire properties. The chapter ends with a summary of measured results of a fabricated test chip carrying a velocity-of-light limited global bus.

2.1 General Wire Modeling

2.1.1 Signal Propagation on Transmission Lines

The most general view of a wire is that of a transmission line, which is only valid for interconnects with a well-defined return path, such as a microstrip. When a signal propagates across a microstrip, an electric and magnetic field is induced around the conductor. The energy stored in the magnetic field for an infinitesimal section, $dx$, of the wire can be represented by a series inductance, $l dx$. Similarly, a shunt capacitor, $c dx$, represents the energy stored in the electric field between the signal conductor and the underlying return path. However, real wires are not ideal so loss mechanisms must be added to the model. A series resistor, $r dx$, captures the finite wire conductance and, since the surrounding dielectric is not a perfect insulator, a shunt conductance, $g dx$, to ground is inserted to capture dielectric loss. Figure 2.1a shows an infinitesimal section, $dx$, of such a transmission line where $r$ includes skin effect and all line parameters are given per unit length.

The change in voltage, $V$, along a transmission line is the drop across the series elements, while the change in current, $I$, is the current through the parallel elements:
By differentiating the first relation with respect to $x$ and inserting the second relation into the result, we get:

$$\frac{\partial^2 V}{\partial x^2} = rgV + (rc + lg) \frac{\partial V}{\partial t} + lc \frac{\partial^2 V}{\partial t^2}$$  \hspace{1cm} (2.2)

Eq. 2.2 is a general description of signal propagation across a transmission line. Later on, we describe how this expression gives the basic understanding of the mechanisms behind signal propagation on transmission lines with various properties.

### 2.1.2 Characteristic Impedance

The transmission line characteristic impedance, $Z_c$, is the relation between voltage and current at any point along the line. $Z_c$ is the same looking into an arbitrary infinitesimal section of the wire. In Appendix A.1, we set $zdx=(r+j\omega l)dx$ (impedance), and $ydx=(g+j\omega c)dx$ (admittance) and use the distributed wire representation in Figure 2.1b to derive the general expression for the line characteristic impedance:

$$Z_c = \sqrt{\frac{r + j\omega l}{g + j\omega c}}$$  \hspace{1cm} (2.3)

Hence, $Z_c$ of an infinitely long transmission line is a complex and frequency-dependent value. At high frequencies ($\omega = 2\pi f \to \infty$), $Z_c$ approaches the value
2.1 General Wire Modeling

$Z_0 = \sqrt{\mu / \epsilon}$. This relation is also obtained for a lossless line, $r=0=g$, and the special case when $rc=gl$.

2.1.3 Transmission Line Transfer Function

The transmission line transfer function, $H$, is obtained by solving for a voltage $V(x, \omega)$, as a function of position $x$. The voltage drop across the incremental resistor and inductor is:

$$\frac{\partial V(x, \omega)}{\partial x} = -r I(x, \omega) - j \omega l I(x, \omega) = -(r + j \omega l) I(x, \omega) \quad (2.4)$$

Inserting $I(x, \omega) = V(x, \omega) / Z_c$, where $Z_c$ is given by Eq. 2.3 gives:

$$\frac{\partial V(x, \omega)}{\partial x} = -\sqrt{(r + j \omega l)(g + j \omega c)} V(x, \omega) \quad (2.5)$$

The solution to this first-order differential equation is the transmission line transfer function:

$$H(x, \omega) = \frac{V(x, \omega)}{V(0, \omega)} = e^{-\sqrt{(r + j \omega l)(g + j \omega c)} x} \quad (2.6)$$

2.1.4 Signal Attenuation

The magnitude, $V(x, \omega)$, of a traveling wave at any point along a transmission line for a given frequency is related to the initial magnitude, $V(0, \omega)$, through Eq. 2.6, which contains the so called propagation constant $\gamma$:

$$\gamma = \sqrt{(r + j \omega l)(g + j \omega c)} \quad (2.7)$$

When the losses ($r$ and $g$) are small, $\gamma$ can be simplified (derived in Appendix A.2) to:

$$\gamma \approx j \omega \sqrt{\mu c} + \frac{r}{2 Z_0} + \frac{g Z_0}{2} = j \omega \sqrt{\mu c} + \alpha_r + \alpha_g \quad (2.8)$$

where $\alpha_r = r / 2 Z_0$ and $\alpha_g = g Z_0 / 2$ is the attenuation factor due to resistive and dielectric loss, respectively. In general, dielectric loss is described by complex and frequency dependent expressions [1], but for most on-chip insulating materials, we can assume a leakage conductance of $g=0$, since conductor losses are dominant [2] [3] [4]. Thus, if $\alpha_g$ is ignored, the wire transfer function for lossy on-chip conductors is simplified to:

$$H(x, \omega) = \frac{V(x, \omega)}{V(0, \omega)} = e^{-j \omega \sqrt{\mu c} x} e^{-\frac{r}{2 Z_0} x} \quad (2.9)$$
Also, $g=0$ simplifies Eq. 2.2 to:

$$\frac{\partial^2 V}{\partial x^2} = r_c \frac{\partial V}{\partial t} + l_c \frac{\partial^2 V}{\partial t^2}$$

(2.10)

### 2.1.5 RC-Interconnect Delay

Most integrated circuit wires are designed in a way that makes the resistive attenuation very large. As a result, the $r_c$-term on the right hand side of Eq. 2.10 dominates and signal propagation is in principle described by a diffusion equation:

$$\frac{\partial^2 V}{\partial x^2} = r_c \frac{\partial V}{\partial t}$$

(2.11)

Thus, the signal diffuses slowly down the line, and the edges are widely dispersed with distance. These wires can be described by simplified $RC$-chains. The delay for a signaling link which includes such an $RC$-domain wire can be approximated by the Elmore delay formula already presented in Eq. 1.19. This is the classical view of an integrated circuit wire, characterized by large delays (much larger than velocity-of-light delays). A typical solution to improve the latency of $RC$-wires is to split the interconnect into an optimum number of equal-length segments, and to insert an inverter (repeater) between each such segment. By inserting an optimum number of repeaters, one can make the total wire delay proportional to $d$ (wire length) instead of $d^2$ as without repeaters [2] [5].

### 2.1.6 Transmission Line Delay

If the series resistance of a conductor is sufficiently reduced, the $l_c$-term on the right hand side of Eq. 2.10 will dominate over the $r_c$-term and signal propagation will in principle be described by a wave equation:

$$\frac{\partial^2 V}{\partial x^2} = l_c \frac{\partial^2 V}{\partial t^2}$$

(2.12)

The interconnect behaves as a transmission line and the signal mainly travels across it as a wave (with a diffusive component), experiencing only limited waveform distortion. From Eq. 2.12, which represents an ideal lossless line, the effective propagation velocity is given by $\nu=1/\sqrt{l_c}$. According to Eq. 1.14, this is equivalent to $\nu=1/\sqrt{\varepsilon \mu}$. Thus, the wire delay for such a completely lossless line is given by:

$$t_d = \frac{d}{\nu} = d \sqrt{\varepsilon \mu} = \frac{d \sqrt{\varepsilon \mu r}}{c_0}$$

(2.13)
2.1 General Wire Modeling

![Diagram of a driver connected to a lossless terminated transmission line.](image)

where $d$ is the wire length, $\epsilon_r$ is the relative dielectric constant, $\mu_r$ is the relative permeability, and $c_0$ is the velocity of light in vacuum. With SiO$_2$ as insulating dielectric ($\epsilon_r=3.9$), the maximum effective velocity is $0.5c_0=1.5\cdot10^8$ m/s. Eq. 2.13 represents the lowest possible limit and adding loss will increase this delay. Repeaters do not improve the latency of these interconnects as their delay is already related to the velocity of light [6].

2.1.7 Signal Reflections

Since voltages and currents travel as waves at high-speed signaling, they generate reflections when passing changes in impedance. Figure 2.2 shows a lossless, finite-length transmission line with characteristic impedance $Z_0$ connected to a driver with output impedance $Z_S$ and loaded by an impedance $Z_L$. The driver transmits a voltage step, $V_0$, which is initially divided between the driver output impedance and the line characteristic impedance. Hence, the initial voltage, $V_i$, entering the source end of the line is given by:

$$V_i = V_0 \frac{Z_0}{Z_0 + Z_S} \quad (2.14)$$

The initial wave travels down the line, and a reflection occurs when it reaches the load impedance. As waves can be superpositioned, the effective voltage (or current) seen at the load is the sum of the incident and reflected voltages (or currents). The magnitude of the reflected wave is determined by the load reflection coefficient, $\Gamma_L$, calculated through the Telegrapher’s equation (derived in Appendix A.3):

$$\Gamma_L = \frac{Z_L - Z_0}{Z_L + Z_0} \quad (2.15)$$
2.1.8 Transmission Line Termination

For an open termination \( Z_L = \infty \) making \( \Gamma_L = 1 \) in Figure 2.2, the reflected and incident waves have identical amplitudes. In this case, the effective far-end voltage equals twice the incident voltage amplitude. If the far end is shorted \( (Z_L=0 \) making \( \Gamma_L=-1 \)), the reflected wave is a negative copy of the incident one, which makes them cancel at the far end of the line.

Most loads are capacitive in digital ICs due to termination in transistor gates. Initially, a load capacitance, \( C_L \), looks like a short circuit with \( \Gamma_L = -1 \) when the incident wave reaches it. Thus, the incident and reflected waves initially cancel at the load. \( C_L \) is then charged at a rate dependent on the time constant \( \tau = Z_0 C_L \) for an \( RC \)-circuit, where \( R = Z_0 \) and \( C = C_L \). Once \( C_L \) has been fully charged, it will behave as an open circuit with \( \Gamma_L = 1 \). This reflection behavior only has to be considered when \( C_L \) is comparable to the total capacitance of the transmission line [7], which has not been the case for the long on-chip global buses implemented in this thesis.

Whenever a wave reflecting off the far end of a transmission line, it eventually reaches the source and a second reflection occurs, this time determined by the source reflection coefficient, \( \Gamma_S = (Z_S-Z_0)/(Z_S+Z_0) \). To avoid uncontrolled wave bouncing and slow propagation delays, the line must be terminated either at the source (series termination) or the destination (parallel termination), with an impedance matched to \( Z_0 \). A wave is fully absorbed in a matched termination \( (Z_{S,L}=Z_0 \) making \( \Gamma_{S,L}=0 \)) and no succeeding reflections can occur. Parallel termination results in stand-by current [2]. Series termination, with a driver output impedance matched to the line, is preferred in CMOS designs since the load is typically a pure gate capacitance, which can be approximated by an open termination according to the discussion above. Using microwave theory, Eq. 2.15 can be generalized to:

\[
\Gamma(x) = \frac{Z_{in}(x) - Z_0}{Z_{in}(x) + Z_0}
\]  

(2.16)

where \( \Gamma(x) \) and \( Z_{in}(x) \) are the reflection coefficient and input impedance at any distance \( x \) (defining \( x=0 \) at the load, and \( x=d \) at the driving source) from the load, respectively [8].

2.1.9 RC-domain and RLC-domain

In Section 2.1.5 and Section 2.1.6, we distinguish between the delay across an \( RC \)-domain wire and \( RLC \)-domain wire, respectively. Depending on the line parameters, the interconnect behavior may be \( RLC \)-line or \( RC \)-line dominated. The border between the two cases occurs for a resistive attenuation smaller than
2.1 General Wire Modeling

50 %. This assumes that we measure the delay from the step launch time to the time at which the signal reaches 50 % of its final value at the far end. Therefore, a wire in $RLC$-domain must satisfy the following constraint from Eq. 2.9:

$$
e^{-\frac{rd}{Z_0}} > \frac{1}{2}$$

$$-\frac{rd}{2Z_0} > -ln(2)$$

$$\frac{rd}{Z_0} < 2ln(2)$$

(2.17)

where $rd=R$ is the total wire resistance and $Z_0$ the high-frequency characteristic impedance. Similarly, a wire in $RC$-domain is characterized by:

$$\frac{rd}{Z_0} > 2ln(2) \approx 1.39$$

(2.18)

2.1.10 Frequency Response for a General Signaling Link

A general signaling link consists of a source driver, a lossy transmission line, and a far-end termination, as shown in Figure 2.2. Since the transmission line has loss, it must now be described by its complex characteristic impedance, $Z_c$, and transfer function, $H$. The transmission line carries two waves, one in each direction. In Appendix A.4, we use circuit equations combined with the two waves on the line to derive the frequency response for a general driver-wire-termination link:

$$G = \frac{2Z_L H}{Z_L(1+H^2 + \frac{Z_s}{Z_c}(1-H^2)) + Z_c(1-H^2 + \frac{Z_s}{Z_c}(1+H^2))}$$

(2.19)

The time domain step response, $s(t)$, for this general signaling link can be found through an inverse Fourier transform, $\text{ifft}(G \cdot P)$, where $P$ is the Fourier transform of a well-behaved input step function, $p(t)$, defined as in [1]. In Figure 2.3, we plot $s(t)$ together with the step response of the wire itself, $h(t) = \text{ifft}(H \cdot P)$, for the case of 50% loss with $Z_S=Z_0=50 \, \Omega$ and $Z_L=\infty$. $s(t)$ is much better than $h(t)$, which shows a long tail originating from resistive loss (including skin effect). Such a long tail gives rise to intersymbol interference (ISI), i.e. a transmitted symbol interferes with subsequent symbols. In order to better understand this behavior, we plot the frequency response of $H$, $1/(1+H^2+(1-H^2)Z_s/Z_c)$, and $Z_c/Z_0$ in Figure 2.4. In the low-frequency band, $1/(1+H^2+(1-H^2)Z_s/Z_c)$ rises with frequency due to the large value of $Z_c/Z_0$. Also, at high frequencies $1/(1+H^2+(1-H^2)Z_s/Z_c)$ falls much slower compared to $H$. Thus, the long tail in $h(t)$ is effectively mitigated by an enhancement of the input voltage at the input end due to
Figure 2.3: Theoretical step response, $s(t)$, of the full-equation transfer function, $G$, for an open wire driven by a matched driver, compared to the step response of the wire transfer function, $H$, alone.

Figure 2.4: Magnitudes of $H$, $1/(1+H^2+(1-H^2)Z_s/Z_c)$, and $Z_c/Z_0$ versus frequency.
2.1 General Wire Modeling

Figure 2.5: Worst-case eye opening set to 64%.

the difference between the actual characteristic impedance of the wire, $Z_c$, and its high frequency value, $Z_0$. This is a very important effect, which we utilize in this work, since the transfer function, $G$, of the full signaling link predicts better wire delays and data rates (for a given eye-opening) compared to the simple theory based on $H$ only.

2.1.11 Simulations of Wire Capacity

To explore wire performance, we define wire delay, $t_d$, as the time from the launch of an input step to the time at which $s(t) = 0.5s(t)_{\text{max}}$, where $s(t)_{\text{max}}$ is the asymptotic value of the step response. $s(t)_{\text{max}}$ equals 1 for an open far-end wire and $Z_0/(2Z_0+R)$, where $R$ is the total wire resistance, for a terminated case. The maximum bit rate achievable on a single wire is limited by the step response. If bits are transmitted too close to each other, the finite rise (or fall) time can make it impossible to detect a clean “one” or “zero” at the far end of the line. An eye-diagram is an overlay of all possible cycle waveforms aligned to a common timing reference. The size of the eye-opening at the center of the eye-diagram indicates the margin available to safely detect the signal. The vertical eye-opening, $EO_v$, is given by $EO_v=2s(T)-1$, where $T$ is the symbol time. A worst-case eye-opening is produced when a single “one” follows a long burst of “zeros”, and a single “zero” is transmitted after many “ones”, as shown in Figure 2.5. Previous publications have used 64 % eye-opening as a criteria to ensure clean data detection, which corresponds to $s(T)=0.82$ [9][10]. When $s(T)=0.5$, the eye is completely closed ($EO_v=0$). In an environment where the eye-opening tends to close in the vertical dimension, the data rate is classically estimated from the time difference, $t_{64v}$, taken from $s(T)=0$ until $s(T)=0.82$. This results in a maximum data rate of $B_{64v}=1/t_{64v}$. Note that this approach only points out the maximum achievable bit rate of a wire under the assumption of no interaction between neighboring wires. On the other hand, the coupling between adjacent wires can lead to a situation...
As a demonstration of key-concepts, we investigate the properties of a copper microstrip ($\rho_C=1.67\cdot10^{-8}\Omega\text{m}$) with a driver matched to the line ($Z_S=Z_0$) and an open far end ($Z_L=\infty$). A homogeneous silicon dioxide dielectric with $\varepsilon_r=3.9$ surrounds the conductor. Eq. 1.4 is utilized to calculate the wire capacitance, $c$. The relation $Z_0=\frac{\sqrt{\mu}}{\varepsilon_0 c}$ together with the definition of $Z_0=\sqrt{\frac{L}{c}}$ approximate the wire inductance, $l$. In Figure 2.6a we plot the signal velocity extracted from the step response for a series of wires with length $d=2-10$ mm and width $w=0.5-6$ μm with an open far end. The wire thickness and dielectric thickness is $h=1$ μm and $t_{ox}=2$ μm, respectively. Figure 2.6b plots the $R/Z_0$-ratio, indicating $RLC$-domain for a value smaller than 1.39 and $RC$-domain for a value larger than 1.39, according to Eq. 2.18. For $RLC$-wires, the velocity approaches the velocity of light ($v_{max}=1.5\cdot10^8$ m/s) for the chosen dielectric. The velocity rapidly decreases (loss increases) for wires entering the $RC$-domain. On the borderline between $RLC$- and $RC$-domain, we especially note that the effective velocity is reduced to around half of $v_{max}$. Figure 2.7 shows the corresponding achievable data rates for 64% eye-opening. The maximum data rate starts at very large values for short wires and decreases with wire length. A 1 cm long microstrip of width 0.5 μm and 6 μm can still carry a data rate of 3.2 Gb/s and 12.2 Gb/s, respectively.
2.2 Delay Measurements

In Paper 7, we implement a velocity-of-light limited, 5 mm long, repeaterless global on-chip bus in a standard 0.18 \( \mu \text{m} \) CMOS process [11]. Figure 2.8 shows the cross section for the bus where transmission line-style interconnects are achieved by routing the signal wires in the thicker top metal M6 layer, and by utilizing metal M4 as ground return plane. Grounded shield wires are inserted between the signal wires to minimize their mutual capacitance and limit the worst-case victim crosstalk amplitude to 175 mV. The grounded return plane under the bus also acts as a shielding layer, which reduces the inter-wire capacitance between interconnect layers. In this configuration, the signal wires are designed for a characteristic impedance of \( Z_0 = 55 \Omega \) and \( R \approx 1.2Z_0 \), just enough to push the interconnect into...
transmission line domain.

Measurements of this interconnect geometry resulted in a nominal wire delay of 52.8 ps, having both neighbors quiet. This latency corresponds to a signal velocity of $0.95 \times 10^8$ m/s or $0.32c_0$ ($c_0=$velocity of light in vacuum), which is 64% of the maximum possible effective velocity $v=c_0\varepsilon_r^{-0.5} \approx 0.5c_0$. We thus conclude that on-chip transmission line global interconnects, achieving near velocity-of-light delay, are feasible and can be implemented with reasonable wire dimensions. Table 2.1 shows the latency benefits of the implemented interconnect compared to other published experiments.

<table>
<thead>
<tr>
<th>Technology</th>
<th>Signal layer</th>
<th>Cross area</th>
<th>Length</th>
<th>Delay</th>
<th>Velocity</th>
</tr>
</thead>
<tbody>
<tr>
<td>0.18 $\mu$m (this work)</td>
<td>M6 over M4</td>
<td>$1.04 \cdot 2 \ \mu m^2$</td>
<td>5 mm</td>
<td>52.8 ps</td>
<td>$0.32c_0$</td>
</tr>
<tr>
<td>0.25 $\mu$m [12]</td>
<td>M4 over M3</td>
<td>$0.78 \cdot 1.33 \ \mu m^2$</td>
<td>10 mm</td>
<td>260 ps</td>
<td>$0.13c_0$</td>
</tr>
<tr>
<td>0.25 $\mu$m [12]</td>
<td>M1 over poly</td>
<td>$0.32 \cdot 0.64 \ \mu m^2$</td>
<td>10 mm</td>
<td>2300 ps</td>
<td>$0.015c_0$</td>
</tr>
</tbody>
</table>

Table 2.1: Interconnect delay comparison.

References


2.2 Delay Measurements


Chapter 3

Interconnect Models

3.1 Problematic Frequency Components

The “ones” and “zeros” in a digital signal are carried by electrical waveforms. A “one” is safely detected if the signal is above some high threshold level, while a “zero” must be below some low threshold. A digital signal can thus be approximated by a trapezoidal waveform, where $T$ is the signal period and $t_r$, $t_f$ is the 10% to 90% rise and fall time, respectively. Fast rise and fall times reduce the timing uncertainty between a “one” and “zero”. The quickest switching on-chip signal is usually the clock signal, which in an ideal case would be represented by a square waveform with 50% duty cycle. Such an ideal digital signal, with clean and sharp edges, is composed of an infinite number of sinusoidal components. In general, the Fourier expansion of an ideal periodic square wave, $v(t)$, with 50% duty cycle is given by:

$$v(t) = \frac{2}{\pi} \sum_{n=1,3,5,\ldots} \frac{1}{n} \sin(2\pi n f t)$$

(3.1)

where $f$ is the signal frequency and $t$ is time. Hence, a 1 Ghz clock signal is composed of not only a fundamental harmonic at 1 Ghz, but also harmonics at 3 Ghz, 5 Ghz, 7 Ghz, etc. Even harmonics (at 2 Ghz, 4 Ghz, etc. for the 1 Ghz clock) are found in the signal spectrum if the duty cycle deviates from 50%. The magnitude of a harmonic decreases as its frequency increases. Each of the harmonics in a digital waveform is attenuated by the skin effect, which thus degrades the edge rate and reduces the signal amplitude. There are several “schools” with different approaches when it comes to estimating the critical frequency below which the dominating spectral energy is located. One rule of thumb considers the spectral components up to 5x the fundamental frequency. However, this rule of thumb is pessimistic since it does not consider the edge rate at all. Figure 3.1 shows an
example of the spectral components in a 1 GHz ideal square wave with rise/fall times of $t_{r,f}=60$ ps. A closer look at the signal spectrum shows that the magnitude of the frequency components rolls off by around -20 dB/decade up to some critical frequency, $f_{\text{crit}}$. Beyond $f_{\text{crit}}$, the frequency components are attenuated faster, typically -40 dB/decade. In [1], $f_{\text{crit}}$ is estimated as $f_{\text{crit},1}=0.5/t_{r,f}$. On the other hand, for a perfect step function driven through a network with a time constant $\tau$, [2] arrives at the critical frequency $f_{\text{crit},2}=0.35/t_{r,f}$. We have used a standard 0.18 $\mu$m CMOS process for the test chip implementations in this thesis [3]. In this process, simulations of the on-chip clock signals never have rise and fall times faster than 60 ps. The fastest clock signal we have utilized in an implementation was running at 3 GHz. According to the pessimistic rule of thumb, we should then consider frequency components up to $f_{\text{crit},0}=15$ GHz. If we instead consider a fastest rise time of $t_r=60$ ps, we arrive at $f_{\text{crit},1}=8.3$ GHz and $f_{\text{crit},2}=5.8$ GHz, respectively. According to Eq. 1.6, these frequency components correspond to a skin depth of $\delta(f_{\text{crit},0})=0.74$ $\mu$m, $\delta(f_{\text{crit},1})=0.99$ $\mu$m, and $\delta(f_{\text{crit},2})=1.18$ $\mu$m, respectively in the utilized process. However, these skin depths are only valid for the weakest frequency component and since the thickest metal layer is 0.92 $\mu$m thick (metal5 and metal6), we consider the skin effect to have only minor impact in the utilized technology.
3.2 Distributed Interconnect Models

The choice of interconnect model depends on the conductor properties, wire length, and signal rise times. Short interconnects with a small resistive component driven at low frequencies can be approximated by a lumped capacitance. These conductors are physically small enough so that all points along the wire react together in response to an input signal. Delay is not captured in this type of simple wire model, which effectively behaves as an equipotential region. As the interconnect becomes longer, one cannot neglect the resistance and a lumped $RC$-model is a more suitable wire description. A ladder network of multiple $RC$-sections is a better approximation compared to the pessimistic results obtained by using just one $RC$-section. For long low-resistive interconnects switching at high frequencies with fast rise times, inductance starts to dominate and transmission line effects must be considered. Eq. 2.17 is not the only requirement for transmission line behavior. A rule of thumb is that transmission line effects must be considered when the input signal rise time, $t_r$, is smaller than $2.5x$ the propagation delay across the wire [4]. In [1] it is recommended that if the wiring is longer than one 6:th of the effective length of the rising edges, distributed transmission line models must be utilized. According to [2], the number of $RLCG$-segments in a distributed transmission line model should be:

$$\text{segments} \geq 10 \left( \frac{d}{\nu t_r} \right)$$  \hspace{1cm} (3.2)

where $d$ is the wire length, $\nu$ is the propagation velocity, and $t_r$ is the signal rise time. In Paper 7 we describe a chip implementation of a $d=5$ mm long velocity-of-light limited on-chip bus with a measured propagation velocity of $\nu=0.95\times10^8$ m/s. According to Eq. 3.2, this bus must be described by at least 9 transmission line segments considering a fastest rise time of $t_r=60$ ps. The simulation model for that implementation was designed using one transmission line section for every 50 $\mu$m interconnect, which thus sums up to a total of 100 sections in the model, fulfilling the constraint in Eq. 3.2.

3.3 Field Solvers

The simplified formulas for calculating wire resistance, inductance, and capacitance, discussed in Chapter 1, will produce inaccurate results for densely-packed multi-layer interconnect structures. In a practical VLSI-implementation, there is a need for an interconnect model which can be utilized in a circuit simulator. Field solvers have become available for parasitic extraction of interconnect geometries. A field solver calculates all self and mutual wire parasitics by modeling
the electromagnetic interaction between all conductors in an interconnect hierarchy. There are mainly two types of field solvers available. The most advanced ones are so called 3D or “full-wave” field solvers, which account for nearly all electromagnetic phenomena. These tools solve Maxwell’s equations directly for any type of complex three-dimensional geometry, including vias and connectors. However, the 3D field solvers are typically complicated to use, take very long time to complete an extraction, and usually produce an output in the form of S-parameters, which is not so useful in circuit simulators for digital circuits.

A second category of field solvers are the so called 2D field solvers. This type of field solver takes an interconnect cross sectional geometry as input and calculates matrices that describe not only the resistance (including skin-effect) for each line, but also the inductive and capacitive coupling between every pair of lines. The extracted RLC-values are given per unit length. An important advantage of 2D field solvers is that they are easy to use and perform fast calculation.

### 3.4 From Field Solver to Simulation Model

We have used the 2D electromagnetic field solver available in the HSPICE circuit simulator (version 2003.3) [5]. This field solver is based on an improved version of the boundary-element method [6], and the filament method also implemented in the Raphael extraction tool [7]. In the filament method, the original conductor is divided into a number of parallel filaments, each with a smaller cross section than the original wire. In this way, high-frequency current redistribution can be accounted for by assuming that the current flow is approximately uniform in a filament. The HSPICE field solver extracts transmission line matrices for conductor resistance ($R$), skin-effect resistance ($R_s$), inductance ($L$), and capacitance ($C$), from the geometry and dimensions of the interconnects and dielectrics.

HSPICE includes a frequency dependent, lossy multi-conductor transmission
line model through the so called W-element. The properties for the W-element are specified through the built-in field solver. Internally, the W-element uses all matrices \((R, R_s, L, C)\) of field solver extracted parasitics. In Paper 5, the behavior of the W-element is compared to two alternative interconnect representations. The first alternative is a distributed interconnect model of discrete \(RLC\)-sections as shown in Figure 3.2. Each conductor has not only a capacitance to the substrate, but also an inter-wire capacitance and mutual inductance to each of the other conductors present in the structure. The second alternative is a cascaded number of discrete \(RC\)-sections (\(\pi\)-configuration) as shown in Figure 3.3. In the \(RC\)-model, the coupling capacitance to each of the other conductors in the structure remains, while the self- and mutual inductances are excluded. The values of \(R, L, C\) for the two latter models were obtained from the same subset of extracted parasitic matrices directly utilized in the W-element model. None of the alternative representations in Figure 3.2 and Figure 3.3 made use of the \(R_s\) matrix.

For the comparison in Paper 5, we investigate 5-bit buses of length 0.5-3 mm in a 0.18 \(\mu\)m CMOS-process. We insert one \(RLC\)- and \(RC\)-section per 50 \(\mu\)m interconnect into the \(RLC\)- and \(RC\)-model, respectively. Simulations show that the W-element and \(RLC\)-network never diverge by more than 2% in overshoot, 12.4% in ground noise, 8.9% in crosstalk, and 5.6% in edge rate predictions. A corresponding comparison between the W-element and the \(RC\)-representation displays much larger model discrepancies. Propagation delay is bit-pattern dependent. Worst-case delay simulations for the center wire in the 5-bit bus show negligible difference between all three models. The results for best-case delay match perfectly in a comparison between the W-element and \(RLC\)-model, while the \(RC\)-model gives 8.5% longer delay. In view of these results, we have relied on the design flow shown in Figure 3.4 to create interconnect circuit models which can be used for test chip design in the Cadence circuit simulator. The process doc-
Figure 3.4: Design flow to create interconnect model.
ocumentation is utilized to create a custom technology file, which models the target 0.18 μm CMOS process, for the HSPICE field solver. The technology file contains information about the metal layers (vertical position, thickness, resistivity) and isolating dielectric layers (vertical position, thickness, dielectric constant). The HSPICE field solver takes the technology file together with a geometrical description of the interconnect structure (metal layer, width, and horizontal position for each wire) and calculates $R$, $R_s$, $L$, and $C$-matrices of extracted wire parasitics. Neglecting the $R_s$ matrix, a custom Perl-script utilizes the field solver extracted wire properties to automatically create an $RLC$-structured VerilogA interconnect model. This VerilogA-model can be simulated in the Cadence circuit simulator together with the transistor models provided by the chip vendor.

References


Chapter 4

Crosstalk

When a signal travels down a transmission line, the electric and magnetic field energy can couple to other nearby conductors. This crosstalk coupling occurs directly through the mutual capacitance and mutual inductance between wires, and indirectly through the impedance of shared return paths. Crosstalk acts as a noise source, which reduces noise margins and can cause signal integrity problems. On-chip, one has traditionally only considered capacitive crosstalk through the Miller-effect on inter-wire capacitances. In this chapter, we experimentally show that inductive coupling effects are not negligible and will become an issue as switching frequencies increase to multi-GHz rates. To get a full understanding of all crosstalk interactions is a complicated task even for simple interconnect structures. This chapter starts with a description of basic crosstalk mechanisms and how they affect the transmission line properties.

4.1 Crosstalk Mechanisms

Consider the two symmetrical and coupled transmission lines (A and B) in Figure 4.1. Line A (aggressor) is driven at the near-end and terminated in its characteristic impedance, $Z_0$, at the far-end. Line B (victim), is terminated in its characteristic impedance at both ends. When a voltage step is driven onto line A, it starts to move from its near end, $P_1$, to its far end, $P_2$. At each point along the line, a fraction of the signal is coupled from the aggressor to the victim line and starts to move towards both victim ends, $P_3$ and $P_4$. The mutual capacitance couples the time derivative of voltage:

$$\frac{\partial V_B(x,t)}{\partial t} = k_{cx} \frac{\partial V_A(x,t)}{\partial t}$$

$$k_{cx} = \frac{c_m}{c_0 + c_m}$$  \hspace{1cm} (4.1)
where $k_{cx}$ is the capacitive coupling coefficient including the mutual capacitance between the wires, $c_m$, and the wire capacitance to ground, $c_0$, per unit length. Thus, a positive time derivative of voltage on Line A, induces a positive forward traveling wave, and a positive reverse traveling wave on Line B at the point of capacitive coupling. Similarly, mutual inductance couples the spatial derivative of voltage:

$$\frac{\partial V_B(x, t)}{\partial x} = k_{lx} \frac{\partial V_A(x, t)}{\partial x}$$

$$k_{lx} = \frac{m}{l}$$

(4.2)

where $k_{lx}$ is the inductive coupling coefficient including the mutual inductance between the wires, $m$, and the wire self inductance, $l$, per unit length. The relation between spatial and time derivatives of waves is given by:

$$\frac{\partial V_f(x, t)}{\partial t} = -\nu \frac{\partial V_f(x, t)}{\partial x}$$

$$\frac{\partial V_r(x, t)}{\partial t} = \nu \frac{\partial V_r(x, t)}{\partial x}$$

(4.3)

where $\nu$ is the propagation velocity and $V_f$, $V_r$ is a forward and reverse traveling wave, respectively. Therefore, a positive spatial derivative of voltage on Line A, induces a negative forward traveling wave, and a positive reverse traveling wave on Line B at the point of inductive coupling, according to Eq. 4.2 and Eq. 4.3. A total forward wave (far-end crosstalk) and total reverse wave (near-end crosstalk)
is obtained by superposition of the forward and reverse waves induced by capac-
itive and inductive coupling. In [1], it is shown that the total forward coupling
coefficient, $k_{fx}$, and total reverse coupling coefficient, $k_{rx}$, is given by:

$$
k_{fx} = \frac{k_{cx} - k_{lx}}{2}
$$

$$
k_{rx} = \frac{k_{cx} + k_{lx}}{4}
$$

As an edge with rise time $t_r$ propagates towards the far-end on the aggressor line, it
continuously couples energy into the victim wire. The forward (far-end) crosstalk
wave moves side-by-side with the edge on the aggressor line and reaches its end
after a line delay, $t_d$, where it is absorbed in the termination impedance during
a time of $\approx t_r$. The reverse (near-end) crosstalk wave starts at the aggressor
edge and moves towards the near-end of the victim line during a total time of
$2t_d$ (round trip from $P_3$ to $P_4$ and back again)[2]. For interconnects surrounded
by a homogeneous dielectric, $k_{cx} = k_{lx}$, which cancels the far-end crosstalk, since
$k_{fx} = 0$. However, if the near-end has an unmatched termination, reflections of
near-end crosstalk can still become a far-end problem. In a non-homogeneous
environment (most VLSI implementations), there is always both near- and far-
end crosstalk.

### 4.2 Line Parameter Variations

Crosstalk causes variations in the transmission line parameters, which in turn af-
facts the effective characteristic impedance and signal propagation delay. The
simplified circuit model of two coupled transmission lines (A and B) in Figure 4.2,
can be used to derive first order equations describing the principle of data-dependent
inductance and capacitance. Kirchoffs voltage law gives:

$$
V_{A1} = L \frac{dI_A}{dt} + M \frac{dI_B}{dt}
$$

$$
V_{B1} = L \frac{dI_B}{dt} + M \frac{dI_A}{dt}
$$

Similarly, Kirchoffs current law yields:

$$
I_A = C_0 \frac{dV_{A2}}{dt} + C_m \frac{d(V_{A2} - V_{B2})}{dt}
$$

$$
I_B = C_0 \frac{dV_{B2}}{dt} + C_m \frac{d(V_{B2} - V_{A2})}{dt}
$$
For “odd-mode” propagation, the signal injected into line A transitions in a direction opposite to the signal injected into line B making $I_A = -I_B$ and $V_{A2} = -V_{B2}$. Using this in Eq. 4.5 and Eq. 4.6 results in:

$$V_{A1,B1} = (L - M) \frac{dI_{A,B}}{dt}$$
$$I_{A,B} = (C_0 + 2C_m) \frac{dV_{A2,B2}}{dt}$$

Thus, the effective odd-mode inductance and capacitance on each line is $L_{odd} = (L - M)$ and $C_{odd} = (C_0 + 2C_m)$, respectively. Similarly, for “even-mode” propagation, the signal injected into line A transitions in the same direction as the signal injected into line B, which makes $I_A = I_B$ and $V_{A2} = V_{B2}$. Inserting this into Eq. 4.5 and Eq. 4.6 gives:

$$V_{A1,B1} = (L + M) \frac{dI_{A,B}}{dt}$$
$$I_{A,B} = C_0 \frac{dV_{A2,B2}}{dt}$$

Thus, the effective even-mode inductance and capacitance on each line is $L_{even} = (L + M)$ and $C_{even} = C_0$, respectively. This results in not only data-dependent characteristic impedance, $Z_0$, but also variations in signal velocity $\nu$. Using the high-frequency version of Eq. 2.3 for $Z_0$, and Eq. 2.12 for $\nu$, Table 4.1 summarizes the data-dependent characteristic impedance and propagation velocity trends for two coupled transmission lines.
4.3 Measured Crosstalk-Induced Delay Variations

<table>
<thead>
<tr>
<th>Data Pattern</th>
<th>(Z_0)</th>
<th>(\nu)</th>
</tr>
</thead>
<tbody>
<tr>
<td>(↓↑) or (↑↓)</td>
<td>(\sqrt{\frac{L-M}{C_0+2C_m}})</td>
<td>(\frac{1}{\sqrt{(L-M)(C_0+2C_m)}})</td>
</tr>
<tr>
<td>(↓↓) or (↑↑)</td>
<td>(\sqrt{\frac{L+M}{C_0}})</td>
<td>(\frac{1}{\sqrt{(L+M)C_0}})</td>
</tr>
</tbody>
</table>

Table 4.1: Trends for data-dependent characteristic impedance, \(Z_0\), and propagation velocity, \(\nu\), for two coupled transmission lines.

When the electric and magnetic fields around the conductors are contained within a single homogeneous dielectric, the \(LC\)-product remains constant, as already described in Eq. 1.14. Also, in this special case, no forward-traveling crosstalk wave is generated as described in Section 4.1. Even though these interconnects suffer from impedance variations, their velocity variation is negligible. However, transmission lines in a non-homogeneous environment have electric and magnetic fields which penetrate through materials with different dielectric constants. The effective dielectric constant, experienced by the conductors, changes depending on the field densities, which makes the \(LC\)-product data-dependent. This gives variations in both characteristic impedance and propagation velocity, which can cause timing- and signal integrity problems.

4.3 Measured Crosstalk-Induced Delay Variations

In Paper 7, we implement a velocity-of-light limited, 5 mm long, repeaterless global on-chip bus in a standard 0.18 \(\mu\)m CMOS process [3]. The cross-section of the implemented bus has already been shown in Figure 2.8 where transmission line-style interconnects are achieved by routing the signal wires in the thicker top metal M6 layer and utilizing metal M4 as ground return plane. The 2D electromagnetic field solver available in HSPICE is employed to extract wire parasitics according to the method discussed in Chapter 3. Grounded shields are inserted between the signal lines to minimize their mutual capacitance. This turns the inter-wire capacitance between adjacent conductors on the same level into a capacitance to ground. Classically, this type of full shielding is employed to eliminate delay variations and to reduce the worst-case crosstalk amplitude. The grounded return plane routed in metal M4 under the bus effectively acts as a shielding layer, which reduces the inter-wire capacitance between interconnect layers. In this configuration, the signal wires are designed for a characteristic impedance of \(Z_0=55\Omega\) and \(R \approx 1.2Z_0\), just enough to push the interconnect into transmission line domain.

The cross-section of the 0.18 \(\mu\)m process has different dielectric constants for various interconnect levels. Also, the top metal M6 is located close to the
passivation oxide and nitride layers, which have dielectric constants that further contribute to a non-homogeneous environment for the signal wires. Since the shields and ground return plane minimize the mutual capacitance between the wires, the effective capacitance on each signal line should be approximately independent of data-pattern. However, long-range inductive coupling remains and dominates over any remaining capacitive coupling. Therefore, the velocity should increase for an odd-mode switching pattern and decrease for an even-mode pattern, according to the propagation velocity trends in Table 4.1. Thus, any impact of dominating inductive coupling should result in delay variations which have a sign opposite to what would be obtained through classical Miller-effect, triggered by dominating capacitive coupling. Figure 4.3 and Figure 4.4 show the measured and simulated wire delay vs. switching pattern, respectively. The simulations of
the field solver extracted interconnect, including layout-extracted parasitics for the circuits, resulted in a nominal wire delay of 40.1 ps, having both neighbors quiet. The corresponding measured delay is 52.8 ps, which is 32 % longer than predicted by the simulator. Inductive coupling is dominating and 22 % shorter (14 % longer) delays, compared to the nominal case, were measured when both neighbors switched in the opposite (same) direction as the victim. This trend is also observed in the simulated results, which show 11 % shorter and 13 % longer delays than nominal.

The ground plane acts as an excellent capacitive and inductive return path and also reduces the effective current loop for the signal lines, which lowers the inductance. However, despite full shielding and the ground return plane, the measurements show that inductive coupling effects may become non-negligible. These crosstalk-induced delay variations through inductive coupling can cause timing issues as on-chip global buses become longer and switching frequencies reach multi-GHz rates. Either capacitive or inductive coupling dominates and the two have opposite sign. Similar delay variation effects have previously been observed in simulation by [4] for a purely coplanar bus structure.

### 4.4 Crosstalk Effects on Latencies and Data Rates

If no equalization techniques are utilized, the maximum achievable data rate of a wire is limited by the size of the eye-opening of an eye-diagram taken at the input of a receiver. In Section 2.1.11, we required the wire step response, \( s(T) \), to be equal to some value, \( s_1 \), for a certain symbol time, \( T \), to achieve a certain data rate, \( B = 1/T \). In [5] [6], this method is utilized to derive expressions for the bit rate capacity of coaxial cables and is further extended to strip line interconnect in [7]. The result of these analyses is that the bit rate limit for an interconnect of length \( d \) with a cross sectional area \( A \) is given by:

\[
B = B_0 \frac{A}{d^2}
\]

where \( B_0 \) is a material constant, which is different for RC- and RLC-lines and also depends on the size of the eye-opening. Hence, one large conductor with cross sectional area \( A \) has the same capacity as several small wires with the same total cross sectional area. This is because each of the smaller wires must operate at a lower data rate as the step response has a proportionally slower rise time. However, these models assume that the eye closes in the vertical dimension since delay variations (due to crosstalk) are not taken into account. Figure 4.5a shows the eye-diagram for a wire in a general bus structure. The eye-opening closes not only in the vertical dimension, due to AC-voltage noise, but also in the horizontal dimen-
sion as a result of AC-timing noise or jitter. When specifying an eye-opening, the amount of voltage margin available to detect the signal can be traded against available timing margin. Figure 4.5b shows a simplified eye-diagram where $t_\delta$ is the time difference between the early and the late waveform at the input of the receiver. The horizontal eye-opening, $EO_h$, is given by $EO_h = T - t_\delta$. In Paper 8, we investigate the effects of capacitive crosstalk on latencies, data rate, and power dissipation for a 6 mm long repeater-inserted 3-bit global bus in 0.18 μm CMOS [3]. The signal wires are $RC$-dominated and routed in metal1, metal2, or metal6. Each wire has minimum width and minimum separation to its nearest neighbor, which makes arrival time the dominating mechanism that closes the eye. Just as with vertical eye-opening, treated in Section 2.1.11, we can require the horizontal eye-opening to have a certain size. For 50% horizontal eye-opening ($EO_h=0.5T$), the symbol time can be expressed as:

$$EO_h = T - t_\delta = 0.5T$$

$$T = 2t_\delta$$

In this particular case, the achievable bit rate, $B_{50h}$, is estimated as:

$$B_{50h} = \frac{1}{T} = \frac{1}{2t_\delta}$$

In Paper 8, the top metal6 achieves shorter delay, higher data rate, and less power dissipation compared to the lower metal layers. This is since metal6 requires less number of repeaters to reach a certain performance target. The trade-offs discussed in Paper 8 can be summarized as follows. Less repeaters than optimum results in:
4.4 Crosstalk Effects on Latencies and Data Rates

Figure 4.6: Simulated maximum delay and symbol time vs. number of repeater sections in metal1, metal2, and metal6.

- Lower Maximum Data Rate
- Increased Crosstalk
- Reduced Power Dissipation

On the other hand, more than optimum number of repeaters results in:

- Larger Maximum Data Rate (through wave pipelining)
- Less Crosstalk Sensitivity
- Increased Power Dissipation
- Increased Latency

The possibility for wave pipelining is particularly interesting. In traditional chip-design, the clock rate is boosted through pipelining with latches [8]. Classical single-cycle full-chip communication only allows one wave of data to be present between two latches. This is not the case for wave pipelining, which was originally proposed by Cotten [9]. He realized that the data rate is limited by variations in path length, signal rise (fall) times, clock skew, and latch setup time rather than the latency of the slowest path. Thus, wave pipelining allows multiple waves of data, related to different clock cycles, between two storage elements. This is achieved by inputing new bits faster than the longest delay through either a logic
block or across an interconnect section. For maximum performance, all path delays from every transmitting to every receiving node must be balanced. In some cases this requires the insertion of intentional delay elements to avoid signal race.

From the setup in Paper 8, Figure 4.6 shows the simulated maximum delay and symbol time (for 50% horizontal and vertical eye-opening) vs. number of repeater sections in metal1, metal2, and metal6, respectively. The optimum number of repeaters is 5 for metal1 and metal2, while 2 repeaters are enough to reach the optimum in metal6. The symbol time becomes shorter than the delay across the bus when optimum number of repeaters (or more) are inserted in each metal layer. This is the onset of wave pipelining, which boosts the data rate beyond the classical limit as illustrated in Figure 4.7. With aggressive repeater insertion, one can see that data rates increase up to a point where the performance of a wire section is dominated by the parasitic capacitances of the repeater itself rather than by wire capacitances. The benefits of wave pipelines compared to conventional pipelines have also been explored by other authors [10] [11] [12]. In recent years, the concept of wave pipelining has been utilized in several successful VLSI-implementations [13] [14] [15].
References


Chapter 5

Synchronization

The terminology related to clocking of digital systems depends on how data transitions occur relative to a reference clock. In an asynchronous system, the signals are allowed to transition at any time. Systems that restrict transitions to predetermined instances in time are called synchronous, mesochronous, or ple-siochronous. This chapter starts with a short review of the latter clocking schemes. We then describe and practically demonstrate a Synchronous Latency Insensitive Design (SLID) scheme that can manage unknown global wire delays and clock skew.

5.1 Synchronous Clocking

The majority of digital VLSI implementations are designed around a traditional synchronous clocking scheme. In this classical scheme, a single global clock is generated and distributed across the whole chip. Figure 5.1 shows the principle for synchronous clocking, where combinational logic blocks are placed between registers, which we can assume to be positively edge-triggered. The timing parameters related to the registers and logic can be summarized as follows. $t_{cq,max}$ and $t_{cq,min}$ are the register maximum and minimum delay, respectively. $t_{su}$ and $t_{hld}$ are the register setup and hold times, while $t_{l,max}$ and $t_{l,min}$ refer to the maximum and minimum delay for the logic. We use $t_{clk1}$ and $t_{clk2}$ to define the position of the rising clock edge at register R1 and R2 relative to the global clock. When a positive clock edge reaches register R1, new data appears at its output and propagates through the logic until it reaches the destination register R2. Ideally, the clock distribution paths are balanced and every register in the system is clocked simultaneously. What matters in clock distribution is not the absolute delay through the clock network, but rather the relative arrival time at the registers. In reality, a clock signal is not perfectly periodic and two registers are seldom clocked si-
multaneously. Jitter, as a result of supply voltage variations and random coupling from surrounding circuits, causes cycle to cycle changes in the clock edge position and thus variations in the clock period. Physical mismatch in clock paths and differences in clock load causes clock skew $\delta = t_{clk2} - t_{clk1} \neq 0$. Compared to jitter, the skew is constant from cycle to cycle and doesn’t cause clock period variations, only a phase shift.

Figure 5.2 shows an example of the clock signals $clk_1$ and $clk_2$ at registers R1 and R2, respectively. The clocks experience a jitter of $t_{jitt}$ and there is also a static skew, $\delta$, between $clk_1$ and $clk_2$. The shortest achievable clock period, $T_{clk}$, is limited by the slowest propagation delays and the register setup time. The worst case occurs when the positive edge of $clk_1$ happens late and the positive edge of the next cycle of $clk_2$ happens early:
Also, data is not allowed to propagate too fast. To guarantee that no data is unintentionally overwritten, the register hold time must be shorter than the minimum propagation delay through the network. The worst situation is when the positive edge of \( clk_1 \) arrives early and the positive edge of \( clk_2 \) (in the same clock cycle) arrives late. The edge separation must be smaller than the minimum delay through the network:

\[
T_{clk} + \delta - 2t_{jitt} > t_{cq,\text{max}} + t_{l,\text{max}} + t_{su}
\]

\[
T_{clk} > t_{cq,\text{max}} + t_{l,\text{max}} + t_{su} - \delta + 2t_{jitt}
\] (5.1)

Thus, from Eq. 5.1, it’s easy to see that jitter is always bad for performance. On the other hand, positive skew (clock and data are routed in the same direction) can increase the clock frequency since the clock period can be decreased by \( \delta \). However, with positive skew, there is a risk for race between the clock and data, so the hold time constraint must be strictly enforced. Negative skew (clock and data are routed in opposite directions) lowers the clock frequency since the clock period must be increased by \( \delta \). The major advantage of negative skew is that it eliminates the risk of race.

To summarize, a synchronous local clock signal has exactly the same frequency as the global reference clock and maintains a known fixed phase offset to that clock. The dominating timing method used today is the above described synchronous approach where a low skew global clock, timing all transactions and events, is distributed over the whole chip. Clock skews of about 10% are normally managed in a design [1]. Synchronous clocking will eventually become unmanageable for high-end chips as the delay of global signals tends to exceed the clock period, thereby ruling out single cycle full-chip communication. Circuit and layout efforts aimed at lowering the skew rely on improvements in materials, which have physical limitations.

### 5.2 Mesochronous Clocking

If a global clock is generated and distributed without any control of the delays in the clock network, the clock signal at various locations on a chip will have the same frequency but unknown phase. In Figure 5.3a, data is transmitted from a block clocked by the local clock \( clk_1 \), across an interconnect with unknown delay, and received in a block clocked by a different local clock \( clk_2 \). Locally,
each clock domain uses a traditional synchronous clocking scheme as in Section 5.1. The received data is said to be mesochronous with $clk_2$ since its transitions occur at the local clock frequency but with an arbitrary phase offset, as shown in Figure 5.3b. This phase offset originates from the unknown phase difference between $clk_1$ and $clk_2$ (due to the different clock distribution delays $\tau_1 \neq \tau_2$), and the unknown interconnect delay, $\tau_3$, between the transmitting and receiving block. To correctly sample the received signal, a synchronizer is needed to adjust the phase offset so that the signal transitions are kept away from the unsafe regions of $clk_2$. Many mesochronous synchronizer techniques have been proposed [2] [3]. A simple technique is described in [4] where the phase difference between the received signal and the receiver clock is measured to properly delay the data through a variable delay line. A solution based on measuring the delay between communicating modules in larger systems was proposed in [5]. This method is quite complex and requires bidirectional buses along with circuitry to adjust the delays in the clock distribution network.

### 5.3 Plesiochronous Clocking

The transitions of a plesiochronous signal occur at a frequency that is almost the same as that of the local clock. This situation arises when the clock signal of two communicating modules is generated independently from separate crystal oscillators, as illustrated in Figure 5.4. The frequency of $clk_1$ and $clk_2$ is slightly different, but nominally the same. This causes a slow drift in the phase difference between the transmitted signal and the local receiver clock. On-chip or board-level electronics typically generate local clock signals from a common clock root so plesiochronous communication happens mostly between isolated systems that are separated by a long distance. In [6], phase-interpolator based clock recovery circuits are utilized in a plesiochronous link implementation to compensate for
skin-effect loss and wiring skew across 20 m long cable connections. Safe reception of a plesiochronous signal requires a buffering scheme. Data may have to be dropped or duplicated depending on if the transmitter clock is faster or slower than the receiver clock, respectively. The receiver circuitry can utilize the periodic behavior of the varying phase offset to detect hazardous sampling instances, which occur if the signal changes state during the unsafe parts of the receiver clock phase. As an example, a recently published high-performance CMOS implementation of a plesiochronous clock and data recovery circuit with analog phase interpolators achieves 10 Gb/s [7].

5.4 Synchronous Latency Insensitive Design

As has previously been discussed, the increased complexity, higher clock frequencies, and longer wire delays of scaled integrated circuits make it more difficult to meet the timing constraints [8]. Simple repeater insertion on critical interconnects is no longer sufficient and pipelining through flip-flops (also in logic blocks) has become a necessity [9]. In Paper 9 we describe and practically demonstrate a Synchronous Latency Insensitive Design (SLID) scheme to resolve the timing closure problems due to unknown global wire delays, clock skew and other timing uncertainties in integrated circuits.

5.4.1 Recent Synchronization Approaches

Many techniques to solve the synchronization problem have been proposed, such as fully asynchronous or Globally Asynchronous Locally Synchronous (GALS) methods [10]. GALS-schemes rely on handshaking or special timing blocks to maintain synchronization. A major problem is the lack of design tools that support the asynchronous design flow. [8] suggests a Latency Insensitive Design (LID) method for synthesized circuits, where the functionality of a block only depends on the order of events that reach it. This makes the system insensitive to

![Figure 5.4: Plesiochronous clocking.](image-url)
the delays of long wires from a functional point of view. The number of clock cycles to complete a certain operation is however not known until after back-end. Other mesochronous synchronizers, which represent the LID-method, are designed around First In First Out (FIFO) synchronizers [11]. A FIFO-synchronizer uses a small ring-buffer to decouple the transmitter and receiver timing. In [12], a FIFO is described that behaves as a latch clocked from both the transmitter and receiver clock domains. However, their solution requires generation of an intermediate clock and also an initialization scheme with a training period. The previously proposed solutions for mesochronous clocking many times require challenging modifications of the traditional design flow. For instance, the use of non-standard library cells in the case of synthesized implementations or tricky initialization procedures. If a new methodology is to be accepted and used, it is advantageous if it avoids these troublesome changes in the established design flow.

5.4.2 The SLID Design Flow

The SLID scheme described in Paper 9 was originally proposed by Edman and Svensson [13]. The idea is to utilize a FIFO re-timer at the receiving block in a mesochronous system. The FIFO thus replaces standard wire pipelining through clocked repeaters. What is fundamentally new with the SLID scheme, compared to other schemes, is that all data is aligned to the correct receiver clock cycle, independent of clock skew and data delays (within certain limits). This is an important benefit, which can be utilized with an absolute minimum of influence on the established synchronous design flow for synthesized circuits.

As a first step in the early high-level system design phase, the circuit is divided into isochronous regions within which the clock is considered synchronous. Inside such a region, the clock skew is kept small enough to avoid races at the target clock frequency. Each of the blocks in a functional diagram is then placed into one of the available isochronous regions. After this, a fixed latency of \( n > 0 \) clock cycles is inserted at the interface between each pair of communicating isochronous regions, to guarantee that the overall system will not suffer from timing problems. The value of \( n \) (chosen latency) can be different for each link, but should be selected from estimations of the maximum delay+skew for the longest path of the full chip. Thus, this decision is equivalent to inserting \( n \) clocked repeaters into each bus. Floorplan changes are avoided at this stage if the same \( n \)-value can be used for all links in the design. This view of the system is utilized up to clock-cycle true verification. Back-end design is guaranteed not to introduce any design changes, despite the fact that all link delays have been specified at an early stage.

In the synthesis phase, the \( n \)-cycle latency inserted between each communicating pair of isochronous blocks is implemented as a two-port \( m \)-word FIFO-synchronizer, according to Figure 5.5 where \( m=4 \). The synchronizer aligns in-
coming data to the local clock phase utilizing a single strobe signal routed along the communication link. The strobe signal is simply equal to the clock of the transmitting block, and will be distorted in the same way as the signals transferred on the data wires. Incoming data is written into the FIFO at an address given by an input counter clocked by the strobe. Data is read from the FIFO at an address given by an output counter clocked by the local clock. Overall global synchronization and clock alignment is achieved by relating each transmitted word to a strobe and local clock flank with the same enumeration at each block, thus guaranteeing exactly $n$ clock-cycles of latency for each link. This is accomplished by resetting the input and output counters to 0 and $(m-n)$, respectively during a global asynchronous reset with no clock running. The clock is then started and clock period No.0 is propagated to each isochronous region. The input counter begins to count up from 0 when the first strobe edge reaches the write port of the receiver. Similarly, the output counter starts to count up from $(m-n)$ when the initial clock pulse arrives at the receiver read port. This procedure labels all of the strobe and local clock cycles in the same way. Metastability is avoided since the two read- and write-pointers, clocked by different clocks, never collide and since the global clock and reset cannot collide during reset.

### 5.4.3 SLID Synchronizer Implementation

In Paper 9, we show the high performance and high robustness capability of the SLID technique through measurement results of a test chip fabricated in 0.18 μm CMOS [14]. The SLID-method is implemented for communication over a fully shielded, 3-bit, 5.4 mm long, global bus structured in exactly the same way as de-
scribed in Paper 7. Test data is transmitted at double data rate from one isochronous region to another across the global bus. The bus wires are routed in metal6 (with a grounded metal4 plane as return path) to create velocity-of-light limited transmission line interconnects according to the principles discussed in Chapter 2. The re-timing circuitry inserted on the receiving side of the global communication link is an \(m=4\) word FIFO-buffer with separate write and read ports as described in section 5.4.2.

The implemented write and read port structures, which can handle double data rate signals, are shown in Figure 5.6 and Figure 5.8, respectively. For high-speed operation, the input counter in the write port is realized as a ring-buffer with double-edge triggered D-flip-flops. The input to the flip-flop which generates the “enable\(0\)“ signal, clocking the flip-flops in FIFO CELL\(0\), is reset to “1”. The inputs of all other flip-flops in the input counter are reset to “0”. As a result, “enable\(0\)” always goes high when the first strobe edge (rising or falling) reaches the input counter. Figure 5.7 shows simulated waveforms of enable signals that consecutively write data into the FIFO starting with CELL\(0\) and continuing with CELL\(1\), CELL\(2\), etc. Note that the in-pointer (enable signals) time scale is referenced to the strobe. Similarly, the output counter in the read port is also structured as a double-edge triggered ring-buffer. Based on the selected \(n\)-value (desired latency in the link), the input to one of the flip-flops in the output counter is reset to “1” while the remaining flip-flop inputs are reset to “0”. Thus, depending on the \(n\)-value, one of the “read\(m-n\)” signals, reading the positive edge triggered flip-flop in the corresponding FIFO CELL\(m-n\), always goes high when the first local clock edge (rising or falling) reaches the output counter. Note that the “readneg\(m-n\)” signals, read the negative edge triggered flip-flop in the corresponding FIFO CELL\(m-n\). Figure 5.7 shows simulated “read\(m-n\)” and “readneg\(m-n\)” waveforms for the implemented FIFO-synchronizer \((m=4)\) when the latency is set to \(n=2\). The out-pointer (read signals) time scale is referred to the local receiver clock (clkRx).

A major advantage of the SLID scheme is that it allows clock and link delays to vary dynamically during system operation, without any risk of communication failure. Errorless communication is achieved as long as \(n\) and \(m\) are chosen so that write and read instances never collide, taking the maximum total delay+skew (which could be multiple clock cycles long) into account. The allowable total skew \((t_{skew,\text{tot}}=t_{tx}+t_{d}-t_{rx})\) resides in the interval:

\[
(n - m)T + t_{dur} < t_{skew,\text{tot}} < nT - t_{dur}
\]  

(5.3)

where \(t_{tx}\) and \(t_{rx}\) is the delay from the clock root to the transmitter and receiver, respectively, \(t_{d}\) is the bus delay (same for data and strobe), \(t_{dur}\) is the minimum time between write and read of a FIFO-cell, and \(T\) is the clock period. The upper skew limit implies that a cell cannot be read before \(t_{dur}\) after a
5.4 Synchronous Latency Insensitive Design

Figure 5.6: Block diagram of the FIFO-synchronizer write port.

Figure 5.7: Simulated waveforms for the FIFO-synchronizer write port when $n=2$. 
Figure 5.8: Block diagram of the FIFO-synchronizer read port.

Figure 5.9: Simulated waveforms for the FIFO-synchronizer read port when $n=2$. 
write process \( (t_{tx}+t_d+t_{dwr} < t_{rx}+nT) \), while the lower bound indicates that the next write process into the same cell must not begin until \( t_{dwr} \) after the read process \( (t_{rx}+nT+t_{dwr} < t_{tx}+t_d+mT) \). Thus, the maximum total delay+skew for a certain setting of \( n \) has a tuning range of \( mT-2t_{dwr} \).

Traditional pipelining changes the cycle level behavior of a circuit, which thus requires manual changes in the RTL-code, updates of test vectors, and re-verification of the entire system. As an example, Intel had to manually insert thousands of flip-flops in the global wiring of the Itanium processor [15]. In comparison, a circuit designed with the SLID method can be viewed as a fully synchronous system with pipelined interconnects. The scheme has only minor influence on the established fully synchronous design flow. A significant benefit of the SLID-technique is that the requirement of synchronous clock distribution along the buses and low clock skew between blocks is completely removed. The scheme is robust to late changes in RTL-code or floorplan. It can be expanded to accommodate any number of links, bidirectional links, any number of isochronous regions, and multi-port isochronous regions, each with its own synchronizer. Further extensions to the scheme can make it accommodate communication between different clock domains, which may have different clock frequencies, different clock phases and unknown communication latency as described in [16].

References


Chapter 6

Power-Efficient Interconnect Design

As previously discussed, the performance of ICs is becoming interconnect limited when feature sizes are scaled down. More and more functionality is being added on-chip, which tends to increase the die size inspite of the reduced geometries. Hence, the number of global lines and their length increases with technology scaling [1]. Traditionally, design efforts have been focused on minimizing interconnect latency and maximizing wire throughput. However, the power dissipation attributed to interconnects is an aspect of wires that should not be forgotten. Interconnect power dissipation is dynamic power resulting from the switching of wiring capacitances. In a recent paper by Magen [2], it is shown that global interconnects consume over 50% of the overall dynamic power of a high-performance microprocessor (77 million transistors), fabricated in 0.13 μm CMOS. In comparison, the microprocessor gates dissipate 34% of the dynamic power and the rest is referred to diffusions. This power situation has arisen mainly due to the relative scaling of cell capacitances and increased wire aspect ratios, aimed at reducing wire resistance in state-of-the-art processes. In this chapter, we provide the background to power dissipation on $RC$- and $RLC$-interconnects and then address the interconnect power problem in two areas: low-swing signaling and transition-energy cost modeling.

6.1 RC-Interconnect Power Dissipation

A wire designed so that $RC$-charging dominates can be described by an $RC$-chain. When a symbol is forced onto such a conductor it charges (or discharges) a total capacitance, $C_{tot}$, including not only the wire capacitance, but also the capacitance from driver and load. If the driver signal swing, $V_0$, is created from the global supply voltage, $V_{dd}$, e.g. through a series regulator, the dynamic power
consumption related to such an $RC$-interconnect is:

$$P_w = \alpha f C_{tot} V_0 V_{dd}$$  \hspace{1cm} (6.1)$$

where $\alpha$ is the signal activity and $f$ is the switching frequency [3]. On the other hand, if the driver is supplied by a separate external source of voltage $V_0$, e.g. using a DC-DC converter, Eq. 6.1 becomes:

$$P_w = \alpha f C_{tot} V_0^2$$  \hspace{1cm} (6.2)$$

For transitions with opposite polarity on neighboring wires, the Miller-effect [4] can lead to four times larger effective $C_{tot}$, further degrading interconnect related power numbers.

### 6.2 Transmission Line Power Dissipation

There are mainly two power dissipation cases to consider for a wire behaving as a transmission line with characteristic impedance $Z_0$ and delay $t_d$, driven by an inverter with output impedance $Z_S$, and terminated in an impedance $Z_L$. The driver is typically matched to the interconnect ($Z_S=Z_0$), but the far end can either be terminated in a matched impedance ($Z_L=Z_0$), or be left open ($Z_L \sim \infty$). Piguet [5] calculated the power dissipation for the case when the far end is terminated in a matched impedance, and arrived at:

$$P_w = \frac{V_{dd} V_0}{4 Z_0}$$  \hspace{1cm} (6.3)$$

Eq. 6.3 assumes equal probabilities for high-to-low and low-to-high transitions. For the case with an open far end, Svensson [6] assumed a random data stream and showed that the power dissipation is given by:

$$P_w = \frac{V_{dd} V_0}{4 Z_0} \cdot \frac{t_d}{T} = \frac{1}{4} f C_{tot} V_{dd} V_0 \hspace{1cm} 2t_d < T \hspace{1cm} (6.4)$$

$$P_w = \frac{V_{dd} V_0}{8 Z_0} \hspace{1cm} 2t_d > T \hspace{1cm} (6.5)$$

where $T$ is the symbol time. For electrically short lines ($2t_d<T$), the forward and reflected waves overlap for the same transmitted symbol making the power dissipation the same as that for a capacitor with capacitance $C_{tot}$. On the other hand, for sufficiently long lines ($2t_d>T$), there is no overlap between forward and reflected waves during the same symbol, which makes the power dissipation resemble that of a matched terminated wire. When comparing Eq. 6.3 and Eq. 6.5, one should remember that the far-end voltage swing in the terminated case is $V_0/2$ compared to $V_0$ in the open end case. Hence, for the same far-end swing, the open wire is four times more power efficient than the terminated one.
6.3 Low-Swing Signaling

6.3.1 Optimum-Voltage Swing Interconnect

The normal case for on-chip wires is full-swing signaling \(V_0 = V_{dd}\), which makes interconnect power dissipation proportional to \(V_{dd}^2\) as seen in Eq. 6.2 through Eq. 6.5. A straightforward and efficient opportunity to save significant amounts of power is to reduce the signal swing below \(V_{dd}\) and utilize an amplifying receiver to restore the signal back to full-swing. The question then arises if there exists a power-optimum voltage swing? To some extent, previous publications have investigated optimum signal swings under simplified conditions [6] [7].

The analysis in [6] examines a driver-interconnect-receiver chain (in 0.18 \(\mu m\) CMOS) having the structure shown in Figure 6.1, where \(V_0\) is the signal swing, \(C_w\) is the wire capacitance, and \(C_L\) is the receiver load capacitance. Both the transmitter and receiver amplifier are modeled as symmetrical inverters. For this setup, the current required to produce a certain voltage swing on the interconnect decreases linearly with decreasing swing. At the same time, the current needed to amplify the signal back to full swing increases superlinearly (contains a quadratic term) with decreasing swing. It is demonstrated that the sum of these two trends gives an opportunity for a power-optimum voltage swing. The optimum occurs at a point where the power used to drive the interconnect balances the power of the receiver. A lower signal swing saves interconnect power but also requires larger receiver gain. On the other hand, receiver power dissipation and latency increase with gain. The number of amplifier stages and transistor sizing in each stage must correspond to the needed gain and latency. For on-chip wires, a swing-optimum exists for a wide range of frequencies and switching activities, and also depends on the mechanism for generating the reduced voltage. The initial estimations in
[6] predict power savings in the range 8.5x-3.4x, for an optimum voltage swing of 60mV-130mV, using an ordinary supply voltage for the transmitter as described by Eq. 6.1. A similar optimum-swing scheme for a 0.5 \mu m process, with \( V_{dd} = 2 \) V, is proposed in [8] where 3x power savings are achieved at a reduced swing of \( V_{dd}/3 \).

There are several simplifying assumptions made to the setup in [6]. The results are based on simplified hand calculations where the traditional quadratic long channel saturation model is utilized to describe the transistor characteristics. However, this long channel model is not an accurate description of a submicron transistor. At high frequencies, both the driver and receiver have to be scaled up. A stronger driver increases the transmitter input capacitance and affects the pre-driver power consumption. A larger receiver affects the value of \( C_L \). Also, for lower swings, the driver size can be reduced for a given speed, which saves power. All of these effects were not considered in [6], since the goal there was just to investigate the opportunity for a swing optimum and estimate the value of possible power savings.

### 6.3.2 Investigated Optimum-Swing Signaling Link

In Paper 1, which could be seen as a follow-up to the ideas presented in [6], we utilize analog simulation, instead of relying on simplified hand calculations, for a more realistic and detailed analysis of optimum voltage swings. Figure 6.2 shows the architecture of the investigated signaling link, where four minimum-sized inverters form the receiver load. We utilize a 2 \mu m wide, 4 mm long, microstrip interconnect routed in metal6, the top layer. To achieve a transmission line-style interconnect with well-behaved properties according to the principles discussed in
### 6.3 Low-Swing Signaling

![Figure 6.3: Total power versus signal swing at on-chip signaling over a 4 mm long and 2 μm wide microstrip. Topmost curve: 6 Gb/s, Middle curve: 5 Gb/s, Lowest curve: 4 Gb/s.](image)

Chapter 2, we place a ground plane in metal5 along the whole length of the wire. This provides a well-defined current return path and makes the line characteristic impedance \( Z_0 = 40 \, \Omega \).

The chosen low-swing transmitter, shown in Figure 6.2, consists of an NMOS-inverter where the signal voltages, \( V_{HI} = V_{DC} + 0.5V_0 \) and \( V_{LO} = V_{DC} - 0.5V_0 \), are fed through transistors sized to have an output impedance matched to the line characteristic impedance. The inverter chains, controlling the top and bottom NMOS transistor, respectively are designed to achieve equal total propagation delay [9]. The receiver amplifier needs \( V_{DC} = 0.5 \, V \) for proper biasing, which would degrade the overdrive voltage on a PMOS transistor too much for transmitting \( V_{HI} \) at very low signal swing and high data rates. This motivates the chosen driver topology.

As receiver, we use the two-stage approach shown in Figure 6.2. The first stage is a differential amplifier with current mirror load, while the second stage is a simple inverter. For an interconnect signal swing of \( V_0 = 0.125 - 1 \, V \), the receiver needs a gain of at least 14.4. The differential stage output is adjusted to match the threshold voltage of the subsequent inverter. A valid receiver output signal is defined to reside outside the range \([0.3, 1.5]\) \, V, where the supply voltage is \( V_{dd} = 1.8 \, V \). To meet these constraints, for a fixed \( V_0 \) at a certain data rate, the differential stage and subsequent inverter are scaled separately.

Figure 6.3 shows the simulated total power dissipation versus signal swing for the investigated signaling link, where the receiver stages are individually scaled in each case. For the highest simulated data rates (6 Gb/s and 5 Gb/s), we are able to find an optimum voltage swing, for which the total power is minimum. As
an example, at 5 Gb/s the minimum power consumption is 0.8 mW at a voltage swing of 200 mV. The power saving compared to the maximum possible signal swing of 1 V is 2.7x at a delay penalty of 164 ps through the swing-restoring receiver amplifier. For the lowest simulated data rate (4 Gb/s), the optimum voltage swing becomes very small and will instead be limited by noise and the maximum gain of the chosen amplifier. The detailed analog simulations of the investigated low-swing link, designed in a 0.18 μm process available in industry, show the feasibility and power benefits of an optimum voltage swing for minimum power under very realistic conditions.

6.4 A Power-Efficient Cache Bus Technique

6.4.1 Dynamic Buses

Dynamic circuit techniques can be used to speed up the delay of performance-critical buses. In bus architectures based on these techniques, all bus wires are initially pre-charged to $V_{dd}$. Depending on the bus input pattern, each driver then conditionally evaluates its domino (output) node. A transmitter output can either stay quiescent at $V_{dd}$, or discharge to ground. An advantage of this scheme is that any near aggressor Miller capacitance is reduced by 2x, while the orthogonal capacitive component remains unchanged. The reason for this is that signals on neighboring wires cannot transition in the opposite direction. This particular property results in the speed up. Also, by moving the switching threshold of the receiver inverter closer to $V_{dd}$, an even faster response is achieved. However, pre-charged nodes always have a large activity (typically around 0.5) independent of the data signal activity $\alpha$. Thus, even low input data switching activities can result in pre-charging/evaluation of the entire bus every cycle. This is reflected in excessive peak currents and large power dissipation also at modest data transition activities. A pre-charged bus also suffers from all the disadvantages associated with dynamic circuit techniques such as charge sharing, leakage, and lost charge due to crosstalk and noise [3].

6.4.2 Conventional Cache Bus Architecture

The Unified Level1 (L1) cache memory on the 32-bit Intel Pentium 4 microprocessor is the 2nd level cache, which stores both data and raw instructions. The cache size is 1 Mb split up into a “higher”(512 kB) and “lower”(512 kB) cache, respectively. Figure 6.4 shows an overview of the Pentium 4 full chip plan, where the L1 cache occupies a significant portion of the die area. Figure 6.5 shows the organization of the dynamic, fully shielded, 3000 μm long L1 cache output.
bus, routed in the metal-6 layer of a 90 nm dual-$V_T$ CMOS technology [10]. A four-way input domino driver (D0-D3, one per cache bank) is utilized to complete evaluation of the repeater-inserted bus within half a cycle of a 3.3 GHz clock. The interconnect is initially pre-charged to $V_{dd}$. Depending on its input, each driver’s domino output then conditionally transitions to $V_{ss}$. The main disadvantage with this scheme is its high power dissipation, especially at certain low input data activities. For a fixed “1” input, the power dissipation is 7.39 mW/bit (2240 fJ) at 3.3 GHz, 1.2 V, 110 °C due to the constant pre-charging and discharging of every bit line each cycle. The power dissipation decreases to 0.70 mW/bit (212 fJ) for a constant “0” input. A realistic L1 cache bus activity is 10 %, which corresponds to 1.49 mW/bit (450 fJ) at the target clock frequency.
6.4.3 Proposed Cache Bus Architecture

As an alternative to the described conventional L1 cache bus, we propose a more power-efficient architecture in Paper 3. Figure 6.6 shows the low-swing inverter transmitter and sense-amplifying flip-flop receiver employed in the proposed design. We use a swing reduction of 25% by lowering the driver supply voltage from $V_{dd}=1.2\, \text{V}$ to $V_{HI}=0.9\, \text{V}$. Also, the double inverter repeater is removed since it is possible to size the transistors in the chosen receiver architecture so that the total delay from driver input to receiver flip-flop setup is matched to the reference design. In both topologies, the pre-driver, interconnect model (field solver extracted), and load instances are identical with constant fanin/fanout loads. Without any delay penalty, the proposed alternative cache bus architecture reaches 3.3 GHz, 2.24 mW/bit (worst-case) operation at 110°C. This corresponds to a worst-case power reduction of 70%. Even at realistic activities of 10%, the proposed design dissipates 0.70 mW/bit reducing power by 53%. Further benefits are reflected in worst-case peak-current reductions from 19.6 mA (conventional bus) to 9.0 mA (proposed bus). Energy reduction is attributed to utilization of lower swing and limiting pre-charging to only the internal nodes of the sense amplifier. Also, the proposed low-swing cache bus demonstrates a 27% total transistor width active-area reduction and limited worst-case supply voltage DC-robustness (evaluated as total delay pushout from driver input to receiver flip-flop setup) compared to the reference design.

Figure 6.6: Proposed low-swing transmitter and sense-amplifier flip-flop receiver.
6.5 Transition-Energy Cost Modeling

6.5.1 Bus Coding

Interconnect power dissipation is tightly coupled to switching activity. An attractive way of saving power is therefore to find methods which reduce the transition activity of on-chip buses. Lately, bus coding has become a very hot topic. The idea is to avoid transitions which are expensive from a power dissipation point of view by employing encoder and decoder circuits on the bus. The simplest and most straightforward encoding scheme for random data is the so called “bus-invert” technique. This method senses the number of transitions that would occur on the bus if the data was to be sent uncoded. If a majority of the bits are about to make a transition, the data is inverted before it is transmitted over the bus. In case of bus-inversion, a control signal is set high and sent along with the data to indicate this to the receiver, which then inverts the received data back to its original state. This approach results in energy and peak-current reductions of 50 % in the worst case and up to 25 % in the average case [11]. The drawback of the bus-invert scheme is increased latency through the encoding and decoding circuits and also the need for an extra control bit, which increases routing area. Many analyses of transition coding typically treats the interconnect as isolated. Nevertheless, in the context of technology scaling, the distributed nature of the lines is non-negligible and wires tend to be placed physically closer to each other. This makes the coupling capacitance between wires routed in the same layer just as important as the capacitances to the substrate. Sotiriadis proposed new coding techniques for an updated electrical bus model, showing that transition reduction encoding is not necessarily the best power-saving method when inter-wire capacitances are considered [12].

6.5.2 Proposed Transition-Energy Cost Model

All proposed bus encoding techniques strive to reduce or avoid transitions that are expensive from a power dissipation point of view [13] [14] [15]. In [16], a 9 mm long spatially encoded static bus outperforms a corresponding repeater-inserted bus in terms of peak energy, with delay and energy overhead of the encoding included. In [17], an encoder circuit is suggested that reduces wire delay and switching energy simultaneously by ensuring that neighboring wires never transition in opposite directions. However, the delay and energy of the encoder and decoder is not included in their analysis. To determine which transitions are expensive and would benefit from encoding, we need an accurate transition-energy cost model. In Paper 4, we propose such a model, which compared to previous models, includes properties that closer captures effects present in submicron global buses.
The new features included in the proposed transition-energy cost model can be summarized as follows. Firstly, the model includes not only inter-wire capacitances on the same interconnect layer as in [12] [18], but also inter-layer capacitances. Interconnects in standard logic circuits are driven by transistors to either $V_{dd}$ or ground. In our proposed model, capacitive coupling between a bus wire and any interconnect on adjacent layers is therefore statistically modeled through the equally sized capacitances $A$ and $B$ to $V_{dd}$ and ground, respectively, as shown in Figure 6.7a. These capacitances also include the drain capacitance for the bus driver and a compensation capacitance to capture the effect of driver shot-circuit current during switching. By adding the capacitance between $V_{dd}$ and the conductor, we add a cost of discharging the wires, as has not been the case in previous models. The inter-wire capacitance between adjacent wires on the same layer is captured by capacitance $C$ in Figure 6.7a.

Secondly, in most previous work, the bus driver has been a single CMOS inverter and for transition energy considerations, only the total inverter output capacitance to ground has been included. Instead, we propose a more realistic multi-stage driver model that considers the input node capacitance to both ground and $V_{dd}$ (via capacitances $E_1$, $E_2$ and $D_1$, $D_2$) as shown in Figure 6.7b. The capacitances associated with any buffers inserted prior to the two main driver inverters can be added to $D$ and $E$. When multiple inverters are used, the discharging of a wire causes charging of intermediate nodes in the driver chain, thus increasing the total cost.

Our proposed transition-energy cost model, which can be extended to buses of arbitrary bit width, is derived from the capacitance parameters shown in Figure 6.7. The model captures the total amount of charge drawn from the supply rail when going from any initial bus state to any final bus state.
6.5.3 Accuracy of Proposed Transition-Energy Model

In Paper 4, the accuracy of the proposed transition-energy cost model is compared to Spectre simulations of a circuit level model describing a 4-bit, 3000 μm long, global bus including drivers. We use a 0.18 μm CMOS process available in industry and model the bus as a distributed RLC-network (as described in Section 3.4) with wire parameters calculated by the field solver available in HSPICE. The bus is routed in metal6. To mimic routing over a mesh of synthesized logic, 20% of the bus length is drawn over a metal4 plane tied to $V_{dd}$, while another 20% of the bus length runs over a metal4 plane connected to $V_{ss}$. Each signal wire is driven by a chain of 4 inverters, progressively up-scaled by a tapering factor of 3. The last inverter stage has a PMOS to NMOS width ratio of $W_p/W_n = 60 \mu m/24 \mu m$, and an output impedance matched to the line characteristic impedance to achieve signal rise times around 65 ps.

The parameter values for the proposed transition-energy cost model are derived from the above described reference circuit level model. These values make it possible to calculate the energy cost for every possible transition on the 4-bit bus using our proposed energy model. Figure 6.8 graphically shows a comparison between the calculated energy, through the proposed model, and the values obtained from the Spectre simulation, when making a transition from a 4-bit start code (0-15) to a 4-bit end code (0-15). In Figure 6.8, the average and maximum deviation between the models is 21.1% and 42.4%, respectively. This difference

![Figure 6.8: Relative difference between calculated transition energy values (through proposed model), and simulated values from the full Spectre model.](image-url)
Figure 6.9: Relative difference between calculated transition energy values (through proposed model), and simulated values from the Spectre model. Fringing capacitances on the outermost bus wires are excluded in the Spectre model.

Figure 6.10: Relative difference between calculated transition energy values (through proposed model), and simulated values from the Spectre model. The Spectre model excludes fringing capacitances on the outermost bus wires, mutual inductances, and mutual capacitances between non-neighboring wires.
is mainly caused by two neglected effects. Firstly, the proposed transition energy model does not include the fringing capacitance on the outermost bus wires, which have only one neighbor. Compared to Figure 6.8, Figure 6.9 shows the difference between calculated energy, through the proposed model, and the values obtained from the Spectre simulation when the fringing capacitance for the outermost wires has been excluded. In Figure 6.9, the average and maximum deviation between the models has decreased to 4.5% and 19.5%, respectively. Note that Figure 6.9 is a graphical representation of the results presented in Table 11.4 of Paper 4. Secondly, Figure 6.10 shows the difference between the compared models when all mutual inductances and all mutual capacitances between non-neighboring wires have been excluded from the Spectre model. For this case, the average and maximum model discrepancy decreases to 1.3% and 9.4%, respectively.

Figure 6.8 through Figure 6.10 are shown here as an attempt to clarify the tabulated results presented in Paper 4. In Paper 4, the proposed transition-energy cost model is further compared to previously suggested models [11] [18]. To summarize the results, the proposed energy model is closer to physical reality and shows less low-cost and costly transitions. Instead, it has a higher concentration around the average cost caused by large drain capacitances from up-sized drivers. Thus, our proposed energy model implies only small energy savings from bus transition coding for cases when other models would predict the opposite.

References


Chapter 7

Conclusions

The requirements on on-chip global communication become increasingly more difficult to fulfill as deep submicron effects continue to make the interconnects slower than the logic. The computational capability of a logic block increases for each introduced technology node, while the communication between two such blocks, separated by a long distance, suffers due to the degraded properties of the interconnects. This global communication bottleneck is the focus of this thesis.

Through an analysis of the intrinsic limitations of electrical global on-chip interconnects, we have found that the limitations can be overcome. By utilizing two upper-level metals, one for the wires and one as a ground return plane, a signal conductor will behave as a microwave-style transmission line, which allows for velocity-of-light delay if properly dimensioned. The feasibility and high-performance properties of these on-chip transmission line interconnects have been verified through measurements of fabricated silicon.

Process scaling allows for higher clock frequencies, but also makes it more difficult to meet the required timing constraints due to the long wire delays. This prohibits communication across a full chip within a single clock cycle. We have successfully shown a Synchronous Latency Insensitive Design (SLID) scheme to resolve the timing closure problems due to unknown global wire delays, clock skew and other timing uncertainties in integrated circuits. The high throughput capability of a transmission line-based global bus, combined with the SLID scheme, has been practically shown through measurements of an implemented test chip.

Interconnects tend to dissipate a dominating portion of an integrated circuits’ dynamic power. One method to address the power problem is to utilize transition coding on global buses, i.e. encoding power-hungry data patterns into more power efficient transitions. To make a correct decision on which transitions that would benefit from coding, it is relevant to start from an accurate transition-energy cost model. We have shown that a proposed transition-energy cost model, which includes a multi-stage transmitter and wire properties which are closer to phys-
ical reality, shows a concentration of transition costs around the average value. Compared to previously suggested transition-energy models, the proposed energy model implies only small energy savings from bus transition coding for cases when other models would predict the opposite.

We have further shown the feasibility and power benefits of an optimum voltage swing, to reach minimum power, for a low-swing global communication link in 0.18 $\mu$m CMOS. Finally, this thesis has shown a 3.3 GHz low-swing single-ended L1 cache bus technique in 90 nm CMOS. In terms of power, the proposed technique achieves 70% worst-case energy and 54% peak-current reduction over a conventional dynamic cache bus scheme in a high-performance microprocessor.
Appendix A

Transmission Line Equations

This appendix contains derivations of the most important transmission line equations used in Chapter 2 of this thesis.

A.1 Characteristic Impedance

Figure A.1: A ladder network of infinitesimal impedance-admittance sections.

Figure A.1 shows a ladder network of infinitesimal impedance-admittance sections, where \( z=r+j\omega l \) (impedance), and \( y=g+j\omega c \) (admittance). The line characteristic impedance \( Z_c \) can be written as:

\[
Z_c = zdx + \left( \frac{1}{ydx/Z_c} \right) = zdx + \frac{Z_c/ydx}{Z_c + 1/ydx} = zdx + \frac{Z_c}{Z_c ydx + 1} \quad (A.1)
\]
When \( dx \to 0 \), we can use the taylor expansion of \((1+x)^{-\alpha} \approx 1-\alpha x\):

\[
Z_c \approx zdx + Z_c(1-Z_c ydx) = zdx + Z_c - Z_c^2 ydx
\]

\[
Z_c^2 ydx \approx zdx
\]

\[
Z_c^2 \approx \frac{z}{y}
\]

\[
Z_c \approx \sqrt{\frac{z}{y}} = \sqrt{\frac{r+j\omega l}{g+j\omega c}} \quad (A.2)
\]

### A.2 The Propagation Constant

The propagation constant for a wire having a resistance, \( r \), inductance, \( l \), capacitance, \( c \), and conductance, \( g \), per unit length is given by:

\[
\gamma = \sqrt{(r+j\omega l)(g+j\omega c)} = \sqrt{rg + j\omega rc + j\omega lg - \omega^2 lc} \quad (A.3)
\]

If \( r \) and \( g \) are sufficiently small, the \( rg \)-term can be neglected:

\[
\gamma \approx \sqrt{j\omega (rc + gl) - \omega^2 lc} = j\omega \sqrt{lc} \sqrt{1 - \frac{j\omega (rc + gl)}{\omega^2 lc}}
\]

\[
\gamma \approx j\omega \sqrt{lc} \sqrt{1 - \frac{j}{\omega} \left( \frac{r}{l} + \frac{g}{c} \right)} \quad (A.4)
\]

Since \( r \) and \( g \) are sufficiently small, we can use the taylor expansion of \((1+x)^{-\alpha} \approx 1-\alpha x\):

\[
\gamma \approx j\omega \sqrt{lc} \left( 1 - \frac{j}{2\omega} \left( \frac{r}{l} + \frac{g}{c} \right) \right) = j\omega \sqrt{lc} + \frac{1}{2} \left( \frac{r}{l} \sqrt{lc} + \frac{g}{c} \sqrt{lc} \right)
\]

\[
\gamma \approx j\omega \sqrt{lc} + \frac{1}{2} \left( \frac{r}{l} \sqrt{lc} + g \sqrt{\frac{l}{c}} \right) \quad (A.5)
\]

Now identify the high-frequency characteristic impedance \( Z_0 = \sqrt{l/c} \) to obtain:

\[
\gamma \approx j\omega \sqrt{lc} + \frac{r}{2Z_0} + \frac{gZ_0}{2} \quad (A.6)
\]
A.3 The Telegrapher’s Equation

The Telegrapher’s Equation

\[ Z_0 \quad V_{Li} \quad V_{Lr} \quad V_L \]

\[ Z_L \]

\[ \ldots \quad I_{Li} \quad I_{Lr} \quad I_L \]

\[ \ldots \]

Figure A.2: A lossless transmission line with characteristic impedance \( Z_0 \) terminated with a load impedance \( Z_L \).

Figure A.2 shows a lossless transmission line with characteristic impedance \( Z_0 \) terminated with a load impedance \( Z_L \), where \( V_{Li} \) (\( V_{Lr} \)) and \( I_{Li} \) (\( I_{Lr} \)) is the incident (reflected) voltage and current wave, respectively. The voltage over \( Z_L \) is \( V_L = V_{Li} + V_{Lr} \), while the current through \( Z_L \) is \( I_L = I_{Li} - I_{Lr} \). The reflection coefficient is defined as:

\[ \Gamma_L = \frac{V_{Lr}}{V_{Li}} = \frac{I_{Lr}}{I_{Li}} \quad (A.7) \]

The transmission line characteristic impedance \( Z_0 \) is given by:

\[ Z_0 = \frac{V_{Li}}{I_{Li}} = \frac{V_{Lr}}{I_{Lr}} \quad (A.8) \]

We derive the telegrapher’s equation by simplifying an expression for \( Z_L \):

\[ \begin{align*}
Z_L & = \frac{V_L}{I_L} = \frac{V_{Li} + V_{Lr}}{I_{Li} - I_{Lr}} = \frac{V_{Li}}{I_{Li}} \left(1 + \frac{\Gamma_L}{1 - \Gamma_L}\right) = Z_0 \left(1 + \frac{\Gamma_L}{1 - \Gamma_L}\right) \\
Z_L - Z_L \Gamma_L & = Z_0 + Z_0 \Gamma_L \\
Z_L - Z_0 & = \Gamma_L (Z_L + Z_0) \\
\Gamma_L & = \frac{Z_L - Z_0}{Z_L + Z_0} \quad (A.9)
\end{align*} \]
A.4 Frequency Response for a General Signaling Link

Figure A.3 shows a general signaling link consisting of a lossy transmission line with characteristic impedance $Z_c$ and transfer function $H$, driven by a transmitter with output impedance $Z_S$, and terminated with a load impedance $Z_L$. $V_1$ is the near-end voltage, which is the superposition of an incident, $V_{1i}$, and reflected, $V_{1r}$, traveling voltage wave. Similarly, $V_2=V_2i+V_{2r}$ is the far-end voltage. $\Gamma_1$ and $\Gamma_2$ is the reflection coefficient looking into the near and far end, respectively. The following relation is valid for the far end:

$$\Gamma_2 = \frac{V_{2r}}{V_{2i}} = \frac{Z_L - Z_c}{Z_L + Z_c} \quad \text{(A.10)}$$

Wave reflections together with Eq. A.10 give:

\[
\begin{align*}
V_{2i} &= HV_{1i} \\
V_{2r} &= \Gamma_2 V_{2i} = \Gamma_2 HV_{1i} \\
V_{1r} &= HV_{2r} = \Gamma_2 H^2 V_{1i} \\
\Gamma_1 &= \frac{V_{1r}}{V_{1i}} = \Gamma_2 H^2 = \frac{H^2 (Z_L - Z_c)}{Z_L + Z_c} \quad \text{(A.11)}
\end{align*}
\]

Using the generalized Telegrapher’s equation with $x=0$ at the far end gives:

\[
\begin{align*}
\Gamma(x) &= \frac{Z_{in}(x) - Z_c}{Z_{in}(x) + Z_c} \\
\Gamma_1 &= \frac{Z_1 - Z_c}{Z_1 + Z_c} \\
\Gamma_1 Z_1 + \Gamma_1 Z_c &= Z_1 - Z_c \\
Z_c (1 + \Gamma_1) &= Z_1 (1 - \Gamma_1) \\
Z_1 &= \frac{Z_c}{1 + \frac{1}{1 - \Gamma_1}} \quad \text{(A.12)}
\end{align*}
\]
A.4 Frequency Response for a General Signaling Link

Express the near-end voltage in two ways:

\[ V_1 = V_{1i} + V_{1r} = V_{1i}(1 + \frac{V_{1r}}{V_{1i}}) = V_{1i}(1 + \Gamma_1) \]  \hspace{1cm} (A.13)

\[ V_1 = V_0 \frac{Z_1}{Z_S + Z_1} \]  \hspace{1cm} (A.14)

Set Eq. A.13 equal to Eq. A.14 to obtain:

\[ V_{1i} = \frac{V_0 Z_1}{(Z_S + Z_1)(1 + \Gamma_1)} \]

Insert Eq. A.12 and simplify:

\[ V_{1i} = \frac{V_0 Z_c}{Z_c(1 - \Gamma_1) + (1 + \Gamma_1)} \]  \hspace{1cm} (A.15)

Study the far-end voltage:

\[ V_2 = V_{2i} + V_{2r} = V_{2i}(1 + \frac{V_{2r}}{V_{2i}}) = V_{2i}(1 + \Gamma_2) = HV_{1i}(1 + \Gamma_2) \]

Insert Eq. A.15 to obtain:

\[ V_2 = \frac{HV_0(1 + \Gamma_2)}{Z_c(1 - \Gamma_1) + (1 + \Gamma_1)} \]

Insert Eq. A.10 and Eq. A.11 and simplify to obtain \( G = V_2/V_0 \), the transfer function from driver voltage to voltage over the load:

\[ V_2 = \frac{HZ_0 \left( \frac{Z_1 - Z_c}{Z_1 + Z_c} \right) + 1}{Z_c \left( 1 - H^2 \left( \frac{Z_1^2 - Z_c^2}{Z_1 + Z_c} \right) \right) + (1 + H^2 \left( \frac{Z_1^2 - Z_c^2}{Z_1 + Z_c} \right))} \]

\[ V_2 = \frac{2Z_L HV_0}{Z_c \left( 1 - H^2 \left( \frac{Z_1^2 - Z_c^2}{Z_1 + Z_c} \right) \right) + (1 + H^2 \left( \frac{Z_1^2 - Z_c^2}{Z_1 + Z_c} \right))} \]

\[ V_2 = \frac{2Z_L HV_0}{Z_c \left( Z_L + Z_c - H^2(Z_L - Z_c) \right) + (Z_L + Z_c + H^2(Z_L - Z_c))} \]

\[ V_2 = \frac{2Z_L HV_0}{Z_c \left( Z_L + Z_c - H^2Z_L + H^2Z_c \right) + (Z_L + Z_c + H^2Z_L - H^2Z_c)} \]

\[ V_2 = \frac{2Z_L HV_0}{Z_c \left( Z_L(1 - H^2) + Z_c(1 + H^2) \right) + Z_L(1 + H^2) + Z_c(1 - H^2)} \]

\[ G = \frac{2Z_L H}{Z_L(1 + H^2 + \frac{Z_c}{Z_c}(1 - H^2)) + Z_c(1 - H^2 + \frac{Z_c}{Z_c}(1 + H^2))} \]  \hspace{1cm} (A.16)