Design of High-Speed Time-Interleaved Delta-Sigma D/A Converters
Ameya Bhide

Division of Integrated Circuits and Systems
Department of Electrical Engineering (ISY)
Linköping University
SE-581 83 Linköping, Sweden
Linköping 2015
Design of High-Speed Time-Interleaved Delta-Sigma D/A Converters
Copyright © 2015 Ameya Bhide
ISBN 978-91-7519-017-4
ISSN 0345-7524
Printed by LiU-Tryck, Linköping, Sweden, 2015
Abstract

Digital-to-analog (D/A) converters (or DACs) are one the fundamental building blocks of wireless transmitters. In order to support the increasing demand for high-data-rate communication, a large bandwidth is required from the DAC. With the advances in CMOS scaling, there is an increasing trend of moving a large part of the transceiver functionality to the digital domain in order to reduce the analog complexity and allow easy reconfiguration for multiple radio standards. ΔΣ DACs can fit very well into this trend of digital architectures as they contain a large digital signal processing component and offer two advantages over the traditionally used Nyquist DACs. Firstly, the number of DAC unit current cells is reduced which relaxes their matching and output impedance requirements and secondly, the reconstruction filter order is reduced.

Achieving a large bandwidth from ΔΣ DACs requires a very high operating frequency of many-GHz from the digital blocks due to the oversampling involved. This can be very challenging to achieve using conventional ΔΣ DAC architectures, even in nanometer CMOS processes. Time-interleaved ΔΣ (TIDSM) DACs have the potential of improving the bandwidth and sampling rate by relaxing the speed of the individual channels. However, they have received only some attention over the past decade and very few previous works been reported on this topic. Hence, the aim of this dissertation is to investigate architectural and circuit techniques that can further enhance the bandwidth and sampling rate of TIDSM DACs.

The first work is an 8-GS/s interleaved ΔΣ DAC prototype IC with 200-MHz bandwidth implemented in 65-nm CMOS. The high sampling rate is achieved by a two-channel interleaved MASH 1-1 digital ΔΣ modulator with 3-bit output, resulting in a highly digital DAC with only seven current cells. Two-channel interleaving allows the use of a single clock for both the logic and the final multiplexing. This requires each channel to operate at half the sampling rate i.e. 4 GHz. This is enabled by a high-speed pipelined MASH structure with robust static logic. Measurement results from the prototype show that the DAC achieves 200-MHz bandwidth, −57-dBc IM3 and 26-dB SNDR, with a power consumption of 68-mW at 1-V digital and 1.2-V analog supplies. This architecture shows good potential for use in the transmitter baseband. While a good linearity is obtained from this DAC, the SNDR is found to
be limited by the testing setup for sending high-speed digital data into the prototype.

The performance of a two-channel interleaved ∆Σ DAC is found to be very sensitive to the duty-cycle of the half-rate clock. The second work analyzes this effect mathematically and presents a new closed-form expression for the SNDR loss of two-channel DACs due to the duty cycle error (DCE) for a noise transfer function (NTF) of \((1 - z^{-1})^n\). It is shown that a low-order FIR filter after the modulator helps to mitigate this problem. A closed-form expression for the SNDR loss in the presence of this filter is also developed. These expressions are useful for choosing a suitable modulator and filter order for an interleaved ∆Σ DAC in the early stage of the design process. A comparison between the FIR filter and compensation techniques for DCE mitigation is also presented.

The final work is a 11 GS/s 1.1 GHz bandwidth time-interleaved ∆Σ DAC prototype IC in 65-nm CMOS for the 60-GHz radio baseband. The high sampling rate is again achieved by using a two-channel interleaved MASH 1-1 architecture with a 4-bit output i.e only fifteen analog current cells. The single clock architecture for the logic and the multiplexing requires each channel to operate at 5.5 GHz. To enable this, a new look-ahead technique is proposed that decouples the two channels within the modulator feedback path thereby improving the speed as compared to conventional loop-unrolling. Full speed DAC testing is enabled by an on-chip 1 Kb memory whose read path also operates at 5.5 GHz. Measurement results from the prototype show that the ∆Σ DAC achieves >53 dB SFDR, < −49 dBc IM3 and 39 dB SNDR within a 1.1 GHz bandwidth while consuming 117 mW from 1 V digital/1.2 V analog supplies. The proposed ∆Σ DAC can satisfy the spectral mask of the 60-GHz radio IEEE 802.11ad WiGig standard with a second order reconstruction filter.


CMOS. Den höga samplingshastigheten uppnås återigen genom att sammanfläta två MASH 1-1 arkitekturer med enbart femton enhetsströmsceller. Den höga hastigheten uppnås genom en ny look-ahead teknik som reducerar den kritiska linjen av integratorn till enbart en adderare. Den föreslagna $\Delta \Sigma$ DA-omvandlaren kan uppfylla spektrummasken av 60-GHz radiostandarden IEEE 802.11ad WiGig med ett andra ordningens rekonstruktionsfilter.
Preface

This dissertation presents the research work performed during the period March 2010 – June 2015 at the Division of Integrated Circuits & Systems, Department of Electrical Engineering, Linköping University, Sweden. The main contributions of this dissertation are as follows:

• Design and implementation of an 200-MHz bandwidth 8-GS/s time-interleaved MASH 1-1 $\Delta \Sigma$ DAC in 65-nm CMOS. A comparative analysis of different logic styles for achieving a high sampling rate is also performed.

• A mathematical analysis of the effect of duty cycle error on the performance of two-channel time-interleaved $\Delta \Sigma$ DACs. The effectiveness of different error mitigation techniques is also studied.

• Design and implementation of an 1.1-GHz bandwidth 11-GS/s Time-interleaved MASH 1-1 $\Delta \Sigma$ DAC in 65-nm CMOS for 60-GHz radio applications. A fast look-ahead technique is proposed for the interleaved MASH modulator.

• Design and implementation of a 1-Kb memory in 65-nm CMOS with a 5.5 GHz read path to enable the testing of high-speed DACs.

The contents of this dissertation are based on the following publications:


The following paper was also published during this period which is outside the scope of this dissertation:

Acknowledgments

This dissertation would have not been possible without the support, encouragement and the guidance of many people. I would like to express my deepest gratitude and thanks to them.

• I would like to thank my supervisor, Professor Atila Alvandpour for giving me this opportunity to pursue PhD studies and his guidance and support. He really knows how to inspire and motivate his students.

• The senior members at ICS and EK for sharing their technical knowledge: Prof. Emer. Christer Svensson, Asst. Prof. Behzad Mesgarzadeh, Adj. Prof. Ted Johansson, Universitetslektor Dr. J. Jacob Wikner, Assoc. Prof. Jerzy Dabrowski and Dr. Christer Jansson.

• Arta Alvandpour, Research Engineer at ICS for all his help with the equipment and hardware issues.

• All the former and current administrators at ICS for their help: Anna Folkesson, Jenny Stendahl, Maria Hamnér and Gunnel Hässler.

• The former and current PhD students at ICS for providing a very friendly and collaborative environment: Daniel Svärd (LISP/SKILL expert), Dr. Dai Zhang (always cool), Dr. Amin Ojani (always in office), Dr. Jonas Fritzin (impedance matching expert), Dr. Ali Fazli Yeknami, Dr. Fahad Qazi, Martin Nielsen-Lönn (makes PCBs at home and translator in the group), Omid E. Najari (always smiling), Tai Quoc Duong (always looking worried), Keirang Chen, Dr. Timmy Sundström, Vishnu Unnikrishnan, Tekn. Lic. Prakash Harikumar, Dr. Nadeem Afzal and Dr. Anu K.M. Pillai.

• TUS Team for all the support with the computing environment.

• Bo Ygfors for lending the Tektronix pattern generator without which measurements of the first chip would not have been possible.
• My wife Priyanka, for all the encouragement, patience and understanding, without which this dissertation would not have been possible. My daughter, Tanaya for always bringing a smile with her antics.

• Aai and Baba for their unconditional love and support due to which I have been able to reach this far. Also, my sister Ashlesha for all the affection.

• My in-laws for always wishing me the best in my endeavours.

• My friends for their support. A deepest thanks to the gang for always being there. A big thanks to the “CDTG-SoC” lunch table friends for the good times.

Ameya Bhide
Linköping, July 2015
## Contents

1 Introduction ........................................ 1
   1.1 Characteristics of Nyquist DACs ................. 1
   1.2 Characteristics of $\Delta\Sigma$ DACs ............... 5
   1.3 $\Delta\Sigma$ DAC Based Transmitters ............... 6
   1.4 Organization and Scope of Dissertation .......... 8

2 TIDSM DAC Design Considerations .................. 11
   2.1 Conventional DSMs ................................ 11
      2.1.1 High-speed Conventional DSMs: Previous Works .... 13
      2.1.2 Speed Limitation of a First Order EFB DSM .......... 18
   2.2 Time-Interleaved DSM and Previous Works ....... 20
   2.3 Choice of Multiplexing Strategy .................. 25
   2.4 DAC Current Cell Design .......................... 31
      2.4.1 Switch Driver Design .......................... 34
   2.5 Aspects of Clock Distribution ..................... 35
   2.6 DAC Testing Challenges ........................... 36
   2.7 Summary ......................................... 38

3 An 8-GS/s 200-MHz BW Interleaved $\Delta\Sigma$ DAC in 65-nm CMOS ....... 39
   3.1 Introduction ..................................... 39
   3.2 DSM Design ....................................... 41
      3.2.1 Analysis of Critical Path ..................... 41
      3.2.2 Comparison With Alternative Logic Styles .......... 44
      3.2.3 Clock Distribution ........................... 47
      3.2.4 Final Multiplexer and Current Cell Design ........ 48
   3.3 Measurement Results ............................... 50
   3.4 Summary ......................................... 54

4 Effect of Clock Duty Cycle Error on Two-channel Interleaved $\Delta\Sigma$ DACs ...... 55
   4.1 Introduction ..................................... 55
   4.2 Mathematical Formulation of the SNDR Loss .......... 57
List of Figures

1-1 DAC in a conventional transmitter. .................................. 2
1-2 Filtering of the nearest DAC image. ................................. 2
1-3 An oversampled DAC reduces the filter order. .................... 3
1-4 An k-bit current steering DAC. ................................. 3
1-5 Effect of truncating the DAC input word length. ............. 5
1-6 Noise shaping of the quantization noise by a DSM. ............ 6
1-7 The $\Delta\Sigma$ DAC architecture. ............................... 6
1-8 DAC in a “digital” transmitter. ................................. 7
1-9 All digital transmitter used in [25] (4 GS/s DAC and 1 GHz carrier). 7
1-10 Config. used in [8] for a 3.6 GS/s $\Delta\Sigma$ DAC for 2.4/5 GHz carriers. 7
1-11 Config. used in [27] (5.4 GS/s DAC/2.7 GHz carrier) and [26] (2.6 GS/s DAC/5.4 GHz carrier). ............................. 8
1-12 Config. used in [26] (2.6 GS/s DAC/650 MHz IF) and [32] (250 MS/s DAC/62 MHz IF). ................................. 8
1-13 Aim of dissertation is to extend the BW of $\Delta\Sigma$ DACs. The source data in this plot is derived from [40]. ....................... 9

2-1 A first-order EFB DSM. ................................................ 11
2-2 An $n^{th}$-order EFB DSM. ............................................. 12
2-3 A 20$n$ dB/decade noise shaping for a DSM with $NTF(z) = (1-z^{-1})^n$. 13
2-4 A second-order EFB DSM. ............................................. 14
2-5 A third-order EFB DSM with four adder critical path. .......... 14
2-6 A second-order CIFB DSM topology used in [8]. ................ 15
2-7 A third-order 2 GS/s DSM proposed in [25]. ..................... 15
2-8 Three phase clocking with dynamic full adders used in [25]. 16
2-9 A MASH architecture with a cascade of individual modulators. 16
2-10 A MASH 1-1 DSM. ................................................... 17
2-11 A pipelined MASH 1-1 DSM with only a one adder critical path. 17
2-12 A 1-bit pipeline of the first-order EFB using a 1-bit carry select adder. 18
2-13 Transfer function $H(z)$ at a sample rate $f_s$. ................. 20
2-14 M-channel time-interleaved implementation of $H(z)$. ........ 20
<table>
<thead>
<tr>
<th>Figure</th>
<th>Description</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>2-15</td>
<td>TI using a M×M block filtering approach.</td>
<td>21</td>
</tr>
<tr>
<td>2-16</td>
<td>A 2-channel TI-EBF implementation for a transfer function $NTF(z) = (1 - z^{-1})^2$.</td>
<td>22</td>
</tr>
<tr>
<td>2-17</td>
<td>Two-channel TI implementation of a delay element/FF</td>
<td>23</td>
</tr>
<tr>
<td>2-18</td>
<td>Two-channel TI implementation of a second-order CIFB DSM.</td>
<td>23</td>
</tr>
<tr>
<td>2-19</td>
<td>A first-order EFB with a delayed integrator.</td>
<td>24</td>
</tr>
<tr>
<td>2-20</td>
<td>TI implementation of a first-order EFB by decomposing the integrator transfer function.</td>
<td>24</td>
</tr>
<tr>
<td>2-21</td>
<td>TI decomposition of a first-order EFB by decomposing the FF only.</td>
<td>25</td>
</tr>
<tr>
<td>2-22</td>
<td>Timing diagram for the multiplexer/serializer.</td>
<td>25</td>
</tr>
<tr>
<td>2-23</td>
<td>The classical MUX scheme used in [34].</td>
<td>26</td>
</tr>
<tr>
<td>2-24</td>
<td>CML Phase Rotator based calibration scheme for MUX used in [40].</td>
<td>27</td>
</tr>
<tr>
<td>2-25</td>
<td>Phase/Delay Calibration based DAC used in [48].</td>
<td>28</td>
</tr>
<tr>
<td>2-26</td>
<td>Two-channel MUX with a single $f_s/2$ clock.</td>
<td>28</td>
</tr>
<tr>
<td>2-27</td>
<td>Timing Diagram for a two channel MUX with a $f_s/2$ clock.</td>
<td>29</td>
</tr>
<tr>
<td>2-28</td>
<td>Two-channel Analog MUX based on IIR pre-filtering used in [51].</td>
<td>29</td>
</tr>
<tr>
<td>2-29</td>
<td>Frequency response of the IIR filter with the transfer function $G(z) = 1/(1 + z^{-1})$.</td>
<td>30</td>
</tr>
<tr>
<td>2-30</td>
<td>DACs with FIR response.</td>
<td>30</td>
</tr>
<tr>
<td>2-31</td>
<td>Commonly used DAC current cell types.</td>
<td>32</td>
</tr>
<tr>
<td>2-32</td>
<td>Dual current cells with embedded multiplexing [48].</td>
<td>33</td>
</tr>
<tr>
<td>2-33</td>
<td>Effect of switch crossover point on the common node potential [50].</td>
<td>34</td>
</tr>
<tr>
<td>2-34</td>
<td>A fast high-crossing switch driver [16] [21].</td>
<td>34</td>
</tr>
<tr>
<td>2-35</td>
<td>Floorplan of a TIDSM MASH 1-1 with clock distribution.</td>
<td>36</td>
</tr>
<tr>
<td>2-36</td>
<td>Commonly used DAC testing memory types.</td>
<td>37</td>
</tr>
<tr>
<td>3-1</td>
<td>Example of a Nyquist DAC in a traditional transmitter (top) and a ΔΣ DAC in a digital baseband transmitter (bottom). Only the I path is shown.</td>
<td>40</td>
</tr>
<tr>
<td>3-2</td>
<td>Proposed two-channel interleaved second order MASH ΔΣ DAC with a 2F, sampling rate. Critical path is enclosed by a dashed rectangle.</td>
<td>41</td>
</tr>
<tr>
<td>3-3</td>
<td>N-bit deep Integrator Pipeline. Critical path is from at flop FF$S_0$ to flop FF$S_{N-1}$.</td>
<td>42</td>
</tr>
<tr>
<td>3-4</td>
<td>A 2-bit pipeline with optimization. Single dashed box shows complementary inputs. Double box shows reset moved to the flop.</td>
<td>45</td>
</tr>
<tr>
<td>3-5</td>
<td>Ratioed and Dynamic Logic Implementation of the integrator.</td>
<td>46</td>
</tr>
<tr>
<td>3-6</td>
<td>The first order EFB TIDSM instantiated twice to obtain the MASH 1-1.</td>
<td>48</td>
</tr>
<tr>
<td>3-7</td>
<td>Clock Distribution and Layout.</td>
<td>48</td>
</tr>
<tr>
<td>3-8</td>
<td>Final MUX and DAC current cell with the timing diagram. Switch driver circuit is the same as the local clock driver of Fig. 3.7(a).</td>
<td>49</td>
</tr>
<tr>
<td>3-9</td>
<td>Simulated output impedance ($Z_0$) profile of the current cell.</td>
<td>50</td>
</tr>
<tr>
<td>3-10</td>
<td>Chip photograph of the implemented ΔΣ DAC.</td>
<td>51</td>
</tr>
</tbody>
</table>
3-11 Measurement setup for the $\Delta\Sigma$ DAC with the expected spectrum at output of every block. An up-sampling filter is not used to simplify testing. Up-sampling images in the output are out of the band of interest.  

3-12 Measured single-ended spectrum showing 8 GS/s operation with $F_s=4$ GHz, $F_{bb}=800$ MHz and input frequency, $F_{in}=200$ MHz. The noise shaping and the 9 out of band images can be seen.  

3-13 Measured $-57$ dBc IMD3 with two $-6$ dBFS tones near 200 MHz spaced 2 MHz apart.  

3-14 Output spectrum with 42 dB SNDR obtained from post-layout simulation for an 8 GS/s operation.  

4-1 Block diagram of a generic two-channel interleaved $\Delta\Sigma$ DAC implementing a noise transfer function $1-H(z)$.  

4-2 Effect of 1% DCE on SNDR for a 4-bit DAC with $f_s=10$ GHz, OSR=16 (BW=312.5 MHz) and NTF of $(1-z^{-1})^3$.  

4-3 Half-rate sampling clock of frequency $f_s/2$ and DCE $=d_e$%.  

4-4 Folding effect of DCE on time-interleaved Nyquist and DSM DACs.  

4-5 Simulation versus Estimation of SNDR loss for a 10 GS/s TIDSM DAC for (a) second-order ($n=2$) and (b) third-order ($n=3$) modulators.  

4-6 Second-order modulator shows a better SNDR than third-order for OSR=16 and $d_e > 0.12\%$ as predicted by Eq. (4.15).  

4-7 Two multi-channel MUX styles.  

4-8 Interleaved $\Delta\Sigma$ DAC with a FIR filter to reduce the effect of the DCE.  

4-9 Frequency response of a 10 GS/s TIDSM DAC noise-shaping with NTF$(z)=$(1 $-z^{-1})^3$ in presence of the FIR filter.  

4-10 Simulation versus estimation of SNDR loss of a 10 GS/s TIDSM DAC for $n=3$ as a function of filter order, $m$ and OSR from Eq. (4.21).  

4-11 Hold interleaving to introduce a zero at $f = f_s/2$, i.e. implementing a filter $1+z^{-1}$.  

4-12 Analog post-correction of timing skew with an auxiliary DAC.  

4-13 Digital pre-correction of timing skew in the modulator.  

4-14 Simulated variation of SNDR as a function of timing skew error for different OSR and modulator orders. Note that duty cycle error, $d_e = 1/2 \times \Delta t/T_s$.  

4-15 Number of Aux. DAC required for analog post-correction with a timing error of $\Delta t/T_s$.  

4-16 Simulated DAC mismatch (% as a function of the timing error for achieving SNDR of Fig. 4-14.  

5-1 Comparison of different DAC based architectures for 60-GHz radio baseband.
5-2 Filtering with a second order LPF for a second order ∆Σ 4-bit DAC at 10.56 GS/s 16-QAM encoded random data. 79
5-3 A conventional first-order EFB DSM. 80
5-4 A 2-channel TI EFB DSM. 81
5-5 TIDSM versus the LA-TIDSM approach to improve the speed. 81
5-6 Proposed two-channel LA-TIDSM EFB DSM with only one adder critical path. 83
5-7 A three channel LA-TIDSM implementation. 86
5-8 Design space comparison of a conventional and LA TIDSM. 89
5-9 Delay improvement over conventional TIDSM as a function of pipeline depth and number of channels. 89
5-10 An alternative two-channel TIDSM architecture of [40] also having 1 adder critical path but requires 8 adders in total. 90
5-11 A 2-bit pipeline slice of a first-order EFB LA-TIDSM. Grey colour represents the LA part. Thin lines are used for CH0 path and thick ones for CH1 path. 91
5-12 Final 2:1 Multiplexer with high-crossing switch driver. 92
5-13 DAC current cell interfaced with a centre-tapped 2:1 transformer. 93
5-14 Simulated output impedance (Z₀) profile of the current cell. 94
5-15 Chip Photograph. 94
5-16 Memory Architecture for full speed LA-TIDSM DAC testing. 94
5-17 DAC and switch driver floorplan. 96
5-18 Measured wideband spectrum with a 1.1 GHz input tone at 11 GS/s. 97
5-19 Measured 39 dB SNDR with a 1.1 GHz at 11 GS/s tone with no dithering. 97
5-20 Measured IM3 of -49 dBc with two tones at 945 MHz and 1.1 GHz respectively. 98
5-21 Measured 53 dB HD2 and 56 dB HD3 with a 428 MHz input sine tone. 98
5-22 Measured interleaving spur of -36.9 dBc at 2.67 GHz with a 2.83 GHz tone to estimate the DCE. 99
5-23 Measured SFDR (in 0-1.1 GHz band), SNDR (0-inp. freq.) and IM3 (centre freq.) versus frequency at 11 GS/s. 100
5-24 Measured Spectral Mask with 16-QAM encoded random data at 10.56 GS/s. 100
5-25 Effect of DCE on a 2-b modulator at 11 GS/s and input frequency of 601 MHz (OSR=9.15). 103
6-1 A hybrid DAC. 107
6-2 Digital IF with TIDSM DACs. 108
6-3 Digital IF of 2.5 GHz using a 10 GS/s TI-DASM DAC. DCE causes a close image spur. 108
## List of Tables

<table>
<thead>
<tr>
<th>Table</th>
<th>Description</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>2-1</td>
<td>Post-layout simulated delay of a 1-bit pipeline at 1 V, 75°C, typical corner using TGFFs and static CMOS logic.</td>
<td>19</td>
</tr>
<tr>
<td>3-1</td>
<td>Maximum effective achieved speed as a function of the pipeline depth in a 2-channel interleaved modulator.</td>
<td>44</td>
</tr>
<tr>
<td>3-2</td>
<td>Delay Comparison with alternative logic style for 2-bit pipelines.</td>
<td>47</td>
</tr>
<tr>
<td>3-3</td>
<td>Comparison with ∆Σ DACs having &gt;2.5-GS/s sampling rate.</td>
<td>54</td>
</tr>
<tr>
<td>5-1</td>
<td>Different modulator options for the 880 MHz bandwidth.</td>
<td>79</td>
</tr>
<tr>
<td>5-2</td>
<td>Truth Table to compute the correct value of carry, $C_1$ from $CF_0,CL_0$ and $CL_1$.</td>
<td>84</td>
</tr>
<tr>
<td>5-3</td>
<td>Carry computation truth table for $C_2$ in a three channel LA-TIDSM.</td>
<td>87</td>
</tr>
<tr>
<td>5-4</td>
<td>Post-layout simulated delay of the integrator (Fig. 5-11) at 1 V, 75°C, typical corner.</td>
<td>92</td>
</tr>
<tr>
<td>5-5</td>
<td>Power and Area Breakdown of the DAC by function.</td>
<td>99</td>
</tr>
<tr>
<td>5-6</td>
<td>Comparison with complete ∆Σ DACs having &gt;2.5-GS/s sampling rate.</td>
<td>101</td>
</tr>
<tr>
<td>5-7</td>
<td>Comparison with other Digital ∆Σ Modulators with &gt; 5 GHz speed.</td>
<td>102</td>
</tr>
<tr>
<td>5-8</td>
<td>Comparison of this work with wideband Nyquist DACs.</td>
<td>102</td>
</tr>
</tbody>
</table>
# List of Abbreviations

<table>
<thead>
<tr>
<th>Abbreviation</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>ADC</td>
<td>Analog-to-Digital Converter</td>
</tr>
<tr>
<td>BPF</td>
<td>Band-Pass Filter</td>
</tr>
<tr>
<td>BPSK</td>
<td>Binary Phase Shift Keying</td>
</tr>
<tr>
<td>BW</td>
<td>Bandwidth</td>
</tr>
<tr>
<td>CIFB</td>
<td>Cascaded Integrator with distributed error Feedback</td>
</tr>
<tr>
<td>CML</td>
<td>Current Mode Logic</td>
</tr>
<tr>
<td>CMOS</td>
<td>Complementary Metal Oxide Semiconductor</td>
</tr>
<tr>
<td>CS-DAC</td>
<td>Current Steering DAC</td>
</tr>
<tr>
<td>DAC</td>
<td>Digital-to-Analog Converter</td>
</tr>
<tr>
<td>DCE</td>
<td>Duty Cycle Error</td>
</tr>
<tr>
<td>DFF</td>
<td>D-Flip flop</td>
</tr>
<tr>
<td>DFT</td>
<td>Design for Testability</td>
</tr>
<tr>
<td>DLL</td>
<td>Delay Locked Loop</td>
</tr>
<tr>
<td>DNL</td>
<td>Differential Non-linearity</td>
</tr>
<tr>
<td>DSM/DDSM</td>
<td>Digital ΔΣ Modulator</td>
</tr>
<tr>
<td>EFB</td>
<td>Error Feedback</td>
</tr>
<tr>
<td>ENOB</td>
<td>Effective Number of Bits</td>
</tr>
<tr>
<td>FA</td>
<td>Full Adder</td>
</tr>
<tr>
<td>FFT</td>
<td>Fast Fourier Transform</td>
</tr>
<tr>
<td>Abbreviation</td>
<td>Description</td>
</tr>
<tr>
<td>--------------</td>
<td>-------------</td>
</tr>
<tr>
<td>FF</td>
<td>Flip-flop</td>
</tr>
<tr>
<td>FIR</td>
<td>Finite Impulse Response</td>
</tr>
<tr>
<td>FOM</td>
<td>Figure of Merit</td>
</tr>
<tr>
<td>HD&lt;sub&gt;n&lt;/sub&gt;</td>
<td>Harmonic Distortion of n&lt;sup&gt;th&lt;/sup&gt;-order</td>
</tr>
<tr>
<td>IF</td>
<td>Intermediate Frequency</td>
</tr>
<tr>
<td>IIR</td>
<td>Infinite Impulse Response</td>
</tr>
<tr>
<td>IM3/IMD3</td>
<td>Third-order Intermodulation Distortion</td>
</tr>
<tr>
<td>INL</td>
<td>Integral Non-linearity</td>
</tr>
<tr>
<td>JLCC</td>
<td>J-Leaded Chip Carrier</td>
</tr>
<tr>
<td>LA-TIDSM</td>
<td>Look-ahead Time-interleaved ΔΣ Modulator</td>
</tr>
<tr>
<td>LPF</td>
<td>Low-Pass Filter</td>
</tr>
<tr>
<td>LSB</td>
<td>Least Significant Bit</td>
</tr>
<tr>
<td>MASH</td>
<td>Multi stAge noise SHaping</td>
</tr>
<tr>
<td>MSB</td>
<td>Most Significant Bit</td>
</tr>
<tr>
<td>MUX</td>
<td>Multiplexer</td>
</tr>
<tr>
<td>NTF</td>
<td>Noise Transfer Function</td>
</tr>
<tr>
<td>OFDM</td>
<td>Orthogonal Frequency Division Multiplexing</td>
</tr>
<tr>
<td>OSR</td>
<td>Oversampling Ratio</td>
</tr>
<tr>
<td>PA</td>
<td>Power Amplifier</td>
</tr>
<tr>
<td>PRBS</td>
<td>Pseudo Random Bit Sequence</td>
</tr>
<tr>
<td>QAM</td>
<td>Quadrature Amplitude Modulation</td>
</tr>
<tr>
<td>QPSK</td>
<td>Quadrature Phase Shift Keying</td>
</tr>
<tr>
<td>SC</td>
<td>Single Carrier</td>
</tr>
<tr>
<td>SFDR</td>
<td>Spurious-Free Dynamic Range</td>
</tr>
<tr>
<td>SNDR</td>
<td>Signal-to-Noise-and-Distortion Ratio</td>
</tr>
<tr>
<td>SNR</td>
<td>Signal-to-Noise Ratio</td>
</tr>
<tr>
<td>Abbreviation</td>
<td>Description</td>
</tr>
<tr>
<td>--------------</td>
<td>------------------------------</td>
</tr>
<tr>
<td>SoC</td>
<td>System-on-Chip</td>
</tr>
<tr>
<td>SQNR</td>
<td>Signal-to-Quantization-Noise Ratio</td>
</tr>
<tr>
<td>STF</td>
<td>Signal Transfer Function</td>
</tr>
<tr>
<td>TG</td>
<td>Transmission Gate</td>
</tr>
<tr>
<td>TIDSM</td>
<td>Time-Interleaved ΔΣ Modulator</td>
</tr>
<tr>
<td>TSPCR</td>
<td>True Single Phase Clocked Register</td>
</tr>
<tr>
<td>Tx</td>
<td>Transmitter</td>
</tr>
</tbody>
</table>
List of Abbreviations
Chapter 1

Introduction

Digital-to-Analog Converters (DACs) are one of the fundamental building blocks of all wireless transmitters (Tx) and form the interface between the analog and the digital world. The increasing demand for high bandwidth and high-data rates has led to the development of a multitude of radio standards, which have channel bandwidths ranging from a few megahertz to many gigahertz. While standards like WiMAX (IEEE 802.16) [1] and Wi-Fi (IEEE 802.11x standards) [2] have channel bandwidths of up to about 40 MHz, recent standards for short-length communication e.g. Ultra Wide Band (UWB) [3] and 60-GHz radio [4–6] have channel bandwidths of 528 MHz and 1.76 GHz respectively. Consequently, there is also a requirement on the DAC to support these increasing bandwidths. If the total channel bandwidth of the radio standard is C, then the DAC is required to support a bandwidth, BW that is greater than C/2.

1.1 Characteristics of Nyquist DACs

Traditionally, Nyquist DACs have been used in transmitters. Figure 1-1 shows the location of the DAC in a conventional Tx chain and operating at a sample rate of $f_s$. The DAC can support an input BW of up to $f_s/2$ and the analog reconstruction/anti-aliasing low-pass filter (LPF) filters out the DAC images. As the input BW becomes closer to $f_s/2$, the first DAC image also moves nearer to the input signal. This results in an increase in the anti-aliasing filter order and a sharp cut-off to attenuate this nearest DAC image. Figure 1-2 shows that as the input tone $f_{in}$ moves closer to $f_s/2$, the nearest DAC image located at $f_s - f_{in}$ also moves nearer to $f_s/2$, thus increasing the difficulty of image filtering. The main drawback of passive on-chip implementations of this LPF is the large area of the inductors or capacitors required and their low quality factor [7] [8]. Recently, a few wideband active filters that can occupy a lesser area also have been proposed for receivers [9] [10]. However, they are
challenging to design and can limit the transmitter linearity as compared to passive filters.

The filter order can be relaxed by oversampling the input signal i.e. operating the DAC at a higher sampling frequency, $f_s'$ so that the nearest DAC image moves farther (see Fig. 1-3). In addition to this, the natural sinc response $(\sin(\pi f/f_{s}'))/(\pi f/f_{s}'))$ resulting from the zero-order hold function of the DAC also helps to provide some additional attenuation. The ratio between the new sampling frequency, $f_s'$ and the original Nyquist sampling frequency, $f_s$ is called the oversampling ratio (OSR). The filter order thus shows a trade-off with the DAC sampling frequency. A relaxed filter requires the DAC to operate at a much higher clock frequency than the Nyquist rate. The oversampling also helps to improve the SNR of the input signal. If $f_s/2$ is the BW of the input signal, then the SNR of a k-bit DAC is given by

$$SNR = 6.02k + 1.76 + 10\log(\text{OSR})$$

where $\text{OSR}=f_s'/f_s$. Equation 1.1 shows that doubling the sampling frequency improves the SNR by $\sim 3$ dB [11].

The current steering DAC architecture is the most popular choice for achieving a high operating frequency. A simplistic view of a k-bit unary decoded current steering DAC is shown in Fig. 1-4. The DAC consists of $N = 2^k - 1$ unit current cells that drive a load $R_L$. The digital decoder controls the switches that steer the unit current, $I_{unit}$ towards one of two differential outputs. As the number of DAC bits increases,
1.1 Characteristics of Nyquist DACs

![Figure 1-3: An oversampled DAC reduces the filter order.](image)

there is an exponential increase in the number of unit cells required. Due to the
process variations, all the current cells do not produce the same current. The static
linearity of the DAC is found to be limited by the output resistance of the DAC cell
and the matching between unit currents in each cell [12]. To achieve a certain yield
percentage, \( Y \) for the number of DACs having less than \( \frac{1}{2} \) LSB INL error, the
standard deviation of the error in current of each unit cell, \( \Delta I_{\text{unit}} \) is given by

\[
\sigma \left( \frac{\Delta I_{\text{unit}}}{I_{\text{unit}}} \right) \leq \frac{1}{2C\sqrt{2^k}} \quad \text{with} \quad C = \text{inv}_\text{norm}(0.5 + \frac{Y}{2}) \tag{1.2}
\]

where \( \text{inv}_\text{norm} \) is the inverse cumulative distribution [12]. The relationship between
the INL and the output resistance, \( R_0 \) of the cell [12] is given by

\[
\text{INL} = \frac{I_{\text{unit}}R_0^2N^2}{4R_0} \tag{1.3}
\]

where \( I_{\text{unit}} \) is the current per cell and \( N (= 2^k - 1) \) is the total number of unit
cells. As the DAC resolution (i.e. \( k \)) increases, the requirement on the cell unit
current matching and the output resistance increases if INL required is the same. The
relationship between the current cell mismatch and the area of the current cell is given
by the well-known Pelgrom model [13]

\[
\sigma^2 \left( \frac{\Delta I_{\text{unit}}}{I_{\text{unit}}} \right) = \frac{1}{WL} \left( A^2_\beta + \frac{A^2_{VT}}{(V_{gs} - V_T)^2} \right)
\]  

(1.4)

where \( A_\beta \) and \( A_{VT} \) are technology dependent matching parameters, \( WL \) represents the area of the current source and \( (V_{gs} - V_T) \) is the overdrive voltage. Thus, an increased matching requirement in the current cell results in an increased area of the current source. Although the overdrive voltage of the current source can be increased to improve the matching, this is ultimately limited by the headroom available. The output resistance can be increased by techniques such as adding cascodes; but the supply voltage, output swing required and hence the headroom can limit the achievable output impedance as the number of DAC bits increases.

For wireless applications, the dynamic performance metrics like the SFDR, harmonic distortion (HD) and inter-modulation distortion (IM) also must be taken into account in addition to static metrics, such as INL and DNL. The dynamic performance of the DAC is limited by five main reasons. Firstly, the output impedance of the current cell is a function of its capacitance, which is non-linear and dominates at high frequencies [14–17]. The harmonic distortion of the DAC is given by

\[
HD_n = \left( \frac{NZ_L}{NZ_{\text{cell}}} \right)^{n-1}
\]

(1.5)

where \( Z_L \) and \( Z_{\text{cell}} \) are load and cell output impedances respectively (input frequency dependent). Secondly, the DAC linearity is limited by the switch timing errors in the DAC. The variations in the switching instants of the different cells results in distortion that severely limits the DAC linearity (refer Fig. 1-4) [18]. Thirdly, the current cell common node voltage must be kept as stable as possible when the current switches from one side to the other [15]. This requires fast transitions on the signals controlling the switches [16]. Fourth, the sampling clock jitter results in the variation in the sample time from cycle-to-cycle that introduces distortion [19] [20]. Lastly, an IR drop along the DAC supply lines causes a variation in the gate-to-source voltage of the current source leading to DNL errors in the DAC [16]. Similarly, an IR drop in the switch driver supply results in timing errors in the DAC switches [21]. In summary, as the number of cells \( N \) increases, it becomes increasingly difficult to maintain the same DAC performance. As \( N \) increases, a larger \( Z_{\text{cell}} \) is required which may be difficult to achieve. With the increased number of switch drivers, the timing errors between the switches are also expected to increase. The clock jitter is also likely to increase due to the increased load on the clock distribution.
1.2 Characteristics of \( \Delta \Sigma \) DACs

Now, again referring to Fig. 1-3, where the oversampled DAC operating at higher frequencies relaxes the filtering order. In addition to the oversampling, if the total number of DAC unit cells (\( N \)) could be reduced, then the DAC cell requirements in Equations (1.2), (1.3), (1.4) and (1.5) can be relaxed. The DAC area would be smaller and hence this would additionally reduce the number of switch drivers required. The clock distribution to the DAC would be smaller leading to an improved clock jitter and a reduced timing error between the switch drivers. The IR drop on the supply lines would also be lesser due to the reduced number of current sources and switch drivers that would lead to a lesser area.

To achieve this current cell reduction, few of the lower LSB’s could be eliminated i.e. the k-bit digital DAC input can be simply truncated to m-bits. However, this increases the quantization noise floor of the input signal thereby reducing the SNR. Since the DAC requires a minimum SNR for achieving a given bit error rate (BER) for a given application, this simple truncation is not appropriate as shown in Fig. 1-5. A \( \Delta \Sigma \) Modulator (DSM) [22] can digitally filter this increased quantization noise (also called noise shaping) due to the truncation and improve the SNR in a BW of \( f_{in} \) as shown in Fig. 1-6. The SNR achieved depends on the order of this noise filtering, the OSR used and the amount of bit reduction desired in the DAC. This type of DAC that uses oversampling and reduces the word length while still achieving the desired SNR through quantization noise filtering is referred to as a \( \Delta \Sigma \) DAC (shown in Fig. 1-7).

The \( \Delta \Sigma \) DAC has three degrees of freedom to achieve the desired SNR (also sometimes referred to as SQNR). The desired SQNR could be achieved by only using a single bit DAC (linear DAC) but which would require a high OSR. Alternatively, using a few more bits in the DAC or increasing the noise shaping filter order can relax the OSR. However, increasing the bits beyond a certain limit may not be beneficial due to the increasing DAC complexity. Similarly, the maximum order of the noise shaping may be limited by the spectral mask of the standard. Increasing the OSR may be ultimately limited by the technology. Thus, the choice of the \( \Delta \Sigma \) modulator architecture should take these factors into account.
1.3 $\Delta \Sigma$ DAC Based Transmitters

If the SQNR specification can be met, then a $\Delta \Sigma$ DAC provides the benefit of digitally relaxing the analog DAC unit cell as well as the order of the analog anti-aliasing filter. This makes the $\Delta \Sigma$ DAC a possible alternative to the Nyquist DAC based transmitters. The $\Delta \Sigma$ DAC has been traditionally used in low-bandwidth high-resolution applications e.g. audio DACs [23]. However, there is an increasing interest in $\Delta \Sigma$ DACs for moderate resolution and higher bandwidth transmitters for wireless applications over the last decade [8, 24–27].

The use $\Delta \Sigma$ DACs in these transmitters has been driven by the concept of Software Defined Digital Radios (SDR/DR). The aim of a software defined digital radio is to provide easy reconfigurability of the hardware for multi-standard support. In a digital radio Tx, the bulk of the signal processing is performed in the digital domain to relax the analog processing and the DAC is kept as close to the antenna as possible [28]. An “ideal” DR Tx is shown in Fig. 1-8 wherein even the mixing with the carrier along with the power/gain control is performed digitally. The DAC is required to work at a frequency that is at least twice the carrier and still have a high dynamic range, which is challenging at gigahertz sampling rates [29]. A $\Delta \Sigma$ DAC in this situation can utilize the oversampling already present in this architecture to relax the DAC requirements.

An example of such an all-digital Tx is [25] which uses a 4 GS/s $\Delta \Sigma$ modulator with a 1-b DAC for a 1 GHz carrier frequency and a BW of 50 MHz is shown in Fig. 1-9. However, such a true digital Tx is challenging to design when the carrier frequency is higher due to the very high DAC sampling rate required e.g. WiMAX,
1.3 $\Delta\Sigma$ DAC Based Transmitters

WiFi (2.4 GHz band), UN-II band (5 GHz band) and UWB (3-10 GHz). Instead, nearly “digital” $\Delta\Sigma$ solutions have been proposed wherein the baseband is up-sampled and digitally processed at a higher frequency [8, 26, 27, 30, 31] while the mixing is performed in the analog domain as shown in Figs. 1-10 and 1-11. Alternatively, a digital-IF is used while the final mixing is done in the analog domain as shown in Fig. 1-12 [26, 32]. All these configurations have proposed the use of $\Delta\Sigma$ DACs to relax the DAC design. At the time of the start of this dissertation work, the largest bandwidth reported in literature for a low-pass $\Delta\Sigma$ DAC was 100 MHz [26] while the highest reported sampling rate using CMOS technology was 5.4 GS/s [27].
1.4 Organization and Scope of Dissertation

In order to use these “digital” architectures for more wideband applications e.g. UWB or 60-GHz radio standards, a higher sampling rate and BW is required from the ΔΣ DACs. The speed that can be achieved by a conventional DSM becomes a bottleneck when aiming for a high BW because of the high sampling rate required. Hence, this dissertation focuses on time-interleaved ΔΣ (TIDSM) DACs that can overcome the limitations of conventional implementations. TIDSM DACs have received attention only recently as compared to TIDSM ADCs and very few TIDSM DACs had been reported at the time of the start of this dissertation work [33, 34]. Hence, this dissertation aims to further improve the performance of the TIDSM DACs through architectural and circuit level techniques (Papers I & III) [35, 36].

The performance limitations of TIDSM DACs are also investigated (Papers II, IV-V) [37–39]. Only very recently, another TIDSM based hybrid DAC has been also reported in [40] which indicates a growing interest in this topic. Figure 1-13 shows the achievable linearity for ΔΣ DACs and Nyquist DACs for various bandwidths [40]. This dissertation aims to increase the overlap between the ΔΣ DACs and Nyquist DACs by using TIDSMs. This topic is organized in the rest of the dissertation as follows:-

- **Chapter 2** discusses the need for time-interleaved ΔΣ DACs and their potential
to improve the speed. The design considerations for the digital and the analog parts of the $\Delta \Sigma$ DACs are presented.

- **Chapter 3** presents the design and implementation of a 8 GS/s 200 MHz BW two-channel time-interleaved $\Delta \Sigma$ DAC in 65-nm CMOS. This chapter is based on **Paper I** and **Paper IV**.

- **Chapter 4** discusses the impact of clock duty cycle on the performance of a two-channel time-interleaved $\Delta \Sigma$ DAC. Analytical expressions for the performance degradation of the interleaved $\Delta \Sigma$ DAC due to the duty cycle error are presented. A comparison of different techniques that can be used to mitigate this problem is also presented. This chapter is based on **Paper II** and **Paper V**.

- **Chapter 5** presents an improvement on the limitations of the architecture in Chapter 3. A new look-ahead modulator interleaved $\Delta \Sigma$ DAC architecture that achieves 1.1 GHz BW and 11 GS/s in 65-nm CMOS is presented. It is shown that this DAC is suitable for the 60-GHz radio baseband. This chapter is based on **Paper III**.

- **Chapter 6** presents a conclusion and future scope in the area of TIDSM DACs.

Finally, **Appendix A** provides a copy of the published papers for a quick reference.
Chapter 2

TIDSM DAC Design Considerations

This chapter provides a brief background on conventional DSM based DACs and discusses the previous work on high-speed conventional DSMs. The limitations of these DSMs are identified and the need for time-interleaved DSMs is motivated. Then the basic principles behind TIDSM DACs are introduced and the different factors that affect their performance are explained.

2.1 Conventional DSMs

Figure 2-1: A first-order EFB DSM.

Figure 2-1 shows the $z$-domain representation of a first-order error feedback (EFB) DSM. This DSM is actually an integrator whose output is quantized. The $p$-bit integrator is split into $m$ MSBs that are sent forward to the DAC, while $p - m$ LSBs are sent back as a feedback into the integrator. This feedback is also referred to as the negative quantization error term, $-er$. The output, $Y(z)$ can then be written as

$$Y(z) = X(z) + (1 - z^{-1})E(z)$$  \hspace{1cm} (2.1)
where, $E(z)$ is the quantization error introduced at the output. The first part of Eq. (2.1) is referred to at the Signal Transfer Function (STF), which in this case is $STF(z) = X(z)$. The coefficient of the second part of the equation which contains the quantization noise term, $E(z)$ is called the Noise Transfer Function (NTF). In this case, $NTF(z) = (1 - z^{-1})$ and represents a first-order high-pass filter response. The output contains the original input signal and a quantization noise term that is high-pass filtered. The spectrum at output $Y$ is similar to the spectrum previously shown in Fig. 1-6. The first-order high pass filtering (noise shaping) shows a 20 dB/decade response. The DSM is said to be stable as long as the integrator does not overflow. Being a first-order system, this is a stable system. However, a first-order EFB by itself is hardly used because it does not provide sufficient noise-shaping to achieve a high SQNR and suffers from limit cycles or idle tones in the output spectrum [22]. To improve the achieved SQNR, a higher-order NTF function is used i.e for an $n^{th}$-order modulator, Eq. (2.1) can be rewritten as

$$Y(z) = X(z) + (1 - z^{-1})^n E(z)$$  \hspace{2cm} (2.2)

The EFB structure of a DSM with any arbitrary NTF is shown in the Fig. 2-2. For an $n^{th}$-order modulator, the noise shaping shows a $20n$ dB/decade response thereby improving the achieved SQNR as shown in Fig. 2-3. In Eq. (2.2), all the zeroes of the NTF are located at DC or zero frequency. However, the location of zeroes can be optimized such that the noise power in the band-of-interest can be minimized [22]. It has been shown in [41] that an $(n - 1)^{th}$-order DSM with number of output bits, $m = n$ can be always made stable i.e. no overflow in the integrator. The maximum SQNR achievable is then given by [33]

$$SQNR_{\text{max}} = 10 \log \left[ \frac{3(2n - 1)2^{2n-1}OSR^{2n-1}}{\pi^{2n-2}} \right]$$  \hspace{2cm} (2.3)

Equation (2.3) shows the three different ways of improving the SQNR. Increasing the OSR is the first option, which as mentioned previously results in the increase in the sampling frequency of the clock but at the same time relaxes the anti-aliasing filter order. The second way is to increase the order i.e. $n$, which relaxes the OSR.
2.1 Conventional DSMs

but this increases the filter order as the quantization noise outside the frequency band of interest increases with $n$. Lastly, increasing the number of DAC bits increases the SQNR but at the cost of increasing complexity to the DAC as mentioned in Chapter 1. The choice of the DSM transfer function should take into account all these three factors. While the required SQNR and the DAC linearity is set by the communication standard that is being targeted; the amount of OSR, the number of DAC bits and the filter order choice is also decided by the CMOS technology being used. The choice of the NTF thus requires extensive optimization that maximizes SQNR; minimizes the DAC bits, filter order and the OSR with a reasonable area and power consumption.

2.1.1 High-speed Conventional DSMs: Previous Works

As mentioned in the previous section, a first-order DSM is rarely used because it does not yield a sufficient SQNR and requires a very high OSR. Hence, a higher order DSM is needed. Consider a second-order DSM with an $NTF(z) = (1 - z^{-1})^2 = 1 - 2z^{-1} + z^{-2}$. Then, referring to the structure of Fig. 2-2, $H(z) = 1 - NTF(z) = 2z^{-1} - z^{-2}$. Figure 2-4 shows the EFB implementation of this second-order DSM. The critical path of this implementation is two adder delays, since the two multiplication operations are just shift and bit-inversions respectively.

As the order increases, the number of adders in the critical path increases, thus limiting the maximum frequency of operation. If the multiplication coefficients are not powers-of-2 then there is additional computational overhead, which further limits the speed. For a third-order $NTF(z) = (1 - z^{-1})^3$, $H(z) = 3z^{-1} - 3z^{-2} + z^{-3}$. Since, all the coefficients are not powers-of-2, they also have to be expressed as a sum of powers-of-2 to avoid any multipliers in the design e.g. $3 = 2^1 + 2^0$. Although the order is third, the critical path becomes four adders as shown in Fig. 2-5. The length
of critical path thus depends on the coefficient of the multiplier and the order of the modulator. If the NTF zeroes are at DC, it is often easier to perform the multiplication with one shift and add operation. The critical path is then \(\leq n + 1\) adders, where \(n\) is the modulator order. On the other hand, with NTF zeroes not located at DC, the critical path is larger as more additions are required to perform the multiplication. Hence, this architecture is not well-suited for a high-speed implementation due to a large critical path. It can be noted that no pipeline stages \((z^{-1})\) between the adders are allowed as this alters the transfer function.

Instead, a cascaded integrator with distributed feedback (CIFB) architecture is used which uses a delayed integrator. A second-order CIFB for \(NTF(z) = (1 - z^{-1})^2\) is shown in Fig. 2-6 where the critical path is again only one adder as long as all coefficients of \(H(z)\) are power-of-2. For an \(n\)th-order DSM, a CIFB implementation improves the critical path by \((n - 1)\) adders over a normal EFB one. In this case, the input signal, \(X(z)\) is delayed by \(z^{-n}\) i.e. \(STF(z) = z^{-n}\). This additional latency is often not critical in communication systems.

Using this architecture, a 3.6 GHz DSM is presented in [8] using a 90 nm CMOS at 1.3 V supply for the IEEE 802.11n/802.16e standards. A 10-bit to 3-bit reduction is achieved for a 20 MHz BW, thus representing an OSR of 90 and an ideally achievable SQNR of over 70 dB. A second-order CIFB implementation similar to Fig. 2-6 is
used. Since full 11-bit additions are not possible in one clock cycle, 4-bit pipelined mirror adders are used to meet the timing. For these low bit-width additions per pipeline, look-ahead adders do not offer a speed advantage [42].

A completely different approach is presented in [25] to achieve a 4 GS/s third-order band-pass DSM. Two 2 GS/s low-pass EFB DSMs are used to achieve the 4 GS/s speed and 50 MHz BW. Zero locations are optimized to improve the SNDR and a 1-bit output is produced, which results in a highly linear 1-bit DAC. This third-order low-pass modulator is shown in Fig. 2-7. The limitation for achieving a high-speed in this case are the multiplications in the feedback path i.e. $2^{-2}$, $2^{-3}$ and $2^{-5}$ which are right-shift operations. Hence, the additions cannot be pipelined as in [8]. Moreover, the critical path is three 13-bit adders which is difficult to achieve in one period even with a fast look-ahead adder. The implementation of this DSM is shown in Fig. 2-8. To achieve the high speed, two techniques have been used. Firstly, the DSM uses a redundant representation like borrow-save (BS) arithmetic instead of two’s complement so that the carry processing can be delayed until the end of the loop [43]. A non-exact quantization of all the collected carries is performed at the end to generate the 1-bit output. The BS arithmetic results in a critical path of three full-adders (FA) cells. To enable this, the second technique used is a three-phase clocking scheme with dynamic logic based pre-charged full adders (FA). One addition is performed per clock phase. The main drawback of this design when used for an even higher speed is the multi-phase clocking which requires a DLL and the use of dynamic logic which has lower noise margins. Also, this implementation may not be suitable for a multi-bit DAC as the carry processing logic at the end due to the BS implementation becomes even more complicated and limits the speed.

A third architecture that has the potential for a high-speed implementation is
the Multi stAge noise SHaping (MASH) architecture which consists of a cascade of individual EFB DSMs. The MASH architecture is shown in Fig. 2-9 wherein the error term, $-er$ generated in each stage becomes the input of the next stage. The outputs then undergo a final processing (error cancellation) to achieve the final output. The individual DSMs can be of any order and the overall DSM order is the total sum of the order of all the stages. In order to understand the MASH operation, consider an example MASH consisting of two stages where each stage is a simple first-order EFB DSM. Then, the input-output relations for the two stages can be written as

$$Y_1(z) = X(z) + (1 - z^{-1}) Er_1(z)$$ \hspace{1cm} (2.4)

$$Y_2(z) = -Er_1(z) + (1 - z^{-1}) Er_2(z)$$ \hspace{1cm} (2.5)

Multiplying (2.5) by $(1 - z^{-1})$ and adding to (2.4),

$$Y(z) = Y_1(z) + (1 - z^{-1}) Y_2(z) = X(z) + (1 - z^{-1})^2 Er_2(z)$$ \hspace{1cm} (2.6)

$Y(z)$ is equal to $X(z)$ and added to a second-order NTF function such that the overall response becomes that of a second-order DSM. The operation $Y_1(z) + (1 - z^{-1}) Y_2(z)$ forms a part of the final processing. Expanding on Fig. 2-9, the MASH implementation of a second-order DSM (also called MASH 1-1) is drawn as in Fig. 2-10. It can be
seen that the feedback path is confined only within each DSM or integrator. The path between the two DSMs is a forward path and hence can be optionally pipelined for a higher speed. Similarly, the final processing also consists only of forward paths only and hence can be optionally pipelined if required. Thus, the critical path of the MASH 1-1 can be restricted to only one adder only. This is shown in Fig. 2-11.

The advantage of this pipelined MASH 1-1 over a second-order EFB DSM of Fig. 2-4 can be seen. This is an improvement over the EFB architecture with only one adder delay in the critical path. The MASH 1-1 critical path is similar to that of the second-order CIFB architecture of Fig. 2-6. However, with the increase in the modulator order e.g. MASH 1-1-1 still has a one adder critical path while the CIFB may have two because of its non-power-of-2 multiplier coefficients in the feedback. This scalability property of MASH makes it very attractive for high speed implementation. The MASH consisting of first order DSMs also offers some more practical advantages during the design. Only a first order DSM is required.
to be designed which can be instantiated multiple times depending upon the order. Secondly, the possibility of adding pipelines between the stages can be beneficial for timing closure [34]. Pipelining between the two integrator adders of the CIFB is not possible without changing the NTF.

These aforementioned advantages of the MASH DSM have been used in [26] and [27]. A 2.625 GS/s speed is reported in [26] for a MASH 1-1 using a 1.3 V supply in 90 nm CMOS and a 6-bit pipelined adder. A 5.4 GS/s speed is reported in [27] for a MASH 1-1-1 using a 1-bit pipelined static adder per stage and a 1.2 V supply in 65 nm CMOS. However, this particular implementation has only a 5-bit input and a 3-bit output. Hence, the power penalty of using a 1-bit pipeline stage is not high. But a similar implementation for a modulator with a larger number of input bits as is usually the case may lead to a higher power. With a 1-bit stage, the total number of FFs increases which results in an increase in power of the clock distribution also. This 5.4 GS/s DSM is the fastest reported in literature using a conventional implementation.

2.1.2 Speed Limitation of a First Order EFB DSM

It is shown in [42] that for very small bit-width additions, special adders like look-ahead are not the fastest. Conventional static, mirror or carry-select (CS) adders lead to faster implementations. Since, a 1-bit pipeline stage results in the highest speed, consider the 1-bit pipeline of the integrator as shown in Fig. 2-12 using a carry-select (CS) adder. $y$ is the output of the integrator that is added with the current input $x$, $ci$ is the carry input from the previous LSB pipeline and $co$ is the carry out going to the next MSB. The NOR gate just before FF$_3$ is required when a synchronously reset-able integrator is required. Consider that a standard 65 nm CMOS technology with a
2.1 Conventional DSMs

Table 2-1: Post-layout simulated delay of a 1-bit pipeline at 1 V, 75°C, typical corner using TGFFs and static CMOS logic.

<table>
<thead>
<tr>
<th>Block</th>
<th>Delay (ps)</th>
</tr>
</thead>
<tbody>
<tr>
<td>FF3 Output Delay</td>
<td>30</td>
</tr>
<tr>
<td>CS Adder (input→cout)</td>
<td>41</td>
</tr>
<tr>
<td>Reset NOR gate</td>
<td>22</td>
</tr>
<tr>
<td>FF3 Setup Time</td>
<td>23</td>
</tr>
<tr>
<td><strong>Total Delay (ps)</strong></td>
<td><strong>116</strong></td>
</tr>
</tbody>
</table>

1 V supply is used and all the transistors are of low-Vt general purpose (GP) type. The critical path starts and ends at FF3 through the 1-bit full adder (FA). Only static CMOS logic is used and the FFs utilized are standard transmission gate flip-flops (TGFF). The contributions from the various components at 75°C, typical process corner with a maximum RC post-layout extracted netlist is shown in Table 2-1.

The table shows that a total delay of 116 ps implies a maximum operating frequency of 8.62 GHz. Assume that a DSM speed of 10 GHz is desired in order to support a bandwidth of few hundred megahertz. Then, a 10 GHz speed cannot be met with this structure and static CMOS logic. Now, FF3 and FF1 could be replicated for a complementary logic style such that the inverters at the adder input can be removed. But, now the fan-out on the CS-adder increases at it needs to provide complementary outputs. So overall, only a 8-10 ps improvement is possible, pushing up the speed to 9.2 GHz but at the cost of 50% additional FFs in the integrator and also a larger clock distribution power. To reach a 10 GHz speed, instead of TGFFs, a faster true single phase clock register (TSPCR) along with a Complementary Pass Transistor Logic (CPL) based FA can be used [7]. However, this option leads to a marginal improvement in total delay to about 105 ps or 9.5 GHz speed. Additionally, this logic style has dynamic nodes in the TSPC and the CPL leading to lower noise margins. A domino logic style using latches does not yield advantage either because of the feedback path as time-borrowing does not yield any benefits. While current mode logic (CML) or pseudo-CMOS logic implementations can yield this 10 GHz speed, the power penalty is very high in these implementations because of static current. Table 2-1 also indicates that the FF delays i.e. its clk→q delay (output) and the setup time accounts for about 50% of the cycle time.

These simulation results indicate that a 10 GHz speed is outside the capability of a standard 1 V 65 nm CMOS technology if a reasonable power consumption is targeted. This highlights the need for a different approach and architecture to achieve the high speed. Time-interleaved DSMs that relax the speed of the logic have the potential to achieve and introduced in the following section.
2.2 Time-Interleaved DSM and Previous Works

Consider a transfer function $H(z)$ that has to be implemented at a sample rate of $f_s$. If a direct conventional implementation of this $H(z)$ poses a limitation because of the high sample rate, then it is possible to implement this as a combination of $M$ channels each operating at a rate of $f_s/M$. The $M$-channels are then recombined at the end to the full sample rate of $f_s$. Thus, the individual channels can operate at a relaxed rate of $f_s/M$, which makes their implementation easier. Only the final combination part works at full sampling rate. Alternative terminology like poly-phase decomposition or loop-unrolling is also often used in literature instead of time-interleaving (TI) [33, 34, 40, 44]. Figure 2-13 shows the original transfer function to be implemented while Fig. 2-14 shows the time-interleaved or a loop-unrolled implementation. The $M$ individual transfer functions $H_0(z), H_1(z)...H_{M-2}(z), H_{M-1}(z)$ are referred to as the poly-phase components of $H(z)$ [44].

The process of obtaining these individual transfer functions has been formalized in [44] and [45] by using a block filtering approach. First, a $M\times M$ block filter, $\overline{H}(z)$ is generated from $H(z)$. Then, the individual poly-phase components are extracted from this block filter. The formal TI representation of $H(z)$ using this block filter is shown in Fig. 2-15. The block filter $\overline{H}(z)$ can be written in the form of a $M\times M$
2.2 Time-Interleaved DSM and Previous Works

The element $H_{ij}$ in the matrix represents the contribution of the $j^{th}$ input to the $i^{th}$ output. The value of $E_i(z)$ is obtained by re-writing $H(z)$ in the following form.

$$H(z) = \sum_{k=0}^{M-1} z^{-k} E_k(z^M)$$  \hspace{1cm} (2.8)

An example of the second-order EFB from Fig. 2-4 can be considered. In this case, a TI decomposition of $H(z) = 1 - NT F(z) = -z^{-2} + 2z^{-1}$ is first required. Using Eq. (2.8), $H(z)$ is expanded as

$$H(z) = E_0(z^2) + z^{-1} E_1(z^2)$$

$$\implies E_0(z) = -z^{-1} \text{ and } E_1(z) = 2$$  \hspace{1cm} (2.9)

The block filter matrix is then written using Eq. (2.7) as

$$\overline{H}(z) = \begin{bmatrix} -z^{-1} & 2 \\ 2z^{-1} & -z^{-1} \end{bmatrix}$$  \hspace{1cm} (2.10)

In Eq. (2.10), the rows refer to the TI-outputs of the matrix whereas the columns refer to the TI-inputs of $\overline{H}(z)$. Let these TI-inputs and outputs of $\overline{H}(z)$ be called $(x'_0, x'_1)$ and $(y'_0, y'_1)$ respectively. Then, we have the following relations

$$y'_0 = -z^{-1} x'_0 + 2x'_1$$  \hspace{1cm} (2.11)
Figure 2-16: A 2-channel TI-EFB implementation for a transfer function \( NTF(z) = (1 - z^{-1})^2 \).

\[
y'_1 = 2z^{-1}x'_0 - z^{-1}x'_1 \quad (2.12)
\]

A TI implementation of the second-order EFB DSM of Fig. 2-4 is shown in Fig. 2-16 using the matrix of Eq. (2.10). The whole TIDSM now works at half the desired speed with a four adder delay in the critical path. It can be recalled that the original second-order EFB of Fig. 2-4 had a two adder delay. This architecture has been used in [33] for an eight channel design and a third-order NTF. A simulated effective 2.66 GHz speed with the individual channel operating at 330 MHz was reported with the aid of a standard digital design flow that uses synthesis and automatic place and route.

A TI decomposition of the second-order CIFB of Fig. 2-6 also can be performed but the approach is different. It is difficult to apply the TI method to the whole NTF due to the distributed nature of the feedback. Instead, each delay element has to be decomposed into its TI form. If \( H(z) = z^{-1} \), then \( E_0(z) = 0 \) and \( E_1(z) = 1 \). The block matrix for \( z^{-1} \) is then given by

\[
\begin{bmatrix}
0 & 1 \\
z^{-1} & 0
\end{bmatrix}
\quad (2.13)
\]

The TI input-output relations are then given by

\[
y'_0 = x'_1 \quad (2.14)
\]

\[
y'_1 = z^{-1}x'_0 \quad (2.15)
\]

The 2-channel TI implementation of a delay element or a FF is shown in Fig. 2-17. Thus replacing each FF in Fig. 2-6 with its TI equivalent from Fig. 2-17, a TI implementation of the CIFB is obtained as shown in Fig. 2-18. This implementation
2.2 Time-Interleaved DSM and Previous Works

Figure 2-17: Two-channel TI implementation of a delay element/FF.

Figure 2-18: Two-channel TI implementation of a second-order CIFB DSM.

has a two-adder delay as compared to the four adder delay of a TI-EFB which makes it a potential candidate for achieving a high speed. This particular implementation style for a TI-CIFB has been proposed in [45] for a \(\Delta\Sigma\) ADC but not yet reported for use with a digital modulator for a DAC.

Continuing in this direction, it would be of interest to further investigate the capability of an interleaved TI-MASH architecture. Some of the properties of a TI-MASH have been earlier studied in [34]. Consider a first-order EFB with a delay in the forward path instead of the feedback path as shown in Fig. 2-19. The only difference compared to Fig. 2-1 is that the output has a one clock cycle delay. In order to interleave this structure, there are two possibilities; either TI decomposition could be applied to the whole integrator transfer function or to only the FF as before. The integrator has a transfer function given by

\[
H(z) = \frac{z^{-1}}{1 - z^{-1}} \tag{2.16}
\]

The block matrix for this function is

\[
\overline{H}(z) = \frac{z^{-1}}{1 - z^{-1}} \begin{bmatrix}
    z^{-1} & 1 \\
    z^{-1} & z^{-1}
\end{bmatrix} \tag{2.17}
\]

This results in the TI implementation that is shown in Fig. 2-20. Although the critical path is equivalent to that of two adders, each stage requires four adders. For a MASH 1-1, this results in totally eight adders, which is double that of Fig. 2-18 or Fig. 2-16.
implying a higher power and area. It can be noted that this circuit applies only to the lower $p - m$ feedback bits as the integrator transfer function can be applied only to these bits. Also, it is noticed that the two integrators run completely independent of each other. Hence, logic is required after the decomposed integrator to compute the correct MSBs that drive the DAC. This logic requires more adders which result in further area and power consumption and has only recently been studied in [40].

On the other hand, using the TI decomposition of the FF leads to a very simplified implementation as shown in Fig. 2-21. It has only two adders per stage and also a critical path of two adders that is independent of the modulator order (which is not the case for a TI-CIFB). This advantage coupled with benefits of scalability, stability and the possibility of adding pipelines between the stages as mentioned in the previous section makes the TI-MASH a very attractive candidate for high speed implementation. An eight-channel 2.5 GS/s TI-MASH DSM using this TI decomposition of the FF has been demonstrated in [34].

This dissertation focuses further on the TI-MASH architecture of Fig. 2-21 and exploits its properties to achieve even higher speeds i.e. 8 GS/s in Paper I and 11 GS/s in Paper III. These will be further described in Chapters 3 and 5 respectively. At the time of the start of this dissertation, [33] and [34] were the only two reported TIDSM high-speed implementations in the literature. Only very recently, another TIDSM has been reported in [40] and is an eight-channel 8 GS/s TI-MASH 1-1-1 DSM for a Nyquist-$\Delta\Sigma$ hybrid DAC using an implementation that is based on Fig. 2-20. It is observed that there are relatively very few publications on the topic of TI $\Delta\Sigma$ DACs and this area has not received as much attention as TI $\Delta\Sigma$ ADCs.
2.3 Choice of Multiplexing Strategy

The previous section focused only on the use of interleaving to relax the speed of the computation and the role of the final multiplexing (serializer) was not discussed. The multiplexing (MUX) strategy also plays an important part in the choice of the number of channels. The timing diagram for an M-channel MUX is shown in Fig. 2-22. The individual channel outputs are available for a time \( t = M/f_s \) and are routed to the output for a time \( t = 1/f_s \).

Figure 2-23 shows the classical fully-synchronous serializer scheme wherein the system clock \( f_s \) is divided down by \( M \) and sent to the individual channels. Additional clocks with frequencies \( 2f_s/M, 4f_s/M...f_s/2 \) are generated as select signals for the M:1 MUX, which is often a counter. The M:1 MUX could be implemented in one step itself or be pipelined into individual multiplexing stages progressively increasing in frequency. This traditional scheme has been used in [34] and [46] to serialize 8 channels to frequencies of 2.5 GHz and 3.456 GHz respectively using static CMOS logic styles. However, achieving higher frequencies using this scheme is challenging due to the shrinking timing window for the multiplexing. Fast FFs with a minimized clock distribution skew and small routing/logic delays are required so that the serializer meets the timing under all process corners. At very high frequencies running into many-GHz, using static CMOS logic becomes challenging and rarely used when many channels are to be serialized. It can also be noted that this MUX is physically located close to the DAC and a CMOS logic style at high frequencies can...
result in data-dependent substrate noise coupling with the DAC [47].

In order to meet the timing at a high frequency, clock delay/phase calibration can help to correctly align the clock edges such that the data can be captured correctly. The MUX can then to an extent become independent of the long routing delays and clock skew between channels. This type of calibration has been used in [48], [49] and more recently in [40]. The 8:1 8 GHz MUX scheme from [40] is shown in Fig. 2-24 as an example. Firstly, the quadrature phases of the \( f_s/2 \) clock are generated from a CML based divider. The calibration or phase rotation is performed by controlling the current that is steered in these clocks by using a digital word. This allows the generation of any arbitrary phase of the MUX clocks. This kind of phase calibration is easier using the CML style and the clocks are converted to CMOS after the calibration. As the data gets closer to the DAC, the last two stages of the MUX again use CML. This is because CML is more robust to power supply noise and can operate at a higher frequency. Additionally, it results in lesser data dependent errors and feedthrough into the DAC [50]. Phase calibration with a resolution of 500 fs has been reported in [40] which ensures a very high speed. The trade-off between the number of channels i.e. the degree of interleaving and the complexity of the serializer can be seen. Increasing the number of channels greatly relaxes the logic in the DSM but complicates the MUX. The use of CML also has an impact on the power consumption because of the static current.

Figures 2-23 and 2-24 illustrate the MUX for a non-return-to-zero (NRZ) DAC wherein the serializer data is retimed with a FF operating at \( f_s \) at the end before sending it to the DAC. It can be noticed that in this case the jitter on the \( f_s \) clock affects the overall performance of the DAC. This results in distortion components in
2.3 Choice of Multiplexing Strategy

Nyquist DACs, but in \( \Delta \Sigma \) DACs it results in the folding of the high frequency noise resulting in a reduction of SNDR. Some analysis of the jitter effects has been earlier performed in [19] and [20].

In many cases, the DAC may instead have to be operated at an \( f_s \) speed by using both the edges of an \( f_s/2 \) clock. There are many reasons for this such as:

- A return-to-zero (RZ) DAC is needed to improve the dynamic performance e.g. a two-DAC structure is used. The two DACs alternately operate for half the cycle and are driven to zero in the other half. This requires an \( f_s/2 \) clock.
- A full-rate \( f_s \) clock may not be available due to the high frequency.
- The distribution of a high frequency clock poses a challenge due to a large wiring capacitance, hence this clock is first divided by two and this clock is used in the DAC.

For these aforementioned reasons, a phase rotator based 12 GHz 4:1 MUX and a Nyquist RZ-DAC is presented in [48] using both the edges of a 6 GHz clock as shown in Fig. 2-25. The phase rotator is used to correctly sample the input data for the first 4:2 MUX while the final multiplexing is pushed inside the DAC itself. The role of the multiplexing and the RZ DAC is combined such that each of the two sub-DACs is active for only half the cycle. The DAC is then sensitive to the duty cycle of \( f_s/2 \) clock.

The previous works discussed in this section so far have used a many-channel strategy with extensive clock phase/delay calibration. Contrary to this approach is a

![Image of CML Phase Rotator based calibration scheme for MUX used in [40].](image-url)
two-channel approach wherein the serializer has a very low complexity and simple clocking as shown in Fig. 2-26. The multiplexing and the individual channels both work on the same and single half-rate clock, \( f_s/2 \). There is a half-cycle path between the two channels and the MUX that should be met and requires a careful design (see Fig. 2-27). The two channels operate at \( f_s/2 \), which is still a high speed if at least a few-GHz sampling rate is targeted. Moreover, if static CMOS logic could be used for the entire logic and the MUX, then overall a reasonable power consumption can be achieved. This dissertation focuses on this two channel approach with the simplified MUX scheme. However, the DAC suffers from a channel skew error resulting from a non-50% duty cycle of the \( f_s/2 \) clock. Hence, either duty cycle correction is required for the \( f_s/2 \) clock or DAC techniques that can maintain the performance even in the presence of channel skew must be used. While duty cycle correction has not been studied in this dissertation, some DAC techniques that can partially or completely absorb this channel skew error are investigated in Chapter 4.

Instead of the serializer strategy discussed so far, there are two more types of analog multiplexing options which are not very frequently used but have been reported in Nyquist DACs [51, 52]. It is of interest to investigate if these can also be used in \( \Delta \Sigma \) DACs. Consider the two-channel scheme of Fig. 2-28 used in [51]. In this
2.3 Choice of Multiplexing Strategy

scheme, the final DAC output is same as a non-interleaved DAC with an $f_s$ rate, but two half-rate DACs operating with $0^\circ$ and $180^\circ$ clocks are used instead. The sum of these two DAC outputs gives the expected DAC output. This could be an attractive solution because the entire circuit switches at a true half-sampling-rate, but the output still has the characteristic of a full-sampling-rate. Since the two DACs are operating all the time, the data fed to the input of the two DACs is not the same as the original data $x_0$ and $x_1$ i.e. data pre-coding is required. Instead, the new data to be fed to each DAC is the difference between the expected final output $y$ and the data output of the other channel. Fig. 2-28 also shows the timing diagram for the DAC with $y$ being the expected DAC output, while $y_0$ and $y_1$ form the DAC outputs of the two sub-DACs.

For M-channels, this pre-coding is equivalent to the transfer function of

$$G(z) = \frac{1}{\sum_{i=0}^{M-1} z^{-i}} \quad (2.18)$$

i.e. $G(z) = 1/(1 + z^{-1})$ for a two-channel case. The drawback with $G(z)$ is that it has an infinite impulse response (IIR) with a pole at $z = -1$ i.e. a high gain at $f_s/2$ as shown in Fig. 2-29. As a result of this, the bit width of the pre-coded data, $y_0$ and $y_1$ is larger than the original DAC data, $x_0$ and $x_1$. As an example, pre-coding a 3-bit MASH 1-1 shaped output data requires the two sub-DACs to be of 6-bits for a high OSR of 50. For a low-OSR such as 5, the sub-DAC bit width increases i.e. two 7-bit sub-DACs are needed. Thus, the bit reduction obtained from the DSM is lost due to the pre-coding. Moreover, this IIR filter may also present a speed bottleneck due to its feedback path. Instead, a large-order FIR approximation of this filter may have to
Figure 2-29: Frequency response of the IIR filter with the transfer function $G(z) = 1/(1 + z^{-1})$.

Figure 2-30: DACs with FIR response

be used as a practical implementation. These two reasons make this interleaving style unsuitable for TI-$\Delta\Sigma$ DACs.

A second form of analog multiplexing called parallel-path DACs [52] or hold-interleaving DACs [53] has been reported in Nyquist DACs. This is shown in Fig. 2.30(a). In this scheme, again two DACs working on input data $x_0$ and $x_1$ with opposite clocks phases are used and their outputs are added. This results in an FIR response with the transfer function, $G(z) = 1 + z^{-1}$. In a Nyquist DAC, this has an advantage of attenuating the closest DAC image [52]. But when used in a $\Delta\Sigma$ DAC, this architecture also provides a protection against SNDR loss due to phase error between the two DACs [39] (Paper II). This will be further discussed in detail in Chapter 4. A digital implementation that achieves the same effect is shown in Fig. 2.30(b). The FIR filtering is performed digitally and a digital serializer (MUX) is used. In this case, some protection against SNDR loss is obtained from the duty cycle error in the $f_s/2$ clock. This digital version of the FIR response based DAC fits very well with the MUX strategy of Fig. 2.26 as the FIR filtering is done prior to the MUX. This strategy is found to be very useful in TIDSM DACs and is also further analyzed in Chapter 4.
2.4 DAC Current Cell Design

The output swing required from the DAC essentially dictates the DAC current cell design and is decided by the application. A survey of the published literature shows that DACs for wireline applications such as cable modems or backplanes require a high output swing e.g. 1.6 V reported in [48] and 2.5 V in [16] [21]. On the other hand, DACs for wireless applications that connect to an upconversion mixer require relatively a lower swing e.g. 0.3 V in [8], 0.6 V in [46]. In the case of a ∆Σ DAC, the number of bits in the DAC is decided by the SQNR/SNDR required while the output swing is decided by the application. Consider the simple representation of the N-bit DAC shown earlier in Fig. 1-4. The maximum differential output swing in given by

\[
V_{\text{swing}} = V_{\text{out}} - V_{\overline{\text{out}}} = 2NR_LI_{\text{unit}} \quad (2.19)
\]

\[
I_{\text{unit}} = \frac{V_{\text{swing}}}{2NR_L} \quad (2.20)
\]

Thus, the effective overdrive, \(V_{gt}\) on the NMOS current source generating \(I_{\text{unit}}\) is given by the equation [54]

\[
V_{gt} = V_{gs} - V_T = \sqrt{\frac{2I_{\text{unit}}}{\mu_n C_{ox}(W/L)}} \quad (2.21)
\]

To achieve the desired SQNR/SNDR, the mismatch between the different current cells also must be taken into account. This value can be obtained through a system level Monte-Carlo simulation e.g. using a Matlab® or Octave simulation. As mentioned in Chapter 1, the equation for the random mismatch between two current cells is given by [13]

\[
\sigma^2\left(\frac{\Delta I_{\text{unit}}}{I_{\text{unit}}}\right) = \frac{1}{WL} \left(\frac{A_{VT}^2 + A_{\beta}^2}{V_{gt}^2}\right) \quad (2.22)
\]

where \(A_{VT}\) and \(A_{\beta}\) are process specific parameters. Equations 2.20–2.22 determine the static parameters i.e. overdrive and the dimensions of the current source. However, to arrive at the overall DAC cell, the required linearity of the DAC also must be taken into consideration. The \(n^{th}\) order harmonic distortion \(HD_n\) of a DAC is given by [16, 17]

\[
HD_n = \left(\frac{NZ_L}{A_{Zcell}}\right)^{n-1} \quad (2.23)
\]

where \(Z_{cell}\) is the output impedance of the current cell. In many communication applications with multiple carriers e.g. OFDM, the IM3 can be a more important specification than HD3 [16, 21]. To measure the IM3, two tones of equal input amplitude (typically half the full scale voltage available or −6 dBFS) are used. It can be shown that for two −6 dBFS input tones, the relation between IM3 and HD3 is
Given by

\[ IM3_{dB} = HD3_{dB} - 2.5 \]  \hspace{1cm} \text{(2.24)}

Depending on the HD3 and IM3 requirement, the desired output impedance is calculated for the DAC unit cell. Achieving this output impedance for the given bandwidth depends on the CMOS technology, output swing and the transistor flavours available. It is thus seen that the DAC cell design is an iterative process so that all these constraints in Eqs. 2.20–2.24 are satisfied. Based on these equations, a few commonly used DAC cell types have evolved as shown in Fig. 2-31.

Figure 2.31(a) shows the traditionally used DAC current cell [17]. The output impedance is boosted by the cascode \( M_6 \) above the current source. The cascode is usually smaller than the current source so that the capacitance of the common node, \( C \), can be reduced. This is because a high capacitance on this node limits the output impedance of the cell at high frequencies. The switches \( M_2 \) and \( M_3 \) may operate in the linear or the saturation region depending on the speed, headroom available and the bandwidth requirement. For fast-switching, often minimum length transistors are used. A drawback with this current cell is that switch inputs couple directly (feedthrough) to the analog outputs through the switch gate-drain capacitance causing data dependent tones at the output [48, 50, 55]. The location of the tones due to this kind of coupling must be evaluated through simulation as these tones can be also outside the signal bandwidth when oversampling is present and hence may
not always degrade the performance [48]. Another drawback reported in this type of current cell is the coupling of the outputs back to the sensitive common node $C$ through the drain-source capacitance of the switch that results in distortion [56].

To alleviate these coupling problems, another commonly used DAC cell is shown in Fig. 2.31(b) [16, 21, 55]. The cascodes are now placed on top of the switches instead. This ensures an isolation between the output node and the switch input and also between the common node and the output. However, this leads to two limitations. Firstly, now the capacitances at D and E have to be charged/switched every time the current cell is activated which is a source of distortion. The capacitance at these nodes must be reduced or their switching must be limited. Secondly, the common node capacitance is quite large in this case if the current sources are large. Leaker (or bleeders) current sources on nodes D and E are optionally required to reduce the switching of the capacitance at D and E [16]. The bleeder current sources require less than 2% of the unit current.

Hence, the third DAC cell shown in Fig. 2.31(c) combines the advantages of the earlier two cells. The DAC cell has a higher output impedance due to the double cascoding. The drawback of this cell is that it needs a larger supply voltage than the first two cells due to the two cascodes. Often 2.5 V supply has been used with this kind of current cell as the modern CMOS technologies offer support for 2.5 V thick gate transistors [16, 57]. Low voltage DACs, with supply voltages between 0.8-1.2 V in modern CMOS technologies, usually use one of the first two DAC cells [18, 58].

A fourth type of DAC cell, shown in Fig. 2-32 combines the multiplexing and the DAC function using a dual-current cell [48]. This cell results in a very good linearity and operating speed. The cascodes are driven directly by the clock load in this case. A return-to-zero functionality is also supported as the DAC current is routed back into the power supply when not in use. The main drawbacks of this implementation are that, firstly, the matching requirements on the DAC increase due to the doubling of current cells. Secondly, from a layout perspective, thrice the number of signals have to be routed into the cells. This results in a higher routing congestion close to the current cells and requires a careful layout. The focus in this dissertation has been to
2.4 Switch Driver Design

In all the DAC current cell types shown previously, the potential on the common node, \( C \) must be kept constant for a good SFDR performance. Any potential that is lost through the parasitic capacitance between the common node and the ground results in distortion. Even if it is difficult to maintain exactly the same potential, any variation should be symmetrical around the nominal value. In order to minimize the variation, the crossing point of the switch must be carefully optimized. Figure 2-33 shows the waveforms on the common node as a function of the switch crossing point [50].

The crossing point for an NMOS switch is usually kept higher than the mid-point if a full-swing CMOS signal is used. CMOS switch drivers have been shown to have the steepest transitions [16, 50]. Figure 2-34 shows a such a fast switching driver used for a 5 GHz speed in 40-nm CMOS [21]. However, for the two time-interleaved DACs in this dissertation, this latch based switch driver cannot be used as multiplexing of two channels using a half-rate-clock is also involved. Secondly, the use digital multiplexing and the reduction in the number of DAC analog components. Hence, this dual current cell is not preferred.

In this dissertation, the DAC cell structure of Fig. 2.31(b) is chosen with the switches operating in deep-triode (linear) region. The switches are driven by a digital driver and hence have a full rail-to-rail swing at their inputs. Since the number of DAC cells used is only 7 (Chapter 3) and 15 (Chapter 5) and targeted for a moderate linearity (8 bits), leaker current cells are not required. This DAC cell is found to be sufficient to meet the dynamic requirements.
two transistor stack in the pull-down of the latch and the contention between the two cross-coupled inverters results in large rising transitions and makes this unsuitable for higher speeds. Hence, Paper I uses a pseudo-differential centre-crossing driver for 8 GS/s operation and a 200 MHz data bandwidth at the expense of some loss in SNDR. This driver is described in Chapter 3. However, for a higher sample rate of 11 GS/s and a larger bandwidth of 1.1 GHz in Paper III, a high-crossing driver is required to extract maximum DAC performance. This pseudo-differential driver is described in Chapter 5.

CML based low-swing drivers are often used at high speeds due to the low swing and naturally provide a differential swing [55]. Because of their low swing, they also reduce the glitch on the common node potential. Moreover, they show better power supply rejection and lesser data-dependent ripple on the power supply. However, the reduced swing results in a larger slope for the data transitions, which may lead to larger timing errors [59]. CML drivers are also found to be less power efficient than CMOS [18]. In this dissertation, CMOS switch drivers have been used with a view of using a full-swing static logic based design and to utilize its fast transitions.

### 2.5 Aspects of Clock Distribution

Any timing mismatch between the different DAC cells results in distortion. Hence, the final MUX and the switch driver cells require a clock tree that is careful matched for equal delays between cells. Moreover, the number of clock buffering stages from the clock input to the switch driver should also be minimized in order to reduce the additional clock jitter introduced by the distribution. Hence, the input clock must be located close to the final MUX. Figure 2-35 shows the potential floorplan and clock distribution of a second order two-channel MASH 1-1 DSM DAC based on this constraint.

The short clock path to the MUX implies that this clock path is not a part of the global clock distribution. The global clock distribution drops off the clock to all the remaining blocks in the DDSM. It is found to be about ~400 µm long for the two DAC prototypes discussed further in Chapters 3 and 5. Local clock buffers in every block drive a group of flip-flops. These local clock buffers have to provide both the clock phases i.e. clk and \( \overline{\text{clk}} \) to the static TGFFs. This is because the FF can be made more compact by avoiding clock inversion inside the cell. The global clock distribution itself can be single-ended or (pseudo-) differential. Both these options have been utilized in the two DACs respectively. In the single ended clock global distribution used in Chapter 3, the local clock buffers perform a single-ended to differential conversion, which is seen to cause rising and falling delay mismatches between the two phases that degrades the SNDR. Hence, a fully (pseudo-) differential clock is used in Chapter 5 that shows a better performance.

A drawback of placing the input clock very close to the MUX is that the farthest FF (between 800−1000 µm away) also receives the DSM input data. The direction
of data flow is opposite to that of the clock distribution i.e the farthest flop receives a late clock and an early data while the MUX receives an early clock and is at the end of the data path. If the skew between the clock reaching an FF in thermometer coding block and the MUX clock is very large (refer Fig. 2-35), then a setup time violation (data transfer error) occurs between these two blocks. In such a case, additional delay stages have to inserted into the MUX clock tree to meet this timing, which may impact the dynamic performance of the DAC. Hence, careful clock skew simulations have to be performed on this path.

### 2.6 DAC Testing Challenges

At high sampling rates, DAC testing is also a challenge as high-speed digital data should be generated for feeding to the DAC. The traditional method of DAC testing includes the use of a high-speed receiver IO interface on the digital inputs [21, 60]. The high-speed interface can directly transfer data at the full rate [60] or use multiple lower frequency streams that are serialized on-chip [21]. For speeds that are in the order of a many-GHz, using this strategy is very challenging. Moreover, this also requires the use of differential signalling e.g. LVDS which increases the pin count. Since DACs are being increasingly embedded in complex SOCs, on-chip DFT features are also required to enable faster testing [59]. One of the ways to do this is to integrate digital sine generators (or digital frequency synthesizers) on chip. Two sine generators are required to perform a two-tone testing [16]. In addition, PRBS generators may be required to test the spectral mask of a given standard [46].

![Figure 2-35: Floorplan of a TIDSM MASH 1-1 with clock distribution.](image-url)
2.6 DAC Testing Challenges

A drawback with sine generators is that they also contain large feedback paths similar to integrators. They may have a critical path that is larger than the TIDSM DAC under test [61, 62]. Hence, even the sine generators may have to be time-interleaved and they may have a worse critical path than the actual MASH modulator. An alternative to digital sine generators is on-chip memories that can provide high-speed data for the DAC. Memories facilitate flexible testing as any test pattern can be loaded into them. High-speed embedded memories to enable full-speed testing have been reported in a few previous works [7, 49, 59]. The depth of the memory is determined by the spacing between two nearest frequency bins that is desired. If the depth D is used, then $f_s/D$ gives the bin spacing. For two tone testing using two coherently sampled sinusoidal signals, $2f_s/D$ is the closest obtainable spacing between the two tones. The challenge with using the memory is extracting the digital word from all the available bits in every clock cycle.

Figure 2-36 shows the three different memory architectures possible in order to
extract the digital word. In the first option of Fig. 2.36(a), the memory could be a static RAM or flip-flop based and data does not move around inside the memory. A decoder issues an address than fetches the correct data through multiplexing. The multiplexing forms the critical path that eventually limits the read speed of the memory [49]. The memory itself is refreshed periodically with a low frequency clock. Each memory cell can be made very compact as the data remains static during the testing. Moreover, the read and write paths of the memory are independent in this architecture.

In the second option of Fig. 2.36(b), the memory is entirely flip-flop based and operates on the full-speed clock $f_{clk}$. The memory acts like a shift register wherein the digital data is cycled consecutively from one position (address) to the next. The multiplexing is inherent in this architecture. This implementation is very simple, however, it is unsuitable for very large memory sizes as it is not very compact and requires an extensive high speed clock distribution. The final option of Fig. 2.36(c) utilizes the advantages of the first two options. The memory, having a depth $D$ is divided into $M$ blocks of depth $D/M$ that operate at a frequency of $f_{clk}/M$ like a shift register, similar to Fig. 2.36(b). Then, they are multiplexed using a $M : 1$ MUX using a decoder similar to Fig. 2.36(a). This style is very popular with relatively larger memory sizes for high speeds and has been previously used in [7] and [59]. An additional advantage of using the styles of Figs. 2.36(b) and 2.36(c) is that power supply noise and substrate noise introduced is largely data independent as all the flip-flops are a part of shift register that operates every clock [59]. This property makes them of interest in mixed signal designs.

Considering all the aforementioned trade-offs, a 1-Kb test memory that operates at 5.5 GHz is further described in Chapter 5 using the architecture of 2.36(a). Since a relatively small and compact memory is required, the style of 2.36(a) is used and it does not require a high speed clock to refresh/hold the data. The critical path of the serializer MUX can be met for this memory size. However, this memory introduces data dependent substrate noise that can affect the sensitive analog blocks. Hence, the DAC is placed in a deep N-well to increase the substrate isolation.

## 2.7 Summary

This chapter explored the limitations of conventional $\Delta \Sigma$ DACs for wideband implementations. Time-interleaving can be used for improving the speed and bandwidth of $\Delta \Sigma$ DACs and different trade-offs in TIDSM architectures were presented. The degree of interleaving i.e. number of channels is determined by the amount of timing relaxation required in the logic, the ease of data multiplexing and the complexity of the clocking. Additionally, the different design choices for the DAC analog current cells were also discussed and the challenges in testing high-speed DACs were also presented.
Chapter 3

An 8-GS/s 200-MHz BW Interleaved ∆Σ DAC in 65-nm CMOS

3.1 Introduction

High-speed high-data-rate communication requires increasingly larger bandwidths, such as the recent wideband standards for UWB [63] and 60-GHz [4, 6] radios that have large RF bandwidths of 528 MHz and 1.7 GHz per channel respectively. As discussed previously in Chapter 1, ∆Σ DACs offer the benefit of a reduced reconstruction filter order and DAC unit cell requirement through digital signal processing. However, the bandwidth of ∆Σ DACs reported so far has been limited to 100 MHz (i.e. 200-MHz RF bandwidth when used in I & Q paths of transmitters) [26]. Aiming to improve the bandwidth of ∆Σ DACs for wideband communication, this work presents a ∆Σ DAC with 200-MHz bandwidth in 65-nm CMOS. This DAC is suitable for a transmitter chain similar to that presented in [8] and shown in Fig. 3-1.

OFDM modulation is commonly used for transmission at high data rates. If an OFDM signal is used with a modulation scheme like QPSK, 16-QAM or 64-QAM, then a SNR in the range of 25-40 dB is required from the DAC [26] [64]. Arriving at such an SNDR for bandwidths over 200 MHz requires DAC sampling rates exceeding 8 GS/s. The speed of the DSM in the DAC then becomes the limiting factor for achieving this sampling rate when a conventional implementation is used.

A MASH architecture based ∆Σ DAC is suited for high-speed implementation because it is inherently stable, has a short integrator critical path and easily allows pipelining as compared to other typical modulator architectures [34]. Matlab® simulations show that a MASH 1-1 DSM DAC with seven thermometer-coded current
cells (3-bit output) can provide a SNDR of 45-dB at 200 MHz bandwidth and 8-GS/s (OSR = 20) with a current mismatch ($\sigma$) of 3.6%. A second order modulator presents only a moderate filtering complexity for suppressing the generated out-of-band quantization noise [8, 26]. For the chosen 1-1 MASH modulator, simulations show that a second order low pass filter can satisfy the spectral mask for the UWB standard in [63]. With these relaxed analog constraints, this becomes mainly a digital problem of designing a DSM for such a high speed. For a conventional MASH 1-1 DDSM that uses static CMOS logic, the 8 GHz speed is found to be very close to the limit of a standard 65 nm technology node at 1 V even when extensive pipelining with 1-bit integrators is utilized [27] (Also refer Sec. 2.1.2). Hence, a time-interleaved MASH DSM is necessary to relax the critical path and the clock rate. As previously discussed in Sec. 2.3, a two-channel strategy is chosen by giving a higher priority to the simplification of the MUX and the current cells. The two-channel approach offers the benefit that the logic and the MUX now work with a single half-rate clock and secondly, it allows multiplexing with rail-to-rail swing and use of traditional current cells (Sec. 2.4). However, this requires the logic in each channel to operate at 4 GHz, which is still a challenging task. This speed has been achieved by a careful integrator design that is tailored for the interleaved structure and is described in the following section.
3.2 DSM Design

Fig. 3-2 shows the detailed block diagram of the proposed 12-bit ΔΣ DAC that uses the two-channel interleaved MASH 1-1 architecture and seven current cells. This figure is based on the first order TI EFB DSM shown earlier in Fig. 2-21. Inputs, $x_0[11:0]$ and $x_1[11:0]$ form the two half-rate channels and are the divided even and odd streams of the original data-rate. The 10-bit back-to-back integrators/adders, i.e. the feedback path in each of the first order MASH DDSMs form the critical path in the design (shown with a dashed rectangle). The carry generated from the integrators is added to the remaining two MSB input bits, $x_0[11:10]$ and $x_1[11:10]$. Being in a forward path, this addition and also the end processing are not critical paths. Therefore, the main design problem becomes that of two 10-bit back-to-back adders that must operate at over 4 GHz ($<250$ ps cycle time).

3.2.1 Analysis of Critical Path

Although this DSM is targeted for an 8 GHz speed, higher speeds are required if even larger bandwidths are targeted. Hence, it is important to have an in-depth look at the critical path of the two-channel TIDSM in order to understand the speed limitations.
Figure 3-3: N-bit deep Integrator Pipeline. Critical path is from flop FF$_{S0}$ to flop FF$_{SN-1}$.

of this architecture. Additionally, the optimum pipeline depth and logic style for the implementation can be determined on the basis of this study.

Consider Fig. 3-3 that shows one N-bit pipeline stage of the two back-to-back adders A (Ch0) and B (Ch1) from Fig. 3-2 in more detail. Both the adders are individually constructed from 1-bit adders, A$_0$-A$_{N-1}$ and B$_0$-B$_{N-1}$ respectively. Outputs S$_0$ and S$_1$ are the running sum of the integrator for channel 0 and channel 1 respectively. The critical path lies between the flip-flops FF$_{S0}$(start) and FF$_{SN-1}$(end). The carry signals from both the adders move in an upward direction while the sum moves in a lateral direction. From this figure, an important observation about the characteristics of the two adders can be made. Adder A is required to generate the sum and carry outputs with equal delay while only the carry generation is critical for Adder B like is the case in conventional adders. To understand this better, it
3.2 DSM Design

can be seen that the worst case delay can result from three different types of paths. In the first case, it results mainly from the carry chain of A, i.e. the path from $FF_{S0} \rightarrow A_0 \rightarrow A_1 \ldots \rightarrow A_{N-1} \rightarrow FF_{SN-1}$. In the second case, the delay mainly comes from the carry chain of B, i.e. $FF_{S0} \rightarrow A_0 \rightarrow B_0 \rightarrow B_1 \ldots \rightarrow B_{N-1} \rightarrow FF_{SN-1}$. In the final case, the delay comes partially from the carry chain of A and partially from the carry chain of B, i.e. $FF_{S0} \rightarrow A_0 \rightarrow A_1 \ldots \rightarrow B_{N-2} \rightarrow B_{N-1} \rightarrow FF_{SN-1}$. In the second and the third cases, it becomes especially important that the sum generated by the 1-bit adders $A_0$-$A_{N-1}$ be transferred to adder B as fast as possible. Thus, A has both its outputs, carry and sum in the critical path making it inherently slower than B.

It can be further seen that even within adders A and B, the individual 1-b adders are inherently different. In adder A, $A_0$ is the slowest cell since it needs to generate the inverted inputs and also the complementary outputs for sum and carry with equal delay. Hence the fan-out of $A_0$ is the highest. Adders $A_1$-$A_{N-2}$ have the same fan-out as $A_0$ but faster by one inverter delay as their complementary inputs are already available. $A_{N-1}$ is the fastest since it is required to generate only three outputs i.e. $co$, sum, $\overline{sum}$ and hence has a lesser fan-out. A similar analysis of adder B shows that $B_0$-$B_{N-2}$ are of the same type and slower than $B_{N-1}$. Thus, in a one N-bit deep integrator pipeline, the 1-bit adders can be written in their order of decreasing delays as follows: $A_0 > A_1, A_{N-1} > A_{N-2} > B_0, B_{N-2} > B_{N-1}$.

While these observations form the basis for designing the 1-bit adders, a further improvement in delay is possible by noting the fact that the $\overline{sum}$ outputs of $B_0$-$B_{N-2}$ are not in the critical path. Hence, it is possible to reduce the drive strength on these nodes. This improves the delay from sum outputs of $A_0$-$A_{N-2}$ to the carry outputs of $B_0$-$B_{N-2}$ respectively as now there is a reduced capacitance on this path. A similar situation exists with $co$ output of $A_{N-1}$ and hence this node can be also be similarly slowed down to speed up its sum output.

Additional overheads exist in the integrator that further contribute to the overall delay. Firstly, integrators are often required to be reset at start-up in order to drive them to a known state. Hence, a NOR gate is required at all outputs of the pipeline in order to synchronously reset the integrator. This increases the load on $\overline{sum}$ outputs of all the adders and introduces a skew between the sum and $\overline{sum}$ outputs of $A_0$-$A_{N-1}$. A second overhead is a higher load of two flip-flops on $S_1$ outputs (see Fig. 3-3). This can be understood from Fig. 3-2 where it can be seen that the integrator output is fed back and also needs to be sent to the next MASH stage as $\Delta \Sigma$ modulators rarely employ a first order shaping function. Replicating the flops results in a lower delay than having one flop to drive both the feedback path and the next MASH stage.

With this understanding of the factors that affect the overall delay, the integrator pipeline of Fig. 3-3 was simulated using combinational static CMOS logic and the pipeline depth was varied from one to five and the worst case delay was calculated in each case. The 1-bit adders were implemented as carry select adders while the flip-flops used were standard transmission gate flip-flops (TGFF). The simulations were carried out in a standard 65 nm CMOS technology at 1 V supply and 75°C using
Table 3-1: Maximum effective achieved speed as a function of the pipeline depth in a 2-channel interleaved modulator.

<table>
<thead>
<tr>
<th>Pipeline Depth</th>
<th>Overhead Delays*</th>
<th>Total Adder Delays(ps)</th>
<th>Total* Path Delays(ps)</th>
<th>Effective Speed (GHz)</th>
<th>Avg. 1-b Adder Delay(ps)</th>
</tr>
</thead>
<tbody>
<tr>
<td>D</td>
<td>$T_f$</td>
<td>$T_{ad}$</td>
<td>$T_p$</td>
<td>$\frac{2}{T_p}$</td>
<td>$\frac{T_{ad}}{(D+1)}$</td>
</tr>
<tr>
<td>1</td>
<td>78</td>
<td>110</td>
<td>188</td>
<td>10.6</td>
<td>55</td>
</tr>
<tr>
<td>2</td>
<td>88</td>
<td>130</td>
<td>219</td>
<td>9.1</td>
<td>44</td>
</tr>
<tr>
<td>3</td>
<td>88</td>
<td>180</td>
<td>268</td>
<td>7.4</td>
<td>45</td>
</tr>
<tr>
<td>4</td>
<td>88</td>
<td>200</td>
<td>288</td>
<td>6.9</td>
<td>40</td>
</tr>
<tr>
<td>5</td>
<td>88</td>
<td>237</td>
<td>325</td>
<td>6.1</td>
<td>39</td>
</tr>
</tbody>
</table>

* Overhead Delays include:
  a) Flop delay = 30 ps for D=1 & 40 ps for D>1.
  b) Flop setup time = 23 ps.
  c) NOR gate delay = 25 ps.

** Simulation with Post-layout extracted netlist, 1 V supply at 75°C.

low-\text{V}_T \text{ general purpose devices. The simulations use post-layout extracted netlists for the flops and adders. Table 3-1 shows the obtained delays and overall effective throughput as a function of pipeline depth.}

Table 3-1 shows that although an effective speed greater than 10 GHz can be achieved with a 1-bit pipeline, the number of flops required to just pipeline the inputs of a first order 12-bit to 3-bit MASH modulator (Fig. 3-2) is ∼140 which makes 5 GHz clock distribution very challenging. The average 1-b adder delays flattens to an optimal 39 ps per 1-b adder for pipeline depth >3 and results in speeds of up to 6.9 GHz with an optimal flop count. Between these two ends of the solution space, a pipeline depth of two and three can be used for speeds between 7.4-9.1 GHz with a moderate increase in number of flip-flops (40%) and a 12% less optimal average 1-b adder delay. It can also be noted that the fixed overhead delays that result from the flop delay/setup time and the reset NOR gate account for at least ∼30% of the overall path delay.

3.2.2 Comparison With Alternative Logic Styles

Since static CMOS logic is extremely robust with very good noise immunity, the simulation results presented in the previous section serve as baseline measurement against which further delay optimization or alternative logic styles can be compared. This section presents some alternatives for delay improvement and discusses the trade-offs involved. In this section, a 2-bit pipeline depth as shown in Fig. 3-4 is used as a reference for all discussions.

Firstly, there exists a possibility to replicate FF_{S0} of Fig. 3-3 such that both the complementary inputs can be directly provided to adder A_0 in Fig. 3-4 (single dashed
square). This gives back one inverter delay of 12 ps, however, the fan-out for the sum output of $B_0$ is now higher compared to Fig. 3-3, which reduces this gain to about 4 ps. Thus, providing complementary inputs to the first adder is only moderately beneficial due to the feedback nature of the critical path.

A second optimization strategy involves moving the reset logic inside the flop, so as to make it an asynchronous reset instead (Fig. 3-4, double dashed square). This gives back 25 ps from the NOR gate (Table 3-1) but increases the flop delay by 15 ps (7 ps in setup time, 8 ps in flop output delay). This results in an overall 10-ps delay improvement and combined with the foregoing optimization of providing complementary inputs yields an improvement of 14 ps, with 25% increase in flop area.

Static TGFFs are not the fastest flops, and instead, using a True Single Phase Clocked Register (TSPCR) FF offers the benefits of a reduced setup time and a single clock phase [65]. TSPCRFF shows a 10 ps improvement for the same clock load as that of a TGFF and coupled with the complementary style of providing the inverted inputs [7] to $A_0$, an overall 17 ps improvement is achieved. However, TSPCRFF being a dynamic flop suffers from lower noise margins.

Alternatively, every alternate CMOS gate in the path can be replaced with a ratioed (pseudo-NMOS) logic to reduce the adder load as shown in Fig. 3.5(a). The full swing is restored after every alternate gate. This yields a 20 ps overall improvement but at a cost of 80% higher power resulting from the static current in this logic style.

Lastly, a pre-charged domino logic based style was also simulated using a two clock phase system as shown in Fig. 3.5(b). Adder A operations are performed in the first clock phase while adder B works on the second. Since the number of adders in the critical path is $D+1$ for a pipeline depth $D$, having a 2-pipeline depth (odd number of additions) is inefficient as the additions cannot be equally spread out over both the

![Figure 3-4: A 2-bit pipeline with optimization. Single dashed box shows complementary inputs. Double box shows reset moved to the flop.](image-url)
An 8-GS/s 200-MHz BW Interleaved ΔΣ DAC in 65-nm CMOS

Figure 3-5: Ratioed and Dynamic Logic Implementation of the integrator.

Table 3-2 shows the relative comparison of delays between all these discussed styles. The table demonstrates that pure combinational static logic with complementary inputs ready and a TGFF with an asynchronous reset yields a frequency very close to 10 GHz (Option 2). In order to achieve speeds above 10 GHz, reduced swing logic or dynamic logic is required (Options 3 & 5), which comes with a high power penalty and/or an impact on noise immunity (Options 3/4/5). Moreover, the speed improvement in these logic styles is only between 5–11% compared to the delays obtained using static logic (Options 1 & 2). Hence, these styles would be less preferred for implementing very high speed two-channel interleaved ΔΣ modulators.

Based on the results obtained in Table 3-2, Option 1 of a TGFF with static combinational logic and synchronous reset in a 2-bit pipeline is found to be sufficient.
### Table 3-2: Delay Comparison with alternative logic style for 2-bit pipelines.

<table>
<thead>
<tr>
<th>Option No.</th>
<th>FF Type</th>
<th>Logic Style</th>
<th>Delay (ps)</th>
<th>Eff. Speed</th>
<th>Impact</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>TGFF</td>
<td>Combinational CMOS</td>
<td>219</td>
<td>9.1</td>
<td>— Baseline measurement.</td>
</tr>
<tr>
<td>2</td>
<td>TGFF with Reset</td>
<td>Combinational CMOS &amp; complementary inputs</td>
<td>204</td>
<td>9.8</td>
<td>— None.</td>
</tr>
<tr>
<td>3</td>
<td>TGFF</td>
<td>Ratioed (pseudo-NMOS)</td>
<td>198</td>
<td>10.1</td>
<td>— 80% increase in power. — Reduced noise margin.</td>
</tr>
<tr>
<td>4</td>
<td>TSPCR</td>
<td>Combinational CMOS &amp; complementary inputs</td>
<td>201</td>
<td>9.9</td>
<td>— Dynamic nodes in TSPCRFF.</td>
</tr>
<tr>
<td>5</td>
<td>Tristated Inverter Latch</td>
<td>Pre-charged Domino</td>
<td>195</td>
<td>10.2</td>
<td>— Increased design complexity. — Doubled clock load. — Reduced noise margin.</td>
</tr>
<tr>
<td>6</td>
<td>TGFF with Reset</td>
<td>Complementary Pass Trans. with keepers</td>
<td>213</td>
<td>9.4</td>
<td>— None.</td>
</tr>
</tbody>
</table>

for the targeted 8 GS/s in this work. Figure 3-6 shows the structure of the implemented first order EFB TIDSM that is instantiated twice to obtain the MASH 1-1 structure.

#### 3.2.3 Clock Distribution

In order to keep the TGFFs compact in size by avoiding a clock inverter inside the flip-flops, both the clock phases are provided from outside the FF. While the global clock distribution is single-ended, the pseudo-differential clock driver of Fig. 3.7(a) distributes the clocks with a 30 ps slope locally to the FFs. Each such clock driver is used for every 18 FFs (fan-out of 3) in a $70\mu m \times 45\mu m$ area (see Fig. 3.7(b)). This driver also minimizes the clk-clk overlap as both the clock phases have an equal load.
3.2.4 Final Multiplexer and Current Cell Design

Fig. 3-8 shows the static 2:1 MUX per cell. The data from the second channel is shifted by half a cycle prior to the MUX. The MUX is single-ended and the complementary DAC switch signals are generated with a 15 ps slope after the MUX by a centre-crossing pseudo-differential switch driver. The switch driver uses the same circuit as the local clock driver of Fig. 3.7(a). A high-crossing switch driver (Fig. 2-34) that is modified for this MUX structure is required in the DAC, but such
3.2 DSM Design

Figure 3-8: Final MUX and DAC current cell with the timing diagram. Switch driver circuit is the same as the local clock driver of Fig. 3.7(a).

A driver is a challenge at 8 GS/s due to the high capacitance and contention at its cross-coupled nodes [16]. Hence, the centre-crossing driver is chosen to meet the speed and fast slope. The DAC is sensitive to a mismatch in driver rise and fall delays and requires a careful driver design. However, due to its pseudo-differential nature, a 3 ps mismatch is the smallest that could be achieved in this design and this is found to degrade the SNDR by 4 dB in post-layout simulation. The clock to the MUX bypasses the main clock distribution and uses a minimized buffering of four stages. The MUX is sensitive to the clock duty cycle and simulations show an SNDR reduction of 4 dB for every 1% variation from the desired 50% duty cycle.

The DAC current cell, also shown in Fig. 3-8 is designed for 0.3 Vpp-diff swing with a 100 Ω passive load and 1.2 V supply. The current source dimensions for the 3.6% mismatch requirement are derived using foundry matching parameters. The current cells are designed for $<-60$ dBc IM3 at 200 MHz using Eq. (2.23), which implies an output impedance greater than 4.8 KΩ. The simulated output impedance profile is shown in Fig. 3-9. The current cell has M4-M5 transistors as the cascode pair on top of the switching M2-M3 pair that is biased in the linear region and M1 as the current source. The seven current cells are laid out in a single column with dummies on either side to simplify the route matching between the switch driver and the switches.
3.3 Measurement Results

The proposed ∆Σ DAC is implemented in a standard 65 nm CMOS process and uses only general purpose low-\(V_T\) (DDSM & DAC) and standard-\(V_T\) devices (DAC). Fig. 3-10 shows the chip photograph. The active area of the DAC is 0.13 mm\(^2\). The test chip is directly wire-bonded to an FR4 PCB.

Fig. 3-11 shows the measurement setup used. A 12-bit input of 8192 sample length and frequency \(F_{in}\) is sent in to the chip at a rate \(F_{bb}\) using a Tektronix AWG5012C pattern generator. Internal to the chip, this data is up-sampled to \(F_s\) rate by a zero-order hold operation. In addition, the two DDSM channel inputs are also shorted together. The zero-hold operation and the shorted inputs, together results in up-sampling of the input data by \(2F_s/F_{bb}\) since the DAC effectively works at a \(2F_s\) sampling rate. A high-speed FIR low-pass filter is not implemented to remove the up-sampling images at the DDSM input. With this simplified setup, these unfiltered images appear at the DAC output also, but they lie outside the band of interest. In a real application, however, the up-sampling images are filtered out prior to the DDSM. By keeping \(F_{in} \leq F_{bb}/4\), the unfiltered images do not result in intermodulation products in the band of interest (DC to \(F_{in}\)). A single-ended sinusoidal clock of frequency \(F_s\) is sent into the chip and amplified to full-swing inside the chip followed by a single-ended to differential conversion. The duty cycle of clock \(F_s\) is tuned off-chip before the measurements by varying the DC component of the sinusoidal clock entering the chip.

![Figure 3-9: Simulated output impedance (Z_0) profile of the current cell.](image-url)
3.3 Measurement Results

Figure 3-10: Chip photograph of the implemented ΔΣ DAC.

Figure 3-11: Measurement setup for the ΔΣ DAC with the expected spectrum at output of every block. An up-sampling filter is not used to simplify testing. Up-sampling images in the output are out of the band of interest.
An 8-GS/s 200-MHz BW Interleaved $\Delta\Sigma$ DAC in 65-nm CMOS

**Figure 3-12:** Measured single-ended spectrum showing 8 GS/s operation with $F_s=4$ GHz, $F_{bb}=800$ MHz and input frequency, $F_{in}=200$ MHz. The noise shaping and the 9 out of band images can be seen.

**Figure 3-13:** Measured $-57$-dBc IMD3 with two $-6$ dBFS tones near 200 MHz spaced 2 MHz apart.
3.3 Measurement Results

Figure 3-14: Output spectrum with 42 dB SNDR obtained from post-layout simulation for an 8 GS/s operation.

Fig. 3-12 shows the measured single-ended output spectrum of a 200-MHz full-scale single tone input at 8 GS/s operation of the ΔΣ DAC. With $F_{bb} = 800$ MHz and $F_s = 4$ GHz, the main 200 MHz tone and the 9 expected images are seen along with the noise shaping, thus demonstrating the desired DAC operation. The DAC achieves 26-dB SNDR for the 200 MHz single tone input consuming 68 mW (40 mW-clocking, 23 mW-logic/FF, 5 mW-analog) from 1-V digital and 1.2-V analog supplies. The measured IMD3 is $-57$ dBc (Fig. 3-13) for two $-6$ dBFS tones near 200 MHz placed 2 MHz apart.

Fig. 3-14 also shows the post-layout simulated DAC output spectrum at 8 GS/s which had a SNDR of 42 dB. The measured SNDR is found to be lower than the simulated value primarily due to a 10 dB loss resulting from a test setup limitation of synchronizing the $F_s$ and $F_{bb}$ clocks. The two clock sources have been locked to a common 50 MHz reference, but they are still not truly synchronous. This results in an up-sampling uncertainty in the zero-hold operation and restricts the SNDR. The higher noise floor up to 800 MHz in the measured spectrum results from this synchronization problem. Although the SNDR is found to be limited by the measurement setup, nevertheless, a linearity $>9$-bits is obtained. Table 3-3 shows the comparison of this work with previous ΔΣ DACs having $>2.5$ GHz sampling rate. The DAC dynamic performance deteriorates beyond 8 GS/s and the DDSM DAC was found to not be operational beyond 9 GS/s.
Table 3-3: Comparison with ΔΣ DACs having >2.5-GS/s sampling rate.

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>DAC Type</td>
<td>ΔΣ DAC</td>
<td>Hybrid ΔΣ RFDAC\textsuperscript{1}</td>
<td>ΔΣ DAC</td>
<td>ΔΣ RFDAC</td>
<td>ΔΣ DAC</td>
</tr>
<tr>
<td>Modulator Type</td>
<td>Error Feedback</td>
<td>MASH</td>
<td>Error Feedback</td>
<td>MASH</td>
<td>Interleaved MASH</td>
</tr>
<tr>
<td>Technology</td>
<td>90nm</td>
<td>65nm</td>
<td>90nm</td>
<td>0.13μm</td>
<td>65nm</td>
</tr>
<tr>
<td>Input Bits</td>
<td>10</td>
<td>5</td>
<td>13</td>
<td>12</td>
<td>12</td>
</tr>
<tr>
<td>Output Bits</td>
<td>3</td>
<td>3</td>
<td>1</td>
<td>3</td>
<td>3</td>
</tr>
<tr>
<td>Order</td>
<td>2</td>
<td>3</td>
<td>3</td>
<td>2</td>
<td>2</td>
</tr>
<tr>
<td>Sampling Rate (GS/s)</td>
<td>3.6</td>
<td>5.4</td>
<td>4</td>
<td>2.6</td>
<td>8</td>
</tr>
<tr>
<td>Bandwidth (MHz)</td>
<td>10</td>
<td>10</td>
<td>50</td>
<td>100</td>
<td>200</td>
</tr>
<tr>
<td>SNDR (dB)</td>
<td>70</td>
<td>-</td>
<td>53</td>
<td>30</td>
<td>26</td>
</tr>
<tr>
<td>IMD3 (dBc)</td>
<td>-70</td>
<td>-</td>
<td>-</td>
<td>-51</td>
<td>-57</td>
</tr>
<tr>
<td>Area (mm\textsuperscript{2})</td>
<td>0.06</td>
<td>-</td>
<td>&lt;0.15</td>
<td>&lt;0.11\textsuperscript{1}</td>
<td>0.13</td>
</tr>
<tr>
<td>Power (mW)</td>
<td>16</td>
<td>&gt;50\textsuperscript{1}</td>
<td>54</td>
<td>40</td>
<td>68</td>
</tr>
<tr>
<td>Swing (V\textsubscript{pp-diff})</td>
<td>0.3</td>
<td>-</td>
<td>1.3</td>
<td>0.35</td>
<td>0.3</td>
</tr>
</tbody>
</table>

\textsuperscript{1} Estimated. \textsuperscript{1} Hybrid DAC is a combination of ΔΣ and Nyquist DACs.

3.4 Summary

This chapter presented a 8-GS/s 200-MHz bandwidth ΔΣ DAC with 26-dB SNDR, −57-dB IMD3 and 68-mW power consumption in 65 nm CMOS. The high sampling rate was achieved by a two-channel interleaved 4-GHz MASH 1-1 modulator structure. This allowed a single clock solution, thus simplifying the timing complexity of the final full-rate multiplexing. Using only seven current cells with relaxed matching requirements, this work demonstrates the potential of this predominantly digital DAC for use in baseband of transmitters for wideband wireless communication, e.g. UWB and 60-GHz radios.
Chapter 4

Effect of Clock Duty Cycle Error on Two-channel Interleaved ΔΣ DACs

4.1 Introduction

Figure 4-1 shows the general structure of a two-channel TIDSM DAC that implements a noise transfer function (NTF) of $1 - H(z)$. The digital modulator is now implemented as a $2 \times 2$ block digital filter containing the two polyphase components of $H(z)$ [45] and operates at a relaxed half-sampling-rate frequency of $\frac{f_s}{2}$ (Sec. 2.3). At high sampling rates, driving the DAC directly with the full-rate $f_s$ clock becomes a challenge or this $f_s$ clock can be sometimes unavailable. Additionally, a return-to-zero (RZ) DAC may be required for improved dynamic performance [48]. In these cases, the two generated polyphase outputs, $y_0$ and $y_1$ are then multiplexed by the same half-rate $\frac{f_s}{2}$ clock to an effective $f_s$ sampling rate and then fed to the DAC [35, 48]. The final full-rate multiplexing before the DAC is sensitive to both the edges of the $\frac{f_s}{2}$ clock as new data is presented to the DAC on both the edges. As long as the duty cycle of this clock is 50%, both the channels are reconstructed for a time $1/f_s$ as desired. However, if the duty cycle is not 50%, then a sampling time error is introduced into the DAC that results in a SNDR loss.

Figure 4-2 illustrates the severity of the effect of this duty cycle error (DCE) in a 4-bit 10 GS/s two-channel wideband TIDSM DAC with a third-order NTF of $(1 - z^{-1})^3$. At an oversampling ratio (OSR) of 16 (bandwidth=312.5 MHz), simulations show that even a 1% duty cycle error (i.e. duty cycle is 49% or 51%) in the half-rate 5 GHz clock results in a SNDR loss of 35 dB. Achieving an exact 50% clock duty-cycle at high speeds is very challenging. Although clock generators often
Effect of Clock Duty Cycle Error on Two-channel Interleaved $\Delta\Sigma$ DACs

Figure 4-1: Block diagram of a generic two-channel interleaved $\Delta\Sigma$ DAC implementing a noise transfer function $1 - H(z)$.

Figure 4-2: Effect of 1% DCE on SNDR for a 4-bit DAC with $f_s = 10$ GHz, OSR=16 (BW=312.5 MHz) and NTF of $(1 - z^{-1})^3$.

employ duty cycle correction [66] or utilize a master clock that is first divided down by two to achieve a 50% duty cycle, there still exists a residual DCE [48]. Hence, it is of importance to analyze and estimate the effect of DCE on two-channel TIDSM DACs.

The effect of DCE on TIDSM DACs has received very less attention in the literature. Previous works [19, 20] have focused only on the analysis of sampling time errors in non-interleaved Nyquist and $\Delta\Sigma$ DACs resulting from stochastic clock jitter, which is not applicable in the case of a deterministic error like the DCE. In [67], the effect of time-average frequency (TAF) and flying-adder (FA) clocks on non-interleaved Nyquist DACs has been studied and a closed-form expression for the SDR is presented. A half-rate clock with a DCE behaves similarly as a FA clock and hence the analysis performed in [67] is used as a starting point to analyze the DCE effect on SNDR of TIDSM DACs.
4.2 Mathematical Formulation of the SNDR Loss

In this work, a new closed-form expression for SNDR loss due to the DCE is derived for modulators of the type, NTF=\((1-z^{-1})^n\). It is further shown that the effect of DCE can be mitigated similarly as stochastic clock jitter by adding a low-order FIR filter between the modulator and the multiplexer that attenuates the high frequency noise [23]. A closed-form expression for estimating the SNDR loss in the presence of this filter is also developed. These expressions are useful as a suitable modulator and filter order that takes the DCE problem into account can be chosen in the very early phase of the design. The method presented in this work can be extended to any other NTF.

\[4.1\]

4.2 Mathematical Formulation of the SNDR Loss

Figure 4-3 shows a clock of frequency \(f_s/2\) having a DCE of \(d_e\) \(\%\) i.e. a duty cycle variation from 50\%. This means that the effective sampling time is of the form \(T_1T_2T_1T_2\ldots\) and so on. Let \(\delta\) be the sampling time error in each sample given by \(\delta = |2d_eT_s|\). Now, initially assume that this clock drives an interleaved Nyquist DAC (see Fig. 4.4(a)) which has a single input tone at a frequency of \(f = f_s/2 - f_b\) that is greater than \(f_s/4\). Then it has been shown in [67] and more recently [18] that such a clock of the form \(T_1T_2T_1T_2\ldots\) produces a distortion tone at a frequency of \(f_b\), i.e. the tone at \(f_s/2 - f_b\) folds back to \(f_b\) and the signal-to-distortion ratio (SDR) of this DAC in dB is given by

\[\text{SDR} = 20 \log \left( \frac{1}{2\delta f} \right) - 3.9 \]  \hfill (4.1)

Equation (4.1) can be rewritten as

\[\text{SDR} = 20 \log \left( \frac{f_s}{2\pi d_e f} \right) \]  \hfill (4.2)

Equation (4.2) calculates the SDR after the DAC that also accounts for the sinc shaping. Notice that if the input frequency tone, \(f\) is close to \(f_s/2\), then it is scaled by the DAC sinc shaping while the distortion tone (close to 0) remains nearly unaffected.
Effect of Clock Duty Cycle Error on Two-channel Interleaved $\Delta\Sigma$ DACs

SDR = 20 log $\left( \frac{1}{2 \Delta f} \right)$ - 3.9 dB

by the sinc function. Since the error is introduced during the multiplexing, SDR can be also referred to the output of the multiplexer and before the DAC (refer Fig. 4-1).

$$SDR_{\text{max}} = 20 \log \left( \frac{f_s}{2 \pi d \Delta f} \right) + 20 \log \left( \frac{\pi f_s}{f_s} \sin(\pi f_s) \right)$$

$$= 20 \log \left( \frac{1}{2 \Delta e \sin(\pi f_s)} \right)$$

(4.3)

Now, consider the case of an interleaved $\Delta\Sigma$ DAC as shown in Fig. 4.4(b). Let the main input tone be located at $f_b$ with 0 to $f_b$ being the band of interest. Analogous to the case of the Nyquist DAC, the shaped noise at high frequencies will cause distortion tones at lower frequencies. More specifically, high frequency noise power in the frequencies from $f_s/2 - f_b$ to $f_s/2$ will be scaled by Eq. (4.3) and fold back into the frequency band from 0 to $f_b$, causing an SNDR loss. Also, note that for $f_b << f_s/2$, the SNDR in the desired band from 0 to $f_b$ remains nearly unaffected.
4.3 Expression for SNDR Loss Due to DCE

by the sinc shaping of the DAC i.e. the SNDR after the multiplexer is approximately equal to the SNDR after the DAC.

Let the quantization noise power in the band of interest for an ideal TIDSM DAC be \( N_q \) and the signal (input tone) power be \( S \). Let the total folded noise power into the band due to the DCE be \( N_f \). The ideal SNDR is then given by \( S/N_q \) while \( S/(N_q + N_f) \) is the reduced SNDR. A “noise figure” term for the TIDSM DAC that specifies the amount of relative SNDR loss in dB in the presence of DCE can then be defined as

\[
F_{dB} = 10 \log \left( 1 + \frac{N_f}{N_q} \right) \tag{4.4}
\]

It can be noted that Eq. (4.4) is independent of the signal power i.e. number of DAC bits since \( N_q \) and \( N_f \) are functions of the NTF and OSR only. For a given NTF and an OSR, \( N_q \) and \( N_f \) can be computed to obtain a closed-form expression for \( F \) using Eq. (4.3) and Eq. (4.4).

4.3 Expression for SNDR Loss Due to DCE

Assume that an \( n \)-th order modulator with an NTF of the form \( (1 - z^{-1})^n \) is used and the bandwidth of interest is \( f_b \), similar to Fig. 4.4(b). Then,

\[
|NTF(f)| = |1 - e^{-2j \frac{\pi f}{fs}|^n = \left[ 2 \sin \left( \frac{\pi f}{fs}\right) \right]^n \tag{4.5}
\]

\( N_q \) is given by

\[
N_q = \int_{0}^{f_b} |NTF(f)|^2 df \tag{4.6}
\]

Due to the oversampling, assuming that \( f_b << f_s/2 \) gives \( \sin \left( \frac{\pi f}{fs}\right) \approx \frac{\pi f}{fs} \). With \( OSR = f_s/(2f_b) \), using Eq. (4.5) in Eq. (4.6) yields

\[
N_q = \frac{\pi^{2n} f_s}{2(2n + 1)OSR^{2n+1}} \tag{4.7}
\]

Using Eq. (4.3), the folded noise power \( N_f \) can be written as

\[
N_f = \int_{\frac{f_b}{2}}^{f_s/2} \left[ \int_{\frac{f_b}{2}}^{f_s} \left( \frac{\pi f}{fs}\right)^2 |NTF(f)|^2 df \right] \sin^{2n+2} \left( \frac{\pi f}{fs}\right) df \tag{4.8}
\]
Changing the integral limits from 0 to \( f_b \) yields

\[
N_f = 2^{2n+2} d_e^2 \int_0^{f_b} \sin^{2n+2} \left( f \frac{f_s}{2} \right) df
\]

\[
= 2^{2n+2} d_e^2 \int_0^{f_b} \cos^{2n+2} \left( f \frac{f_s}{f_s} \right) df
\]

(4.9)

For \( f_b << f_s/2 \), \( \cos(\pi f/ f_s) \approx 1 \). Eq. (4.9) simplifies to

\[
N_f = \frac{2^{2n+1} d_e^2 f_s}{OSR}
\]

(4.10)

Now, using Eq. (4.7) and Eq. (4.10) in Eq. (4.4) yields

\[
F|dB = 10 \log \left[ 1 + 2^{2n+2}(2n+1)d_e^2 OSR^{2n}/\pi^{2n} \right]
\]

(4.11)

Thus, a closed-form expression for the SNDR loss, \( F \) due to a DCE of \( d_e \% \) has been obtained. Equation (4.11) shows that in the presence of DCE, the dominant term that contributes to the SNDR loss is \( (2OSR/\pi)^{2n} \).

### 4.4 Validation of Expression for SNDR Loss

In order to validate Eq. (4.11), a 10 GS/s two-channel TIDSM DAC with 13-bit digital input and 4-bit DAC is chosen. The NTFs chosen for simulation are \((1 - z^{-1})^2\) and \((1 - z^{-1})^3\) i.e. second and third-order modulators respectively. Simulations are carried out for three values of OSR i.e. 16 \( (f_b=312.5 \text{ MHz}) \), 10 \( (f_b=500 \text{ MHz}) \) and 5 \( (f_b=1 \text{ GHz}) \). These modulator orders and bandwidths are chosen as they are of potential interest in wideband applications for UWB and 60-GHz radio. The digital modulator is implemented as a discrete-time model in Matlab® while transient circuit simulations are performed for the multiplexer and the DAC in Cadence® Spectre®. Ideal multiplexer and DAC models are utilized and the DCE of the \( f_s/2 \) clock is parametrically varied from 0% to 5%. The DAC output is filtered with a Bessel low-pass filter having a bandwidth of \( f_b \) prior to measuring the SNDR. In all cases, the number of FFT points chosen is \( 2^{14} \) and a 0 dBFS single tone input of frequency \( f_b \) is used.

Figures 4.5(a) and 4.5(b) show the comparison between the simulated and estimated SNDR loss for the three OSR values and the two modulators respectively. The estimation using the linear quantizer model of the modulator (Eq. (4.11)) matches closely with the transient simulated SNDR loss with a less than 0.9 dB error. This
4.4 Validation of Expression for SNDR Loss

![Graph](image)

Figure 4-5: Simulation versus Estimation of SNDR loss for a 10 GS/s TIDSM DAC for (a) second-order \((n=2)\) and (b) third-order \((n=3)\) modulators.

demonstrates that the analysis in the preceding sections is valid and can be used to estimate the performance of the TIDSM DAC.

Equation (4.11) and the simulation results show that a higher OSR and \(n\) results in a higher SNDR loss. This makes higher OSR and higher order modulators more susceptible to the duty cycle problem. Higher order modulators are used because they yield more noise-shaping and hence a higher SNDR in the bandwidth of interest.
Due to the high sensitivity of Eq. (4.11) to $n$, it can then so happen that the benefit of using a higher order modulator is nullified by the SNDR loss due to the DCE. In other words, it is possible that a lower order modulator shows a better performance than the higher order one above a certain value of DCE for a given value of OSR. In order to demonstrate this problem, consider that $I_n$ is the improvement in the ideal SNDR obtained by using a $(n+1)^{th}$-order modulator over an $n^{th}$-order one. Then, from Eq. (4.7) we have

$$I_n = \frac{N_{q,n}}{N_{q,n+1}} = \frac{(2n+3) \text{OSR}^2}{(2n+1) \pi^2} \tag{4.12}$$

Similarly, the ratio between the SNDR loss due to the DCE from an $(n+1)^{th}$-order modulator and an $n^{th}$-order one, $L_n$, is calculated from Eq. (4.11) as

$$L_n = \frac{F'_{n+1}}{F_n} = \frac{\pi^{2n+2} + 2^{2n+4}d_e^2(2n+3)\text{OSR}^{2n+2}}{\pi^2(2n+2)d_e^2(2n+1)\text{OSR}^{2n+2}} \tag{4.13}$$

Now equating $I_n$ and $L_n$, a limit for $d_e$ can be obtained above which an $n^{th}$-order modulator starts showing a better performance over an $(n+1)^{th}$-order one.

$$d_e = \sqrt{\frac{\pi^{2n}(2n+3)\text{OSR}^2 - (2n+1)\pi^2}{3(2n+3)(2n+1)2^{2n+2}\text{OSR}^{2n+2}}} \tag{4.14}$$

In order to obtain the value of $d_e$ for a comparison between a second and a third-order modulator, substituting $n=2$ in Eq. (4.14) yields

$$d_e = \sqrt{\frac{7\pi^4\text{OSR}^2 - 5\pi^2}{6720 \cdot \text{OSR}^6}} \tag{4.15}$$

For OSR=16, Eq. (4.15) results in a value of $d_e=0.12\%$. This means that the duty cycle of the clock must be between the values 49.88% and 50.12% in order to obtain a benefit on the third-order modulator over a second-order. This requirement is extremely stringent and becomes even stricter as the OSR increases. On the other hand, for a more wideband operation with an OSR of 5, this limit of $d_e$ becomes 1.08%. This is a more relaxed requirement on the clock. Thus, a higher order modulator is more suitable for operation with a low OSR. In order to check the validity of Eq. (4.15), transient simulations of the obtained SNDR for the second and third-order modulators are also performed for small values of $d_e$ between 0% and 0.15% when OSR=16. Fig. 4-6 shows the obtained simulation results. For no DCE, the third-order modulator has a simulated 14.7 dB higher SNDR (15.6 dB predicted by Eq. (4.12)). Exactly as predicted by Eq. (4.15), the performance of the third-order modulator drops below that of the second one for $d_e$ as small as 0.12%.
4.4 Validation of Expression for SNDR Loss

![Figure 4-6: Second-order modulator shows a better SNDR than third-order for OSR=16 and $d_e > 0.12\%$ as predicted by Eq. (4.15).](image)

![Figure 4-7: Two multi-channel MUX styles.](image)

### 4.4.1 Extension to Larger Number of Channels

It could be of interest to expand this analysis to a larger number of channels. However, to perform such an analysis the multiplexing style should be first considered. Consider the example of a 4-channel system shown in Fig. 4.7(a). The 4:1 MUX is achieved in steps of 2:1 MUX with the data being progressively shifted to a higher frequency clock synchronously [48]. This is the common approach even for larger multiplexing ratios such as 8:1 [49]. It can then be observed that the analog performance ultimately depends on the duty cycle of the $f_s/2$ clock even in this case i.e. only the final 2:1 MUX stage determines the performance. Any duty cycle error in the other divided clocks such as $f_s/4$, $f_s/8$ etc. only affects the data capture window. For a correct DAC operation, there should not be any data capture (data setup/hold time) errors until the data arrives at the final 2:1 MUX. Then, the two-channel analysis remains valid even for a larger number of channels that use the style of Fig. 4.7(a).

On the other hand, consider the multiplexing that is performed in one single step...
as shown in Fig. 4.7(b) for a four-channel system. Assume that the 0°, 90°, 180° and 270° clock phases are generated by a DLL or a phase rotator and the 4:1 multiplexing is performed directly in one step i.e. no splitting into individual 2:1 muxes. Following the same procedure as Sec. 4.2, in a 4-channel Nyquist DAC, an input tone \( f_{in} \) will result in distortion tones that lie at \( \pm k f_s/4 \pm f_{in} \), where \( k = 0, 1, 2, 3 \). Now consider a 4-channel 10 GS/s \( \Delta \Sigma \) DAC with a 1 GHz BW. In this case, the NTF band from 4 GHz to 5 GHz folds back into the 0–1 GHz band as before. Additionally, the NTF band between 1.5 GHz and 2.5 GHz also folds back into the 0–1 GHz band. Moreover, the amount of folded noise depends on the clock error per channel. Overall, this leads to a very complicated expression and integrals for the SNDR loss by using the method described in [67] and Sec. 4.2. Hence, a further analysis for the SNDR loss for such a multiplexing style with a larger number of channels has not been performed in this work.

4.5 Mitigating DCE Effect with Digital Filtering

The analysis in the preceding sections suggests that the amount of high-frequency noise that folds back into the bandwidth of interest must be reduced in order to mitigate the effect of the DCE. This means that the high-frequency shaped noise must be filtered out prior to the multiplexer. The magnitude of the shaped noise is the highest at \( f_s/2 \). Hence, introducing zero(es) at \( f_s/2 \) can limit the noise folded back into the desired band. FIR low-order low-pass filters having a transfer function of the type, \( G(z) = (1 + z^{-1})^m \) (where \( m \) is the order) are of particular interest as they fulfill this requirement and have a very small attenuation in the desired band. Moreover, these filters can be implemented in a multiplier-less architecture making them suitable for high-speed operation. For third-order and above, other filter transfer functions e.g. \([1 2 2 1]\) could be of interest as they have power-of-2 coefficients [68]. However, \([1 2 2 1]\) results in a much lesser attenuation close to \( f_s/2 \) compared to \( (1 + z^{-1})^3 \). Hence, for the discussion in this section, only \( G(z) = (1 + z^{-1})^m \) is considered. Figure 4-8 shows the block diagram of a TIDSM DAC with such a filter which is also implemented with a polyphase architecture. The FIR filter must be of a low order because it increases the number of DAC bits. For every one order increase in the filter, the number of DAC bits increases by one. Hence, the FIR filtering comes at the expense of the DAC cell matching. Figure 4-9 shows the frequency response of the shaped noise in the presence of such a filter.

It is of interest to estimate the performance of the TIDSM DAC in the presence of the filter. Hence, a closed-form expression for the SNDR loss, \( F \) can be developed in this case as well. Such an expression for the TIDSM DAC is useful for the co-design of the modulator and the filter.
4.6 Expression for SNDR Loss with FIR Filter

With \( G(z) = (1 + z^{-1})^m \),

\[
|G(f)| = |1 + e^{-2j\pi f f_s}|^m = \left[ 2 \cos \left( \frac{\pi f}{f_s} \right) \right]^m
\]  

(4.16)

Then, the quantization noise power, \( N_q \), is given by

\[
N_q = \int_{f_b}^{f_s} |NTF(f)|^2 |G(f)|^2 df
\]  

(4.17)

For \( f_b << f_s/2 \), \( \cos(\pi f / f_s) \approx 1 \) and \( \sin(\pi f / f_s) \approx \pi f / f_s \), thus Eq. (4.17) simplifies to

\[
N_q = \frac{\pi^{2n+2m} f_s}{2(2n+1)OSR^{2n+1}}
\]  

(4.18)
Using Eq. (4.3) and Eq. (4.16), the folded noise power $N_f$ can be written as

$$N_f = \int_{f_s/2}^{f_s/2} \left[ 2d_e \sin \left( \frac{\pi f}{f_s} \right) \right]^2 \left| NTF(f) \right|^2 |G(f)|^2 df$$

$$N_f = 2^{2n+2m+2} d_e^2 \int_{f_s/2}^{f_s/2} \sin^{2n+2} \left( \frac{\pi f}{f_s} \right) \cos^{2m} \left( \frac{\pi f}{f_s} \right) df$$

Changing the integral limits from 0 to $f_s$ yields

$$N_f = 2^{2n+2m+2} d_e^2 \int_0^{f_s} \sin^{2n+2} \left[ \frac{\pi f}{f_s} \left( \frac{f}{2} - f \right) \right] \cos^{2m} \left[ \frac{\pi f}{f_s} \left( \frac{f}{2} - f \right) \right] df \quad (4.19)$$

$$N_f = 2^{2n+2m+2} d_e^2 \int_0^{f_s} \cos^{2n+2} \left( \frac{\pi f}{f_s} \right) \sin^{2m} \left( \frac{\pi f}{f_s} \right) df$$

Further simplification results in

$$N_f = \frac{2^{2n+1} \pi^{2m} d_e^2 f_s}{(2m+1)OSR^{2m+1}} \quad (4.20)$$

Now, using Eq. (4.4), Eq. (4.18) and Eq. (4.20), the SNDR loss, $F$ in the presence of the filter is simplified to

$$F = 10 \log \left[ 1 + 2^{2(n-m+1)} d_e^2 (2n+1)OSR^{2(n-m)} \right] \quad (4.21)$$

Firstly, it can be seen from Eq. (4.21) that $m=0$ represents the condition of no filter and simplifies to (4.11) as expected. Equation (4.21) intuitively also shows the improvement in the overall SNDR due to the filter. While Eq. (4.11) is a function of $(2OSR/\pi)^{2n}$, Eq. (4.21) is a function of $(2OSR/\pi)^{2(n-m)}$. Hence, increasing the filter order $m$ improves the performance of the DAC. At $n = m$, the SNDR loss, $F$ is no more a function of the OSR and achieves a near immunity to $d_e$.

### 4.6.1 Validation of SNDR Loss with FIR Filter

In order to validate the preceding analysis, transient simulations are now performed on the 10 GS/s TIDSM DAC with the filter included for $n=3$ with OSR=16 and OSR=5. The DCE is swept from 0 to 5% while the filter order is swept from 0 to 3.
Figures 4.10(a) and 4.10(b) show that the simulated SNDR loss, $F$ matches closely with the estimation from Eq. (4.21) with a less than 1.3 dB error.

For the case of OSR=16 (Fig. 4.10(a)), the first-order filter ($m=1$) shows a drastic improvement in performance e.g. a 24 dB improvement for $d_e=1\%$. However, the SNDR loss is still high even with $m=1$. A second-order filter ($m=2$) shows a very good immunity to DCE with the loss being less than 4 dB for $d_e$ as high as 5% and less than 0.5 dB for $d_e=2\%$. Filter order of three results in near immunity to the DCE with a less than 0.05 dB loss due to DCE. For the case of OSR=5 (Fig. 4.10(b)), $m=1$ itself could be a sufficient option as it shows a < 1.3 dB SNDR loss for $d_e$ upto 2%. As mentioned previously, the immunity to DCE with an $m^{th}$-order filter comes at the cost of $m$ additional DAC bits. In other words, the overall DAC moves from being DCE-limited to matching-limited. Hence, mismatch shaping may be additionally required in the presence of the filter.

Some measurement results to validate Eqs. (4.11) and (4.21) have also been possible in the second DAC prototype, which is described further in the Chapter 5. These measurement results are presented in Section 5.6.1.
68 Effect of Clock Duty Cycle Error on Two-channel Interleaved $\Delta \Sigma$ DACs

$$F(z) = \frac{1 + z^{-1}}{z^{-1/2}}$$

Figure 4-11: Hold interleaving to introduce a zero at $f = f_s/2$, i.e. implementing a filter $1 + z^{-1}$

4.7 Other Timing Error Mitigation Techniques

In discrete time domain, the timing error $e_t$ for the $k$th sample can be written as [69]

$$e_t(k) = [y(k) - y(k-1)] \frac{\Delta t(k)}{T_s}, \text{ where } \frac{\Delta t(k)}{T_s} = 2d_e \quad (4.22)$$

where $\Delta t(n)$ is the absolute timing error for the $n$th output $y(k)$, $T_s$ is the effective sampling time period and $d_e$ is the duty cycle error (deviation in duty cycle from 50%). Equation (4.22) then provides three ways of reducing the timing error. One way is to reduce $\Delta t(k)$ through duty cycle correction, which has not been focused in this work. The second way is to reduce the quantity $y(k) - y(k-1)$. In the frequency domain, this is equivalent to reducing the high-frequency content in the output relative to the frequency of interest. This can be achieved by simply increasing the number of DAC bits or using the FIR filtering by introducing zeroes at $f_s/2$ as described in the previous section. The third way is to compensate for the error $e_t(k)$ by injecting a quantity $-e_t(k)$ into the DAC to cancel the error. The quantity $\Delta t(k)/T_s$, being static can be measured in advance and configured into the DAC. Calculation of $-e_t(k)$ for every clock cycle then involves only a subtraction and a multiplication operation.

4.7.1 Parallel-Path DACs or Hold Interleaving

While parallel-path DACs [52], also called hold interleaved DACs have been reported only in Nyquist DACs, they are found to be suitable even for TIDSM DACs as they also implement the function $F(z) = (1 + z^{-1})$ in the analog domain. Figure 4-11 shows this architecture for two-channels wherein the outputs of both the channels are held for one full cycle of the $f_s/2$ clock and added in the analog domain, but the two DACs operate on the $0^\circ$ and $180^\circ$ phases of the $f_s/2$ clock. This architecture implements a filtering transfer function $F(z) = 1 + z^{-1}$ resulting in a zero at the frequency $f_s/2$. Even in this case, the effective number of DAC bits increases by one and the effectiveness of the filtering depends on the relative gain mismatch between the two DACs since the addition in $F(z)$ is performed in the analog domain.
4.7 Other Timing Error Mitigation Techniques

4.7.2 Compensation Through Analog Post-correction

Since the duty cycle error \(d_e\) (or \(\Delta t/T_s\)) can be measured, then the error \(e_t(k)\) can be easily calculated in the digital domain for every sample. Thus, another auxiliary DAC can be added in parallel with the main DAC to inject a magnitude \(-e_t(k)\) into the final output as shown in Fig. 4-12. The advantage of this scheme is that it is independent of the noise transfer function and is an open-loop correction scheme not involving any feedback, making this suitable for high-speed operation. However, the auxiliary DAC itself now requires a much higher linearity compared to the main DAC as the quantity \(e_t(k)\) can be at least one order of magnitude less than \(y(k) - y(k - 1)\) depending on the amount of minimum correction required. Moreover, the auxiliary DAC itself also encounters the duty-cycle error as it also operates on the same clock at the main DAC. This reduces the effectiveness of the correction.

4.7.3 Compensation Through Digital Pre-correction

Alternative to analog post-correction, the calculated error, \(-e_t(n)\) can be fed back into the digital \(\Delta \Sigma\) modulator loop as shown in Fig. 4-13 [69]. The actual correction achieved depends on the overall noise transfer function, \(NTF(z) = 1 - H(z)\) as the fed back negative error \(-e_t(k)\) is subjected to an operation of \(H(z)\). Good correction can be achieved for frequencies \(|1 - H(z)| << 1\). This can then be written in the form of an equation as

\[
Y(z) = X(z) + E_q(z)NTF(z) - E_t(z)(1 - NTF(z)) \quad (4.23)
\]

\[
Y(z) = X(z) + E_q(z)(1 - H(z)) - E_t(z)H(z) \quad (4.24)
\]
Effect of Clock Duty Cycle Error on Two-channel Interleaved ΔΣ DACs

From these equations, a perfect pre-distortion of the input signal exists only if $Y(z) = X(z) - E(z)$, which can happen only for $|1 - H(z)| << 1$ in the band of interest. This technique has been used in [69] for clock jitter compensation. It is also of interest to investigate this technique for a deterministic error like the DCE. The advantage of this approach is that no additional restrictions are placed on the main DAC.

4.8 Comparison of Different Mitigation Techniques

In order to compare the performance of these techniques, the same 10 GS/s TIDSM with a 4-bit thermometer coded DAC at the output was considered. Simulations were carried by feeding the two-channel modulator digital output data streams to a behavioural model of the multiplexer and a DAC where the channel skew and DAC cell mismatch could be parametrically varied. Simulations were performed with second ($1 - z^{-1}$) and third order ($1 - z^{-1}$) noise shaping transfer functions for a low OSR (=5) and a high OSR (=25). The duty cycle error in the half-rate clock ($f_s/2$) was swept from 0-5% (i.e. $\Delta t/T_s$ from 0-10%) and the achieved SNDR was calculated in each case, e.g., a duty cycle error of 1% corresponds to a $\Delta t/T_s$ error of 2%. Figure 4-14 shows the simulated SNDR for the four resulting combinations of two different noise-shaping orders and two different OSRs with respect to $\Delta t/T_s$.

4.8.1 Performance of Correction Techniques

Figure 4-14 shows that the performance improvement from correction techniques over the original TIDSM DAC is >13 dB for OSR=25 and >4 dB for OSR=5. However, their performance is poorer compared to first order digital FIR filtering (also referred to as pre-filtering) by >10 dB for OSR=25 and >4 dB for OSR=5. Analog correction is found to be more effective than the digital option by about 4–6 dB. At low OSR=5, even just increasing the number of DAC bits by one can show a better SNDR than correction.
4.8 Comparison of Different Mitigation Techniques

Additional challenges were also found in the correction techniques that make them unsuitable for high speed implementation. The auxiliary DAC required in analog post-correction introduces severe matching restrictions to the DAC. For the simulated 4-bit DAC, if a correction is required for $\Delta t/T_s$ up to 2% error ($d_e = 1\%$), then $\min(e_t) = 0.02$ from Eq. (4.22). Assuming that now the value of $\min(e_t) = 0.02$ determines the unit cell current then this implies that the auxiliary DAC requires at least 6 bits to achieve correction, making this overall a 10-bit DAC. A 10-bit DAC again would be matching limited, which may negate the benefits of using a $\Delta \Sigma$ DAC. Figure 4-15 shows the amount of additional DAC bits required for an error correction up to $\Delta t/T_s\%$.

The main drawback of the digital implementation is that it presents a long critical path from the outputs, $y_0$ and $y_1$ back into the loop of the modulator (Fig. 4-13).
Hence, this prevents any bit-level pipelining of the adders in the modulator severely restricting the operating speed. Moreover, the effectiveness of digital correction is NTF dependent as compared to the analog correction.

### 4.8.2 Performance of FIR Filtering Techniques

Figure 4-14 and Section 4.6.1 shows that the FIR filtering techniques offer a better performance than compensation. Figure 4-16 shows the effect of FIR filtering on the DAC unit cell matching requirements due to the increase in the number of cells using a second order noise shaping as an example. The graph shows the amount of DAC mismatch that produces the same average SNDR as that produced by the timing error. In other words, the plot shows the limit for DAC mismatch above which the \( \Delta \Sigma \) DAC becomes matching limited in the presence of FIR filtering.

To interpret this plot, consider Fig. 4-14 again and the case where OSR=25 and a second order NTF is used. For \( \Delta \tau / T_s = 10\% \) (\( d_c = 5\% \)), the SNDR obtained is 38 dB. A first-order digital FIR filtering is used for this case which improves the SNDR to 66 dB. Now referring to Fig. 4-16, for OSR=25 and second order NTF, a mismatch (\( \sigma \)) = 3% in the DAC produces a distribution having an average of 66 dB SNDR.

Although hold interleaving and first order pre-filtering, both have the same transfer function, digital pre-filtering has much relaxed matching constraints to achieve the SNDR in Fig. 4-14 for a given timing skew error. Digital pre-filtering with \( (1 + z^{-1})^2 \) introduces stringent matching requirements of \( \sigma < 0.5\% \) in the DAC, making the performance matching limited and not timing skew limited anymore.
4.9 Summary

This chapter mathematically analyzes the effect of DCE on two-channel TIDSM DACs with NTF = $(1 - z^{-1})^n$. The TIDSM DAC is found to be very sensitive to this error which limits the overall performance. A closed-form expression that estimates the performance loss due to the DCE is derived. Adding a low-order FIR filter can mitigate the effect of DCE. The expression is further extended to include the effect of the filter. The presented method can be extended to other NTFs. This analysis is useful as these expressions support a duty cycle “aware” design process for wideband TIDSM DACs. A comparison of various techniques to reduce the effects of channel timing skew introduced by the DCE in interleaved ΔΣ DACs is also performed. Digital pre-filtering to reduce the high frequency shaped noise is found to be more effective than correction techniques. Compensation techniques are found to be less suitable for high speed implementations as they either introduce severe matching restrictions on the DAC (analog correction) or restrict the speed of the ΔΣ modulator (digital correction).
Effect of Clock Duty Cycle Error on Two-channel Interleaved $\Delta \Sigma$ DACs
Chapter 5

A 11-GS/s 1.1-GHz BW TI-ΔΣ DAC for 60-GHz Radio in 65-nm CMOS

The performance of the TIDSM MASH 1-1 DAC proposed in Chapter 3 showed three limitations. Firstly, the critical path of two adders limited the maximum speed. Secondly, the high OSR of 20 showed a high sensitivity to the duty cycle of the sampling clock (Refer Eq. 4.11). Thirdly, the DAC testing required a high-speed digital data interface which proved to be very challenging. All these three limitations are overcome in this work.

5.1 Introduction

The increasing demand for high-data-rate short-range wireless communication has led to the evolution of the unlicensed 60-GHz radio band (57.2-65.8 GHz) which has a continuous bandwidth of 9 GHz. This has resulted in the development of recent standards, such as WiGig (IEEE 802.11ad) [4], ECMA-387 [6] and WirelessHD [5]. These standards have divided the 60-GHz band into four channels, each having a 1.76 GHz (I+Q paths) RF channel bandwidth (BW).

DACs that form a part of the transmitter baseband are hence required to have a wide bandwidth >880 MHz (in both, I & Q paths to enable the 1.76 GHz channel BW) and a resolution >6-8 bits to support the different modulation schemes of these standards [7,46,70–72]. Most of the DACs reported in literature for 60-GHz radio have so far used a conventional approach with a 2× digital interpolation of the baseband, followed by a Nyquist current-steering DAC and a fourth or fifth order passive LC-analog anti-aliasing filter, which then connects to an up-conversion
A 11-GS/s 1.1-GHz BW TI-ΔΣ DAC for 60-GHz Radio in 65-nm CMOS

Figure 5-1: Comparison of different DAC based architectures for 60-GHz radio baseband.

mixer [46, 70]. This approach is shown in Fig. 5.1(a). The passive filters occupy a large on-chip area and have a low quality factor [7]. While some low-area high-order wideband active filters have also been recently reported [9, 10], they are challenging to design and impact the transmitter linearity.

With the advances in CMOS scaling, there is a trend of using digital processing to move the analog functionality of the RF transceivers to the digital domain for easy configurability and relaxing the analog circuit requirements. Some examples of these techniques in transmitters include oversampling/interpolation filtering to reduce the anti-aliasing filter order and the use of ΔΣ modulation to reduce the number of DAC unit cells [8, 25–27, 30]. However, these techniques have been applied only for relatively low channel-bandwidth standards (< 160 MHz) e.g. WLAN, WiMAX, UMTS, WCDMA and UN-II bands where the carrier frequencies are only a few gigahertz.

Applying similar techniques to 60-GHz radio is challenging due to its large BW which results in a very high speed requirement from the digital processing. Nevertheless, there is an emerging trend towards digital architectures for the 60-GHz band. A 7-bit oversampling interpolation digital FIR filter before the DAC operating at 9.6 GS/s is presented in [7]. This oversampling filter, along with the sinc response of a Nyquist DAC, can satisfy the spectral mask of the WiGig standard without an
anti-aliasing filter and allow the DAC to directly connect to the mixer. This digital oversampling based architecture is shown in Fig. 5.1(b). However, this architecture now requires a Nyquist DAC with at least 6-8 bits of resolution and operating at a high sample rate of ∼10 GS/s. The design of this DAC is challenging as this may require the use of analog techniques such as use of sub-DACs or dual-current cells with higher matching requirements, special DAC switching schemes e.g. quad-switching, clocking schemes with extensive phase calibration and threshold voltage calibration of the switch to correct timing errors [18, 48, 73].

A third potential architecture that still uses a digital oversampling filter but now instead uses a ∆Σ DAC instead of the Nyquist DAC is shown in Fig. 5.1(c). In this scenario, the ∆Σ DAC can further enable this trend towards digital architectures by using digital processing to reduce the number of DAC unit cells and hence the overall DAC complexity. However, ∆Σ DACs have the drawback of a large out-of-band shaped quantization noise which needs to be filtered out to meet the spectral mask of the WiGig standard. If the order of the filtering can be restricted to a first or a second order, then a good trade-off between the high filter order of the conventional transmitter (Fig. 5.1(a)) and the large complexity of a high-speed Nyquist DAC in the interpolation-based architecture (Fig. 5.1(b)) can be achieved. The ∆Σ DAC (Fig. 5.1(c)) can thus present an intermediate digital solution.

Based on the foregoing discussion, the ∆Σ DAC is required to work at ∼10 GS/s and provide a BW > 880 MHz. ∆Σ DACs have not been targeted for this high bandwidth and sample rate because of the speed limitation of the integrator (feedback path) in conventional digital ∆Σ modulators (DSM). Hence, time-interleaved ∆Σ modulators (TIDSM) that use a poly-phase decomposition (loop-unrolling) of the integrator are required to relax the critical path in the modulator [33–35,40,45]. Using this concept, MASH based TIDSMs that achieve 8 GS/s [35, 40] and a ∆Σ DAC with 200 MHz BW [35] have been previously reported. However, these loop-unrolled architectures are eventually limited by the critical path of the integrator and the final full-rate-multiplexing that makes the >10 GS/s speed very challenging. Hence, this work presents a two-channel ∆Σ DAC (Fig. 5.1(c)) that reduces the critical path of a conventional loop-unrolled MASH TIDSM by modifying the execution order of the computations, while the two-channel architecture allows a single clock design with a simplified final multiplexing.

A two-channel LA-TIDSM MASH 1-1 DAC with an 8-bit digital input and a 4-bit DAC that achieves 11 GS/s and 1.1 GHz bandwidth is presented in this work. This DAC along with a second order low pass filter can support the spectral mask of the IEEE 802.15ad WiGig standard for the 60-GHz band. The remainder of this chapter is organized as follows. Sections 5.2 and 5.3 describe the modulator choice and the LA-TIDSM architecture respectively. Sections 5.4 and 5.5 describe the implementation of the LA-TIDSM DAC and the testing methodology using an on-chip testing memory. Finally, the measurement results are presented in Sections 5.6.
5.2 Modulator Architecture

In order to support modulation schemes from BPSK to 16-QAM, the \( \Delta \Sigma \) DAC is targeted for a >40 dB SNDR in a bandwidth of 880 MHz [26]. The DSM should operate at a multiple of the reference sampling rate, which is 1.76 GHz in Single Carrier (SC) mode for the WiGig standard. The order of the DSM also affects the out-of-band quantization noise and hence the filter order required to meet the spectral mask. In addition to these constraints, the number of channels in the TIDSM based DAC and the choice of the final full-rate-multiplexing (serializer) strategy also influences the choice of the DSM order and the achievable SNDR.

Referring to Chapter 4, Figure 4-1 shows a generic two-channel TIDSM architecture that is obtained by loop-unrolling a conventional DSM. It can be recollected that two-channel TIDSM shown in Figure 4-1 implements a NTF\((z) = 1 - H(z)\) and operates at a relaxed half-sampling rate of \( f_s/2 \). The DSM is implemented as a \( 2 \times 2 \) block digital filter that contains the two poly-phase components of \( H(z) \) [45]. The two outputs are then multiplexed by the same half-rate-clock to the full sampling rate of \( f_s \). While a larger number of channels can further relax the critical path in the DSM [34], the final full-rate-multiplexing now requires accurate multiphase clock generation which is challenging at high frequencies [40, 48]. Hence, two-channel TIDSM DACs are of particular interest as they use only a single half-rate-clock for the DSM and the multiplexing, thus keeping a low clocking complexity while still relaxing the DSM critical path. The multiplexing and the overall DAC performance of this two-channel architecture is sensitive to the duty cycle of the \( f_s/2 \) clock. A duty cycle error (DCE) in the \( f_s/2 \) clock i.e. a variation from 50% duty cycle directly impacts the SNDR of the TIDSM DAC since this results in a timing skew between the two channels. The SNDR loss results from the folding of the high-frequency shaped noise between \( f_s/2 - f_{in} \) and \( f_s/2 \) back into the BW of interest that lies between 0 and \( f_{in} \). It has been shown earlier in [38] Eq. (4.11) in Chapter 4 that the loss in SNDR, \( \Delta \Sigma \) in a TIDSM DAC due to a DCE of \( d_e \% \) is given by

\[
L_{\Delta \Sigma}|_{dB} = 10 \log \left[ 1 + \frac{2^{2n+2}(2n+1)d_e^2\text{OSR}^{2n}}{\pi 2^n} \right]
\]  

where NTF\((z) = (1 - z^{-1})^n\), \( n \) represents the DSM order and OSR is the oversampling ratio. Although duty cycle correction or a double frequency clock that is divided down to achieve a 50% duty cycle can be employed to mitigate this problem, there still exists some residual DCE [48, 66]. This suggests that an increased SNDR is required as a margin to accommodate some amount of DCE. It can further be noted that the DCE does not affect the SFDR of the \( \Delta \Sigma \) DAC. The interleaving spurs resulting from the band between 0 and \( f_{in} \) appear in the band between \( f_s/2 - f_{in} \) and \( f_s/2 \). Hence, these tones are also filtered out by the anti-aliasing filter.

Table 5-1 shows the different possible alternatives for the TIDSM DAC in the presence of above mentioned constraints. The SNDR is estimated for a 0 dBFS sine
5.2 Modulator Architecture

Table 5-1: Different modulator options for the 880 MHz bandwidth.

<table>
<thead>
<tr>
<th>Option No.</th>
<th>Mod. Order</th>
<th>Samp. Freq (GHz)/OSR</th>
<th>DAC Bits</th>
<th>Ideal SNDR (dB) @ 880 MHz</th>
<th>Loss from 1% DCE SNDR (dB)</th>
<th>Eff. SNDR (dB)</th>
<th>LP Filter Order</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>2</td>
<td>8.8/5</td>
<td>4</td>
<td>42.3</td>
<td>0.8</td>
<td>41.5</td>
<td>2</td>
</tr>
<tr>
<td>2</td>
<td>2</td>
<td>10.56/6</td>
<td>3</td>
<td>40.0</td>
<td>1.5</td>
<td>38.5</td>
<td>2</td>
</tr>
<tr>
<td>3</td>
<td>2</td>
<td>10.56/6</td>
<td>4</td>
<td>45.4</td>
<td>1.5</td>
<td>43.9</td>
<td>2</td>
</tr>
<tr>
<td>4</td>
<td>2</td>
<td>12.32/7</td>
<td>4</td>
<td>49.1</td>
<td>2.5</td>
<td>46.6</td>
<td>1</td>
</tr>
<tr>
<td>5</td>
<td>2</td>
<td>12.32/7</td>
<td>3</td>
<td>43.6</td>
<td>2.5</td>
<td>41.1</td>
<td>2</td>
</tr>
<tr>
<td>6</td>
<td>3</td>
<td>8.8/5</td>
<td>4</td>
<td>47.1</td>
<td>5.9</td>
<td>41.2</td>
<td>2</td>
</tr>
<tr>
<td>7</td>
<td>3</td>
<td>10.56/6</td>
<td>4</td>
<td>51.1</td>
<td>9.9</td>
<td>41.2</td>
<td>2</td>
</tr>
</tbody>
</table>

Figure 5-2: Filtering with a second order LPF for a second order ∆Σ 4-bit DAC at 10.56 GS/s 16-QAM encoded random data.

wave at 880 MHz and a DCE error of 1% i.e. the clock duty cycle is between 49% and 51% (a 2 ps timing error at 10 GS/s). In order to estimate the filter order needed, the baseband signal is assumed to be first up-sampled and pulse shaped with a 0.25 roll-off root-raised-cosine (RRC) filter prior to the TIDSM [46]. The TIDSM uses an 8-bit input data from the filter and an NTF(z)=(1 – z⁻¹)ⁿ. It can firstly be seen from Table 5-1 that the fourth option with an OSR of 7 and a first order filter is the most desirable option but the 12.32 GS/s sample rate is very challenging. The third option with 10.56 GS/s, 4-bit DAC and a second order filter is the next best that can achieve the 40 dB SNDR. It can be further seen from Table 5-1 and Eq. (5.1) that a third order DSM does not yield a better SNDR in the presence of 1% DCE. Thus, the second order TIDSM with a 4-bit DAC and operating at >10.56 GS/s is chosen as the design target. The unit cell current matching (σ) for the 4-bit thermometer coded DAC was chosen such that the SNDR loss due to mismatch is less than that produced by a 1% DCE. Monte-Carlo simulations showed that σ < 1.1% satisfies
this requirement. Fig. 5-2 shows the WiGig spectral mask that can be met with this chosen DAC option and a second order filter.

**5.3 Proposed Look-Ahead Time-interleaved Modulator**

As mentioned in Chapters 2 and 3, the traditional MASH DSM architecture that consists of a cascade of first-order error-feedback (EFB) DSMs (Fig. 2-9) is a very attractive candidate for high-speed implementation due to two main reasons [26, 27]. Firstly, the critical path is the shortest, corresponding to one adder delay, and restricted within each of the individual modulators. Any critical path spanning across the different cascade stages can be pipelined as this is a forward path [34]. Secondly, a cascade of first-order modulators is inherently stable. A conventional first-order EFB modulator with the integrator critical path is shown in Fig. 5-3 wherein the \( q \) LSBs of the input signal, \( x \), enter the integrator. The carry generated from the integrator is then added to the remaining \( p \) MSBs of \( x \). The integrator bit-width is determined by the number of DAC bits required. Figure 5-4 (also Fig. 2-21) shows the first-order loop-unrolled two-channel TI EFB DSM operating at half the speed but the critical path is now a two adder delay (Adders A and B). The two adders, A and B, can be optimized to achieve a very high speed, nevertheless, they ultimately limit the modulator speed [35] (Chapter 3). An effective 10 GHz speed cannot be met with this two-channel architecture in a standard 1 V 65 nm CMOS technology if purely static CMOS logic with its robust noise margins and >1-bit per pipeline stage is to be used (Refer Table 3-1 in Sec. 3.2.1).

The main reason for the speed limitation of this first order EFB TIDSM is the fact that adder B has to wait for the computation from adder A i.e the two adders (or channels) are coupled (shown in Fig. 5.5(a)). If the two channels/adders could be decoupled, then the two additions can happen in parallel within the integrator,
5.3 Proposed Look-Ahead Time-interleaved Modulator

Thus speeding it up (Fig. 5.5(b)). To achieve this decoupling, a pre-computation that corresponds to the intermediate computed value of Fig. 5.5(a) is performed prior to the loop. If this pre-computation (or look-ahead) turns out to be incorrect, then a post-decode block corrects this after the integrator. In summary, this involves moving a part of the computation out from the integrator feedback loop to before (look-ahead) and after the integrator (post-decode).

In order to arrive at the proposed LA solution, the first-order EFB two-channel TIDSM of Fig. 5-4 must be considered again. The DSM has an input width of
\( l = p + q \) bits, of which the \( q \) LSBs enter the feedback path i.e. the integrator. The two carry signals \((C_0, C_1)\) generated from the integrator are then added to the \( p \) MSBs of the two channels respectively to obtain the noise-shaped output. Let \( x_{0,\text{LSB}} \) and \( x_{1,\text{LSB}} \) be the lower \( q \) bits of the two-channel entering the integrator. Then, the following equations can be written for the \( k^{th} \) sample of the two generated sum \((S_0, S_1)\) and carry \((C_0, C_1)\) signals.

\[
S_0(k) = [S_1(k - 1) + x_{0,\text{LSB}}(k)] \mod 2^q \quad (5.2)
\]
\[
S_1(k) = [S_0(k) + x_{1,\text{LSB}}(k)] \mod 2^q \quad (5.3)
\]
\[
C_0(k) = \left\lfloor \frac{S_1(k - 1) + x_{0,\text{LSB}}(k)}{2^q} \right\rfloor \quad (5.4)
\]
\[
C_1(k) = \left\lfloor \frac{S_0(k) + x_{1,\text{LSB}}(k)}{2^q} \right\rfloor \quad (5.5)
\]

where \( \left\lfloor \cdot \right\rfloor \) denotes a floor operation and can take the value of 0 or 1 in this case. Using Eq. (5.2) in Eq. (5.3), we get

\[
S_1(k) = [[(S_1(k - 1) + x_{0,\text{LSB}}(k)] \mod 2^q + x_{1,\text{LSB}}(k)] \mod 2^q \quad (5.6)
\]

Equation (5.6) represents the two coupled adders. This equation is commutative in nature if any carry generated is ignored and can be rewritten as

\[
S_1(k) = [(x_{0,\text{LSB}}(k) + x_{1,\text{LSB}}(k)) \mod 2^q + S_1(k - 1)] \mod 2^q \quad (5.7)
\]

Equation (5.7) shows that the first addition part of the equation, i.e. \( x_{0,\text{LSB}}(k) + x_{1,\text{LSB}}(k) \) can be pre-computed in advance (look-ahead) before entering the feedback loop since the two inputs are readily available i.e. \( S_1 \) can be computed independent of \( S_0 \). Rewriting Eq. (5.7) as

\[
S_1(k) = [x_{L,\text{LSB}}(k) + S_1(k - 1)] \mod 2^q \quad (5.8)
\]

where \( x_{L,\text{LSB}}(k) = [(x_{0,\text{LSB}}(k) + x_{1,\text{LSB}}(k)) \mod 2^q] \) and represents only the sum part from this addition i.e. lower \( q \) bits (and not the carry generated from the addition). From Eq. (5.2) and Eq. (5.8), it can be seen that the computation of \( S_0 \) and \( S_1 \) is possible in parallel, thus making it possible to decouple the two adders, A and B. The parallel computation of \( S_0 \) and \( S_1 \) results in the improvement of the operating speed by reducing the critical path to that of only one adder as compared to Eq. (5.6).

Figure 5-6 demonstrates the proposed LA-TIDSM that implements Eq. (5.2) and Eq. (5.8) in parallel by moving the pre-computation of the intermediate partial sum, \( x_{L,\text{LSB}} \) to before the loop.
However, this modified order of executing the additions compared to the loop-unrolled TIDSM (Fig. 5-4) for computing $S_1$ results in the incorrect carry being generated from the loop for the second channel (CH1) in some cases. If the carry generated from $S_1(k-1)+x_{L,LSB}$ (Eq. (5.8)) in the LA-TIDSM is called $CL_1$, then $CL_1 \neq C_1$, where $C_1$ (Eq. (5.5)) is the correct expected carry for CH1 of Fig. 5-4. Note that carry of CH0 is not affected by this change in order of the additions, i.e. $CL_0 = C_0$. Hence, for the modulators of Fig. 5-4 and Fig. 5-6 to be functionally equivalent, the expected carry $C_1$ must be correctly decoded before passing it on to the final addition with the $p$ MSB bits.

In order to decode the correct value of $C_1$, the carry $CF_0$ generated by the pre-addition of $x_{0,LSB}$ and $x_{1,LSB}$ is also propagated forward (Fig. 5-6). The information to calculate $C_1$ is found to be embedded within $CF_0$, $CL_0$ and $CL_1$. The truth table for predicting $C_1$ from $CF_0$, $CL_0$ and $CL_1$ is shown in Table 5-2. Simplifying the truth table results in the following expression,

$$C_1 = CF_0 CL_1 + CL_0 (CF_0 + CL_1)$$ (5.9)

The proof for arriving at this truth table for $C_1$ that results in the functional equivalency between the TIDSM and the proposed LA-TIDSM is provided in Section 5.3.2.

The delay of the pre-computation in Eq. (5.8) is one adder delay similar to that of the integrator while the delay required to implement the post-decoding of $C_1$ in Eq. (5.9) is less than one adder delay i.e. the pre-computation and post-decoding logic does not limit the speed of the modulator.
Table 5-2: Truth Table to compute the correct value of carry, $C_1$ from $CF_0, CL_0$ and $CL_1$.

<table>
<thead>
<tr>
<th>Case No.</th>
<th>$CF_0$</th>
<th>$CL_0$</th>
<th>$CL_1$</th>
<th>Expected $C_1$</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>2</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>X</td>
</tr>
<tr>
<td>3</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>4</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>5</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>X</td>
</tr>
<tr>
<td>6</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>7</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
</tbody>
</table>

5.3.1 Numerical Example of LA-TIDSM

An example of the look-ahead approach is presented here using decimal numbers in order to explain the post-decode block clearly. Assume that the integrator can hold values between 0 and 9. Let the value stored in the integrator, $S_1(k - 1) = 3$. Let the two channel inputs $x_{0,LSB}(k)$ and $x_{1,LSB}(k)$ be 6 and 8 respectively. Then, using Eq. (5.2)–Eq. (5.5), the following result is obtained for the conventional TIDSM of Fig. 5-4: $S_0(k) = 9, C_0(k) = 0, S_1(k) = 7$ and $C_1(k) = 1$.

Now considering the LA-TIDSM of Fig. 5-6, we get $x_{L,LSB}(k) = 4$ and $CF_0(k) = 1$. Moving into the integrator, the following result is obtained: $S_0(k) = 9, CL_0(k) = 0, S_1(k) = 7$ and $CL_1(k) = 0$. It is seen that the value of $S_0(k)$ and $S_1(k)$ are correctly calculated. Also, $CL_0 = C_0$ while $C_1 \neq CL_1$. Hence, the correct value of $C_1$ has to be predicted looking at $CF_0$, $CL_0$ and $CL_1$ i.e. the truth table in Table 5-2. For $CF_0 = 1, CL_0 = 0$ and $CL_1 = 0$, we get $C_1 = 1$ from the table which is the correct expected value in a conventional TIDSM.

5.3.2 Proof of Equivalency between TIDSM and LA-TIDSM

The critical part of LA-TIDSM is arriving at the truth table for correctly decoding $C_1$ (Table 5-2) that results in a functional equivalency with the TIDSM. In this section, only the $q$ LSB’s of $x_0$ and $x_1$ are used and hence the LSB suffix for these variables is dropped. Consider the sequencing of operations in a TIDSM (Fig. 5-4). Let the integrator output in the previous clock $S_1(k - 1)$ be called $S_1$ for the remainder of this section. Then, the value of the carry $C_1$ is calculated in the TIDSM by combining Eq. (5.2), Eq. (5.3) and Eq. (5.5) and re-writing them as

$$ C_1 = 1 \text{ if } F > 2^q - 1 \text{ else } C_1 = 0. \quad (5.10) $$

where $F = \left[ (S_1 + x_0) \mod 2^q \right] + x_1$. \quad (5.11)
5.3 Proposed Look-Ahead Time-interleaved Modulator

Now, looking at the LA-TIDSM in Fig. 5-6, $C_1$ needs to be correctly predicted from $CF_0$, $CL_0$ and $CL_1$ i.e. $F$ must be estimated for the eight different cases. The following identities are used in the proof for any two $q$-bit unsigned numbers, $a$ and $b$.

\[ a + b \leq 2^q - 1 \implies a + b = (a + b) \mod 2^q \quad (5.12) \]
\[ a + b \leq 2^q - 1 \implies (a + b) \mod 2^q \leq 2^q - 1 \quad (5.13) \]
\[ a + b > 2^q - 1 \implies a + b = [(a + b) \mod 2^q] + 2^q \quad (5.14) \]

Only two of the eight cases from Table 5-2 are proved here but a similar procedure is extended for other cases as well.

**Case 4** ($CF_0 = 1$, $CL_0 = 0$, $CL_1 = 0$)

\[ CF_0 = 1 \implies x_0 + x_1 > 2^q - 1 \quad (5.15) \]
\[ CL_0 = 0 \implies S_1 + x_0 \leq 2^q - 1 \quad (5.16) \]
\[ CL_1 = 0 \implies S_1 + [(x_0 + x_1) \mod 2^q] \leq 2^q - 1 \quad (5.17) \]

From Eq. (5.15), if $x_0 + x_1 > 2^q - 1$, then $S_1 + x_0 + x_1 > 2^q - 1$. Now using Eq. (5.16), we have

\[ [(S_1 + x_0) \mod 2^q] + x_1 > 2^q - 1 \]
\[ \implies F > 2^q - 1 \implies C_1 = 1 \quad (5.18) \]

**Case 5** ($CF_0 = 1$, $CL_0 = 0$, $CL_1 = 1$)

\[ CF_0 = 1 \implies x_0 + x_1 > 2^q - 1 \quad (5.19) \]
\[ CL_1 = 0 \implies S_1 + x_0 \leq 2^q - 1 \quad (5.20) \]
\[ CL_1 = 1 \implies S_1 + [(x_0 + x_1) \mod 2^q] > 2^q - 1 \quad (5.21) \]

Using Eq. (5.19) in Eq. (5.21), we have

\[ S_1 + x_0 + x_1 - 2^q > 2^q - 1 \]
\[ \implies S_1 + x_0 + x_1 > 2^{(q+1)} - 1 \quad (5.22) \]

Now, using Eq. (5.20) in Eq. (5.22), we get $x_1 > 2^q$, which cannot be true. Hence, this condition cannot occur implying $C_1 = X$.

Similarly, extending this proof to the remainder of the six cases results in the truth table of Table 5-2.
5.3.3 LA-TIDSM for Larger Number of Channels

Although the final implementation has been targeted for two-channels, the LA-TIDSM can be scaled easily for any number of channels. It is of interest to compare the performance and behaviour of the LA-TIDSM with the conventional implementation for a larger number of channels. While the critical path of a conventional M-channel TIDSM is M adders, for an LA-TIDSM it always remains one adder delay, independent of the number of channels. In a M-channel LA-TIDSM, M-1 look-ahead additions are performed prior to the integrator i.e. $x_{0,\text{LSB}} + x_{1,\text{LSB}} + x_{2,\text{LSB}} + \cdots + x_{M-2,\text{LSB}} + x_{M-1,\text{LSB}}$ and M-1 carry signals resulting from each addition are propagated forward. Consider the three channel implementation shown in Fig. 5-7.

The correct carry for Channel 2, $C_2$ must be correctly decoded from the information present in the two forward carry signals, $CF_0$ and $CF_1$ and the three integrator carry outputs; $CL_0$, $CL_1$ and $CL_2$. Using these five inputs, a truth table for $C_2$ must be constructed using a process similar to that used in Section 5.3.2. This truth table consisting of 32 entries is shown in Table 5-3. From this table, the expression for $C_2$ is calculated as

$$C_2 = CF_1CL_2 + CL_1(CF_1 + CL_2)$$

(5.23)

Figure 5-7: A three channel LA-TIDSM implementation.
### Table 5-3: Carry computation truth table for $C_2$ in a three channel LA-TIDSM.

<table>
<thead>
<tr>
<th>Case No.</th>
<th>$CF_0$</th>
<th>$CF_1$</th>
<th>$CL_0$</th>
<th>$CL_1$</th>
<th>$CL_2$</th>
<th>Expected $C_2$</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>2</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>X</td>
</tr>
<tr>
<td>3</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>4</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>X</td>
</tr>
<tr>
<td>5</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>X</td>
</tr>
<tr>
<td>6</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>X</td>
</tr>
<tr>
<td>7</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>8</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>9</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>X</td>
</tr>
<tr>
<td>10</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>11</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>12</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>X</td>
</tr>
<tr>
<td>13</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>X</td>
</tr>
<tr>
<td>14</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>15</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>16</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>17</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>18</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>X</td>
</tr>
<tr>
<td>19</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>X</td>
</tr>
<tr>
<td>20</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>21</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>22</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>X</td>
</tr>
<tr>
<td>23</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>24</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>25</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>X</td>
</tr>
<tr>
<td>26</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>X</td>
</tr>
<tr>
<td>27</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>X</td>
</tr>
<tr>
<td>28</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>29</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>X</td>
</tr>
<tr>
<td>30</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>31</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
</tbody>
</table>

Upon comparing Equations 5.9 and 5.23, it can be seen that they are very similar i.e. the post decoding function involves only the integrator carry outputs from the current and previous channel only. Note that $C_0 = CL_0$ and requires no correction while $C_1$ is decoded as in the two channel case using Table 5-2. Based on this, the expression for the correct carry $C_i$ of the $i^{th}$ channel $\forall i \neq 0$ can be generalized for
an M-channel LA-TIDSM as
\[ C_i = C F_{i-1} C L_i + C L_{i-1}(C F_{i-1} + C L_i) \]  
(5.24)

where \( 1 \leq i \leq M - 1 \). Thus the LA-TIDSM scales very well with the addition of more channels with no theoretical effect on the critical path of the modulator. The maximum delay still remains one adder delay in the integrator section of the modulator. The lookahead section, being a forward path can be pipelined for one adder delay (this is not shown in Fig. 5-7).

It is of interest to compare the design space of the conventional TI and LA-TIDSM as a function of the number of channels, \( M \) and the pipeline depth, \( D \) (i.e. number of bits per pipeline). Consider that each pipeline in both the implementations is constructed from static 1-bit carry select (CS) adders and TGFFs similar to Chapter 3. It can be recollected from Chapter 3 and also shown in [34] that the total delay per pipeline is given by \( (M + D - 1)T_{add} + T_{FF} \), where \( T_{add} \) is the delay of one 1-bit CS adder. On the other hand, for an LA-TIDSM this becomes \( DT_{add} + T_{FF} + T_{buf} \), where \( T_{buf} \) is a buffering delay. The fact that the total delay in LA-TIDSM is independent of the number of channels as explained earlier is not true in practice. Since the integrator output \( S_{M-1} \) is sent back at the end of every clock back into all the channels of the integrator, it needs to be buffered. This buffer delay, \( T_{buf} \) shows a dependence on the number of channels. Hence, the LA-TIDSM is faster by \( (M - 1)T_{add} - T_{buf} \). However, in practice the improvement seen is better than \( (M - 1)T_{add} - T_{buf} \). The reason for this is as follows. It can again be recollected from Chapter 3 that in a conventional TIDSM, all the adders do not have equal delays. The first \( M - 2 \) channels have slower adders than the \( (M - 1) \)th channel. This is because the first \( M - 2 \) channel adders have carry chain as well as sum paths in the critical path while the last channel has only the carry chain in the critical path. On the other hand, all the LA-TIDSM adders only have the carry chain in the critical path.

Figure 5-8 shows a delay-space exploration of the LA-TIDSM and conventional TIDSM as a function of the number of channels and the pipeline depth. The LA-TIDSM delay shows a weak sensitivity to the number of channels as expected and increased improvement in performance as the number of channels increases. Fig. 5-9 shows the relative % improvement (speed-up) in total delay obtained over the conventional TIDSM and calculated as \( (1 - \text{Delay}_{\text{conv TIDSM}}/\text{Delay}_{\text{LA-TIDSM}}) \). As the number of channels increases, larger % gains in the total delay are seen. The relative improvement is found to be a weak function of the pipeline depth for a fixed number of channels (except for \( M=2 \)). These simulations reveal that >20 GS/s effective throughput is possible with a 4-channel implementation and above.

5.3.4 Alternative Implementations

Alternative implementations of the LA-TIDSM are also possible. Referring to Eq. (5.6) where \( C_1(k) = \lfloor (S_0(k) + x_{1,\text{LSB}}(k))/2^q \rfloor \), there exists another way of
5.3 Proposed Look-Ahead Time-interleaved Modulator

computing $C_1$ instead of the post-decode block. It is observed that $C_1$ is not required within the loop and hence can be calculated by replicating the operation $(S_0 + x_1.LSB)$ outside the loop. However, this technique is inefficient as it requires an extra adder and does not help to improve the critical path within the loop.

The TIDSM structure of Fig. 5-4 and its enhancement, the LA-TIDSM in Fig 5-6 is obtained by a TI/poly-phase decomposition of the delay element, $z^{-1}$ in the integrator transfer function, $H(z) = z^{-1} / (1 - z^{-1})$. (Also refer Sec. 2.2 and Figs. 2-20, 2-21). An alternative TI implementation of the MASH architecture has been very recently proposed in [40] by using a poly-phase decomposition of the full integrator transfer function, $H(z)$ instead. This implementation also has a one adder critical
Figure 5-10: An alternative two-channel TIDSM architecture of [40] also having 1 adder critical path but requires 8 adders in total.

path within the loop and a two-channel implementation of this architecture is shown in Fig 5-10, wherein the two integrators run independently each of other. However, it is found that this leads to an inefficient carry calculation logic for $C_0$ and $C_1$ in the forward path of the modulator, which increases the number of adders required. Consider that the modulator has $(p + q)$ input bits. The 2-channel first order EFB LA-TIDSM (Fig. 5-6) requires only the hardware equivalent of $3(p + q)$-bit adders while the implementation in [40] requires $8(p + q)$-bit adders. For a three channel implementation, the total number of adders for the two implementations becomes 5 and 21 respectively. The LA-TIDSM thus requires at least 62.5% and 76% lesser number of adders for 2 and 3 channel implementations as compared to [40] for a similar performance. While the LA-TIDSM totally increases the adder count by only two for every channel increase, in [40] the adder count increases by 3 per channel making it inefficient in area and power.

5.4 High-Speed LA-TIDSM DAC Design

5.4.1 Modulator Design

An 8-bit input two-channel LA-TIDSM with 4-bit output is implemented in a MASH 1-1 configuration consisting of a cascade of two first-order EFB DSMs. Each of the two EFB DSMs is pipelined into 2-bit sections as shown in Fig. 5-11. Only
5.4 High-Speed LA-TIDSM DAC Design

Figure 5-11: A 2-bit pipeline slice of a first-order EFB LA-TIDSM. Grey colour represents the LA part. Thin lines are used for CH0 path and thick ones for CH1 path.

purely static CMOS custom designed logic is used. The FFs used are conventional Static Transmission Gate Flip-flops (TGFF) while the 2-bit additions are carried out using 1-b carry-select full adders (FA). A NOR gate for synchronously resetting the integrator is also used at the end of the addition. Since the NOR gate is inverting, Adder 2 generates sum and carry. On the other hand, Adder 1 generates sum and carry. However, this requirement of different output polarities from the two adders has no impact on the total delay.

Table 5-4 shows the post-layout simulated delay contributions from the various components in the critical path formed by the feedback. The simulations are carried out at 1 V, 75°C for a typical corner in a standard 65-nm CMOS process using general purpose (GP) transistors and maximum RC extracted layout. Adder 1 is inherently slower than Adder 2 because it produces the complementary inputs/outputs and has a two gate delay. Adder 2 on the other hand, receives complementary carry inputs, doesn’t need to produce complementary outputs and has only a one gate delay. The output FF for $S_1$ is replicated so that one copy of the output goes to the next MASH stage while one copy goes back into the feedback loop. It is seen that the total delay of 181 ps implies a maximum half-clock frequency of 5.52 GHz and an effective rate of 11.05 GS/s. Comparing this to the 2-bit TIDSM pipeline of [35], this represents a 37 ps improvement in the delay or a 17% speed up in the critical path.
Table 5-4: Post-layout simulated delay of the integrator (Fig. 5-11) at 1 V, 75°C, typical corner.

<table>
<thead>
<tr>
<th>Block</th>
<th>Load</th>
<th>Delay (ps)</th>
</tr>
</thead>
<tbody>
<tr>
<td>FF Output Delay</td>
<td>2 Inverters</td>
<td>32</td>
</tr>
<tr>
<td>Buffer</td>
<td>2 XOR, 1 NAND, 1 NOR</td>
<td>16</td>
</tr>
<tr>
<td>Adder 1 (input→cout)</td>
<td>2 XOR</td>
<td>63</td>
</tr>
<tr>
<td>Adder 2 (cin→cout)</td>
<td>1 NOR</td>
<td>22</td>
</tr>
<tr>
<td>Reset NOR gate</td>
<td>2 FF</td>
<td>25</td>
</tr>
<tr>
<td>FF Setup Time</td>
<td></td>
<td>23</td>
</tr>
<tr>
<td><strong>Total Delay</strong></td>
<td></td>
<td><strong>181</strong></td>
</tr>
</tbody>
</table>

Figure 5-12: Final 2:1 Multiplexer with high-crossing switch driver.

5.4.2 Final Multiplexer and DAC Current Cell Design

Figure 5-12 shows the 2:1 final full-rate multiplexing (MUX) scheme and the switch driver. The 4-bit output of the LA-TIDSM is converted to a 15-bit thermometer code prior to the final multiplexing. The thermometer coding of the second channel data, CH1, is moved to the falling edge of the clock through a half-cycle path shift of the CH1 data from the LA-TIDSM. There is a half-cycle path at the input of the MUX which has a 70 ps delay and hence easily meets the timing. Since the switch driver is required to generate complementary outputs, this pseudo-differential multiplexing with the cross-coupled inverters, I1 and I2 helps to nominally equalize the delays of the complementary outputs. The switch driver is made high-crossing through the use of two cross-coupled NMOS, Mn1 and Mn2 [17]. The cross-over point is set at 0.7 V as setting it any higher yields no further improvement in the dynamic performance of the whole DAC. The switch driver is designed for a 15-ps rise and fall times when connected to the current-steering DAC. The MUX utilizes two 1 V power supplies, one for the clock distribution and one for the switch driver. Each of these rails use an on-chip decoupling of 100 pF.
5.4 High-Speed LA-TIDSM DAC Design

Figure 5-13 shows the DAC current cell used. The current source $M_1$ utilizes a low-$V_t$ low-power (LP) NMOS and is designed for 0.6% current mismatch $\sigma$ [13] with an overdrive voltage of 360 mV. The matching is over-designed compared to the requirement of 1.1% from Section 5.2 because the DAC also supports a modulator bypass mode that allows the DAC to be driven directly from the memory by a 4-bit data from any other NTF for testing purposes. The switches $M_2$ and $M_3$ use the fast low-$V_t$ GP devices and operate in the linear region. The cascodes, $M_4$ and $M_5$ on top of the switches are sized for an output impedance that gives a $>50$ dB SFDR performance up to 1.1 GHz BW. An output impedance greater than 3.3 K$\Omega$ is required to achieve 50 dB SFDR at 1.1 GHz. The output impedance profile for the current cell is shown in Fig. 5-14 and meets the requirement for a 1.1 GHz BW. The cascodes also use 1.2 V low-$V_t$ LP NMOS which grants some additional headroom compared to the 1 V GP devices. Cascoding on top of the switches is used to avoid the coupling of the switch driver signals with the DAC output. For measurement purposes, the DAC has a differential 100 $\Omega$ on-chip source termination and is interfaced to a spectrum analyzer with an off-chip 1.1 GHz bandwidth 2:1 centre-tapped transformer. This setup ensures proper impedance matching for the DAC.

Deep n-well structures have been extensively used in order to reduce the substrate noise coupling from the digital blocks. The MUX and the switch driver NMOS devices are also placed in small distributed deep n-wells while the 4-bit DAC consisting of only NMOS is placed in a separate large deep n-well. The 15 current cells are laid out in a one single column with the odd and even numbered cells placed on either side of the centre respectively to mitigate the gradient errors. The clock distribution to the 15 MUX switch driver cells is carefully matched with an H-tree and the NMOS of the distribution buffers are also placed in small distributed deep n-wells.

![Figure 5-13: DAC current cell interfaced with a centre-tapped 2:1 transformer.](image-url)
5.5 Chip Implementation and Testing Methodology

A prototype IC is fabricated in a standard 65\textmu m CMOS technology and mounted on a JLCC-68 package. It integrates a 8-bit two-channel LA-TIDSM with a 4-bit DAC.
and a 1-Kbit memory to enable full speed testing of the DAC. Figure 5-15 shows the chip photograph and Fig. 5-16 shows the overall testing methodology. The memory is designed using static TGFFs and laid out in a 32b x 32b aspect ratio with each location being 8-bit wide. The memory is written into serially at a low speed and then read at full speed internally during the DAC operation. This is achieved by first fetching four memory locations incrementally using a lower frequency $f_s/4$ clock. This 32-bit data is split into two 16-bit streams representing odd and even data. These two streams are then multiplexed using the $f_s/2$ clock to obtain two 8-bit data that are fed to the LA-TIDSM. The memory allows a 128-point deep signal to be tested and hence the minimum frequency bin spacing in the input signal is $f_s/128$. For all the SFDR and IM3 measurements, a $\pm$0.5 LSB dithered input signal is used so that the non-linearity components are not masked while no dithering is used during SNDR measurement.

The entire chip including the pads occupies an area 1.5mm x 0.9mm. The high-speed $f_s/2$ clock is sent into the chip as a sinusoidal differential signal and amplified to rail-to-rail within the chip. Static CMOS pseudo-differential clock distribution is used. The duty cycle is set by the cross-coupled inverters in the clock distribution and hence no external duty cycle calibration of the input clock is performed. Simulations of the clock distribution across different process corners and temperature show that a duty cycle between 49.5%-50.5% is achieved at the the switch driver, which is less than the desired 1% error.

5.5.1 Layout Considerations

Figure 5-17 shows the floor plan of the DAC, switch driver and the overall clock distribution. A correct-by-construction layout was used to minimize bit-to-bit timing errors i.e. the height of the switch driver and the DAC current cell was made exactly equal. The pitch for the power grid was 3.635 $\mu$m. The switch driver required a height of $3.635 \times 4 = 14.54 \mu$m. So fitting the DAC cell exactly within this height required that it should be wide and thin. This required a few iterations to achieve the compact height since the current cell and cascode sizes are quite large as compared to the switch driver transistors. The switch driver outputs are thus matched by design. As the fifteen drivers and current cells are placed in parallel to each other, a H-tree pseudo-differential clock distribution with good bit-bit matching is also enabled. The path of the clock from the input pads till the switch driver is short and consists of only 7 inverter stages. The pseudo-differential clock inverter (CI) shown in Figure 5-17 is used as a building block for the clock distribution.

While the switch driver receives an early clock due to the short path, the FFs in the modulator receive a late clock because of the long global clock distribution paths. This results in a 80 ps clock skew and shrinks the window for transferring data from the modulator to the switch driver to 100 ps (180 ps cycle time). A careful control of the clock skew and simulations are required on this path to ensure enough margins on this path. At 1 V, typical corner and 75°C, this path showed a 35 ps margin.
5.6 Measurement Results

The LA-TIDSM DAC achieves an effective sample rate of 11 GS/s. Since the 3 dB bandwidth of the transformer is 1.1 GHz, all the measurements are restricted to this bandwidth. Figure 5-18 shows the measured wideband spectrum and the noise shaping at 11 GS/s with a 1.1 GHz input tone. Fig. 5-19 shows that the measured SNDR is 39 dB in a 1.1 GHz bandwidth. Fig. 5-20 shows a measured IM3 of −49 dBc with two −6 dBFS tones located at 945 MHz and 1117 MHz respectively. Due to the limited depth of the testing memory, the closest distance between two coherent sinusoidal tones possible is 170 MHz. To measure the harmonic distortion, a 428 MHz tone is the highest frequency whose HD2 and HD3 lie close to the 0-1.1 GHz band. The measured HD2/HD3 is 56 dB/53 dB respectively and shown in Fig. 5-21.

Figure 5-23 shows a sweep of the input frequency versus the measured SFDR (0-1.1 GHz band), SNDR (0-input frequency) and IM3 (centre frequency) at 11 GS/s. The figure shows that a >53 dB SFDR and <49 dBc IM3 performance is achieved in the 0-1.1 GHz band. The measured SNDR is 42 dB (ENOB 6.8 bits) for the WiGig 880 MHz bandwidth and 39 dB (ENOB 6.2 bits) in a 1.1 GHz BW. The total measured power consumption is 117 mW from 1 V digital (90 mW) and 1.2 V (27 mW) analog supplies. The power and area breakdown of the ΔΣ DAC is shown in Table 5-5.

In order to evaluate only the final MUX and estimate the DCE in the ΔΣ DAC,
5.6 Measurement Results

Figure 5-18: Measured wideband spectrum with a 1.1 GHz input tone at 11 GS/s.

Figure 5-19: Measured 39 dB SNDR with a 1.1 GHz at 11 GS/s tone with no dithering.

The 4-b DAC is configured as a wideband Nyquist DAC that is directly driven from the memory by using the modulator bypass path in the chip. A 4-b unshaped single tone signal at 2.83 GHz \( f_{in} \) is used. This results in a measured interleaving spur of \(-36.9 \text{ dBC}\) at 2.67 GHz \( f_s/2 - f_{in} \) at 1 V supply as shown in Fig. 5-22. The timing error, \( \Delta t \) is then calculated using (18)

\[
\text{SFDR} = 20 \log_{10} \left( \frac{1}{\pi f_{in} \Delta t} \right) \quad (5.25)
\]
This yields $\Delta t = 1.6$ ps or an estimated DCE of 0.88%. Using Eq. (5.1), the DCE is found to contribute to a 1.2 dB relative SNDR loss for the IEEE 802.11ad 880 MHz BW and a 0.6 dB loss for the 1.1 GHz BW.

In order to measure the spectral mask, single-carrier 16-QAM encoded random data with a frequency bin spacing of $\sim 80$ MHz between 0 to 880 MHz is generated and RRC pulse-shaped in Matlab with an 18th-order RRC filter having a 0.25 roll-off factor. This data is loaded into the memory for the mask measurement. No external additional low-pass filtering is used. The filtering is achieved from a combination of the 1.1
5.6 Measurement Results

Table 5-5: Power and Area Breakdown of the DAC by function.

<table>
<thead>
<tr>
<th>Function</th>
<th>Power (mW)</th>
<th>Area</th>
</tr>
</thead>
<tbody>
<tr>
<td>DAC (1.2 V)</td>
<td>27</td>
<td>DAC</td>
</tr>
<tr>
<td>(incl. clock distr.)</td>
<td></td>
<td>300×60 µm²</td>
</tr>
<tr>
<td>MUX (1 V)</td>
<td>18</td>
<td>MUX</td>
</tr>
<tr>
<td>(incl. clock distr,)</td>
<td></td>
<td>280×85 µm²</td>
</tr>
<tr>
<td>ΔΣ Logic (1 V)</td>
<td>30</td>
<td>ΔΣ Mod.</td>
</tr>
<tr>
<td>ΔΣ Clock Distr. (1 V)</td>
<td>42</td>
<td>-</td>
</tr>
<tr>
<td>Total</td>
<td>117</td>
<td>Total</td>
</tr>
<tr>
<td></td>
<td></td>
<td>0.14 mm²</td>
</tr>
</tbody>
</table>

Figure 5-22: Measured interleaving spur of −36.9 dBc at 2.67 GHz with a 2.83 GHz tone to estimate the DCE.

GHz interfacing transformer, bonding wire inductance, JLCC socket capacitance and the PCB track. It is seen that this overall combination provides a 1.5th-order low-pass response between 0.95-1.9 GHz and a 2.3rd-order low-pass filter response between 1.9-3 GHz. Fig. 5-24 shows the measured spectral mask under these conditions at 10.56 GS/s operation. It is seen that the mask of the IEEE 802.11ad (WiGig) standard is met and the out-of-band quantization noise from the second-order ΔΣ modulator is found to be not a limiting factor.

Table 5-6 shows the comparison of this LA-TIDSM DAC with previously reported ΔΣ DACs having a sample rate >2.5 GHz. It is seen that this work represents an improvement of over five times in the measured bandwidth and is the first ΔΣ DAC to achieve a sample rate >10 GS/s and a BW >1 GHz. High-speed DSM’s have also been used in hybrid DACs (a combination of Nyquist and ΔΣ DACs) [27, 40] and
A 11-GS/s 1.1-GHz BW TI-ΔΣ DAC for 60-GHz Radio in 65-nm CMOS

Figure 5-23: Measured SFDR (in 0-1.1 GHz band), SNDR (0-inp. freq.) and IM3 (centre freq.) versus frequency at 11 GS/s.

Figure 5-24: Measured Spectral Mask with 16-QAM encoded random data at 10.56 GS/s.

frequency synthesizers [34]. Table 5-7 shows a comparison with these previously reported high-speed digital ΔΣ modulators having greater than 5 GHz speed. It can be seen that the high speed ΔΣ modulator space is dominated by the MASH architecture and this LA-TI DSM achieves the highest speed.

Since the aim of this LA-TIDSM DAC is to provide a third alternative to the traditional Nyquist DAC based architecture (Fig. 5.1(a)) and the oversampled high-speed Nyquist DAC architecture (Fig. 5.1(b)), it is of interest to compare the performance of this DAC with other previously reported DACs with these characteristics and a
Table 5-6: Comparison with complete $\Delta\Sigma$ DACs having $>$2.5-GS/s sampling rate.

<table>
<thead>
<tr>
<th>Paper</th>
<th>[8]</th>
<th>[25]</th>
<th>[26]</th>
<th>[35]</th>
<th>This Work</th>
</tr>
</thead>
<tbody>
<tr>
<td>Mod. Type</td>
<td>EFB</td>
<td>EFB</td>
<td>MASH</td>
<td>TI MASH</td>
<td>LA-TI MASH</td>
</tr>
<tr>
<td>Tech.</td>
<td>90nm</td>
<td>90nm</td>
<td>0.13 $\mu$m</td>
<td>65nm</td>
<td>65nm</td>
</tr>
<tr>
<td>Inp./Out Bits</td>
<td>10/3</td>
<td>13/1</td>
<td>12/3</td>
<td>12/3</td>
<td>8/4</td>
</tr>
<tr>
<td>Order</td>
<td>2</td>
<td>3</td>
<td>2</td>
<td>2</td>
<td>2</td>
</tr>
<tr>
<td>Speed (GS/s)</td>
<td>3.6</td>
<td>4</td>
<td>2.6</td>
<td>8</td>
<td>11</td>
</tr>
<tr>
<td>BW (MHz)</td>
<td>10</td>
<td>50</td>
<td>100</td>
<td>200</td>
<td>1100</td>
</tr>
<tr>
<td>SNDR (dB)</td>
<td>70</td>
<td>53</td>
<td>30</td>
<td>26</td>
<td>39</td>
</tr>
<tr>
<td>IM3 (-dBc)</td>
<td>70</td>
<td>-</td>
<td>51</td>
<td>57</td>
<td>49</td>
</tr>
<tr>
<td>Area (mm$^2$)</td>
<td>-</td>
<td>&lt;0.15</td>
<td>&lt;0.11$'$</td>
<td>0.13</td>
<td>0.14</td>
</tr>
<tr>
<td>Power (mW)</td>
<td>16</td>
<td>54</td>
<td>40</td>
<td>68</td>
<td>117</td>
</tr>
<tr>
<td>$V_{pp-diff}$ 50$\Omega$</td>
<td>0.3$'$</td>
<td>1.3</td>
<td>0.35</td>
<td>0.3</td>
<td>0.5</td>
</tr>
</tbody>
</table>

Estimated.

5.6 Measurement Results

Table 5-7 shows this comparison. For high-speed DACs reported in [57] and [18], performance in the 0-1.1 GHz bandwidth has been extracted so a comparison with similar bandwidths can be made. It can be seen that the overall SFDR in this work shows a similar performance as these Nyquist DACs. The overall figure-of-merit (FOM) [74] is found to be comparable to the other Nyquist DACs. Since 75% of the power in $\Delta\Sigma$ DAC comes from the digital part, this DAC can benefit from further CMOS scaling which can further improve its FOM. An area comparison of this $\Delta\Sigma$ DAC with [70] and [58] is easier because these DACs are also designed in 65 nm CMOS. The $\Delta\Sigma$ DAC in this work has 1.6 times more area than the Nyquist DAC presented in [70]. In [58], although a very compact DAC is presented, a high performance analog transistor with 1.5 times better matching parameter, $A_{vt}$ is used. If normal low-$V_t$ low-power transistors are used, then the $\Delta\Sigma$ DAC would have 2 times larger area than the Nyquist DAC of [58]. This indicates that the two-channel TI-$\Delta\Sigma$ DAC has a larger area consumption as compared to Nyquist DACs due to the increased digital processing. If the area is a constraint, then a TIDSM with larger number of channels can help to reduce the area [34].

The DAC clock spur can be a concern in transceivers utilizing frequency-division duplexing (FDD) where transmit and receive operations occur simultaneously in bands that are close to each other, such as LTE or W-CDMA standards. The DAC clock can leak through the antenna duplexer into the receiver band degrading its performance [75]. IEEE 802.11ad compliant 60-GHz radio transceivers, on the other hand, use time-division duplexing (TDD) where transmit and receive operations are in the same band with separate antennas and no duplexer [70, 71]. Thus, the receiver performance is less affected by the DAC clock spur.
Table 5-7: Comparison with other Digital ΔΣ Modulators with > 5 GHz speed.

<table>
<thead>
<tr>
<th>Paper</th>
<th>Tech. (nm)</th>
<th>Freq. (GHz)</th>
<th>Type</th>
<th>P (mW)</th>
<th>Area (mm²)</th>
</tr>
</thead>
<tbody>
<tr>
<td>[27] ISSCC’08</td>
<td>65</td>
<td>5.4</td>
<td>5b, 3rd ord MASH</td>
<td>&gt;48</td>
<td>–</td>
</tr>
<tr>
<td>[35] TCAS-II’13</td>
<td>65</td>
<td>8</td>
<td>12b, 2nd ord 2-ch TI-MASH</td>
<td>62</td>
<td>0.075</td>
</tr>
<tr>
<td>[40] JSSC’15</td>
<td>65</td>
<td>8</td>
<td>12b, 3rd ord 8-ch TI-MASH</td>
<td>&lt;165</td>
<td>–</td>
</tr>
<tr>
<td>This Work</td>
<td>65</td>
<td>11</td>
<td>8b, 2nd ord 2-ch LA TI-MASH</td>
<td>70</td>
<td>0.098</td>
</tr>
</tbody>
</table>

Table 5-8: Comparison of this work with wideband Nyquist DACs.

<table>
<thead>
<tr>
<th>Paper</th>
<th>Tech. (nm)</th>
<th>Freq. (GHz)</th>
<th>Type</th>
<th>P (mW)</th>
<th>Area (mm²)</th>
</tr>
</thead>
<tbody>
<tr>
<td>This Work</td>
<td>65</td>
<td>11</td>
<td>8b, 2nd ord 2-ch LA TI-MASH</td>
<td>70</td>
<td>0.098</td>
</tr>
<tr>
<td>CICC’09</td>
<td>60</td>
<td>11</td>
<td>5b, 3rd ord MASH</td>
<td>&gt;48</td>
<td>–</td>
</tr>
<tr>
<td>JSSC’13</td>
<td>65</td>
<td>11</td>
<td>5b, 3rd ord MASH</td>
<td>&gt;48</td>
<td>–</td>
</tr>
<tr>
<td>VLSI’11</td>
<td>65</td>
<td>11</td>
<td>5b, 3rd ord MASH</td>
<td>&gt;48</td>
<td>–</td>
</tr>
<tr>
<td>JSSC’08</td>
<td>65</td>
<td>11</td>
<td>5b, 3rd ord MASH</td>
<td>&gt;48</td>
<td>–</td>
</tr>
<tr>
<td>TVLSI’14</td>
<td>65</td>
<td>11</td>
<td>5b, 3rd ord MASH</td>
<td>&gt;48</td>
<td>–</td>
</tr>
<tr>
<td>JSSC’15</td>
<td>65</td>
<td>11</td>
<td>5b, 3rd ord MASH</td>
<td>&gt;48</td>
<td>–</td>
</tr>
</tbody>
</table>

5.6.1 Additional Duty Cycle Measurements

While all measurements reported in the previous section have been reported for a 1 V supply, it is possible to vary the duty cycle of the clock by varying the MUX clock distribution supply. The DCE can be then estimated by using the technique described in Fig. 5-22. Thus, it would be possible to characterize the effect of DCE on only the interleaving and also check the savings when an FIR filter is used as discussed in Section 4.5 i.e. Eq. (4.11) and Eq. (4.21) can be verified. For this purpose, the modulator was configured in the bypass mode. A MASH 1-1 shaped signal with a −9.3 dBFS (for stability of the DSM) input tone at 601 MHz, 2-bit output at 11 GS/s
was created in Matlab® and fed to the MUX and DAC directly through the memory and SNDR measurements were made for different DCE values. Then, this shaped signal was fed to a FIR filter with a transfer function $1 + z^{-1}$ (Section 4.5) in Matlab® leading to a 3-bit signal. The MUX and DAC were now driven by this 3-bit data from the memory and SNDR measurements were again made for different DCE values. Figure 5-25 shows the measured SNDR and a comparison with the expected SNDR. The drop in SNDR in the normal case of no FIR filter is seen as expected. With the first order FIR filter, the SNDR remains almost unchanged indicating that a near immunity to the duty cycle variation has been achieved.

Figure 5-25: Effect of DCE on a 2-b modulator at 11 GS/s and input frequency of 601 MHz (OSR=9.15).
A 11-GS/s 1.1-GHz BW TI-Σ DAC for 60-GHz Radio in 65-nm CMOS
Chapter 6

Conclusions and Future Work

With the advances in CMOS scaling, there is a growing trend of digitally-assisted analog circuits. $\Delta\Sigma$ DACs belong to this class of circuits and relax the analog components by reducing the number of DAC current cells and the anti-aliasing filter order through digital signal processing and oversampling. The traditional usage of $\Delta\Sigma$ DACs has been for high-resolution applications such as audio DACs. But over the last decade, $\Delta\Sigma$ DACs have found increasing use in RF transmitters for relatively low bandwidth standards such as Wi-Fi, WiMAX etc. Using $\Delta\Sigma$ DACs for wideband standards such as UWB or 60-GHz radio, which have hundreds of megahertz of bandwidth, requires a very high sample rate. The aim of this dissertation has been to further enhance the sampling rate and bandwidth of $\Delta\Sigma$ DACs through time-interleaving.

Chapter 2 presented the limitations of conventional $\Delta\Sigma$ DACs and the need for time-interleaved DSMs. The trade-offs between the number of channels and the ease of the final multiplexing were analyzed. The suitability of MASH DSMs for high-speed operation was presented. Various options for the DAC current cell design were also discussed.

A two-channel TIDSM allows for a single-clock design with a simplified multiplexing. This requires the individual channels to operate at a high frequency. Chapter 3 presented an analysis of the critical path of the two-channel MASH DSM. This led to an optimized adder pipeline using static CMOS logic for the integrator that enabled an 8 GS/s 200-MHz prototype $\Delta\Sigma$ DAC which achieved a >9-bit linearity consuming 68 mW in 65-nm CMOS technology. The analog part of the prototype consisted of only seven current cells. The obtained SNDR of 26-dB was lower than expected due to a measurement setup limitation. Nevertheless, a potential for wideband operation and further improvement in performance was observed in this prototype.

The final multiplexing of the two-channel TIDSM DAC was found to be sensitive to the duty cycle of the half-sample-rate clock. A new expression that estimates
the loss in SNDR due to a non-50% duty cycle was presented in Chapter 4. The effectiveness of different techniques such as FIR filtering and compensation to mitigate this problem was studied. It was observed that a near immunity to DCE was possible through FIR filtering of the high-frequency noise, but this has a trade-off with the DAC cell matching.

In order to further enhance the sampling rate of the TIDSM, a new look-ahead TIDSM was presented in Chapter 5. The two channels were decoupled within the integrator by moving a part of a computation to before and after the feedback loop. This enabled a TIDSM DAC prototype that achieved 11 GS/s and 1.1 GHz bandwidth with −49 dBc IM3, 53 dB SFDR and 39 dB SNDR consuming 117 mW of power in 65-nm CMOS. This DAC used only fifteen analog current cells and satisfied the spectral mask of the IEEE 802.11ad WiGig standard with a second-order reconstruction filter. The full-speed DAC testing was enabled by an on-chip 1-Kb memory. This ΔΣ DAC was the first reported DAC to achieve a greater than >10 GS/s sampling rate and >1 GHz BW. The wideband potential for moderate resolution wireless transmitters was demonstrated by this DAC.

In summary, this dissertation has extended the bandwidth of ΔΣ DACs by one order of magnitude [26] and doubled the sampling rate of DSMs [27] as compared to the existing literature at the start of this dissertation.

6.1 Future Work

This dissertation focused only on two-channel TIDSM because of the ease of multiplexing. On the other hand, using many channels can reduce the speed of each channel to a lower value. This would allow the use of NTFs other than of the type $NTF(z) = (1 - z^{-1})^m$ used in this work. Higher in-band SNDR is possible by optimizing the location of zeroes in the NTF instead of placing them only at DC [24, 25]. This leads to more “time-consuming” operations in the DSM like multiplications (or multiplication through additions) which are relatively easier to achieve at lower clock rates. But to use this multi-channel approach, new timing calibration techniques for accurate multiplexing or data equalization/pre-distortion must be investigated [76], since ΔΣ DACs are very sensitive to clock timing errors as discussed in Chapter 4.

There is also a growing trend of high-speed wideband Nyquist DACs with a high resolution e.g. 12-14 bits for cable TV applications [16, 73, 77]. To achieve a high-resolution, the literature shows that hybrid DACs (Nyquist + ΔΣ DACs) have the potential for a high linearity [27, 40]. A hybrid DAC is shown in Fig. 6-1 wherein only $q$ LSBs (typically $q \geq p$) are passed through a DSM while the $p$ MSBs are directly sent to another DAC. The two sub-DACs are added up to produce the final output. The aim of doing this is to combine the advantages of Nyquist DACs (flat noise floor) and ΔΣ DACs (relaxed current cells). Using this technique, a 500-MHz BW hybrid DAC with a 12-bit linearity has been recently presented in [40]. Since
6.1 Future Work

TI-DSMs can operate at higher speeds, this technique can be potentially used for achieving a high linearity with larger bandwidths.

Finally, the increased frequency obtained by time-interleaving is beneficial even in case of band-pass DSMs and digital-mixers as a higher IF frequency is possible. Figure 6-2 shows a potential digital-IF architecture using TI-DSMs. The effective sample rate is \( f_s \) while the digital-IF frequency is \( f_s/4 \). Note that even in this case the overall output is sensitive to the duty cycle of \( f_s/2 \) clock. Figure 6-3 shows the obtained simulated spectrum of a 10 GS/s DAC having an IF of 2.5 GHz for a single tone input of 250 MHz. It can be seen that the single tone is shifted to 2.75 GHz due to the mixing. It could be of further interest to study the effect of DCE error on this architecture. Let the channel lie between 2.25-2.75 GHz range. Due to DCE errors, the NTF function between 2.25 GHz and 2.5 GHz folds onto the NTF function between 2.5−2.75 GHz and vice-versa. However, as this noise is of a low level, its effect on the SNDR is not as high as in the case of a simple low-pass DSM. On the other hand, the single tone at 2.75 GHz results in an in-band image at 2.25 GHz due to the DCE, which was not the case in low-pass DSMs. Similar to Chapter 4, a further study of the effect of DCE on this architecture could be studied. Furthermore, by using the second image of a DAC [25], it would also be possible to directly transfer the base-band to an RF-carrier frequency of 7.5 GHz in this case without explicitly using an upconversion mixer. This architecture could be useful for the UWB standard which has channel bandwidths of 528 MHz and carrier frequencies from 3.4−10.3 GHz [63].
Figure 6-2: Digital IF with TIDSM DACs.

Figure 6-3: Digital IF of 2.5 GHz using a 10 GS/s TI-DASM DAC. DCE causes a close image spur.
References


REFERENCES


REFERENCES


Appendix A

Published Papers

Journals


The following journal paper based on Chapter 5 is not added to the appendix since the final print version is yet unavailable:


Conferences


Papers

The articles associated with this thesis have been removed for copyright reasons. For more details about these see:

http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-120626