A 128 Gbps PAM-4 Feed Forward Equaliser with Optimized 1UI Pulse Generator in 65 Nm CMOS.
Jiawei Wang,Hao Xu,Ziqiang Wang,Haikun Jia,Hanjun Jiang,Chun Zhang,Zhihua Wang
DOI: https://doi.org/10.1049/cds2.12151
2023-01-01
Abstract:A quarter-rate PAM-4 FFE employing INCC 1UIPG is implemented in 65 nm CMOS. The proposed INNC 1UIPG reduces the average transition time by ~20%, saving clocking power consumption by ~1.5X, lowering jitter amplification by about 2~5 dB compared with previous works. Along with the bandwidth- and power-efficient partially segmented tailless 1-stage front-end architecture, the proposed FFE achieves 128Gbps PAM-4 data rate with a 0.014 mm2 area. This letter presents a 4-level Pulse Amplitude Modulation (PAM-4) Feed Forward Equaliser (FFE) with a novel Internal Node Charge Controlled 1-Unit Interval Pulse Generator (INCC 1UIPG). Partially segmented architecture and tailless 1-stage front end are chosen to reduce the overall load capacitance for better bandwidth and power performance. The proposed INCC 1UIPG adopts a stacking-reduced structure and precisely controls the internal nodes, demonstrating advantages in speed, power, and jitter, showing better potential of working at a ultra-high baud rate. The wider bandwidth and faster transition edge allow the implementation of the equaliser working at 128Gbps with an area of 0.014 mm2 in 65 nm CMOS. The ever-increasing bandwidth demand in high-performance computing and other applications is continuously promoting the data rate of wireline communication systems with some protocols already requiring data rate in excess of 50-Gbaud, posing serious challenges to the design of transceivers. State-of-the-art TXs adopt a hybrid architecture to fully integrate the advantages of their analogue [1, 2] and Digital-to-Analogue Converter (DAC)-based [3] counterparts, which have not only high resolution and low complexity, but also flexible and efficient Finite Impulse Response (FIR) tuning, called segmented FFE architecture [4-6] in this letter. To further ease bandwidth pressure, high-speed TXs have a trend to reduce the number of full-rate nodes. By combining the 4:1 MUX into the pre-driver, the authors in ref. [2] reduce this number to 2 with the internal full-rate nodes that are peaked by inductors. However, this technique is not suitable for DAC-based and segmented TXs for area considerations. Another technical route attempts to further merge 4:1 pre-driver into the driver [1, 4, 6], thereby eliminating all the internal full-rate nodes, called 1-stage front end in this letter. In a 1-stage front end, total capacitance of output stage becomes even more critical, which determines achievable bandwidth and overall power dissipation. Extremely, some design employs the tailless CML driver to obtain the smallest size for a specified output swing [5, 6]. Given that this letter is targeted at an aggressive 128Gbps data rate in 65 nm CMOS technology, segmented architecture and 1-stage tailless front end are chosen with the FIR tap is designed to be partially adjustable to ease the bandwidth pressure ever further. Contrary to the trend of the front end, a high-performance full-rate working 1UIPG widely used in quarter-rate architecture attempts to adopt a multi-stage structure [5, 6] to improve speed – it is difficult to optimise both two edges of the pulse in a single stage, which usually corresponds to 3-stacked devices [1, 3]. The authors in reference [4] proposed a pre-charged structure that generates the 1UI pulse in a single-stage circuit. However, this technique is not suitable for a tailless CML driver, in which any pre-charge level will be translated to the output immediately. The authors in ref. [6] adopt a 2-stage structure to avoid 3-stacked paths. Unfortunately, the 2-stacked devices on critical path and undriven internal nodes ultimately limit the achievable speed. In order to address these drawbacks, the proposed 2-stage INCC 1UIPG optimises two edges of the pulse separately, reduces device stacking on critical paths, and reasonably controls the internal nodes, showing the best potential of working at ultra-high speed. Figure 1 shows the overall architecture of the proposed equaliser (half circuits). Data path is divided into MSB and LSB to generate PAM-4 output where MSB block is composed of the same two LSBs for good linearity. Each block is further divided into three groups of slices, X1, X2, and X6, forming a 3-bit DAC. X1 and X2 can be configured as a main tap or post tap as required with X6 is fixed as a main tap. Finite Impulse Response timing is generated at 1/8 rate with C8 clock. Subsequently, X1 and X2 slices can select data with different timing under the control of FFE_DAC<1:0> to be configured as different taps with X6 experiencing a matching delay. The selected 4-bit parallel data become time-interleaved 1UI pulses in the proposed INCC 1UIPG and finally complete the combination in the 4:1 tailless CML driver. When assigned as a post tap, the output current of the drivers of X1 and X2 slices can be further continuously adjusted through the bias of their cascode transistors. Feed Forward Equaliser (FFE) architecture (half circuits). The proposed equaliser adopts a partially segmented quarter-rate architecture and 1-stage tailless front end to reduce overall load capacitance and achieve the aggressive target of 128Gbps. A 3-bit DAC is used to provide coarse tuning, with the fine tuning being implemented in the analogue domain, forming a segmented architecture. Since X1 and X2 slices can be allocated as a main tap when ‘strong’ equalisation is not required, the equaliser is more bandwidth- and power-efficient compared with its analogue counterpart – in which the main tap driver itself must be sized to deliver specific output swing, and any of the equalisation tap drivers would introduce additional loading. At the same time, the DAC is allowed to be simple with low circuit complexity and small parasitic capacitance. (A ‘pure’ DAC-based TX needs to have much more bit with complex calibration for resolution and linearity considerations.) Moreover, the front end is designed to be partially adjustable – the largest X6 slices are fixed as main tap, allowing the cancelation of their cascode transistors to further reduce driver size under the same output swing, which greatly reduces the load capacitance, at a cost of tuning flexibility. Figure 2 shows three 1UIPGs with different structure and their timing diagrams under 64 Gbaud with the critical paths in stage 1 marked as red. Figure 2a adopts a single-stage structure, which uses the falling edge of CKQ and the rising edge of CKI to select the low level of the data, where M3 is used to control the internal charge of N2. This structure achieves 112 Gbps PAM-4 data rate in [1] and 224Gbps in [3], both in 10 nm CMOS. Although the charge of internal node N2 is reasonably controlled, there is a 3-stacked charging path (M1-M2-M4) existing, which leads to a slow rising edge at the output and it is difficult to reach full swing at a high baud rate. Comparison of 3 types of 1UIPGs under 64Gbaud. The authors in reference [6] adopt a two-stage architecture to avoid 3-stacked paths as shown in Figure 2b. Using the rising edge of CKQ and the falling edge of CKI to select the high level of the data, this structure achieves a PAM-4 data rate of 200Gbps in 28 nm CMOS. In the first stage, when the data is high and the rising edge of CKQ comes, OUT1, which is originally high, is pulled down. In the second stage, M6 pre-charges N2 when CKIB is pulled down and the falling edge of OUT1 controls M6 and M7 to charge OUT2, thus producing its rising edge. Subsequently, the rising edge of CKIB controls M8 to discharge OUT2, thus producing its falling edge. The pre-charged 2-stacked path allows OUT2 to have a faster rising edge. However, this structure's speed is still limited due to the following reasons. Firstly, the falling edge of OUT1, which is used to produce the final 1UI pulse, is generated by a 2-stacked path where N1 needs to be discharged first when M2 and M3 try to pull down OUT1. More importantly, when CKIB changes from low to high, OUT1 remains low for a period, thereby M8 needs to discharge not only OUT2 but also N2 at the same time, which leads to a slow falling edge of the final 1UI pulse. The proposed INCC 1UIPG is shown in Figure 2c. Different from (b), this 2-stage structure uses the rising edge of OUT1 and the falling edge of CKQB to generate the final 1UI pulse. In the first stage, the falling edge of CKI controls single M1 to produce the rising edge of OUT1 when data is high. Since M2 has been already turned off, N1 node will no longer affect this charging process. Considering that the falling edge of OUT1 is non-critical and N1 can be pre-discharged by M3, relative transistors are allowed to use smaller size, which further expands the bandwidth of OUT1. Meanwhile, CKQ generates CKQB through an inverter, matching the delay between CKI path to ensure an accurate 1UI pulse width under PVT variations. In the second stage, M6 pre-charges N2 when OUT1 is low, the rising edge of OUT2 is finally generated by the falling edge of CKQB. It is important to notice that M8 and M9 will discharge OUT2 and N2 simultaneously at the rising edge of OUT1, accelerating the falling edge of the final 1UI pulse. In this two-stage structure, bandwidth of the intermediate-node OUT1 has been further optimised with all the internal nodes (N1 and N2) are reasonably controlled, resulting a higher-performance 1UI pulse. Figure 3 shows a use case of the 3 aforementioned 1UIPGs. Use cases 1, 2, and 3 are obtained by using structures (a), (b), and (c) in Figure 2 as the 1UIPG in Figure 3, respectively. Note that the three use cases have the same input clock and data buffer and employ the same size 4:1 multiplexer for a fair comparison (marked as red in Figure 3). From the analysis and simulation results, we can explain the following properties of the proposed INCC 1UIPG. A use case of the three aforementioned 1UIPGs. Figure 4 shows simulation results of the 1UI pulses over PVT variations of the three use cases under 64Gbaud. As shown in Figure 4a, Use case #1 has the largest rise time due to the 3-stacked charging path. Figure 4b illustrates the limited fall time of Use case #2 due to the uncontrolled internal nodes. Figure 4c compares the average transition time of the 1UI Pulses. Use case #3 shows the best performance with the help of 2-stacked dynamic logic and reasonable INCC. Compared with the previous two, the average transition time is reduced by 22% and 17% under TT corner, respectively. Simulation results of 1UI pulses over PVT variations of the 3 use cases. (a) Rise time, (b) fall time, and (c) average time. Faster slew rate of the 1UI pulse can speed up the charging and discharging processes of the output of 4:1 multiplexer, extend the bandwidth, and therefore reduce its deterministic jitter (DJ). And what's more, the sharper slope at the transition point of pulse generator and multiplexer outputs reduces the conversion of their intrinsic voltage noise into jitter. Figure 5 shows 4:1 multiplexer output DJ of the three use cases under 64Gbaud and 80Gbaud, respectively. Use case 3 shows minimal output DJ, demonstrating its potential to work at higher baud rates. Simulation results of 4:1 multiplexer output of the 3 use cases. Reducing device stacking on a critical path can also improve the size design, reduce the total loading of clock path and data path, and therefore reduce the power consumption of their buffers. It is attractive to minimise the clock loading to reduce the design effort of clocking network, of which must take speed, jitter, and power consumption into fully consideration. Specifically, the critical edges of the proposed INCC 1UIPG (Rising edge of OUT1 and falling edge of CKQB, see Figure 2) are both generated by a stacking-free transistor (M1 and M5). M2 cuts off the pull-down path and shields N1 node when M1 charges OUT1 and therefore M1 can be small in size, just like in an inverter. The falling edge of OUT1 is non-critical so that M2 and M3 can be sized even smaller. By contrast, the critical edge in use case 2 – the falling edge of OUT1 is generated by stacking devices M2 and M3 with N1 cannot be discharged in advance, the size of relative transistors cannot be small (M3 is twice of M2 in use case #2, increasing clock loading by about 1.5X). Similarly, M2 is twice of M4 and M1 is triple of M4 in use case #1. Figure 6 shows power breakdown of the three use cases. Since the fan-out factor of buffers cannot be huge for speed and jitter considerations (we use FO2 for 16 GHz clock in 65 nm CMOS). The heavier loading of data and clock path leads to more buffer stages, greater total power dissipation, and more clocking jitters. Considering the large number of slices in an actual TX (need ~6X of the use case for a 1.2Vppd output swing), these power savings are very attractive. Power breakdown of the 3 Use cases. The stacked devices will also underperform in terms of jitter amplification due to the poor slope. We designed a simulation to verify this. As shown in Figure 7, a small jitter impulse (1ps in this simulation) is injected into one of the quarter-rate clocks (C0 in this simulation). By recording the transient response of the output of pulse generator and multiplexer when transmitting repeating clock patterns in the three use cases (we removed the clock and data buffers in this simulation; an ideal clock source with a fixed slope is used as a substitute to eliminate the impact of the multi-stage buffers), we can obtain their jitter impulse response (JIR). After normalising them to the input injection, we obtained the corresponding jitter transfer function (JTF) of the three use cases by Discrete-time Fourier Transform. Jitter amplification simulation method. Figure 8 shows the simulated JIR and JTF under 64Gbaud. Use case #3 reflects a milder JIR and about 5 dB/2 dB lower jitter amplification than use cases #1 and #2, respectively. Simulated jitter impulse response (JIR) and jitter transfer function (JTF) of the 3 Use cases. The FFE prototype chip is fabricated in 65 nm CMOS technology with a core area of 0.014 mm2 as shown in Figure 9a. Figure 9b demonstrates the post-layout simulation results of proposed INCC 1UIPG working at 64Gbaud. The 1UI pulse eye with 10.83ps rise time and 11.33ps fall time is shown in Figure 9c. The pulse is full-swing and fast enough to drive the subsequent tailless CML transistors. Power breakdown of the FFE (i.e., high speed data path of the TX prototype, design of high-performance clocking network is not discussed in this letter, and its power consumption is therefore not calculated here) is shown in Figure 9(d). Feed Forward Equaliser slices (FFE selectors + D4 buffers + INCC 1UIPGs, as shown in Figure 2) consume about half of the power consumption of the data path. The driver stage consumes about 45.6 mW power to provide ~1Vppd output swing. Layout details and post-layout simulation results of the Feed Forward Equaliser (FFE) prototype chip. The channel responses with 2.7 dB/5.7 dB/10.3 dB insertion loss, respectively, at Nyquist frequency (32 GHz) are shown in Figure 9e. Figure 9f shows the 128Gbps PRBS15 eye after a 2.7 dB channel loss. Figure 9g~j compare the 128Gbps PRBS15 PAM-4 eye w/or w/o TX FFE under 5.7 dB/10.3 dB channel loss, respectively. By adjusting the coefficient of the segmented equaliser reasonably, the eye can be opened up to 0.49UI with approximately 95mVppd height per sub-eye for a 10.3 dB loss. Table 1 summarises the performance of the proposed FFE and compares it with reported quarter-rate PAM-4 TXs' high-speed data paths. A quarter-rate PAM-4 FFE employing INCC 1UIPG is implemented in 65 nm CMOS. The proposed INNC 1UIPG reduces the average transition time by ~20%, saving clocking power consumption by ~1.5X, lowering jitter amplification by about 2~5 dB compared with previous works. Along with the bandwidth- and power-efficient partially segmented tailless 1-stage front-end architecture, the proposed FFE achieves 128 Gbps PAM-4 data rate with a 0.014 mm2 area. Jiawei Wang: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Visualization, Writing – original draft, Writing – review and editing. Hao Xu: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – review and editing. Ziqiang Wang: Conceptualization, Funding acquisition, Methodology, Project administration, Resources, Supervision, Writing – review and editing. Haikun Jia: Methodology, Resources, Software, Validation, Writing – review and editing. Hanjun Jiang: Methodology, Resources, Software, Writing – review and editing. Chun Zhang: Methodology, Resources, Software, Writing – review and editing. Zhihua Wang: Funding acquisition, Methodology, Project administration, Resources, Supervision. This work is supported by the Shenzhen Science and Technology Program (No. JCYJ20180306170609470) and Key Research and Development Plan of Shandong Province (No. 2022CXGC010109). The authors declare that we do not have any possible conflicts of interest. Shenzhen Science and Technology Program, Grant/Award Number: JCYJ20180306170609470; Key Research and Development Plan of Shandong Province, Grant/Award Number: 2022CXGC010109 The data that support the findings of this study are available from the corresponding author upon reasonable request.