Addressing Transient and Permanent Faults in NoC With Efficient Fault-Tolerant Deflection Router
Chaochao Feng,Zhonghai Lu,Axel Jantsch,Minxuan Zhang,Zuocheng Xing
DOI: https://doi.org/10.1109/TVLSI.2012.2204909
2013-01-01
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Abstract:Continuing decrease in the feature size of integrated circuits leads to increases in susceptibility to transient and permanent faults. This paper proposes a fault-tolerant solution for a bufferless network-on-chip, including an on-line fault-diagnosis mechanism to detect both transient and permanent faults, a hybrid automatic repeat request, and forward error correction link-level error control scheme to handle transient faults and a reinforcement-learning-based fault-tolerant deflection routing (FTDR) algorithm to tolerate permanent faults without deadlock and livelock. A hierarchical-routing-table-based algorithm (FTDR-H) is also presented to reduce the area overhead of the FTDR router. Synthesized results show that, compared with the FTDR router, the FTDR-H router can reduce the area by 27% in an 8$\,\times\,$8 network. Simulation results demonstrate that under synthetic workloads, in the presence of permanent link faults, the throughput of an 8 $\,\times\,$8 network with FTDR and FTDR-H algorithms are 14% and 23% higher on average than that with the fault-on-neighbor (FoN) aware deflection routing algorithm and the cost-based deflection routing algorithm, respectively. Under real application workloads, the FTDR-H algorithm achieves 20% less hop counts on average than that of the FoN algorithm. For transient faults, the performance of the FTDR router can achieve graceful degradation even at a high fault rate. We also implement the fault-tolerant deflection router which can achieve 400 MHz in TSMC 65-nm technology.