M2STaR: A Multimode Spatio-Temporal Redundancy Design for Fault-Tolerant Coarse-Grained Reconfigurable Architectures
Xiangyu Kong,Jianfeng Zhu,Xingchen Man,Guihuan Song,Yi Huang,Chenchen Deng,Pengfei Gou,Shouyi Yin,Shaojun Wei,Leibo Liu
DOI: https://doi.org/10.1109/tcad.2023.3239563
IF: 2.9
2023-01-01
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Abstract:Coarse-grained reconfigurable architectures (CGRAs) can provide both energy efficiency and performance for embedded systems, and thus they are increasingly deployed in the areas of aerospace, automotive engineering, and security where reliability is also a main criterion. However, the state-of-the-art fault-tolerant strategies for CGRAs apply either temporal or spatial scheme, including redundancy, periodic detection, workload balancing, and reconfiguration, failing to exploit the feature of dynamic and partial reconfiguration of CGRAs. Also, vulnerable judging circuits and inflexible mode shifting bottleneck the reliability design of fault-tolerant CGRAs. This article proposes a novel multimode fault-tolerant framework for CGRAs, which combines spatial-redundant data paths with temporal-redundant voters and thus reduces the vulnerable judging circuits while balancing the performance and reliability. This framework can also enable a changing reliability level at runtime via an online configuration transformation method based on precompiled patterns. Within the proposed framework, we systematically searched the design space spanning various combinations of the mainstream schemes with a Markov process model to compare the effectiveness and accordingly selected five points as available modes in our design after comprehensive consideration of fault tolerance and time overhead on CGRA. The framework is comprehensively evaluated on a cycle-accurate CGRA simulator, considering both permanent and transient faults. The experimental results show that the fault coverage rate of single transient faults or permanent faults has increased from 71.74% to 93.84%, which means the fault tolerance of the system has been increased by 31.03% compared with the state-of-the-art methods. There is also a great improvement in mean-time-to-failure (MTTF) and reconfiguration latency over baseline designs.