A low-overhead soft-hard fault-tolerant architecture, design and management scheme for reliable high-performance many-core 3D-NoC systems

Khanh N Dang,Michael Meyer,Yuichi Okuyama,Abderazek Ben Abdallah
DOI: https://doi.org/10.1007/s11227-016-1951-0
2020-03-21
Abstract:The Network-on-Chip (NoC) paradigm has been proposed as a favorable solution to handle the strict communication requirements between the increasingly large number of cores on a single chip. However, NoC systems are exposed to the aggressive scaling down of transistors, low operating voltages, and high integration and power densities, making them vulnerable to permanent (hard) faults and transient (soft) errors. A hard fault in a NoC can lead to external blocking, causing congestion across the whole network. A soft error is more challenging because of its silent data corruption, which leads to a large area of erroneous data due to error propagation, packet re-transmission, and deadlock. In this paper, we present the architecture and design of a comprehensive soft error and hard fault-tolerant 3D-NoC system, named 3D-Hard-Fault-Soft-Error-Tolerant-OASIS-NoC (3D-FETO). With the aid of efficient mechanisms and algorithms, 3D-FETO is capable of detecting and recovering from soft errors which occur in the routing pipeline stages and leverages reconfigurable components to handle permanent faults in links, input buffers, and crossbars. In-depth evaluation results show that the 3D-FETO system is able to work around different kinds of hard faults and soft errors, ensuring graceful performance degradation, while minimizing additional hardware complexity and remaining power efficient.
Hardware Architecture
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the decline in hardware reliability in three - dimensional network - on - chip (3D - NoC) systems due to the continuous shrinking of technology nodes. Specifically, the paper focuses on how to achieve fault tolerance for soft errors and hard faults in multi - core 3D - NoC systems with high integration and high performance. As transistor sizes continue to shrink, operating voltages decrease, and power densities increase, the vulnerability of these systems significantly increases, making them susceptible to permanent faults (such as time - dependent dielectric breakdown, electromigration, etc.) and transient errors (such as crosstalk, radiation particles, cosmic rays, etc.). These problems may lead to congestion across the network or silent corruption of data, severely affecting the reliability and performance of the system. To solve the above problems, the paper proposes a low - overhead hard - and - soft - fault - tolerant architecture design and management scheme, namely the 3D - Hard - Fault - Soft - Error - Tolerant - OASIS - NoC (3D - FETO) system. Through effective mechanisms and algorithms, this system can detect and recover from soft errors occurring in the routing pipeline stage, and use reconfigurable components to handle permanent faults in links, input buffers, and cross - bars. In this way, 3D - FETO can operate normally under different hard - fault and soft - error conditions, ensuring a smooth degradation of performance while minimizing additional hardware complexity and maintaining energy efficiency. The main contributions of the paper include: 1. Proposing a new self - adaptive 3D router architecture based on a robust hardware reconfiguration mechanism, which can detect and recover from soft errors in the routing pipeline stage. 2. Designing an efficient online - controlled fault detection and diagnosis scheme. Through these innovations, 3D - FETO aims to provide a reliable solution for future 3D - NoC systems to meet the increasingly severe hardware reliability challenges.