FASHION: Fault-Aware Self-Healing Intelligent On-chip Network

Pengju Ren,Michel A.Kinsy,Mengjiao Zhu,Shreeya Khadka,Mihailo Isakov,Aniruddh Ramrakhyani,Tushar Krishna,Nanning Zheng
DOI: https://doi.org/10.48550/arXiv.1702.02313
2017-02-08
Abstract:To avoid packet loss and deadlock scenarios that arise due to faults or power gating in multicore and many-core systems, the network-on-chip needs to possess resilient communication and load-balancing properties. In this work, we introduce the Fashion router, a self-monitoring and self-reconfiguring design that allows for the on-chip network to dynamically adapt to component failures. First, we introduce a distributed intelligence unit, called Self-Awareness Module (SAM), which allows the router to detect permanent component failures and build a network connectivity map. Using local information, SAM adapts to faults, guarantees connectivity and deadlock-free routing inside the maximal connected subgraph and keeps routing tables up-to-date. Next, to reconfigure network links or virtual channels around faulty/power-gated components, we add bidirectional link and unified virtual channel structure features to the Fashion router. This version of the router, named Ex-Fashion, further mitigates the negative system performance impacts, leads to larger maximal connected subgraph and sustains a relatively high degree of fault-tolerance. To support the router, we develop a fault diagnosis and recovery algorithm executed by the Built-In Self-Test, self-monitoring, and self-reconfiguration units at runtime to provide fault-tolerant system functionalities. The Fashion router places no restriction on topology, position or number of faults. It drops 54.3-55.4% fewer nodes for same number of faults (between 30 and 60 faults) in an 8x8 2D-mesh over other state-of-the-art solutions. It is scalable and efficient. The area overheads are 2.311% and 2.659% when implemented in 8x8 and 16x16 2D-meshes using the TSMC 65nm library at 1.38GHz clock frequency.
Hardware Architecture
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to avoid packet loss and deadlock phenomena when the Network - on - Chip (NoC) in multi - core and many - core systems faces faults or power - gating. Specifically, the paper focuses on the following points: 1. **Deadlock problem**: When some routers or links are disconnected, it may lead to circular dependencies among network resources, thus causing network deadlocks. The method proposed in the paper needs to remove these circular dependencies in the channel - dependency graph of each new topology. 2. **Performance problem**: Faults will lead to a reduction in the path diversity of NoC, and the congestion degree of some paths will increase. Although adaptive routing can alleviate some of the impacts, it usually needs to remove some channel dependencies or prohibit some routing paths, which further reduces the diversity of routing paths. 3. **Scalability problem**: As the number of cores in the system increases, the scale of NoC also expands. Therefore, a distributed solution is required because centralized solutions cannot scale. To solve the above problems, the paper proposes an intelligent on - chip network self - healing architecture named FASHION, and its main contributions include: - **Distributed intelligent unit (Self - Awareness Module, SAM)**: Allows routers to automatically detect permanent component faults, and generate a network connectivity graph through a distributed spanning - tree search algorithm with a computational complexity of \(O(|L|)\), where \(L\) is the number of links in the network. - **Hardware self - adjustment technology**: Guarantees connectivity and deadlock - free routing within the maximum connected sub - graph, with a computational complexity of \(O(|R||L|)\), where \(R\) is the number of nodes in the network. - **Bidirectional links and unified virtual channel structure**: Further enhance network connectivity and maintain a high fault tolerance. In addition, the FASHION architecture can also be extended to the field of NoC power - gating. By reconfiguring the algorithm, it provides deadlock - free paths, which are suitable for any irregular topology, thereby strengthening the existing NoC power - gating schemes. The paper verifies the advantages of the FASHION architecture in terms of computational complexity, scalability, and fault tolerance through experiments. In particular, when dealing with different numbers of faults, the FASHION architecture can significantly reduce the node loss rate and maintain high network performance.