Hybrid Modular Redundancy: Exploring Modular Redundancy Approaches in RISC-V Multi-Core Computing Clusters for Reliable Processing in Space

Michael Rogenmoser,Yvan Tortorella,Davide Rossi,Francesco Conti,Luca Benini

DOI: https://doi.org/10.1145/3635161

2023-11-14

Abstract:Space Cyber-Physical Systems (S-CPS) such as spacecraft and satellites strongly rely on the reliability of onboard computers to guarantee the success of their missions. Relying solely on radiation-hardened technologies is extremely expensive, and developing inflexible architectural and microarchitectural modifications to introduce modular redundancy within a system leads to significant area increase and performance degradation. To mitigate the overheads of traditional radiation hardening and modular redundancy approaches, we present a novel Hybrid Modular Redundancy (HMR) approach, a redundancy scheme that features a cluster of RISC-V processors with a flexible on-demand dual-core and triple-core lockstep grouping of computing cores with runtime split-lock capabilities. Further, we propose two recovery approaches, software-based and hardware-based, trading off performance and area overhead. Running at 430 MHz, our fault-tolerant cluster achieves up to 1160 MOPS on a matrix multiplication benchmark when configured in non-redundant mode and 617 and 414 MOPS in dual and triple mode, respectively. A software-based recovery in triple mode requires 363 clock cycles and occupies 0.612 mm2, representing a 1.3% area overhead over a non-redundant 12-core RISC-V cluster. As a high-performance alternative, a new hardware-based method provides rapid fault recovery in just 24 clock cycles and occupies 0.660 mm2, namely ~9.4% area overhead over the baseline non-redundant RISC-V cluster. The cluster is also enhanced with split-lock capabilities to enter one of the redundant modes with minimum performance loss, allowing execution of a mission-critical or a performance section, with <400 clock cycles overhead for entry and exit. The proposed system is the first to integrate these functionalities on an open-source RISC-V-based compute device, enabling finely tunable reliability vs. performance trade-offs.

Systems and Control,Hardware Architecture

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the problem of the reliability of computing systems in the space environment, especially the problem of radiation - induced soft errors in space cyber - physical systems (S - CPS) such as spacecraft and satellites. Specifically: 1. **High cost and performance degradation**: Traditional radiation - hardening techniques (such as Radiation Hardening by Design, RHBD) can improve the reliability of the system, but these techniques are very expensive and will lead to a significant increase in area and performance degradation. 2. **Limitations of traditional redundancy methods**: Existing modular redundancy methods (such as Dual Modular Redundancy DMR and Triple Modular Redundancy TMR) can improve the fault - tolerance ability of the system, but they usually adopt a fixed redundancy scheme, and repeated execution in space and time will seriously affect performance and power consumption. 3. **The need for flexibility and configurability**: In order to meet the requirements of different tasks (such as high - performance computing tasks and critical tasks), a flexible and configurable redundancy scheme is required, which can dynamically adjust the redundancy mode at runtime to balance reliability and performance. For this reason, the paper proposes a novel Hybrid Modular Redundancy (HMR) method based on the RISC - V multi - core computing cluster, which has the following characteristics: - **Flexible dual - core and triple - core lock - step grouping**: It supports on - demand configured Dual - Core Lock - Step (DCLS) and Triple - Core Lock - Step (TCLS) modes. - **Fast fault - recovery mechanism**: Two recovery methods - software - based and hardware - assisted - are proposed to quickly recover the system state in the event of a fault. - **Runtime - programmable split - lock - step mechanism**: It allows for rapid switching between critical - task code segments and high - performance code segments, minimizing configuration overhead. Through these innovations, the HMR method can provide higher performance and lower resource costs while ensuring reliability, thereby better meeting the needs of space computing systems.

Hybrid Modular Redundancy: Exploring Modular Redundancy Approaches in RISC-V Multi-Core Computing Clusters for Reliable Processing in Space

On-Demand Redundancy Grouping: Selectable Soft-Error Tolerance for a Multicore Cluster

Hybrid Hardening Approach for a Fault-Tolerant RISC-V System-On-Chip

A RISC-V Fault-Tolerant Soft-Processor Based on Full/Partial Heterogeneous Dual-Core Protection

A New Task Model for COTS-based N-modular Redundant Systems

High Reliability Computer Platform Using Quadruple Modular Redundancy

Enhancing Fault Awareness and Reliability of a Fault-Tolerant RISC-V System-on-Chip

Feedback-Based Low-Power Soft-Error-Tolerant Design for Dual-Modular Redundancy.

Enabling Efficient Hybrid Systolic Computation in Shared-L1-Memory Manycore Clusters

Enabling Efficient Hybrid Systolic Computation in Shared L1-Memory Manycore Clusters

Trikarenos: A Fault-Tolerant RISC-V-based Microcontroller for CubeSats in 28nm

Culsans: An Efficient Snoop-based Coherency Unit for the CVA6 Open Source RISC-V application processor

A Redundancy Mechanism under Single Chip Multiprocessor Architecture

Analysis of Redundancy Techniques for Electronics Design—Case Study of Digital Image Processing

Optmr: Optimal Data Flow Graph Partitioning for Triple Modular Redundancy Against Hardware Trojan in Reconfigurable Hardware

Energy-Efficient Hardware-Accelerated Synchronization for Shared-L1-Memory Multiprocessor Clusters

Design and Experimental Investigation of Trikarenos: A Fault-Tolerant 28nm RISC-V-based SoC

Implementation of a Reconfigurable Computing System for Space Applications

Elzar: Triple Modular Redundancy using Intel Advanced Vector Extensions (technical report)

Neutron Irradiation Testing and Analysis of a Fault-Tolerant RISC-V System-on-Chip

Software-Based Fault Recovery via Adaptive Diversity for COTS Multi-Core Processors