TRIX: Low-Skew Pulse Propagation for Fault-Tolerant Hardware

Christoph Lenzen,Ben Wiederhake
DOI: https://doi.org/10.48550/arXiv.2010.01415
2020-10-04
Abstract:The vast majority of hardware architectures use a carefully timed reference signal to clock their computational logic. However, standard distribution solutions are not fault-tolerant. In this work, we present a simple grid structure as a more reliable clock propagation method and study it by means of simulation experiments. Fault-tolerance is achieved by forwarding clock pulses on arrival of the second of three incoming signals from the previous layer. A key question is how well neighboring grid nodes are synchronized, even without faults. Analyzing the clock skew under typical-case conditions is highly challenging. Because the forwarding mechanism involves taking the median, standard probabilistic tools fail, even when modeling link delays just by unbiased coin flips. Our statistical approach provides substantial evidence that this system performs surprisingly well. Specifically, in an "infinitely wide" grid of height~$H$, the delay at a pre-selected node exhibits a standard deviation of $O(H^{1/4})$ ($\approx 2.7$ link delay uncertainties for $H=2000$) and skew between adjacent nodes of $o(\log \log H)$ ($\approx 0.77$ link delay uncertainties for $H=2000$). We conclude that the proposed system is a very promising clock distribution method. This leads to the open problem of a stochastic explanation of the tight concentration of delays and skews. More generally, we believe that understanding our very simple abstraction of the system is of mathematical interest in its own right.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?