A Cost-Efficient Router Architecture for HPC Inter-Connection Networks: Design and Implementation

Yi Dai,Kai Lu,Liquan Xiao,Jinshu Su
DOI: https://doi.org/10.1109/tpds.2018.2873337
IF: 5.3
2018-01-01
IEEE Transactions on Parallel and Distributed Systems
Abstract:High-radix routers with lower latency and higher bandwidth play an increasingly important role in constructing large-scale interconnection networks such as those used in super-computers and datacenters. The tile-based crossbar approach partitions a single large crossbar into many small tiles and can considerably reduce the complexity of arbitration while providing higher throughput than the conventional switch implementation. However, it is not scalable due to power consumption, placement, and routing problems. Inspired by non-saturated throughput theory, this paper proposes a scalable router microarchitecture, termed Multiport Binding Tile-based Router (MBTR). By aggregating multiple physical ports into a single tile a high-radix router can be flexibly organized into different tile arrays, thus the number of tiles and hardware overhead can be considerably reduced. For a radix-64 router MBTR achieves up to 50 similar to 75% reduction in memory consumption as well as wire area compared with a hierarchical switch. We theoretically deduce the sufficient and necessary conditions for the asymmetrical crossbar to achieve un-saturated relative 100 percent throughput. Based on this observation we analyze the MBTR throughput and derive the condition that should be satisfied by the MBTR design parameters to yield 100 percent throughput. We further discuss how to make a trade-off between MBTR parameters based on the constraints of performance, power and area. The simulation results demonstrate MBTR is indistinguishable from the YARC router in terms of throughput and delay, and can even outperform it by reducing potential contention for output ports. We have fabricated a 36-port MBTR chip at 28 nm, providing 100 Gb/s bidirectional bandwidth per port, with a fall-through latency of just 30 ns. Internally it runs at 9.6 Tb/s, thus offering a speedup of 1.34x.
What problem does this paper attempt to address?