Abstract:Massive multiuser (MU) multiple-input multiple-output (MIMO) enables concurrent transmission of multiple users to a multi-antenna basestation (BS). To detect the users' data using linear equalization, the BS must perform preprocessing, which requires, among other tasks, the inversion of a matrix whose dimension equals the number of user data streams. Explicit inversion of large matrices is notoriously difficult to implement due to high complexity, stringent data dependencies that lead to high latency, and high numerical precision requirements. We propose a novel preprocessing architecture based on the block-LDL matrix factorization, which improves parallelism and, hence, reduces latency. We demonstrate the effectiveness of our architecture through (i) massive MU-MIMO system simulations with mmWave channel vectors and (ii) measurements of a 22FDX ASIC, which is, to our knowledge, the first fabricated preprocessing engine for massive MU-MIMO with 64 BS antennas and 16 single-antenna users. Our ASIC reaches a clock frequency of 870 MHz while consuming 416 mW. At its peak throughput, the ASIC preprocesses 1.44 M 64$\times$16 matrices per second at a latency of only 0.7 $\mu$s.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in large - scale multi - user (MU) multiple - input multiple - output (MIMO) systems, how to efficiently implement the pre - processing steps of linear minimum mean - square error (LMMSE) data detection. Specifically, the paper aims to design a new - type pre - processing architecture to reduce the complexity and latency of matrix pre - processing, thereby meeting the requirements of modern wireless communication systems for high throughput and low latency. ### Specific Background of the Problem 1. **Challenges in Large - Scale MU - MIMO Systems**: - Large - scale MU - MIMO systems allow multiple users to transmit data to multi - antenna base stations (BS) simultaneously. - In order to detect users' signals, the base station needs to perform pre - processing steps, which involve calculating and solving a matrix related to the number of user data streams. - The complexity of explicit matrix inversion increases rapidly with the increase in the number of users, resulting in high complexity, high latency, and strict numerical precision requirements. 2. **Limitations of Existing Methods**: - Existing hardware implementation methods (such as Cholesky, LU, LDL, QR decomposition, etc.) can achieve matrix inversion, but still face the problems of excessive complexity and latency in large - scale systems. - Approximation methods can reduce complexity, but will sacrifice bit - error - rate performance and perform poorly under certain conditions (for example, when the number of base station antennas is much larger than the number of users). ### The Solution in the Paper The paper proposes a pre - processing architecture based on block - LDL (BLDL) matrix decomposition, and its main features include: - **Improving Parallelism**: Through BLDL decomposition, parallel processing can be carried out on multiple data items, thereby reducing latency. - **Avoiding Forward Substitution**: Use specific methods to skip the forward substitution step, further reducing complexity. - **Sharing Hardware Resources**: Share hardware resources to reduce the silicon area. ### Experimental Verification The paper verifies the effectiveness of the proposed architecture in the following ways: - **System Simulation**: Use millimeter - wave channel vectors to conduct large - scale MU - MIMO system simulations. - **ASIC Measurement**: Design and fabricate a 22FDX ASIC chip for actual measurement. This chip can operate at a clock frequency of 870 MHz, with a power consumption of 416 mW, can process 1.44 M 64×16 matrices per second, and has a latency of only 0.7 µs. In conclusion, this paper aims to solve the complexity and latency problems of pre - processing steps in large - scale MU - MIMO systems. By proposing an efficient BLDL decomposition architecture, higher throughput and lower latency are achieved, thereby meeting the requirements of modern wireless communication systems.

A 1.2 mm$^2$ 416 mW 1.44 Mmat/s 64$\times$16 Matrix Preprocessing ASIC for Massive MIMO in 22FDX

A 1.58 Gbps/W 0.40 Gbps/mm2 ASIC Implementation of MMSE Detection for $128\times 8~64$ -QAM Massive MIMO in 65 Nm CMOS

A Jammer-Mitigating 267 Mb/s 3.78 mm$^2$ 583 mW 32$\times$8 Multi-User MIMO Receiver in 22FDX

A 46 Gbps 12 pJ/b Sparsity-Adaptive Beamspace Equalizer for mmWave Massive MIMO in 22FDX

Parallel Photonic Acceleration Processor for Matrix-Matrix Multiplication

Low-Computing-Load, High-Parallelism Detection Method Based on Chebyshev Iteration for Massive MIMO Systems with VLSI Architecture

Transceiver Design in Millimeter Wave Full-Duplex Multi-User Massive MIMO Communication Systems

A 2.92-Gb/s/w and 0.43-Gb/s/mg Flexible and Scalable CGRA-Based Baseband Processor for Massive MIMO Detection.

Near-Optimal Hybrid Processing for Massive MIMO Systems via Matrix Decomposition

Large-Scale MIMO Detection for 3GPP LTE: Algorithms and FPGA Implementations

A 28-GHz Beam-Space MIMO RX With Spatial Filtering and Frequency-Division Multiplexing-Based Single-Wire IF Interface

Intelligent Surface-Aided Transmitter Architectures for Millimeter Wave Ultra Massive MIMO Systems

On the Low-Complexity, Hardware-Friendly Tridiagonal Matrix Inversion for Correlated Massive MIMO Systems

Finite-Precision Arithmetic Transceiver for Massive MIMO Systems

A Parallel Early-Pruned K-Best MIMO Signal Detector Up to 1.9Gb/s.

High Throughput MIMO-OFDM Detection with Graphics Processing Units

A 128/256-point pipeline FFT/IFFT processor for MIMO OFDM system IEEE 802.16e

Energy-Efficient Multi-Antenna Hybrid Block Diagonalization Precoding and Combining for MmWave Massive Multi-User MIMO Systems

Efficient DSP and Circuit Architectures for Massive MIMO: State-of-the-Art and Future Directions

A 0.58-mm 2 2.76-Gb/s 79.8-pJ/b 256-QAM Message-Passing Detector for a 128 × 32 Massive MIMO Uplink System

High-Throughput Accelerator for Exact-MMSE Soft-Output Detection in Open RAN Systems