LockillerTM: Enhancing Performance Lower Bounds in Best-Effort Hardware Transactional Memory

Li Wan,Fu Chao,Qiang Li,Jun Han
DOI: https://doi.org/10.1109/ipdps57955.2024.00081
2024-01-01
Abstract:Concurrent access to shared data has always been a challenge for developing multi-threaded programs and a bottleneck in the performance of Chip-Multiprocessor (CMP) systems. The challenge has been exacerbated by the need to augment processor cores and network bandwidth to fulfill the low-latency demands of ever-expanding data processing. Existing commercial best-effort Hardware Transactional Memory (HTM) is a common and effective solution. However, its architectural constraints prevent transactions from surviving in exceptions, cache overflow, and coexisting with a non-speculation fallback path, leading to unstable performance and diminishing favor. In this paper, we propose three lightweight mechanisms designed to mitigate the limitations of the best-effort HTM architecture to enhance performance stability. One is the recovery mechanism that supports the dynamic revocation of toxic conflicting requests, dramatically reducing the potential of livelocks. The second is the HTMLock mechanism with hardware and software co-design, which allows transactions using HTM and locks to run concurrently except when encountering actual conflict. Lastly, the switchingMode mechanism enables a running transaction to proactively attempt to switch to HTMLock mode in the event of a non-conflict-induced abort. Gem5 infrastructure is extended to validate and evaluate our mechanisms in a 32-core tiled CMP system. Experimental studies show that LockillerTM outperforms the coarse-grained locking scheme under STAMP benchmarks except for the yada workload, irrespective of thread number and cache size. Furthermore, our approach achieves an average of 1.86x and 1.57x speedup in all benchmarks and different threads under a typical cache size and a maximum of 7.79x and 6.73x speedup in high-contention benchmarks under extreme scenarios with only 8KB L1 cache and 32 threads, compared to best-effort HTM and state-of-the-art HTM respectively.
What problem does this paper attempt to address?