Abstract:The Massimult project aims to design and implement an innovative CPU architecture based on combinator reduction with a novel combinator base and a new abstract machine. The evaluation of programs within this architecture is inherently highly parallel and localized, allowing for faster computation, reduced energy consumption, improved scalability, enhanced reliability, and increased resistance to attacks. In this paper, we introduce the machine language LambdaM, detail its compilation into KVY assembler code, and describe the abstract machine Matrima. The best part of Matrima is its ability to exploit inherent parallelism and locality in combinator reduction, leading to significantly faster computations with lower energy consumption, scalability across multiple processors, and enhanced security against various types of attacks. Matrima can be simulated as a software virtual machine and is intended for future hardware implementation.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to design and implement a new parallel CPU architecture based on combinator reduction. Specifically, the Massimult project aims to improve existing computing architectures by introducing new combinator bases and abstract machines. The following are the main problems and goals mentioned in the paper: 1. **Improve parallelism and locality**: - The existing von Neumann architecture relies on sequential operation execution. Although modern CPUs are equipped with multiple cores to promote parallel processing, this requires software to be specifically designed to fully utilize these cores. However, an architecture based on combinator reduction can provide inherent parallel execution without programmers having to specifically consider the parallelization problem. - Computation in the combinator reduction model can be regarded as a tree, and all branches can evolve concurrently and finally reach a completed state. This model allows for more efficient parallel computing. 2. **Reduce energy consumption**: - Since combinator reduction is based on local operations, it can selectively activate the actually required parts in the chip at a very fine - grained level, thereby significantly reducing energy consumption. - The inherent parallelism also reduces the need to maximize the clock frequency, further saving energy. 3. **Enhance scalability and reliability**: - The new architecture design enables computation to be scaled among multiple processors, improving the system's scalability. - The characteristics of parallel computing and locality enhance the system's reliability and the ability to resist various types of attacks. 4. **Unify the processing of programs and data**: - In graph reduction, programs and data are processed uniformly, unlike the von Neumann architecture where code and data are stored and processed separately. This reduces the need for global read - write operations and further reduces the complexity of the cache mechanism. 5. **Advantages of theoretical basis and programming paradigm**: - Combinator reduction is based on Lambda calculus and combinatory logic, and these theoretical models are closely related to the functional programming paradigm. Functional programming avoids side effects and mutable states, so it is more suitable for parallel execution and improves the utilization efficiency of computing resources. - The strong theoretical basis of functional programming helps in reasoning about program behavior and supports tools for automatic analysis and verification of code, thereby developing high - quality, error - free, safe and reliable software. 6. **Hardware implementation and optimization**: - The paper introduces the LambdaM machine language and its compilation process into KVY assembly code, and describes the design and functions of the Matrima abstract machine. The Matrima virtual machine can be simulated in software and is planned for hardware implementation in the future. - The project combines in - depth research and strong engineering capabilities, successfully defines new combinator codes, and develops new reduction mechanisms for parallel and speculative evaluation. The next goal is to achieve competitive performance through GPU simulation and better understand code optimization in speculative evaluation. In summary, this paper aims to solve the deficiencies of existing computing architectures in terms of parallelism, energy consumption, scalability and security through innovative CPU architecture design, and promote the development of functional programming languages and combinator reduction techniques.

Massimult: A Novel Parallel CPU Architecture Based on Combinator Reduction

Multiprocessors for Evaluating Compound Arithmetic Functions

CIM-MLC: A Multi-level Compilation Stack for Computing-In-Memory Accelerators

OCC: An Automated End-to-End Machine Learning Optimizing Compiler for Computing-In-Memory

An Efficient Implementation of Montgomery Multiplication on Multicore Platform With Optimized Algorithm, Task Partitioning, and Network Architecture

Scalable MatMul-free Language Modeling

Fast Matrix Multiplication via Compiler-only Layered Data Reorganization and Intrinsic Lowering

A Method for Efficient Heterogeneous Parallel Compilation: A Cryptography Case Study

Klessydra-T: Designing Vector Coprocessors for Multi-Threaded Edge-Computing Cores

MeMPA: A Memory Mapped M-SIMD Co-Processor to Cope with the Memory Wall Issue

A Multi-Layer Parallel Hardware Architecture for Homomorphic Computation in Machine Learning

A Learnable Parallel Processing Architecture Towards Unity of Memory and Computing

MX: Enhancing RISC-V's Vector ISA for Ultra-Low Overhead, Energy-Efficient Matrix Multiplication

Quartet: A 22nm 0.09mj/lnference Digital Compute-in-Memory Versatile AI Accelerator with Heterogeneous Tensor Engines and Off-Chip-Less Dataflow

Architecture Design of a Single-chip Multiprocessor

Deinsum: Practically I/O Optimal Multilinear Algebra

HiMA: Hierarchical Quantum Microarchitecture for Qubit-Scaling and Quantum Process-Level Parallelism

Thermally Constrained Codesign of Heterogeneous 3-D Integration of Compute-in-Memory, Digital ML Accelerator, and RISC-V Cores for Mixed ML and Non-ML Workloads

A Triplet-based Computer Architecture Supporting Parallel Object Computing

A Reconfigurable Processor Architecture Combining Multi-Core and Reconfigurable Processing Units