Abstract:This work focuses on an efficient Agile design methodology for domain-specific accelerators. We employ feature-by-feature enhancement of a vertical development stack and apply it to the TVM/VTA inference accelerator. We have enhanced the VTA design space and enabled end-to-end support for additional workloads. This has been accomplished by augmenting the VTA micro-architecture and instruction set architecture (ISA), as well as by enhancing the TVM compilation stack to support a wide range of VTA configs. The VTA tsim implementation (CHISEL-based) has been enhanced with fully pipelined versions of the ALU/GEMM execution units. In tsim, memory width can now range between 8-64 bytes. Field widths have been made more flexible to support larger scratchpads. New instructions have been added: element-wise 8-bit multiplication to support depthwise convolution, and load with a choice of pad values to support max pooling. Support for more layers and better double buffering has also been added. Fully pipelining ALU/GEMM helps significantly: 4.9x fewer cycles with minimal area change to run ResNet-18 under the default config. Configs featuring a further 11.5x decrease in cycle count at a cost of 12x greater area can be instantiated. Many points on the area-performance pareto curve are shown, showcasing the balance of execution unit sizing, memory interface width, and scratchpad sizing. Finally, VTA is now able to run Mobilenet 1.0 and all layers for ResNets, including the previously disabled pooling and fully connected layers. The TVM/VTA architecture has always featured end-to-end workload evaluation on RTL in minutes. With our modifications, it now offers a much greater number of feasible configurations with a wide range of cost vs. performance. All capabilities mentioned are available in opensource forks while a subset of these capabilities have already been upstreamed.

A Hardware-Software Blueprint for Flexible Deep Learning Specialization

A Highly Configurable Hardware/Software Stack for DNN Inference Acceleration

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

An Optimization Toolchain Design Of Deep Learning Deployment Based On Heterogeneous Computing Platform

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

Leveraging the VTA-TVM Hardware-Software Stack for FPGA Acceleration of 8-bit ResNet-18 Inference

Enabling One-Size-Fits-All Compilation Optimization for Inference Across Machine Learning Computers

Enabling One-size-fits-all Compilation Optimization across Machine Learning Computers for Inference

Deep Neural Network Hardware Deployment Optimization via Advanced Active Learning

Invited: Algorithm-Software-Hardware Co-Design for Deep Learning Acceleration

An Open-Source ML-Based Full-Stack Optimization Framework for Machine Learning Accelerators

VELTAIR: towards high-performance multi-tenant deep learning services via adaptive compilation and scheduling

CFU Playground: Full-Stack Open-Source Framework for Tiny Machine Learning (tinyML) Acceleration on FPGAs

Survey and design of paleozoic: a high-performance compiler tool chain for deep learning inference accelerator

Reconfigurable processor for deep learning in autonomous vehicles

Case Study: Optimization Methods With TVM Hybrid-OP on RISC-V Packed SIMD

Integrated Hardware Architecture and Device Placement Search

Design Automation for Efficient Deep Learning Computing

Collage: Seamless Integration of Deep Learning Backends with Automatic Placement

A Bandwidth Enhancement Method of VTA Based on Paralleled Memory Access Design