Abstract:As the demand for neural network operations on edge devices increases, energy-efficient neural network inference solutions become necessary. To this end, this paper proposes a compact 4-bit number format (SD4) for neural network weights. In addition to significantly reducing the amount of neural network data transmission, SD4 also reduces the neural network convolution operation from multiplication and addition (MAC) to only addition. MNIST and CIFAR-10 CNNs with SD4 weights achieve results similar to their FP32-trained counterparts. The difference between the top-1 accuracy of 4-bit ResNet CNN for ImageNet and the baseline FP32 CNN is less than 0.5%. In the hardware design, we have implemented a multiplier-less convolution acceleration circuit. Compared with the 8-bit weight circuit, the power consumption and area of a 4-bit <span class="mjpage"><svg xmlns:xlink="http://www.w3.org/1999/xlink" width="8.181ex" height="2.176ex" style="vertical-align: -0.338ex;" viewBox="0 -791.3 3522.5 936.9" role="img" focusable="false" xmlns="http://www.w3.org/2000/svg"><g stroke="currentColor" fill="currentColor" stroke-width="0" transform="matrix(1 0 0 -1 0 0)"> <use xlink:href="#MJMAIN-33" x="0" y="0"></use> <use xlink:href="#MJMATHI-74" x="500" y="0"></use> <use xlink:href="#MJMATHI-69" x="862" y="0"></use> <use xlink:href="#MJMATHI-6D" x="1207" y="0"></use> <use xlink:href="#MJMATHI-65" x="2086" y="0"></use> <use xlink:href="#MJMATHI-73" x="2552" y="0"></use> <use xlink:href="#MJMAIN-33" x="3022" y="0"></use></g></svg></span> convolution circuit are reduced by nearly 50%. This work also proposes a systematic CNN deployment solution consisting of software CNN training and hardware acceleration. The proposed FPGA-based accelerator for VGG7 image classification achieves a peak throughput of 345.6 GOPS when running at a 100-MHz clock rate. The proposed convolution accelerator's power consumption and energy efficiency are 1.19W and 289. 5 GOPS/W, respectively. Compared to the CPU implementation of VGG7-128 inference, the multiplier-less acceleration circuit is 4.8 times faster and achieves 384 times higher energy efficiency.<svg xmlns="http://www.w3.org/2000/svg" style="display: none;"><defs id="MathJax_SVG_glyphs"><path stroke-width="1" id="MJMAIN-33" d="M127 463Q100 463 85 480T69 524Q69 579 117 622T233 665Q268 665 277 664Q351 652 390 611T430 522Q430 470 396 421T302 350L299 348Q299 347 308 345T337 336T375 315Q457 262 457 175Q457 96 395 37T238 -22Q158 -22 100 21T42 130Q42 158 60 175T105 193Q133 193 151 175T169 130Q169 119 166 110T159 94T148 82T136 74T126 70T118 67L114 66Q165 21 238 21Q293 21 321 74Q338 107 338 175V195Q338 290 274 322Q259 328 213 329L171 330L168 332Q166 335 166 348Q166 366 174 366Q202 366 232 371Q266 376 294 413T322 525V533Q322 590 287 612Q265 626 240 626Q208 626 181 615T143 592T132 580H135Q138 579 143 578T153 573T165 566T175 555T183 540T186 520Q186 498 172 481T127 463Z"></path><path stroke-width="1" id="MJMATHI-74" d="M26 385Q19 392 19 395Q19 399 22 411T27 425Q29 430 36 430T87 431H140L159 511Q162 522 166 540T173 566T179 586T187 603T197 615T211 624T229 626Q247 625 254 615T261 596Q261 589 252 549T232 470L222 433Q222 431 272 431H323Q330 424 330 420Q330 398 317 385H210L174 240Q135 80 135 68Q135 26 162 26Q197 26 230 60T283 144Q285 150 288 151T303 153H307Q322 153 322 145Q322 142 319 133Q314 117 301 95T267 48T216 6T155 -11Q125 -11 98 4T59 56Q57 64 57 83V101L92 241Q127 382 128 383Q128 385 77 385H26Z"></path><path stroke-width="1" id="MJMATHI-69" d="M184 600Q184 624 203 642T247 661Q265 661 277 649T290 619Q290 596 270 577T226 557Q211 557 198 567T184 600ZM21 287Q21 295 30 318T54 369T98 420T158 442Q197 442 223 419T250 357Q250 340 236 301T196 196T154 83Q149 61 149 51Q149 26 166 26Q175 26 185 29T208 43T235 78T260 137Q263 149 265 151T282 153Q302 153 302 143Q302 135 293 112T268 61T223 11T161 -11Q129 -11 102 10T74 74Q74 91 79 106T122 220Q160 321 166 341T173 380Q173 404 156 404H154Q124 404 99 371T61 287Q60 286 59 284T58 281T56 279T53 278T49 278T41 278H27Q21 284 21 287Z"></path><path stroke-width="1" id="MJMATHI-6D" d="M21 287Q22 293 24 303T36 341T56 388T88 425T132 442T175 435T205 417T221 395T229 376L231 369Q231 367 232 367L243 378Q303 442 384 442Q401 442 415 440T441 433T460 423T475 411T485 398T493 385T497 373T500 364T502 357L510 367Q573 442 659 442Q713 442 746 415T780 336Q780 285 742 178T704 50Q705 36 709 31T724 26Q752 26 776 56T815 138Q818 149 821 151T837 153Q857 153 857 145Q857 144 853 130Q845 101 831 73T785 17T716 -10Q669 -10 648 17T627 73Q627 92 663 193T700 345Q700 404 656 404H651Q565 404 506 303L499 291L466 157Q433 26 428 16Q415 -11 385 -11Q372 -11 364 -4T353 8T350 18Q350 29 384 161L420 307Q423 322 423 345Q423 404 379 404H374Q288 404 229 303L222 291L189 157Q156 26 151 16Q138 -11 108 -11Q95 -11 87 -5T76 7T74 17Q74 30 112 181Q151 335 151 342Q154 357 154 369Q154 405 129 405Q107 405 92 377T69 316T57 280Q55 278 41 278H27Q21 284 21 287Z"></path><path stroke-width="1" id="MJMATHI-65" d="M39 168Q39 225 58 272T107 350T174 402T244 433T307 442H310Q355 442 388 420T421 355Q421 265 310 237Q261 224 176 223Q139 223 138 221Q138 219 132 186T125 128Q125 81 146 54T209 26T302 45T394 111Q403 121 406 121Q410 121 419 112T429 98T420 82T390 55T344 24T281 -1T205 -11Q126 -11 83 42T39 168ZM373 353Q367 405 305 405Q272 405 244 391T199 357T170 316T154 280T149 261Q149 260 169 260Q282 260 327 284T373 353Z"></path><path stroke-width="1" id="MJMATHI-73" d="M131 289Q131 321 147 354T203 415T300 442Q362 442 390 415T419 355Q419 323 402 308T364 292Q351 292 340 300T328 326Q328 342 337 354T354 372T367 378Q368 378 368 379Q368 382 361 388T336 399T297 405Q249 405 227 379T204 326Q204 301 223 291T278 274T330 259Q396 230 396 163Q396 135 385 107T352 51T289 7T195 -10Q118 -10 86 19T53 87Q53 126 74 143T118 160Q133 160 146 151T160 120Q160 94 142 76T111 58Q109 57 108 57T107 55Q108 52 115 47T146 34T201 27Q237 27 263 38T301 66T318 97T323 122Q323 150 302 164T254 181T195 196T148 231Q131 256 131 289Z"></path></defs></svg>

L-MPC: A LUT based MuIti-LeveI Prediction-Correction Architecture for Accelerating Binary-Weight Hourglass Network

L-MPC: A LUT based Multi-Level Prediction-Correction Architecture for Accelerating Binary-Weight Hourglass Network

BOOST: Block Minifloat-Based On-Device CNN Training Accelerator with Transfer Learning

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

A High-Performance Pixel-Level Fully Pipelined Hardware Accelerator for Neural Networks

A Multi-Task Hardwired Accelerator for Face Detection and Alignment

Look-Up Table based Neural Network Hardware

LUT Tensor Core: Lookup Table Enables Efficient Low-Bit LLM Inference Acceleration

A High Performance Multi-Bit-Width Booth Vector Systolic Accelerator for NAS Optimized Deep Learning Neural Networks

A Multi-Mode Visual Recognition Hardware Accelerator for AR/MR Glasses

LUTMUL: Exceed Conventional FPGA Roofline Limit by LUT-based Efficient Multiplication for Neural Network Inference

Instruction Driven Cross-layer CNN Accelerator for Fast Detection on FPGA

High-Performance FPGA-Based CNN Accelerator with Block-Floating-Point Arithmetic.

High PE Utilization CNN Accelerator with Channel Fusion Supporting Pattern-Compressed Sparse Neural Networks

A 460 GOPS/W Improved Mnemonic Descent Method-Based Hardwired Accelerator for Face Alignment.

A 1.17 TOPS/W, 150fps Accelerator for Multi-Face Detection and Alignment

DLUX: A LUT-Based Near-Bank Accelerator for Data Center Deep Learning Training Workloads

Power Efficient Tiny Yolo CNN Using Reduced Hardware Resources Based on Booth Multiplier and WALLACE Tree Adders

A High-Throughput FPGA Accelerator for Lightweight CNNs With Balanced Dataflow

Hundred-Kilobyte Lookup Tables for Efficient Single-Image Super-Resolution

A Multiplier-Less Convolutional Neural Network Inference Accelerator for Intelligent Edge Devices