Abstract:In mobile and edge devices, always-on keyword spotting (KWS) is an essential function to detect wake-up words. Recent works achieved extremely low power dissipation down to $\sim500$ nW [1]. However, most of them adopt noise-dependent training, i.e. training for a specific signal-to-noise ratio (SNR) and noise type [1], and therefore their accuracies degrade for different SNR levels and noise types that are not targeted in the training (Fig. 9.9.1, top left). To improve robustness, so-called noise-independent training can be considered, which is to use the training data that includes all the possible SNR levels and noise types [2]. But, this approach is challenging for an ultra-low-power device since it demands a large neural network to learn all the possible features. A neural network of a fixed size has its own memory capacity limit and reaches a plateau in accuracy if it has to learn more than its limit (Fig. 9.9.1, top right). On the other hand, it is known that biological acoustic systems employ a simpler process, called divisive energy normalization (DN), to maintain accuracy even in varying noise conditions [3]. In this work, therefore, by adopting such a DN, we prototype a normalized acoustic feature extractor chip (NAFE) in 65nm. The NAFE can take an acoustic signal from a microphone and produce spike-rate coded features. We pair NAFE with a spiking neural network (SNN) classifier chip [4], creating the end-to-end KWS system. The proposed system achieves 89-to-94% accuracy across -5 to 20dB SNRs and four different noise types on HeySnips [5], while the baseline without DN achieves a much lower accuracy of 71-87%. NAFE consumes up to 109nW and the KWS system 570nW.

An Energy-Efficient Binarized Neural Network Using Analog-Intensive Feature Extraction for Keyword and Speaker Verification Wakeup.

Sub-mW Keyword Spotting on an MCU: Analog Binary Feature Extraction and Binary Neural Networks

A 510-nW Wake-Up Keyword-Spotting Chip Using Serial-FFT-Based MFCC and Binarized Depthwise Separable CNN in 28-nm CMOS

A Background-Noise and Process-Variation-Tolerant 109nW Acoustic Feature Extractor Based on Spike-Domain Divisive-Energy Normalization for an Always-On Keyword Spotting Device

NS-FDN: Near-Sensor Processing Architecture of Feature-Configurable Distributed Network for Beyond-Real-Time Always-on Keyword Spotting

Optimization and evaluation of energy-efficient mixed-signal MFCC feature extraction architecture

MSP-MFCC: Energy-Efficient MFCC Feature Extraction Method With Mixed-Signal Processing Architecture for Wearable Speech Recognition Applications

Towards Energy-Preserving Natural Language Understanding with Spiking Neural Networks

Low Bits: Binary Neural Network for Vad and Wakeup

Energy-efficient MFCC Extraction Architecture in Mixed-Signal Domain for Automatic Speech Recognition

Adder Neural Networks for Speaker Verification

An Ultra-Low Power Binarized Convolutional Neural Network-Based Speech Recognition Processor with On-Chip Self-Learning.

9.1 μW keyword spotting processor based on optimized MFCC and small‐footprint TENet in 28‐nm CMOS

Global-Local Convolution with Spiking Neural Networks for Energy-efficient Keyword Spotting

NS-KWS: joint optimization of near-sensor processing architecture and low-precision GRU for always-on keyword spotting

More is Less: Domain-Specific Speech Recognition Microprocessor Using One-Dimensional Convolutional Recurrent Neural Network

Neural Scoring, Not Embedding: A Novel Framework for Robust Speaker Verification

End-to-End Feature Learning for Text-Independent Speaker Verification

A Low-Power Keyword Spotting System with High-Order Passive Switched-Capacitor Bandpass Filters for Analog-MFCC Feature Extraction

FPGA Implementation of Keyword Spotting System Using Depthwise Separable Binarized and Ternarized Neural Networks

DeltaKWS: A 65nm 36nJ/Decision Bio-inspired Temporal-Sparsity-Aware Digital Keyword Spotting IC with 0.6V Near-Threshold SRAM