Keywords Spotting Using the MAX78000

Abstract

Audio assistants have become very popular with range of applications from household to automotive and industrial products and IoT. Such devices constantly listen to their surroundings and wake up on pretrained keywords to execute certain commands. Power consumption is a key factor for many of such resource constrained edge applications, where the connectivity to the cloud for processing of raw data is not feasibly. The MAX78000 is a new breed of Artificial Intelligence (AI) microcontroller built to enable neural networks to execute at ultra-low power and live at the edge of the IoT. In this document, we show case the implementation of a keyword spotting application on the MAX78000. The machine learning model is built with Maxim’s development flow on PyTorch, trained with a subset of Google’s speech command dataset with 20 keywords, and deployed on the MAX78000EVKIT.

Introduction

The application of digital assistants powered by voice-activated user interfaces has drastically increased in the recent years. While some products heavily rely on cloud connectivity to execute the speech recognition algorithms and natural language processing on powerful remote servers, it is not feasible in lower power devices to constantly stream the audio to the cloud for processing. Particularly, the detection of wake-up keywords as well as a limited set of command words are expected to be completed locally to optimize the power consumption and reduce the latency in IoT and edge applications. Such applications are efficiently covered by the MAX78000, an ultra-low power microcontroller with a Convolutional Neural Networks (CNN) accelerator.

CNNs are very popular in modeling acoustic systems, especially keyword detection. CNNs, like regular neural networks, are constructed as a series of neurons with weights and biases followed by a nonlinearity. However, a convolutional layer only looks into a local area with a fraction of the output neurons of the last layer at a time and slides it over the last layer per execution (Figure 1). A pooling layer is frequently used in tandem with the CNN to down sample the last layers output. Such operations are the heart of the MAX78000 CNN architecture employing 64 parallel processors, each with a pooling unit, a convolutional engine, and a dedicated weight memory.

This application note examines how to implement a keyword spotting application on the MAX78000, an ultra-low power microcontroller with a CNN accelerator. Twenty keywords were selected from the second version of the Google speech commands dataset to train the keyword spotting demonstration (KWS20).

Figure 1. Basic operation of the CNN.

MAX78000

The MAX78000 ^[1] is a new breed of Artificial Intelligence (AI) microcontroller built to enable neural networks to execute at ultra-low power and live at the edge of the IoT. This product combines the most energy-efficient AI processing with Maxim's proven ultra-low power microcontrollers. The hardware-based CNN accelerator enables battery-powered applications to execute AI inferences while spending only microjoules of energy. This makes it an ideal architecture for keyword spotting applications. The MAX78000 features an Arm^® Cortex^®-M4 with FPU CPU for efficient system control with an ultra-low-power deep neural network accelerator. Figure 2 shows the top-level architecture of the MAX78000.

Figure 2. The architecture of the MAX78000.

The MAX78000 evaluation kit provides a platform to leverage the capabilities of the MAX78000 to build new generations of AI devices. The EV kit features onboard hardware like a digital microphone, serial port, camera module support, and a 3.5in touch-enabled color, thin-film transistor (TFT) display ^[2] (Figure 3) for the KWS20 demo application.

Figure 3. Keyword spotting demo on the MAX78000EVKIT.

MAX78000 Development Flow

The PyTorch or TensorFlow-Keras toolchain can be used to develop a model for the MAX78000. The model is created with a series of defined subclasses representing the hardware. Some operations like pooling or activations are fused to 1D or 2D convolution layers, and fully connected layers. Rounding and clipping are also added to match the hardware.

The model is trained with floating-point weights and training data. Weights can be quantized either during training (quantization aware training) or after training (post-training quantization). The result of quantization can be evaluated over the evaluation dataset to check the accuracy degradation due to weight quantization.

The MAX78000 synthesizer tool (ai8xize) accepts the PyTorch checkpoint or TensorFlow exported ONNX files as an input, as well as the model description in the YAML format. A sample data file (.npy file) is provided to the synthesizer as well to verify the synthesized model on the hardware. The inference outcome for this data is compared with the expected output of the presynthesis model.

The MAX78000 synthesizer automatically generates the C code, which can be compiled and executed on the MAX78000. The C code includes Application Programming Interface (API) calls to load the weights as well as the provided sample data to the hardware to execute an inference on the sample data and compare the classification outcome with the expected result as a pass/fail sanity test. This generated C code can be used as an example to create own applications. Figure 4 shows the overall development flow of the MAX78000.

Figure 4. Development flow of the MAX78000.

Keywords Spotting Methodologies:

1. Mel Frequency Cepstrum Coefficient Feature Extraction

Mel Frequency Cepstrum Coefficients (MFCC) is one of the well-known and popular feature extraction methods ^[3]. The purpose of feature extraction is to represent the speech signal with a set of known and relevant components for the classification.

MFCC is achieved by signal disintegration using a filter bank. It provides the Discrete Cosine Transform (DCT) of a real logarithm of the short-term energy on the Mel frequency scale. More specifically, the computation pipeline of MFCC includes windowing the speech signal into frames, performing Fast Fourier Transform (FFT) to find the power spectrum of each frame, filter bank processing using the Mel scale, and finally DCT on the log scale of the power spectrum (Figure 5).

Figure 5. MFCC processing on Arm (initial model).

The speech data is preprocessed on a microcontroller to generate the MFCC to pursue this approach. The FFT, filtering, log, and DCT must be implemented in the firmware of the Arm processor of the MAX78000. Next, the CNN performs the inferences on the MFCC of the speech data samples. This model was initially investigated for this application.

2. MFCC Approximation Using the CNN

An alternative approach was investigated to create two separate CNNs and improve efficiency. An MFCC estimator network (Melspectogram net) was trained to provide an approximation of the actual MFCC for a given waveform. A second KWS20 classifier network was employed to classify the keywords from the estimated MFCC. The CNN accelerator runs the MFCC and KWS20 network sequentially in this method. The MFCC operation converts time series samples into a two-dimensional (2D) space. This is modeled using a series of 1D convolutional layers. The KWS20 classifier receives 2D image-like data and passes it through a few 2D convolutional layers.

Figure 6. MFCC approximation on the CNN (the second approach).

3. Raw Data Processing with CNN

A single combined network is trained with the raw data in this third demo approach to identify the classes rather than have two separate CNNs for the MFCC and classification. This method simplifies the training and reduces the size of the network without any significant performance degradation. The network comprises a series of 1D convolutional layers mimicking the MFCC approximator, followed by a few 2D convolutional layers. A dense layer at the end generates the maximum likelihood of each class. This approach (Figure 7) is selected to construct the KWS20 demo, as elaborated in the following sections.

Figure 7. Demo model: combined CNN with raw data as input.

Dataset and Augmentation

This exercise uses the version 2 of the speech command dataset created by Google ^[4][5]. The dataset consists of over 100k of utterances of 35 different words stored as one-second .wave format files sampled at 16kHz. Twenty out of the 35 words were chosen as the desired classes and the rest labeled as the unknown class. Table 1 shows the selected keywords.

Table 1. Selected set of 20 keywords in demo
Class Code	Word	Number of Utterances	Class Code	Word	Number of Utterances	Class Code	Word	Number of Utterances
0	Up	3723	7	No	3941	14	Five	4052
1	Down	3917	8	On	3845	15	Six	3860
2	Left	3801	9	Off	3745	16	Seven	3998
3	Right	3778	10	One	3890	17	Eight	3787
4	Stop	3872	11	Two	3880	18	Nine	3934
5	Go	3880	12	Three	3727	19	Zero	4052
6	Yes	4044	13	Four	3728	20	Unknown	28375

The number of utterances of the unknown class is significantly higher than the others as it includes the aggregate of all the remaining 15 classes. This results in the overtraining of the unknown class compared to the rest. The weight of the unknown class in the CrossEntropyLoss function (PyTorch) is set to 0.14 of the weight of other classes to address this issue.

Each waveform is augmented twice with additional noise, time-shift, and random stretch to further enhance the dataset, resulting in 3x of the original dataset size. The augmentation improves the network performance in the real environment with background noise. Figure 8 illustrates an example of utterance of Stop before and after augmentation.

The augmented dataset is partitioned to the training, validation, and testing categories (Table 2).

The default TensorFlow data format is channel last. The expected input shape for the Conv1D operation is batch_size, width, and channels, and the Conv2D operation is batch_size, height, width, and channels.

Samples are read sequentially and stored in 128 rows to make 1 x 128 x 128 tensors for training purposes. Figure 9a shows an image representation of the Stop data samples prior to being sent to the CNN.

On the other hand, the format of the synthesizer is channel first, like PyTorch. The training script generates an example sample data file for each class to be used by the synthesizer for verification. These sample class data files are converted to the channel-first format by transposing the dataset sample. Figure 9b shows the same sample transposed for the synthesizer.

Table 2. KWS20 Dataset size
Category	Number of Utterances
Training	197751
Validation	21972
Testing	68250

Figure 8. Waveform of Stop before and after augmentation.

Figure 9. Waveform of Stop represented by a 128 x 128 image.: a. Being fed to the network for training. b. Transposed for use with the synthesis script.

Figure 9. Waveform of Stop represented by a 128 x 128 image:
a. Being fed to the network for training.
b. Transposed for use with the synthesis script.

CNN Model Training

The combined keyword spotting CNN is trained to classify the raw data. The model consists of two back-to-back CNNs: 1D (Conv1D) and 2D (Conv2D) convolutional networks. The Conv1D CNN includes four layers and extracts speech features. The Conv2D CNN comprises five layers, followed by a fully connected layer to classify the utterances. The model is trained with an augmented dataset for 20 keywords (Table 1). Figure 10 shows the CNN model.

Figure 10. Keyword spotting model in PyTorch.

The model training is executed by the following script:

(ai8x-training) $ ./train_kws20.sh

The script automatically downloads the Google speech commands version 2 dataset, expands it using the augmentation technique described above, and completes the training. Figure 11 shows the result of the model training.

Figure 11. Example of model training.

CNN Model Quantization

The CNN weights generated during the training must be quantized to 8-bits. The CNN weights quantization is done by executing this script:

(ai8x-synthesis) $ ./quantize_kws20.sh

The quantized model can be evaluated by executing the following script:

(ai8x-training) $ ./evaluate_kws20.sh

Figure 12 shows the result of the model evaluation.

Figure 12. Example of the model evaluation and confusion matrix after quantization.

CNN Model Synthesis

The network synthesis script generates a pass/fail example C code, which includes necessary functions to initialize the MAX78000 CNN accelerator, load quantized CNN weights as well as provided input sample, and unload the classification results to compare against the expected output. The synthesis tool requires three inputs (Figure 4):

Quantized PyTorch checkpoint file or TensorFlow model exported to the ONNX format.
Network model YAML description.
A sample input with the expected result to include in the generated C code for verification.

The synthesis script generates the output files shown in Figure 13.

Figure 13. Generated MAX78000 example source code.

The example code can be compiled and deployed to the MAX78000. Figure 14 shows the result of the execution with the confidence level per class.

Figure 14. Keyword spotting model execution result.

The barebone C code is used as the base to build the KWS20 demo platform. The CNN initialization, weights (kernels), and helper functions to load/unload the weights and samples are ported from the generated example code to the KWS20 demo described in the next section.

KWS20 Demonstration Platform

The KWS20 firmware demonstrates the detection of keywords on the MAX78000 EV kit shown in Figure 15. The onboard I²S microphone samples an 18-bit, 16kHz audio signal, and streams to the MAX78000. A simple highpass filter is used to remove the DC offset of the microphone and store the samples in a circular buffer. The signal level is averaged over 128-sample windows and compared to an adjustable threshold to find the beginning of a word. The level below this threshold is categorized as the silence prior to the word in an utterance. The beginning of a word is detected once the signal level passes the threshold. 16kHz, 8-bit samples (one second) are needed to start an inference on the CNN accelerator. The signal level at the end of a spoken word is monitored. The inference can start if the average level goes and stays below an adjustable threshold for several back-to-back 128-sample windows or if the 16k samples are already collected. The inference on the CNN accelerator for this network takes about 2.5ms. The inference result and the confidence level are shown on the display and the serial port ^[6]. Figure 16 summarizes the processing flow in the KWS20 demo FW.

Figure 15. KWS20 on the MAX78000 EV kit.

Figure 16. Processing flow of the KWS20 demo firmware.

Conclusion

This application note demonstrates the implementation of a 20-keyword detection model and final deployment on the ultra-low power MAX78000 platform for the resource constrained edge or IoT applications. It also highlights the MAX78000 architecture and describes the development flow to build a machine learning model showcasing keyword spotting as the target application.

References

^[1] MAX78000 Datasheet

^[2] MAX78000 Evaluation Kit Datasheet

^[3] Ricardo Lopez-Ruiz (2018), From Natural to Artificial Intelligence, Algorithms and Applications,

^[4] Pete Warden, "Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition", Apr. 2018

^[5] http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz

^[6] KWS20 Demo README