Developing Power-optimized Applications on the MAX78000
Abstract
Power consumption is a key factor for edge Artificial Intelligence (AI) applications where the entire system is powered by small battery cells and is expected to operate for months without recharging or replacing the batteries. The MAX78000 ultra-low power AI microcontroller is built to target such applications at the edge of the IoT. In this document, various configurations are described to enable users to develop power-optimized applications on the MAX78000, along with benchmarking examples. The power optimization methods are applied to two case study applications—Keyword Spotting with 20 keywords (KWS20) and Face Identification (FaceID), and the reported results can be used as a guideline for the user’s application.
Introduction
The MAX78000 is an ultra-low power microcontroller with a dedicated Convolutional Neural Network (CNN) accelerator. This architecture enables the development of very power efficient AI applications in energy constrained environments. The MAX78000 provides a variety of options to facilitate developing low power applications (elaborated in datasheet [1]), such as different oscillators, clock sources, and operation modes that can be selected based on a target application. The following sections contain an overview of such options with practical examples and actual measurements on the MAX78000 EV kit. For more information and details, follow the referenced materials throughout this document.
Note: The measurement results presented in this application note are based on running example codes on a MAX78000 EV kit. The actual result on the user's platform and specific applications can vary.
MAX78000
The MAX78000 is a new generation of AI microcontrollers built to enable the execution of neural networks at ultra-low power and live at the edge of the IoT. This product combines the most energy-efficient AI processing with Maxim Integrated's proven ultra-low power microcontrollers. The hardware-based CNN accelerator enables battery-powered applications to execute AI inferences while spending only microjoules of energy. The MAX78000 features an Atarm® Cortex®-M4 with FPU microcontroller for efficient system control with an ultra-low power deep neural network accelerator. A RISC-V core is also integrated and can execute application and control codes as well as drive the CNN accelerator. Figure 1 shows the top-level architecture of the MAX78000.
The MAX78000 evaluation kit (EV kit) provides a platform to leverage the capabilities of the MAX78000 to build new generations of AI devices. The EV kit features onboard hardware like a digital microphone, serial ports, camera module support, and a 3.5in touch-enabled color thin-film transistor (TFT) display [2]. It also includes the circuitry to monitor and display the power level. The MAX34417 [3] monitors VCOREA and VCOREB voltage as well as the current of the
MAX78000, and reports the accumulated power to the MAX32625 which is used as the power data processor that also controls the power display, as shown in Figure 2.
The power monitor operates in two modes; instantaneous, which displays average power, supply current, and voltage and Windowed Energy Accumulation, which measures the CNN power or total System Power based on triggering events. The instantaneous measurements are typically used for quick, coarse estimates of the power level. On the other hand, the windowed measurements are ideal to measure the accumulated energy between the user defined start and complete instances. These events are triggered by the toggling of two GPIOs of the MAX78000 (P0.2 and P0.3) which are also connected to LED1 and LED2 as well as ports of the MAX32625 power data processor. For more information on using the power monitor, refer to [4].
MAX78000 Clocks and Operating Modes
The MAX78000 supports different clock sources and low power operating modes that can be jointly configured to achieve optimized power and performance. In this application note, a brief description of each mode is provided to guide the reader to set up the modes according to the desired application. For more detailed descriptions, refer to [5].
Clocks
The MAX78000 includes multiple configurable clocks used by different peripherals. The user can configure the clock sources as needed to select the combination of performance and power efficiency. The selected System Oscillator (SYS_OSC) is the clock source for most internal blocks. The following oscillator sources are available and can be selected as SYS_OSC:
- Internal Primary Oscillator (IPO) – 100MHz:
IPO is the fastest frequency oscillator and draws the most power. When entering Low Power Mode (LPM), this oscillator can be powered down. - Internal Secondary Oscillator (ISO) – 60MHz:
This is a low power internal secondary oscillator that is the Power-On Reset default for SYS_OSC. - Internal Nano-Ring Oscillator (INRO) – 8kHz-30kHz:
INRO is an ultra-low power internal oscillator that can be selected as the SYS_OSC and is always enabled. The frequency is configurable to 8kHz, 16kHz, or 30kHz (default). - Internal Baud Rate Oscillator (IBRO) – 7.3728MHz:
This is a very low power internal oscillator that can be selected as SYS_OSC. This clock can optionally be used as a dedicated baud rate clock for the UARTs. - External Real-Time Clock Oscillator (ERTCO) – 32.768kHz:
ERTCO is an extremely low power internal oscillator that can be selected as the SYS_OSC. The ERTCO can optionally use a 32.768kHz input clock or an 8kHz independent nano-ring oscillator instead of an external crystal. This oscillator is the default clock for the Real-Time Clock (RTC) and is enabled automatically as soon as the RTC is enabled.
The selected SYS_OSC is the input to the system oscillator divider to generate the System Clock (SYS_CLK). The divider is selectable from 0 to 7, resulting in divisors of 1, 2, 4, 8, 16, 32, 64, or 128. SYS_CLK drives the Arm Cortex-M4 with FPU core, and all Advanced High-Performance Bus (AHB) masters in the system. After passing through a fixed divide by two divider, it also can generate the Advanced Peripheral Bus (APB) clock feeding the CNN accelerator, RISC-V, and peripherals. For more detailed information on the clock sources, configurations, and the list of registers to set, refer to section 3.1 of [5].
The API to setup clock is listed in Table 1.
API | Description |
int MXC_SYS_ClockSourceEnable (mxc_sys_system_clock_t clock) | Enables a clock source without selecting it; the following sources are available: MXC_SYS_CLOCK_IPO, MXC_SYS_CLOCK_IBRO, MXC_SYS_CLOCK_ISO, MXC_SYS_CLOCK_INRO, MXC_SYS_CLOCK_ERTCO, MXC_SYS_CLOCK_EXTCLK |
int MXC_SYS_Clock_Select (mxc_sys_system_clock_t clock) | Selects the clock source to be used for SYS_CLK; see the Clocks section for the list |
void SystemCoreClockUpdate(void) | Update the system core clock according to the selected clock source |
void MXC_SYS_ClockEnable (mxc_sys_periph_clock_t clock) | Enables peripheral clocks: MXC_SYS_PERIPH_CLOCK_GPIO0 to 2 MXC_SYS_PERIPH_CLOCK_DMA MXC_SYS_PERIPH_CLOCK_SPI0 and 1 MXC_SYS_PERIPH_CLOCK_UART0 to 3 MXC_SYS_PERIPH_CLOCK_I2C0 to 2 MXC_SYS_PERIPH_CLOCK_TMR0 to 5 MXC_SYS_PERIPH_CLOCK_ADC MXC_SYS_PERIPH_CLOCK_CNN MXC_SYS_PERIPH_CLOCK_PT MXC_SYS_PERIPH_CLOCK_TRNG MXC_SYS_PERIPH_CLOCK_SMPHR MXC_SYS_PERIPH_CLOCK_OWIRE MXC_SYS_PERIPH_CLOCK_CRC MXC_SYS_PERIPH_CLOCK_AES MXC_SYS_PERIPH_CLOCK_I2S MXC_SYS_PERIPH_CLOCK_PCIF MXC_SYS_PERIPH_CLOCK_WDT0 MXC_SYS_PERIPH_CLOCK_CPU1 // RISC-V clock MXC_SYS_PERIPH_CLOCK_WDT1 MXC_SYS_PERIPH_CLOCK_LPCOMP1 to 3 |
void MXC_SYS_ClockDisable (mxc_sys_periph_clock_t clock) | Disable peripheral clocks; see the Clocks section for the list |
Operating Modes
The MAX78000 includes multiple operating modes and the ability to fine tune power options to optimize performance and power. The system supports the following operating modes, as summarized in Table 2.
Operation Mode | Arm | RISC-V | Oscillators | System | CNN Quadrants | CNN RAM | Peripherals |
ACTIVE | On | On | All available | Available | Active, Configurable | Active, Configurable | Available |
SLEEP | Retention | On/ Retention | All available | Available | Active, Configurable | Active, Configurable | Available |
LPM | Retention | On/ Retention | ISO, IBRO, ERTCO, INRO |
0,1: Retention 2,3: Available |
Active, Configurable | Active, Configurable | Available |
UPM | Retention | Retention | IBRO, ERTCO, INRO |
Retention | Optionally off | Selectable retention | Retention, LPUART0, LPTMR0-1 LPWDT0, LPCOMP0-3, GPIO, WUT, RTC available |
STANDBY | Retention | Retention | IBRO, ERTCO, INRO |
Retention | Off | Selectable retention | Retention, WUT, RTC, COMP0, GPIO available |
BACKUP | Off | Off | ERCO | Configurable retention |
Off | Selectable retention | Off WUT, RTC, COMP0, GPIO available |
POWER DOWN |
Off | Off | Off | Off | Off | Off | Off, Configurable P3.0/1 |
ACTIVE Mode:
In this mode, both the Arm and the RISC-V cores can execute the application code and all digital and analog peripherals are available on demand. Dynamic clocking disables peripherals that are not in use, providing the optimal mix of high performance and low power consumption. Each of the peripherals can be individually enabled during active mode or powered down. The CNN and each of the four CNNx16_n Processor Arrays and their associated memories can be powered down or set to active as needed.
SLEEP Mode:
This mode can be used when either Arm or RISC-V executes the code, during which the other goes to sleep to reduce the power. The clocks can be optionally enabled to speed up the wakeup process. The CNN is available for use and each of the four CNNx16_n Quadrants are individually configurable for power down. All peripherals are on unless explicitly disabled
To enter SLEEP Mode: SCB->SCR &= ~SCB_SCR_SLEEPDEEP_Msk; // Needed just once Either directly set sleep mode bit in MXC_GCR->pm register, or use WFI instruction: • MXC_GCR->pm &= ~MXC_F_GCR_PM_MODE; MXC_GCR->pm |= MXC_S_GCR_PM_MODE_CM4; • Or use WFI: o Arm: __WFI(); // Enter to sleep and wait for interrupt to wake up o RISC-V: asm volatile("wfi"); To enter into this mode on Arm, MXC_LP_EnterSleepMode() API call can also be used.
Low Power Mode (LPM):
This mode can be used if RISC-V is expected to collect and move data from enabled peripherals while Arm is in deepsleep. If RISC-V is not used and is in sleep mode, Arm can also process the data and switch to LPM while peripherals or CNN are active and wake up on interrupts. Arm, DMA, and SPI0 are in state retention. The CNN quadrants, memory, and other peripherals are active and configurable. Note that CNN can only wake up RISC-V but not Arm in LPM. In this case, the Wake Up Timer (WUT) can be used to wake up Arm if the CNN processing time is known. Alternatively, the CNN interrupt can wake up RISC-V, which in turn can wake up Arm and then RISC-V can go back to sleep.
Before going to LPM: Make sure the clock source is enabled during LPM in MXC_GCR->pm, for example: MXC_GCR->pm &= ~MXC_F_GCR_PM_ISO_PD; // enable ISO during sleep Power Sequencer Registers (PWRSEQ) are used to configure clock source and wake up source. The clock source for peripherals and RISC-V needs to be specified. By default, it uses ISO during LPM. If needed, set this field to 1 to use the PCLK (APB Clock = SYS_CLK/2) PWRSEQ_LPCN ->lpcn |= MXC_F_PWRSEQ_LPCN_LPMCLKSEL; The wakeup source of the low power peripheral (e.g GPIO, RISC-V) needs to be enabled. For example: MXC_PWRSEQ->lppwen |= MXC_F_PWRSEQ_LPPWEN_CPU1; // wake up on RISC-V interrupt
To enter LPM:
SCB->SCR |= SCB_SCR_SLEEPDEEP_Msk; // Needed just once
/* Write 1 to clear */
MXC_PWRSEQ->lpwkst0 = 0xFFFFFFFF;
MXC_PWRSEQ->lpwkst1 = 0xFFFFFFFF;
MXC_PWRSEQ->lppwst = 0xFFFFFFFF;
MXC_MCR->ctrl |= MXC_F_MCR_CTRL_ERTCO_EN; // enable ERTCO
Either directly set sleep mode bit in MXC_GCR->pm register, or use WFI instruction:
- MXC_GCR->pm &= ~MXC_F_GCR_PM_MODE; MXC_GCR->pm |= MXC_S_GCR_PM_MODE_LPM;
- Or use WFI:
- Arm:__WFI(); // Enter to deepsleep, wait for interrupt to wake up
In LPM, IPO is powered down and IBRO, ERTCO, INRO, or ISO should be used.
Note: If the Arm core goes to LPM, RISC-V can continue execution and go to sleep as needed. If the entry into LPM needs to be managed between two cores, Multiprocessor Communications can be used to ensure both are in a known state. Refer to section 3.3.3 in [5] for more information.
Micro Power Mode (UPM):
This mode is used for extremely low power consumption while using a minimal set of peripherals to provide wakeup capability. In this mode, the state of Arm and RISC-V, system, and CNN RAM as well as non-UPM peripherals are retained. The CNN quadrants can be powered off. IPO and ISO oscillators are powered down and IBRO, ERTCO, and INRO are enabled. The following peripherals can be used as source of wake up: WUT, LPUART0, LPTMR0, LPTMR1, LPWDT0, LPCOMP0-LPCOMP3, and GPIO
To enter UPM Mode: /* Write 1 to clear */ MXC_PWRSEQ->lpwkst0 = 0xFFFFFFFF; MXC_PWRSEQ->lpwkst1 = 0xFFFFFFFF; MXC_PWRSEQ->lppwst = 0xFFFFFFFF; MXC_MCR->ctrl |= MXC_F_MCR_CTRL_ERTCO_EN; // enable ERTCO Set UPM mode bit in MXC_GCR->pm register: MXC_GCR->pm &= ~MXC_F_GCR_PM_MODE; MXC_GCR->pm |= MXC_S_GCR_PM_MODE_UPM; To enter into this mode on Arm, MXC_LP_EnterDeepSleepMode() API call can also be used.
This mode is most beneficial in low activity cycle applications to save energy.
To ensure the deterministic entry to UPM mode, Arm and RISC-V might need synchronization using Multiprocessor Communication. For more information, refer to section 3.3.4 in [5].
STANDBY Mode:
In this mode, the system state is retained while RTC is running; Arm, RISC-V, system RAM, GPIO, and peripherals also retain their state. CNN quadrants are off and CNN memory has selectable retention. IPO and ISO oscillators are powered down and ERTCO, INRO, and IBRO are enabled. This mode provides further power reduction if duty cycling of an application is needed.
To enter STANDBY Mode: /* Write 1 to clear */ MXC_PWRSEQ->lpwkst0 = 0xFFFFFFFF; MXC_PWRSEQ->lpwkst1 = 0xFFFFFFFF; MXC_PWRSEQ->lppwst = 0xFFFFFFFF; MXC_MCR->ctrl |= MXC_F_MCR_CTRL_ERTCO_EN; // enable ERTCO Set STANDBY mode bit in MXC_GCR->pm register: MXC_GCR->pm &= ~MXC_F_GCR_PM_MODE; MXC_GCR->pm |= MXC_S_GCR_PM_MODE_STANDBY; To enter into this mode on Arm, MXC_LP_EnterStandbyMode() API call can also be used.
To ensure the deterministic entry to STANDBY mode, Arm and RISC-V might need to synchronize using Multiprocessor Communication. For more information, refer to section 3.3.5 in [5].
BACKUP Mode:
This mode is used to retain System RAM. In this mode, Arm, RISC-V, all peripherals, and all oscillators except ERTCO are powered off. System RAM 0-3 can be independently configured to retain state using PWRSEQ_LPCN registers.
Entering BACKUP mode does not require synchronization between the Arm and RISC-V, and either core can set GCR_PM.mode to BACKUP, which causes the device to immediately enter BACKUP.
To enter BACKUP Mode: /* Write 1 to clear */ MXC_PWRSEQ->lpwkst0 = 0xFFFFFFFF; MXC_PWRSEQ->lpwkst1 = 0xFFFFFFFF; MXC_PWRSEQ->lppwst = 0xFFFFFFFF; Set BACKUP mode bit in MXC_GCR->pm register: MXC_GCR->pm &= ~MXC_F_GCR_PM_MODE; MXC_GCR->pm |= MXC_S_GCR_PM_MODE_BACKUP To enter into this mode on Arm, MXC_LP_EnterBackupMode() API call can also be used.
Power Down Mode (PDM):
This mode is used during product level distribution and storage. Arm, RISC-V, memories, peripherals, and oscillators are powered down. There is no data retention in this mode, but values in the flash memory are preserved. In this mode, VREGI POR voltage monitor is operational.
Exit from PDM is possible with an external reset (RSTN) or a wakeup event on either P3.0 or P3.1 if configured.
To enter PDM Mode: MXC_GCR->pm &= ~MXC_F_GCR_PM_MODE; MXC_GCR->pm |= MXC_S_GCR_PM_MODE_POWERDOWN To enter into this mode on Arm, MXC_LP_ShutDownMode() API call can also be used.
Wakeup Configuration
Arm and RISC-V need to be configured to wake up in case of a wakeup event. For the list of wakeup sources in different operating modes, refer to section 3.4 in [5].
Arm Wakeup Configuration:
Interrupts should be enabled. In case of SLEEP, the Arm core wakes up after__WFI() in response to any interrupt.
For a low power mode, the wakeup source needs to be enabled:
- In Power Sequencer Registers (PWRSEQ): Enable peripherals and specific GPIO source in Low Power Peripheral Wakeup Enable registers (PWRSEQ_LPWKEN0 to 3, PWRSEQ_LPPWEN)
- In Global Control Registers (GCR): Enable GPIO Wakeup Enable and RTC Alarm Wakeup Enable in Power Management Register (GCR_PM)
Refer to sections 3.11 and 3.13 of [5].
Example of Arm wakeup configuration: Wakeup from LPM on GPIO: MXC_GCR->pm |= MXC_F_GCR_PM_GPIO_WE; // enable wakeup from GPIOs MXC_PWRSEQ->lpwken2 |= (1 << 7); // Pb2: GPIO2.7 is selected for wakeup Wakeup from LPM on RISC-V interrupt: MXC_PWRSEQ->lppwen |= MXC_F_PWRSEQ_LPPWEN_CPU1; // wakeup on RISC-V interrupt
RISC-V Wakeup Configuration:
To configure the RISC-V to wake up on an interrupt, the IRQ as well as the EVENT need to be enabled.
Example of RISC-V wakeup configuration: __enable_irq(); NVIC_EnableIRQ(CNN_IRQn); NVIC_EnableEVENT(CNN_IRQn);
Table 3 shows an example of Arm and RISC-V code. Arm goes to LPM (deepsleep) and RISC-V initializes and starts CNN, and goes to sleep until CNN inference is complete. Afterwards, a CNN interrupt wakes up RISC-V, which then wakes up ARM as well.
For complete code of this example, refer to [6].
Arm | RISC-V |
// mnist-riscv int main(void) { MXC_ICC_Enable(MXC_ICC0); // Enable cache // Switch to 100 MHz clock MXC_SYS_Clock_Select(MXC_SYS_CLOCK_IPO); SystemCoreClockUpdate(); // Set RISC-V boot address MXC_FCR->urvbootaddr = (uint32_t) &__FlashStart_; // Enable Sempahore clock MXC_SYS_ClockEnable(MXC_SYS_PERIPH_CLOCK_SMPHR); // Set wakeup ISR NVIC_SetVector(RISCV_IRQn, WakeISR); // DO NOT DELETE THIS LINE: MXC_Delay(SEC(2)); // Let debugger interrupt if needed // Enable RISC-V clock MXC_SYS_ClockEnable(MXC_SYS_PERIPH_CLOCK_CPU1); // Go to LPM, wait for RISC-V to wake up MXC_LP_ClearWakeStatus(); SCB->SCR |= SCB_SCR_SLEEPDEEP_Msk; // SLEEPDEEP=1 __WFI(); // Go to deepsleep, Let RISC-V run return 0; } void WakeISR(void) { MXC_SEMA->irq0 = MXC_F_SEMA_IRQ0_EN & ~MXC_F_SEMA_IRQ0_CM4_IRQ; } |
int main(void) { Debug_Init(); // Set up RISCV JTAG MXC_ICC_Enable(MXC_ICC1); // Enable cache // Enable peripheral, CNN interrupt, CNN clock // CNN clock: 50 MHz div 1 cnn_enable(MXC_S_GCR_PCLKDIV_CNNCLKSEL_PCLK, MXC_S_GCR_PCLKDIV_CNNCLKDIV_DIV1); cnn_init(); // Bring state machine into consistent state cnn_load_weights(); // Load kernels cnn_load_bias(); cnn_configure(); // Configure state machine load_input(); // Load data input cnn_start(); // Start CNN processing while (cnn_time == 0) asm volatile("wfi"); // Sleep, Wait for CNN complete // Now signal Arm to wake up MXC_SEMA->irq0 = MXC_F_SEMA_IRQ0_EN | MXC_F_SEMA_IRQ0_CM4_IRQ; } int cnn_enable(uint32_t clock_source, uint32_t clock_divider) { // Reset all domains, restore power to CNN MXC_BBFC->reg3 = 0xf; // Reset MXC_BBFC->reg1 = 0xf; // Mask memory MXC_BBFC->reg0 = 0xf; // Power MXC_BBFC->reg2 = 0x0; // Iso MXC_BBFC->reg3 = 0x0; // Reset MXC_GCR->pclkdiv = (MXC_GCR->pclkdiv & ~(MXC_F_GCR_PCLKDIV_CNNCLKDIV | MXC_F_GCR_PCLKDIV_CNNCLKSEL)) | clock_divider | clock_source; // Enable CNN clock MXC_SYS_ClockEnable(MXC_SYS_PERIPH_CLOCK_CNN); // Enable Interrupt and wakeup on CNN complete __enable_irq(); NVIC_EnableIRQ(CNN_IRQn); NVIC_EnableEVENT(CNN_IRQn); return CNN_OK; } |
Power Consumption in Different Modes
In developing an application, the user can switch the MAX78000 to different operation modes and schedule the tasks accordingly to save power. An understanding of power consumption in each operation mode, as well as the time needed to boot up or to wake up, helps the user select the appropriate mode and plan the duty cycle of the tasks. The Power Monitor can be used in System Power mode [4] to measure power consumption and wakeup or bootup time in several cases. In this test, RISC-V and peripherals are off and Arm is tested in different operation modes. For the wakeup time test, RISC-V is used as the wakeup source of Arm. Figure 3 shows the result of this experiment. ACTIVE and SLEEP power were measured using ISO oscillator at 60MHz. Other low power modes employed ERTCO (32kHz) during the sleep period (API default).
Bootup time (defined as the time from power-up to the start of executing the main code) and wakeup time with different clock frequencies are also tested, as reported in Figure 4. Wakeup time is scaled with selected clock frequency. The bootup time is measured with power on default clock (ISO at 60MHz).
CNN Power
The core of an AI application is the inference, performed by the CNN accelerator. Depending on the application, the inference can happen continuously on incoming data or periodically at certain time intervals. The CNN inference can start once the input data is ready (e.g. KWS case study), or in FIFO mode (e.g. FaceID example) where it starts as soon as enough data to begin the process is stored in the FIFO. The CNN power consumption is measured in three phases:
- Loading weights (kernels): Happens once to load weights into CNN memory in active mode
- Loading input data: Every time there is a new inference. In FIFO mode, it can overlap with inference
- Inference: Operates on input data and generates the classification result
The Power Monitor uses GPIO signals to trigger each event and measure the power of each phase in CNN Power Mode [4]. To conduct measurements, two examples in the Maxim SDK were selected: KWS20 and FaceID. Each example was trained [7], synthesized [8], and flashed into the MAX78000 EV kit. The synthesis script can be instructed to generate the code for Arm only (default: RISC-V is off), or for RISC-V (with --deepsleep --riscv --riscv-debug flags) where Arm initializes RISC-V and goes to sleep. The synthesis script can also include GPIO signaling to instruct the Power Monitor to run in CNN Power Mode (--energy). The energy, time, inactive (I) and active (A) power measurements are displayed on the Power Monitor display. The results of testing KWS20 and FaceID in different operation modes and clock sources (IPO at 100MHz and ISO at 60MHz) are shown in Table 4 and Table 5. Note that FIFO is used in the case of FaceID, so inference time includes loading time as well. According to the results, the inference time and energy using ARM only as well as ARM and RISC-V are similar when CNN is used with the same clock. Loading weights and data take longer with slower clock sources, causing a significant increase in energy. The user should consider the energy needed to load kernels and inputs to decide how to schedule the application. In cases in which the activity cycle is small, reloading the weight might improve the overall power consumption. The inference time and energy are not expected to improve in SLEEP mode or LPM, as they are measured only during the activity window of CNN. However, they significantly affect the overall power consumption of applications, as shown later.
The case in which both ARM and RISC-V were asleep during idle time was selected to test the effect of oscillator frequency by applying different clock frequencies and sources. RISC-V drives loading weights and inputs, as well as manages CNN. By reducing the clock frequency, the inference time and energy go up exponentially. This suggests that executing the loading and inference at the highest speed and then going to a longer sleep period tends to improve the overall energy consumption. The result of this experiment is shown in Figure 5 and Figure 6.
KWS20: Operation Mode |
Clock Frequency (MHz) | Kernel Loading | Input Loading | Inference | |||||||||||
Arm | RISC-V | CNN | E(mJ) | t(ms) | I(mW) | A(mW) | E(mJ) | t(ms) | I(mW) | A(mW) | E(mJ) | t(ms) | A(mW) | ||
Arm Only | Always ACTIVE | 100 | N/A | 50 | 0.146 | 11.6 | 7.2 | 19.8 | 0.013 | 1.1 | 7.2 | 19.27 | 0.14 | 2 | 37.45 |
SLEEP | 100 | N/A | 50 | 0.148 | 11.6 | 7.02 | 19.8 | 0.013 | 1.1 | 7.02 | 19.33 | 0.136 | 2 | 35.82 | |
LPM | 60 | N/A | 30 | 0.154 | 19.3 | 5.33 | 13.29 | 0.014 | 1.8 | 5.33 | 13.06 | 0.139 | 4.1 | 18.32 | |
Arm and RISC-V | Always ACTIVE | 100 | 50 | 50 | 0.42854 | 34.72 | 5.47 | 17.81 | 0.04001 | 3.31 | 10.45 | 22.52 | 0.1401 | 1.99 | 21.86 |
Only Arm SLEEP | 100 | 50 | 50 | 0.04036 | 3.31 | 5.5 | 17.71 | 0.14077 | 2.01 | 21.89 | |||||
Both SLEEP | 100 | 50 | 50 | 0.03829 | 3.15 | 5.47 | 17.63 | 0.14055 | 2.19 | 21.41 | |||||
Both SLEEP | 60 | 30 | 30 | 0.487 | 50.3 | 3.48 | 13.16 | 0.0481 | 5.53 | 3.63 | 12.31 | 0.14595 | 3.38 | 13.06 | |
LPM | 60 | 30 | 30 | 0.0456 | 4.7 | 3.48 | 12.92 | 0.1424 | 3.5 | 13.67 |
FaceID: Operation Mode |
Clock Frequency (MHz) | Kernel Loading | Input Loading | FIFO Loading + Inference |
|||||||||||
Arm | RISC-V | CNN | E(mJ) | t(ms) | I(mW) | A(mW) | E(mJ) | t(ms) | I(mW) | A(mW) | E(mJ) | t(ms) | A(mW) | ||
Arm Only | Always ACTIVE | 100 | N/A | 50 | 0.158 | 12.2 | 7.03 | 20.01 | 0.132 | 12.1 | 7.03 | 17.91 | 0.435 | 14 | 38.19 |
SLEEP | 100 | N/A | 50 | 0.156 | 12.2 | 7.22 | 20.01 | 0.13 | 12.1 | 7.22 | 17.93 | 0.43 | 14 | 38 | |
LPM | 60 | N/A | 30 | 0.188 | 20.3 | 4.6 | 13.5 | 0.162 | 20 | 4.6 | 12.61 | 0.472 | 24 | 24.1 | |
Arm and RISCV | Always ACTIVE | 100 | 50 | 50 | 0.26 | 36 | 3.52 | 10.64 | 0.30203 | 24.82 | 14.54 | 26.72 | 0.57958 | 24.37 | 38.39 |
Only Arm SLEEP | 100 | 50 | 50 | 0.29813 | 24.32 | 5.46 | 17.47 | 0.57823 | 24.87 | 29.3 | |||||
Both SLEEP | 100 | 50 | 50 | 0.29899 | 24.82 | 5.48 | 17.53 | 0.57921 | 24.37 | 29.18 | |||||
Both SLEEP | 60 | 30 | 30 | 0.53 | 52 | 3.45 | 13.48 | 0.36954 | 41.47 | 3.57 | 12.49 | 0.65009 | 40.72 | 19.52 | |
LPM | 60 | 30 | 30 | 0.328 | 36.5 | 3.45 | 12.45 | 0.608 | 36.1 | 20.36 | |||||
LPM, Fast FIFO | 60 | 30 | 30 | N/A | N/A | N/A | N/A | 0.524 | 25.2 | 24.29 | |||||
LPM | 60 | 60 | 60 | N/A | N/A | N/A | N/A | 0.254 | 18.2 | 6.02 | 19.19 | 0.533 | 18 | 35.58 | |
LPM, Fast FIFO | 60 | 60 | 60 | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | 0.471 | 12.6 | 43.57 |
Retaining Weights in CNN Memory
Each of the four CNN Mask RAMs (the CNN weights storage) can be configured to retain the weights in case of UPM, BACKUP, or STANDBY. On the other hand, CNN configuration registers are lost in any of those power saving modes (except SLEEP and LPM) regardless of CNN Mask RAM retention setting and need to be reloaded.
For some low duty cycle applications, the user can choose to turn off Mask RAMs to further reduce power consumption during sleep. In this case, the effect of extra power used to repeatedly reload weights should be considered. Table 6 shows the power consumption increase from 0.17mW to 0.38mW during UPM when CNN weights are retained for the FaceID example. In this case, the CNN configuration reloading takes a fraction of a millisecond. However, if CNN Mask RAMs are off, the total reloading time of weights and configuration is greater than 12 milliseconds. The user can study such tradeoffs to optimize the schedule and mode of operations for their applications. The following examples show how to turn on/off weight retention and how to reinitialize CNN after wakeup.
Example: Turn on weight retention MXC_BBFC->reg3 = 0xf; // Reset on MXC_BBFC->reg1 = 0x0; // turn off CNN RAM0-3 power MXC_BBFC->reg3 = 0x0; // Reset off
Example: Loading weights and configuring CNN with CNN Mask RAMs retention // wake up from UPM, BACKUP, or STANDBY – CNN memory was retained cnn_init(); // Bring state machine into consistent state cnn_load_bias(); cnn_configure(); // Configure state machine … // proceed to loading data
Example: Turn off weight retention prior to sleep (must turn back on after waking up) MXC_BBFC->reg3 = 0xf; // Reset on MXC_BBFC->reg1 = 0x0; // turn off CNN RAM0-3 power MXC_BBFC->reg3 = 0x0; // Reset off /* get ready to go to sleep */ MXC_LP_EnterDeepSleepMode(); // Arm enters LPM, waits for wakup /* restore power to CNN RAM0-3 after wakeup, or prior to next cnn_load() */ MXC_BBFC->reg3 = 0xf; // Reset MXC_BBFC->reg1 = 0xf; // turn on CNN RAM0-3 power MXC_BBFC->reg3 = 0x0; // Reset
Example: Loading weights and configuring CNN without CNN Mask RAMs retention // Power up or wake up from UPM, BACKUP or STANDBY, CNN memory not retained cnn_init(); // Bring state machine into consistent state cnn_load_weights(); // Load kernels cnn_load_bias(); cnn_configure(); // Configure state machine … // proceed to loading data
UPM | Loading Weights/CNN Config After Waking up from UPM | |||
CNN Weight Retention | P(mW) | E(µJ) | T(ms) | P(mW) |
Off | 0.172 | 242.38 | 12.16 | 19.93 |
On | 0.387 | 1.956 | 0.1089 | 17.95 |
Application Power
When developing an application, it is important to choose the operation mode of each core, clock frequency, and duty cycle. Typically, applications running on Arm can be executed at a higher frequency (100MHz, vs. 50MHz APB clock) and have more optimized code, which can result in lower overall energy used. If there are parallel tasks that can be assigned to both Arm and RISC-V cores, it is important to use the proper operation mode for each core. In SLEEP and LPM, cores can operate or sleep independently. However, in other power saving modes like UPM or STANDBY, both cores are powered down simultaneously. In this case, Arm and RISC-V might need to coordinate in the case that they both go to low power mode using Multiprocessor Communications as described in section 8.2 of [5].
For applications that require occasional short activity periods, UPM, STANDBY, BACKUP, or SHUTDOWN mode can be an option. The user can evaluate the energy cost of maintaining weights in CNN memory during the sleep periods and compare with the energy of reloading weights prior to each activity window. The Power Monitor circuit on the MAX78000 EV kit can be used to measure the overall system power within a time window [4]. This helps the user to profile the application power at run time as needed.
Case Study 1: Keyword Spotting (KWS20) Demo on Arm and CNN
This example application demonstrates the recognition of a set of 20 keywords using the MAX78000 EV kit [9]. The application reads I2S microphone samples at 16kHz and monitors the threshold on 128 sample windows. Once it goes above the silence threshold, the application captures one second worth of samples, runs CNN inference on the samples, and displays the classification result on the TFT display.
In this benchmarking example, RISC-V is not used and the Arm core handles the entire task. Several modes of operation were tested. The Power Monitor in System Power Mode is used to measure the total energy in ten second intervals, both when there is complete silence (i.e. no inference or TFT update) and when keywords are uttered at a rate of one word per second (ten inferences and TFT updates). The result is shown in Table 7. In ACTIVE mode, the Arm core always operates. The power in this mode is about 20mW, regardless of silence or detection as the inference energy is a fraction of total power consumption. By using LPM during the idle time of collecting I2S samples as well as the inference time, and by turning off the CNN clock after inferences (a major reduction in power), the power goes down to 8.3mW (6.58mW during silence).
Energy avg during 10s (silence) | Energy avg during 10s (10xwords) | |||||||
SYS Clock | CNN Clock | Arm Mode | E(mJ) | t(s) | P(mW) | E(mJ) | t(s) | P(mW) |
IPO |
50Mhz | ACTIVE | 200.9 | 10.002 | 20.09 | 206.2 | 10.045 | 20.53 |
SLEEP during inference |
200.4 | 10.001 | 20.04 | 204.3 | 10.001 | 20.43 | ||
SLEEP between I2S interrupts & during inference | 155 | 9.63 | 16.1 | 209.3 | 12.01 | 17.4 | ||
LPM between I2S interrupts & during inference | 146.7 | 9.63 | 15.2 | 211.5 | 12.35 | 17.1 | ||
LPM between I2S interrupts and during inference, CNN clock off after completion | 63.5 | 9.62 | 6.58 | 101.9 | 12.18 | 8.3 | ||
Same as above, TFT off | 66.51 | 9.63 | 6.75 | 67.4 | 9.64 | 6.99 |
The average power consumption and activity duty cycle of the last two power optimized operation modes in Table 7 are analyzed and demonstrated in Figure 7.
Case Study 2-A: FaceID Demo on Arm and CNN
This demo application demonstrates the identification of subjects using facial recognition on the MAX78000. The FaceID CNN model generates an embedding of the size 512 from live images captured from the EV kit camera. The distance of the embedding from the embeddings of known subjects is calculated, and if it is greater than a threshold, the best candidate is selected. The database of known embeddings is created from pictures of the subjects and is integrated with the firmware. The inference is executed on a 160x120 face box of each image. To enhance the performance of identification, the inference is executed three times on a frame as well as each time the face box is slightly moved on the picture.
In this case study, RISC-V is off and the entire task is driven by Arm and executed sequentially. The Arm goes to SLEEP mode during inference and UPM or STANDBY between frames for about half a second. Table 8 summarizes the execution time and power of main functions of the FaceID. Table 9 displays the time and power consumption during each operation mode.
Processing One Frame | Energy Avg | ||||
---|---|---|---|---|---|
SYS Clock | CNN Clock | Arm duty cycle | E(µJ ) | t(ms) | P(mW) |
ARM-100MHz | 50MHz | Image capture | 1189.8 | 171.2 | 6.95 |
CNN kernel load (for STANDBY or UPM) | 935.3 | 37.5 | 24.94 | ||
3 x CNN run | 1164.7 | 42 | 27.73 | ||
3 x CNN unload | 1.6 | 0.15 | 10.7 | ||
3x Distance calculation | 470.4 | 48 | 9.8 | ||
Display image on TFT | 620.5 | 49.4 | 12.56 | ||
Total | 4382.2 | 348.25 |
Mode | Active | UPM | Sleep | Total |
Time (ms) | 227 | 498 | 160 | 885 |
Power (mW) | 18.14 | 0.3 | 6.92 | |
Energy (µJ) | 4118 | 150 | 1112 | 5380 |
Mode | Active | Standby | Sleep | Total |
Time (ms) | 227 | 497 | 160 | 884 |
Power (mW) | 18.14 | 0.03 | 6.92 | |
Energy (µJ) | 4118 | 15 | 1112 | 5245 |
Case Study 2-B: FaceID Demo on Arm, RISC-V, and CNN
In this scenario, Arm handles the TFT control and display (only accessible through Arm on MAX78000 EV kit) and RISC-V drives image capture from the camera, CNN loading and inference, and distance calculation of the embedding against the database. Shared memory and mailboxes are employed to coordinate the Arm and RISC-V communication.
Table 10 summarizes the execution time and power during major functions of the FaceID in this mode. After processing a frame, Arm and RISC-V go to STANDBY mode for about half a second like the previous case.
Processing One Frame | Energy Avg | ||||
SYS Clock | CNN Clock | RISC-Arm | E(µJ) | t(ms) | P(mW) |
ARM-100MHz RISCV-60MHz | 50MHz | Image capture | 1392.6 | 169 | 8.24 |
CNN kernel load | 818.3 | 35.5 | 23.05 | ||
3 x CNN run | 7910.1 | 327 | 24.19 | ||
3 x CNN unload | 9.5 | 0.9 | 10.5 | ||
3x Distance calculation | 699.6 | 66 | 10.6 | ||
Display image on TFT (on Arm in parallel with RISC-V processing) | 60 | 12.48 | |||
Total | 10830 | 658.9 |
Mode | Active | Standby | Sleep | Total |
Time (ms) | 448 | 500 | 169 | 1117 |
Power (mW) | 18.14 | 0.05 | 7.89 | |
Energy (µJ) | 8126.72 | 25 | 1333.41 | 9485.13 |
By comparing the two implementations of FaceID in Figure 10, it is clear that the total energy to process a frame is significantly lower by using Arm, which is primarily due to its higher speed as opposed to RISC-V (100MHz, vs 50MHz) resulting in a shorter duration of the ACTIVE state.
Summary
This application note covers an overview of power optimization techniques supported by the MAX78000 and the results when used with Keyword Spotting and FaceID applications. The following summary highlights some general recommendations to improve the power consumption:
- Faster is generally better. Generally, using higher clock rates makes the execution faster and the impact of constant static power is reduced, improving the overall power consumption.
- Single core is preferred if there are not many parallel tasks to perform. Running tasks on RISC-V is slower due to its clock rate, and tends to result in an increase in energy usage compared to Arm.
- Take advantage of UPM, BACKUP, and STANDBY when there is no activity. Consider LPM when Arm can sleep while some peripherals are running.
- Load kernels (weights) once and retain in memory to avoid reloading energy in UPM, BACKUP, and STANDBY modes. However, if the duty cycle of activity is very low, consider disabling weight retention or using POWERDOWN mode and reloading the weights in every cycle.
- Make sure to turn off the CNN clock after CNN processing is complete and turn it on before accessing CNN again in the next activity period.
- Use LPM when CNN is running and Arm is waiting for inference to complete. To wake up, either use RISC-V or the wakeup timer. A CNN interrupt cannot wake up Arm directly in LPM.
- Using fast FIFO or quad fast FIFO (supported only if RISC-V drives CNN) significantly improves input loading time and inference energy (--fast-fifo, --fast-fifo-quad) [8].
参考电路
[2] MAX78000 Evaluation Kit Data Sheet
[4] MAX78000 Power Monitor and Energy Benchmarking Guide
[6] https://github.com/MaximIntegratedAI/MAX78000_SDK/tree/master/Examples/MAX78000/CNN/mnist-riscv
[7] https://github.com/MaximIntegratedAI/ai8x-training, please see README for instructions.
[8] https://github.com/MaximIntegratedAI/ai8x-synthesis, please see README for instructions.
[9] https://github.com/MaximIntegratedAI/MAX78000_SDK/tree/master/Examples/MAX78000/CNN/kws20_demo
[10] https://github.com/MaximIntegratedAI/MAX78000_SDK/tree/master/Examples/MAX78000/CNN/faceid_evkit